M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

July 11, 2025 · 6 min read

Ant Group

📖 Technical Report | 🤗 Hugging Face｜ 🤖 ModelScope

Introduction

We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

📌 Updates

[2025.07.14] 🔥 Our Technical Report is in public on arxiv.
[2025.07.11] 🔥 We release M2-Reasoning on 🤗 Hugging Face and 🤖 ModelScope.

Key Features

A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.

Evaluation

We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:

General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.

Models	MathVista	MathVision	MathVerse	DynaMath	WeMath	LogicVista	Avg. (Δ)
Base-Scale General Models
InternVL3-8B	70.5	30.0	38.5	25.7	39.5	44.5	41.4
InternVL3-9B	69.0	29.3	37.9	25.1	34.8	49.0	40.8
Qwen2.5-VL-7B	68.1	25.4	41.1	21.8	36.2	47.9	40.1
MUG-U-7B	74.8	26.1	35.4	17.2	26.5	39.8	36.6
SAIL-VL-1.6-8B	74.2	23.2	33.4	14.0	29.6	41.4	36.0
Base-Scale Reasoning Models
WeThink-VL-7B	71.6	26.0	44.2	24.8	48.0	51.2	44.3 (+4.2)
Taichu-VLR-7B	72.3	27.1	46.7	23.0	44.0	48.3	43.6
VLAA-Thinker-7B	68.0	26.4	48.2	22.4	41.5	48.5	42.5 (+2.4)
URSA-8B-PS-GRPO	67.8	31.8	41.5	22.4	38.3	44.7	41.1 (+8.2)
Ovis2-8B	71.8	25.9	42.3	20.4	27.2	39.4	37.8
Our Models
Base Model	70.2	25.9	30.5	20.2	27.2	37.8	35.5
M2-Reasoning-CI-7B	71.7	29.2	42.1	25.0	42.8	46.8	42.9 (+7.4)
M2-Reasoning-7B	75.0	31.5	44.7	26.8	41.8	50.0	45.0 (+9.5)

Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench

CV-Bench:

Models	Count	Relation	Depth	Distance	Avg.
Large-Scale Models
GPT-4O	65.9	85.7	87.8	78.2	78.9
Gemini-1.5-pro	70.4	85.2	82.4	72.8	77.4
Base-Scale Models
InternVL3-8B	74.0	90.6	84.3	81.0	82.0
Qwen2.5-VL-7B-Instruct	65.2	86.6	70.6	79.8	75.0
LLava-NEXT-Video-7B	59.3	77.0	71.3	54.7	65.2
Our Models
M2-Reasoning-7B	66.6	92.8	89.3	84.3	82.3

VSI-Bench:

	OC	AD	OS	RS	RDs	RDr	RP	AO	Avg.
Large-Scale Models
Gemini-1.5-pro	56.2	30.9	64.1	43.6	51.3	46.3	36.0	34.6	45.4
GPT-4O	46.2	5.3	43.8	38.2	37.0	41.3	31.5	28.5	34.0
Base-Scale Models
InternVL3-8B	68.1	39.0	48.4	33.6	48.3	36.4	27.3	35.4	42.1
Video-R1-7B	-	-	-	-	-	-	-	-	37.1
Qwen2.5-VL-7B-Instruct	37.7	20.1	49.7	37.4	38.5	40.4	31.4	32.0	35.9
LLava-NeXT-Video-7B	48.5	14.0	47.8	24.2	43.5	42.4	34.0	30.6	35.6
Our Models
M2-Reasoning-7B	41.0	34.0	60.9	55.4	40.7	47.3	29.9	28.8	42.3

Model Downloads

You can download the model from both Hugging Face and ModelScope.

If you're in mainland China, we strongly recommend you to download our model from ModelScope.

Example Usage

The basic environment is python=3.10, torch=2.6.0+cu124, transformers=4.49.0

We provide a small example on the usage of this repo.

import os
import torch

from transformers import (
    AutoProcessor,
    AutoTokenizer,
)

import warnings
import argparse
from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor

warnings.filterwarnings("ignore")

class BailingMMInfer:
    def __init__(self,
        model_name_or_path,
        device="cuda",
        max_pixels=None,
        min_pixels=None,
        video_max_pixels=768 * 28 * 28,
        video_min_pixels=128 * 28 * 28,
        generation_config=None
    ):
        super().__init__()
        self.model_name_or_path = model_name_or_path

        self.device = device

        self.device_map = device

        self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
        self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28

        self.model, self.tokenizer, self.processor = self.load_model_processor()
        if max_pixels is not None:
            self.processor.max_pixels = max_pixels
        if min_pixels is not None:
            self.processor.min_pixels = min_pixels
        if generation_config is None:
            generation_config = {
                "num_beams": 1,
                "do_sample": True,
                "temperature": 0.9
            }

        self.generation_config = generation_config


    def load_model_processor(self):

        model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
            self.model_name_or_path,
            torch_dtype=torch.bfloat16,
            device_map=self.device_map,
            _attn_implementation="flash_attention_2"
        ).eval()

        tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
        processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)

        return model, tokenizer, processor

    def generate(self, messages, max_new_tokens=512):
        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True, use_system=True
        )

        image_inputs, video_inputs = self.processor.process_vision_info(messages)


        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
        )
        # print(inputs)
        print(self.tokenizer.decode(inputs['input_ids'][0]))

        inputs = inputs.to(self.device)

        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)

        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs,
                max_new_tokens=max_new_tokens,
                eos_token_id=self.processor.tokenizer.eos_token_id,
                **self.generation_config,
            )

        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]

        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
        )[0]

        return output_text

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning")
    parser.add_argument('--max_pixels', type=int, default=401408)
    parser.add_argument('--min_pixels', type=int, default=401408)
    parser.add_argument('--max_new_tokens', type=int, default=4096)

    args = parser.parse_args()

    device = "cuda" if torch.cuda.is_available() else "cpu"
    # model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
    bailing2 = BailingMMInfer(
        args.model_name_or_path,
        device=device,
        max_pixels=args.max_pixels,
        min_pixels=args.min_pixels
    )

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./assets/example1.png"},
                {"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
            ],
        },
    ]
    output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
    print(output_text)



'''
[Output]:

<think>
To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:

\[
\text{Area} = \frac{1}{2} \times d_1 \times d_2
\]

where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
- The area of the rhombus is 137.9 square meters.
- One of the diagonals,

Introduction​

📌 Updates​

Key Features​

Evaluation​

Model Downloads​

Example Usage​