📖 Technical Report | 🤗 Hugging Face| 🤖 ModelScope

介绍

我们推出了 M2-Reasoning-7B,一个在通用与空间推理方面都表现卓越的模型。我们的方法融合了两项关键创新:(1) 一个全新的数据管道,生成了29.42万个高质量数据样本(其中16.8万用于冷启动微调,12.62万用于RLVR)。这些数据具有逻辑连贯的推理轨迹,并经过了全面评估。(2) 一种动态多任务训练策略,通过逐步优化来缓解数据间的冲突,并利用针对特定任务的奖励机制来提供定制化的激励信号。通过这种精心筛选的数据与先进训练方法的结合,M2-Reasoning-7B 在8个基准测试中创造了新的业界最佳水平(SOTA),在通用和空间推理领域均展现出卓越的性能。

📌 更新

主要特性

  • 高质量的数据构建流程:我们设计并实现了一个多阶段的数据合成与筛选流程,能够生成大量的推理数据。
  • 动态多任务训练策略:我们提出了一种高效的训练策略,能够有效应对数据异构性问题。该策略包括逐步动态优化,以缓解不同数据源之间的冲突,以及任务特定的奖励机制,提供定制化的激励信号。
  • 统一的通用与空间推理模型:我们提出了 M2-Reasoning-7B,这是一款专为通用推理与空间推理任务而设计的多模态大语言模型(MLLM)。在8个不同的基准测试中进行的广泛评估表明,借助我们定制的数据和训练流程,M2-Reasoning在通用推理和空间推理领域均取得了新的SOTA成果。

评测

我们在通用推理和空间推理对模型进行了全面评估。我们的评估使用了一组多样化的公开基准测试,这些测试根据它们主要衡量的能力进行分类:

  • 通用推理(数学与逻辑):为了评估这一能力,我们采用了六项基准测试:MathVista、MathVision、MathVerse、DynaMath、WeMath 和 LogicVista。
ModelsMathVistaMathVisionMathVerseDynaMathWeMathLogicVistaAvg. (Δ)
基础规模通用模型
InternVL3-8B70.530.038.525.739.544.541.4
InternVL3-9B69.029.337.925.134.849.040.8
Qwen2.5-VL-7B68.125.441.121.836.247.940.1
MUG-U-7B74.826.135.417.226.539.836.6
SAIL-VL-1.6-8B74.223.233.414.029.641.436.0
基础规模推理模型
WeThink-VL-7B71.626.044.224.848.051.244.3 (+4.2)
Taichu-VLR-7B72.327.146.723.044.048.343.6
VLAA-Thinker-7B68.026.448.222.441.548.542.5 (+2.4)
URSA-8B-PS-GRPO67.831.841.522.438.344.741.1 (+8.2)
Ovis2-8B71.825.942.320.427.239.437.8
本文模型
Base Model70.225.930.520.227.237.835.5
M2-Reasoning-CI-7B71.729.242.125.042.846.842.9 (+7.4)
M2-Reasoning-7B75.031.544.726.841.850.045.0 (+9.5)
  • 空间推理:我们使用两项基准来评估这一能力:CV Bench和VSI Bench

    • CV-Bench:
    ModelsCountRelationDepthDistanceAvg.
    大规模模型
    GPT-4O65.985.787.878.278.9
    Gemini-1.5-pro70.485.282.472.877.4
    基础规模模型
    InternVL3-8B74.090.684.381.082.0
    Qwen2.5-VL-7B-Instruct65.286.670.679.875.0
    LLava-NEXT-Video-7B59.377.071.354.765.2
    本文模型
    M2-Reasoning-7B66.692.889.384.382.3
    • VSI-Bench:
    OCADOSRSRDsRDrRPAOAvg.
    大规模模型
    Gemini-1.5-pro56.230.964.143.651.346.336.034.645.4
    GPT-4O46.25.343.838.237.041.331.528.534.0
    基础规模模型
    InternVL3-8B68.139.048.433.648.336.427.335.442.1
    Video-R1-7B--------37.1
    Qwen2.5-VL-7B-Instruct37.720.149.737.438.540.431.432.035.9
    LLava-NeXT-Video-7B48.514.047.824.243.542.434.030.635.6
    本文模型
    M2-Reasoning-7B41.034.060.955.440.747.329.928.842.3

模型下载

您可以从 Hugging FaceModelScope 两个平台下载模型。 如果您位于中国大陆,我们建议您从 ModelScope 下载模型。

使用样例

基础环境为:python=3.10torch=2.6.0+cu124transformers=4.49.0

我们提供了一个简单的示例,展示如何使用本模型。

import os
import torch

from transformers import (
    AutoProcessor,
    AutoTokenizer,
)

import warnings
import argparse
from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor

warnings.filterwarnings("ignore")

class BailingMMInfer:
    def __init__(self,
        model_name_or_path,
        device="cuda",
        max_pixels=None,
        min_pixels=None,
        video_max_pixels=768 * 28 * 28,
        video_min_pixels=128 * 28 * 28,
        generation_config=None
    ):
        super().__init__()
        self.model_name_or_path = model_name_or_path

        self.device = device

        self.device_map = device

        self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
        self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28

        self.model, self.tokenizer, self.processor = self.load_model_processor()
        if max_pixels is not None:
            self.processor.max_pixels = max_pixels
        if min_pixels is not None:
            self.processor.min_pixels = min_pixels
        if generation_config is None:
            generation_config = {
                "num_beams": 1,
                "do_sample": True,
                "temperature": 0.9
            }

        self.generation_config = generation_config


    def load_model_processor(self):
        
        model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
            self.model_name_or_path,
            torch_dtype=torch.bfloat16,
            device_map=self.device_map,
            _attn_implementation="flash_attention_2"
        ).eval()

        tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
        processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)

        return model, tokenizer, processor

    def generate(self, messages, max_new_tokens=512):
        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True, use_system=True
        )

        image_inputs, video_inputs = self.processor.process_vision_info(messages)


        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
        )
        # print(inputs)
        print(self.tokenizer.decode(inputs['input_ids'][0]))

        inputs = inputs.to(self.device)

        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)

        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs,
                max_new_tokens=max_new_tokens,
                eos_token_id=self.processor.tokenizer.eos_token_id,
                **self.generation_config,
            )

        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]

        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
        )[0]

        return output_text

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning")
    parser.add_argument('--max_pixels', type=int, default=401408)
    parser.add_argument('--min_pixels', type=int, default=401408)
    parser.add_argument('--max_new_tokens', type=int, default=4096)

    args = parser.parse_args()

    device = "cuda" if torch.cuda.is_available() else "cpu"
    # model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
    bailing2 = BailingMMInfer(
        args.model_name_or_path, 
        device=device, 
        max_pixels=args.max_pixels, 
        min_pixels=args.min_pixels
    )

    messages = [
        {
            "role": "system", 
            "content": [
                {"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./assets/example1.png"},
                {"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
            ],
        },
    ]
    output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
    print(output_text)



'''
[Output]:

<think>
To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:

\[
\text{Area} = \frac{1}{2} \times d_1 \times d_2
\]

where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
- The area of the rhombus is 137.9 square meters.
- One of the diagonals,