跳到主要内容

Ming-Omni:一个用于感知与生成的统一多模态模型

· 阅读需 10 分钟
inclusionAI
Ant Group

GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope

介绍

Ming-lite-omni 是 Ming-omni 的轻量版,源自 Ling-lite,拥有 28 亿激活参数。Ming-lite-omni 是一个统一的多模态模型,能够处理图像、文本、音频和视频,并在语音和图像生成方面表现出较强能力。Ming-lite-omni 使用专用编码器从不同模态提取 token,然后由 Ling 处理,Ling 是一个 MoE 架构,配备了新提出的模态专用路由器。该设计使单一模型能在统一框架内高效处理和融合多模态输入,从而支持多样化任务,无需使用多个模型、任务专用微调或结构改动。重要的是,Ming-lite-omni 超越传统多模态模型,支持音频和图像生成。通过集成先进的音频解码器实现自然语音,以及利用 Ming-Lite-Uni 实现高质量图像生成,模型还能进行上下文感知聊天、文本转语音及多功能图像编辑。我们的实验结果表明,Ming-lite-omni 在所有模态上的统一感知与生成方面提供了强大解决方案。值得注意的是,Ming-lite-omni 是我们所知首个模态支持与 GPT-4o 匹配的开源模型,且我们发布了全部代码和模型权重,以促进社区进一步研究和发展。

📌 更新

  • [2025.06.12] 🔥 我们的技术报告已公开发布于 arxiv。
  • [2025.05.28] 🔥 Ming-lite-omni 官方版本发布,性能更佳并支持图像生成。
  • [2025.05.04] 🔥 发布 Ming-lite-omni 测试版本:Ming-lite-omni-Preview

主要特性

  • 统一全模态感知:Ming-lite-omni 基于 Ling(一个 MoE 架构的大语言模型),通过模态专用路由器解决任务冲突,确保来自不同模态的 token 的连贯融合。

  • 统一感知与生成:Ming-lite-omni 实现统一的理解与生成,使模型在生成过程中能解读多模态指令和用户意图,从而提升生成质量并增强多任务使用便利性。

  • 创新的生成能力:Ming-lite-omni 能感知所有模态,同时生成高质量文本、实时语音和生动图像,展现出卓越的跨模态表现,涵盖图像感知、视听交互和图像生成等多样任务。

评测

Ming-lite-omni 在图像感知、视听交互及图像生成任务中均展现出优异的跨模态性能。具体来说,在图像感知任务中,Ming-lite-omni 仅激活 28 亿参数,性能已可与 Qwen2.5-VL-7B 相媲美。它在端到端语音理解和指令执行上表现优于 Qwen2.5-Omni 和 Kimi-Audio。同时支持原生分辨率的图像生成、编辑及风格迁移,GenEval 得分达 0.64,优于主流模型如 SDXL。在 FID 指标上,Ming-lite-omni 达到 4.85,刷新了现有方法的最佳水平。

Image benchmark

BenchmarksMing-lite-omniQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.184.484.5
HallusionBench55.055.851.7
MMBench_TEST_V1180.882.882.0
MMMU56.356.654.8
MMStar64.765.365.2
MMVet71.371.668.1
MathVista71.668.167.9
OCRBench88.487.888.2
Average71.471.570.3

Encyclopedia Benchmarks

Object RecognitionMing-lite-omniQwen2.5-VL-7B-Instruct
Plants54.9647.8
Animals56.750.85
Vehicles41.9142.29
Food & Ingredients62.2854.09
Dishes44.339.07
General91.0892.42
Average58.5454.43

Video benchmark

BenchmarksMing-lite-omniQwen2.5VL-7B-Instruct
VideoMME67.067.3
MVBench67.767.4
Video-MMMU46.347.4
LongVideoBench56.654.7
Average59.459.2

Note: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark

SpeechQA

ModelAverageAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.5453.693.4035.3535.4349.0122.5798.85
Baichuan-Audio3.6954.003.3949.6448.8063.3041.3286.73
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04

ASR

Modelaishell1aishell2_androidaishell2_ioscv15_zhfleurs_zhwenetspeech_meetingwenetspeech_netlibrispeech_test_cleanlibrispeech_test_othermultilingual_librispeechcv15_enfleurs_envoxpopuli_v1.0_en
Ming-lite-omni1.472.552.526.312.965.955.461.442.804.156.893.395.80
Qwen2.-Omni1.182.752.635.203.005.907.701.803.407.567.604.105.80
Qwen2-Audio1.532.922.926.907.507.168.421.603.605.408.606.906.84
Kimi-Audio0.602.642.567.212.696.285.371.282.425.8810.314.447.97

Information-Seeking Benchmark

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-lite-omni27.730.425.4

OCR

ModelMing-lite-omniQwen2.5-VL-7B-Instruct
ChartQA_TEST85.187.3
DocVQA_TEST9395.7
OCRBenchV2_en/zh53.3/5256.3/57.2
OmniDocBench↓34/34.430.8/39.8
TextVQA_VAL82.884.9

GUI

ModelMing-lite-omniInternVL3 8BQwen2.5-VL-7B-Instruct
ScreenSpot82.179.578.9*
ScreenSpot-V284.181.4-
AITZ(EM)66.6-57.6*

Note: * denotes the reproduced results.

Unified Generation Benchmark

Modelsingle_objecttwo_objectcountingcolorspositioncolor_attrGENEVALDPGBenchFID↓
Ming-lite-omni0.98750.77270.68120.78720.310.290.6481.724.85
Metaquery-XL------0.6182.056.02
SDv2.10.980.510.440.850.070.170.5068.0926.96
Emu3-Gen0.980.710.340.810.170.210.5480.60-
SDXL0.980.740.390.850.150.230.5574.658.76
Janus0.970.680.300.840.460.420.6179.6810.10
JanusFlow------0.6380.099.51

Please refer to our technical report for more comprehensive evaluation results.

模型下载

您可以从 Huggingface 和 ModelScope 两个平台下载模型。

模型输入模态输出模态下载地址
Ming-Lite-Omni图像、文本、视频、音频图像、文本、音频🤗 HuggingFace
🤖 ModelScope

如果您位于中国大陆,我们强烈建议您从 🤖 ModelScope 下载模型。

环境准备

Installation with pip

pip install -r requirements.txt
# for python 3.10
pip install data/matcha_tts-0.0.5.1-cp310-cp310-linux_x86_64.whl
# for python 3.8
# pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8 # for H20 GPU

Installation with docker

You can also initialize the environment by building the docker image. First clone this repository:

git clone --depth 1 https://github.com/inclusionAI/Ming.git
cd Ming

Then build the docker image with the provided Dockerfile in docker/docker-py310-cu121. This step might take a while:

docker build -t ming:py310-cu121 docker/docker-py310-cu121

At last, start the container with the current repo directory mounted:

docker run -it --gpus all -v "$(pwd)":/workspace/Ming ming:py310-cu121 ming:py310-cu121 /bin/bash

You can run the model with python interface. You may download the huggingface model in the repo directory first (.../Ming/) or mount the downloaded model path when starting the container.

使用样例

We provide a step-by-step running example:

Step 1 - Download the source code

git clone https://github.com/inclusionAI/Ming.git
cd Ming

Step 2 - Download the model weights and create a soft link to the source code directory

Download our model following Model Downloads

mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-Lite-Omni inclusionAI/Ming-Lite-Omni

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model.

jupyter notebook cookbook.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb.

import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# load model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]

# 1. Format inputs using chat template
text = processor.apply_chat_template(messages, add_generation_prompt=True)

# 2. Extract vision/audio data
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

# 3. Prepare tensor inputs
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# 4. Configure generation
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
eos_token_id=processor.gen_terminator,
generation_config=generation_config,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# 5. Decode output
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 62G GPU memory.

许可与法律声明

本代码仓库遵循 MIT 许可证,法律声明见项目根目录下的 LEGAL.md 文件

引用

如果您觉得我们的工作对您有帮助,欢迎引用。


@misc{Mingomni2025,
title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
author = {Inclusion AI},
year = {2025},
eprint = {2506.09344},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2506.09344}
}

Ling: A MoE LLM Provided and Open-sourced by InclusionAI

· 阅读需 11 分钟
inclusionAI
Ant Group

🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope

Introduction

Ling is a MoE LLM provided and open-sourced by InclusionAI. We introduce two different sizes, which are Ling-lite and Ling-plus. Ling-lite has 16.8 billion parameters with 2.75 billion activated parameters, while Ling-plus has 290 billion parameters with 28.8 billion activated parameters. Both models demonstrate impressive performance compared to existing models in the industry.

Their structure makes it easy to scale up and down and adapt to different tasks, so users can use these models for a wide range of tasks, from processing natural language to solving complex problems. Furthermore, the open-source nature of Ling promotes collaboration and innovation within the AI community, fostering a diverse range of use cases and enhancements.

As more developers and researchers engage with the platform, we can expect rapid advancements and improvements, leading to even more sophisticated applications. This collaborative approach accelerates development and ensures that the models remain at the forefront of technology, addressing emerging challenges in various fields.

Update

  • [2025-5-10] Ling-lite-1.5 has been released! It achieves significant progress in reasoning ability compared with previous Ling-lite.
  • [2025-4-15] Ling-lite is upgraded to Ling-lite-0415. The new model demonstrates notable improvements over its predecessor, Ling-lite-0220, especially on code and math.

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model#Total Params#Activated ParamsContext LengthDownload
Ling-lite-base-1.516.8B2.75B128K🤗 HuggingFace
🤖 ModelScope
Ling-lite-1.516.8B2.75B128K🤗 HuggingFace
🤖 ModelScope
Ling-plus-base290B28.8B64K🤗 HuggingFace
🤖 ModelScope
Ling-plus290B28.8B64K🤗 HuggingFace
🤖 ModelScope
Ling-coder-lite-base16.8B2.75B16K🤗 HuggingFace
🤖 ModelScope
Ling-coder-lite16.8B2.75B16K🤗 HuggingFace
🤖 ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Evaluation

Ling-lite

Standard Benchmarks

Benchmark#shotsLing-lite-1.5Ling-liteQwen3-4B-InstructQwen3-8B-InstructMoonlight-16B-A3B-InstructLLaMA3.1-8B
MMLU(EM)574.3371.2770.0975.9770.7468.67
GPQA(Pass@1)036.5529.7340.447.1019.5127.59
HumanEval(Pass@1)087.2784.3881.9485.2972.9467.23
LiveCodeBench 2408-2502 (Pass@1)022.718.9421.826.8814.7618.41
LCBench(pass@1)060.3746.5748.6160.0328.3923.13
Math(EM)082.6272.8081.4682.7067.152.42
AIME2024(pass@1)021.8810.2120.6226.256.887.29
OlympiadBench(pass@1)052.3036.4454.3356.1132.8517.04
BBH(EM)075.7566.3878.2179.3363.4568.05
IFEval(Prompt Strict)077.7077.9981.0683.5549.0173.01
BFCL_live072.1567.9365.3569.8347.1449.98

Context Window

Evaluation results on the Needle In A Haystack (NIAH) tests. Ling-lite-1.5 has improved long text generation capability and performs well across most context window lengths up to 128K.

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-lite-1.5"

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:

git clone -b  v0.7.3 https://github.com/vllm-project/vllm.git
cd vllm
git apply Ling/inference/vllm/bailing_moe.patch
pip install -e .

Offline Inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-lite-1.5")

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)

llm = LLM(model="inclusionAI/Ling-lite", dtype='bfloat16')
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-lite \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--use-v2-block-manager \
--gpu-memory-utilization 0.90

To handle long context in vLLM using YaRN, we need to follow these two steps:

  1. Add a rope_scaling field to the model's config.json file, for example:
{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
  1. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

MindIE

This subject outlines the primary processes for executing a Ling MoE model with specified hardware and the MindIE inference framework.

Configure preparation

Create a model directory on the host for downloading, the directory example is: /root/models', which is used to mount the docker container later.

Download the mindie-related configuration from github:

cd /root/models
git clone git@github.com:inclusionAI/Ling.git

Machine network environment check

# Check the physical link
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Check the links
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check your network health
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# Check whether the detected IP address is correctly configured
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# Check whether the gateway is configured correctly
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# Check the consistency of the underlying TLS verification behavior of the NPU, recommend that all 0 be
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
# The underlying TLS check line of the NPU is set to 0
for i in {0..7}; do hccn_tool -i $i -tls -s enable 0; done

Pull the image

Go to Ascend Community/Development Resources and pull the mindie image

Image version: 1.0.0-800I-A2-py311-openeuler24.03-lts

The versions of each component are as follows:

ComponentVersion
MindIE1.0.0
CANN8.0.0
PTA6.0.0.beta1
HDK24.1.0

Container startup and configuration changes

Start the container

Execute the following startup command (reference):

docker run -itd --privileged --name=container name --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/models:/home/HwHiAiUser/Ascend \
mindie: 1.0.0-XXX-800I-A2-arm64-py3.11 (modified according to the name of the loaded image) \
bash
Download the model

In this case, we use ModelScope to download the model, and install ModelScope first:

pip install modelscope

Download the model:

# The model takes a long time to download and can be executed in the background
nohup modelscope download --model inclusionAI/Ling-plus --local_dir /home/HwHiAiUser/Ascend/Ling_plus 2>&1 > /tmp/ling_plus.log &

nohup modelscope download --model inclusionAI/Ling-plus-base --local_dir /home/HwHiAiUser/Ascend/Ling_plus_base 2>&1 > /tmp/ling_plus_base.log &

nohup modelscope download --model inclusionAI/Ling-lite --local_dir /home/HwHiAiUser/Ascend/Ling_lite 2>&1 > /tmp/ling_lite.log &

nohup modelscope download --model inclusionAI/Ling-lite-base --local_dir /home/HwHiAiUser/Ascend/Ling_lite_base 2>&1 > /tmp/ling_lite_base.log &

After the download is completed, you need to change the file permissions, otherwise an error will be reported when MindIE-Service is started:

chmod -R 750 *.json *.py
Model weight format conversion

This section applies to the Ling Lite model, the Ling Plus model does not need to worry about this chapter

mindie supports safetensors format weights, if the download weights are not in safetensors format, you need to convert the weights, take Ling Lite as an example, the conversion command is as follows:

# Convert Ling lite
python /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor.py

cd /home/HwHiAiUser/Ascend/Ling_lite
cp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_safetensor/

# Convert Ling lite base
python /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor_base.py

cd /home/HwHiAiUser/Ascend/Ling_lite_base
cp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_base_safetensor/

The path of loading the Ling Lite model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_safetensor', and the path of the Ling Lite Base model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_base_safetensor'

Change the model configuration

The default model configuration file (config.json) mindie cannot be loaded directly, and needs to be changed:

# Adapt to mindie's Ling lite model configuration
cp /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json

# Adapt to mindie's Ling lite base model configuration
cp /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_base_config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json

# Adapt to mindie's Ling plus model configuration
cp /home/HwHiAiUser/Ascend/Ling_plus/config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_plus/config.json

# Adapt to mindie's Ling plus base model configuration
cp /home/HwHiAiUser/Ascend/Ling_plus_base/config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_base_config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_plus_base/config.json

Execute the shell script that adapts the mindie to the Ling model:

bash /home/HwHiAiUser/Ascend/Ling/inference/mindie/patch_atb_llm.sh

Stand-alone Servitization Inference (Ling lite)

Set the underlying environment variables:

source /usr/local/Ascend/atb-models/set_env.sh

Set different mindie configurations according to the model type:

# Ling Lite
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

# Ling Lite base
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.base.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

Start the mindie service:

chmod 640 /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

cd $MIES_INSTALL_PATH
nohup ./bin/mindieservice_daemon > /tmp/service.log 2>&1 &

Check /tmp/service.log to check whether the output is Daemon start success!, if so, it means that MindIE-Service has started successfully.

Test if the request is correct:

# Chat model
wget -O- --post-data="{\"messages\":[{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Who are you?\"}], \"stream\": false, \"max_tokens\":100, \"model\": \"bailing_moe\", \"temperature\":0}" \
--header='Content-Type:application/json' \
'http://127.0.0.1:1025/v1/chat/completions'

# base model

wget -O- --post-data='{"inputs":"My name is Olivier and I","stream":false,"parameters":{"temperature":1,"max_new_tokens":100,"do_sample":false}}' \
--header='Content-Type:application/json' \
'http://127.0.0.1:1025/infer'

Multi-machine service-based inference (Ling plus)

All of the following commands need to be executed simultaneously on all machines.

To enable multi-machine service-based inference, you need to configure a multi-machine ranktable file.

  • Get the

Ming-Lite-Uni:自然多模态交互统一架构的进展

· 阅读需 7 分钟
inclusionAI
Ant Group

GITHUB 📑 Technical Report|🤗 Hugging Face|🤖 ModelScope

简介

Ming-Lite-Uni 是一个开源的多模态框架,包含一个全新设计的统一视觉生成器,以及一个原生多模态自回归模型,用于整合视觉与语言能力。

本项目提供了集成 MetaQueries 与 M2-omni 框架的开源实现,并引入了创新性的多尺度可学习Token机制多尺度表示对齐策略。Ming-Lite-Uni 利用固定的MLLM与可训练的扩散模型,使原生多模态AR模型不仅支持文本生成图像(text-to-image),还支持基于指令的图像编辑,从而扩展其功能,不再局限于视觉理解。实验结果表明,Ming-Lite-Uni 具备强大的性能表现,并在交互体验上展现出高度流畅性。目前该项目处于alpha阶段,将持续优化中。

感谢大家的支持与关注!我们正在稳步推进项目,并取得了良好进展,更多更新即将到来,敬请期待!

📌 更新日志

为什么重要?

Ming-Lite-Uni 的统一架构克服了传统方法的根本性局限:

传统方法Ming-Lite-Uni 的优势
模块化流程
(如 CLIP/SigLIP + 扩散模型)
端到端统一模型
理解与生成无缝融合
离散Token自回归
(视觉定位能力有限)
连续Token空间
原生支持细粒度视觉概念
固定分辨率处理
(上采样会产生伪影)
多尺度自适应
各分辨率下均保持一致的画质
编辑流程分离
(需要手动对齐)
对话驱动控制
自然语言指导像素级编辑
理解瓶颈
(视觉语义错位)
联合表示学习
理解与生成能力相互增强

核心增强点

  • 统一的视觉理解与生成架构:Ming-Lite-Uni 在 OpenCompass 榜单中理解得分达 69.7,优于 DeepSeek-VL2 (66.4);同时在 GenEval 图像生成基准上取得 0.62 的得分,超过 SDXL (0.55)。
  • 多尺度可学习Token:引入4×/8×/16×多尺度的分层Token,分别捕捉图像的整体布局(低分辨率)、物体结构(中分辨率)和细节纹理(高分辨率),GenEval得分提升3.5%。
  • 多尺度表示对齐:设计了尺度一致性损失,通过原生分辨率优化确保各层级表示与最终结果的一致性,图像重建质量提升超过2dB PSNR,GenEval得分提升1.5%。
  • 具备AGI能力的系统:支持“生成城堡 → 添加日落 → 调整视角”等链式指令,响应时间<1秒(RTX 4090测试)。系统支持指令驱动的生成与编辑,并已对齐 GPT-4o(2025年3月行业标杆)。

赋能多模态交互

Ming-Lite-Uni 是统一的多模态理解模型,突破传统NLP与视觉理解范畴,进一步支持图像生成、图像编辑与风格迁移等交互式生成任务。

模型结构

Ming-Lite-Uni 是面向图像理解与高保真图像生成的统一多模态模型。其将图像表示压缩为连续视觉Token,并与文本Token一同输入自回归Transformer中进行处理;生成部分则由外部训练的扩散模型(SANA)执行,输入为Transformer生成的Token。

结构图

Benchmark 评测

我们使用公开基准对 Ming-Lite-Uni 的多模态理解与文本生成图像能力进行了分别的定量评估。对于多模态理解,我们与传统的图文输入文本输出模型,以及具备视觉生成能力的最新模型进行了对比。对于多模态生成,我们在 GenEval 基准上评估了文本生成图像的表现。详细信息请参考我们的技术报告。

Multimodal Understanding

TypeModelAvg.MMBMMSMMMUMathVHallAI2DMM-Vet
Und. OnlyLLaVA-72B68.084.565.856.668.447.986.260.6
Qwen2.5-VL-7B76.287.871.167.970.858.888.276.7
Emu3-Chat-58.5-31.6---37.2
InternVL2.5-78B75.287.569.57071.457.489.171.8
DeepSeek-VL266.481.261.050.759.451.584.560.0
GPT-4o-20241120 (closed)72.084.365.170.759.956.284.974.5
Step-1o (closed)77.787.369.369.974.755.889.182.8
Und. and Gen.TokenFlow-XL-68.9-38.7---40.7
Janus-Pro-7B-79.2-41.0---50.0
Ours (Ming-Lite-Uni)69.780.760.551.268.351.884.572.3

Image Generation

TypeMethodSingle Obj.Two Obj.CountingColorsPositionColor Attri.Overall
Gen. OnlyLlamaGen0.710.340.210.580.070.040.32
SDv2.10.980.510.440.850.070.170.50
Emu3-Gen0.980.710.340.810.170.210.54
SDXL0.980.740.390.850.150.230.55
DALL-E 30.960.870.470.830.430.450.67
SD3-Medium0.990.940.720.890.330.600.74
Und. and Gen.Show-o0.950.520.490.820.110.280.53
TokenFlow-XL0.950.600.410.810.160.240.55
Janus-Pro-1B0.980.820.510.890.650.560.73
Ours (Ming-Lite-Uni)0.990.760.530.870.260.300.62

Example Usage

System Requirements

  • Python: >= 3.8
  • PyTorch: >= 2.4.1+cu12.2 (CUDA 12.2 compatible)
  • flash-attn: >= 2.6.3

Installation

We recommend installing the following versions to set up your environment using pip:

pip install -r requirements.txt
  • Usage Guided

Below is an example of how to load and use the model:

import torch
import os
from Ming_Uni.MingUniInference import Ming_Uni_Inference
from Ming_Uni.process import MyProcessor
device = torch.cuda.current_device()
device = torch.device(device)

model_path='../Ming-Lite-Uni/'
model = Ming_Uni_Inference(model_path)
model.to(torch.bfloat16)
model.to(device)
model.eval()

llm_model=os.path.join(model_path, 'qwen2_5_llm')
my_proc=MyProcessor(llm_model)

image_file = "tests/cake.jpg"
prompt = "add a candle on top of the cake"
inputs = my_proc.process(image_file=image_file, prompt=prompt, device=device)

result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
result.save("result.png")

For more advanced usage, such as fine-tuning or generating images, refer to the documentation.

致谢

该项目目前处于早期阶段。尽管一些初步结果令人鼓舞,但要实现理解与生成的无缝整合,还需取得较大进展。代码和模型都需要进一步打磨和优化,因此我们选择将项目开源。欢迎社区贡献力量,共同完善和发展该项目。如果您有任何建议或发现代码中的问题,请通过 Pull Requests 进行贡献。感谢您的支持和关注!

开放协作

我们开源了 Ming-Lite-Uni,以加速向通用人工智能(AGI)迈进,特点包括:

  • 📂 完整模型权重与测试代码
  • 🧩 模块化架构,方便扩展
  • 📊 全面基准测试(对比 GPT-4V、SDXL 等)

"2025 年 3 月 ChatGPT-4 同步发布图像生成功能,印证了我们关于统一多模态 AI 是下一范式的愿景。"

联系方式

如果在使用本项目过程中需要帮助或遇到问题,请在 GitHub 提交 issue。

许可与法律声明

Ming 遵循 MIT 许可证,法律声明见项目根目录下的 LEGAL.md 文件

引用

如果您觉得我们的工作对您有帮助,欢迎引用。

@article{Mingunify2025,
title = {Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction},
author = {Inclusion AI, Ant Group},
journal = {arXiv preprint},
year = {2025}
}

Ming-Lite-Omni-Preview: MOE架构的多模态大模型

· 阅读需 9 分钟
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face | 🤖 ModelScope

简介

Ming-Lite-Omni-Preview 构建自 Ling-Lite,它是一个 MoE(专家混合)模型,能够感知文本、图像、音频和视频等多种模态,并以流式方式生成文本和自然语音。 为了更自然地处理多模态输入,我们对 Ling-Lite 进行了增强,为每种模态引入了专用路由模块。 因此,Ming-Omni 在处理多模态信息方面表现优异,并具有很强的可扩展性。

主要特性

  • Omni and Novel MoE Architecture: 一种基于专家混合(MoE)的创新型 Omni 架构,在多个多模态评测中取得了领先性能。

  • Video understanding: 支持视觉 Token 的 KV-Cache 动态压缩机制,既能理解数小时的长视频,也能对几秒钟的短视频进行精细分析。

  • Natural Speech Generation and Fine-grained Voice Dialogue: 支持端到端对话中的方言理解与生成,具备一次性语音克隆能力,并通过音频分词器压缩提升语调表现力。

评测结果

Image benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.8483.984.5
HallusionBench54.6851.951.7
MMBench_TEST_V1179.6384.382.0
MMMU57.058.654.8
MMStar62.063.965.2
MMVet73.667.168.1
MathVista69.068.267.9
OCRBench87.986.488.2
Average70.9670.570.3

Object Recognition

Object RecognitionMing-Lite-Omni-PreviewQwen2.5-VL-7BInternVL-2.5-8B
Plants52.155.332.8
Animals52.654.836.5
Home appliances & furniture93.597.490.9
Personal Electronics96.195.193.2
Food & Ingredients57.560.048.7
Tableware96.6 94.988.1
Vehicles31.940.931.9
Average68.671.260.3

Video benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5VL-7B
VideoMME wo/w sub.63.9/67.665.1/71.6
MVBench67.072.0
Video-MMMU45.447.44
LongVideoBench53.760.0

Audio benchmark

SpeechQA

ModelAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.693.4035.3535.4349.0122.5798.85
Baichuan-Audio4.003.3949.6448.8063.3041.3286.73
GLM-4-Voice4.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.493.9355.7161.3281.1052.8799.42
Ming-Lite-Omni-Preview4.253.8858.9546.0660.0046.7196.53

ASR

ModelAishell-1Aishell-2 iosWenetspeech test-netWenet test-meetingLibrispeech test-cleanLibrispeech test-other
Whisper Large-v35.144.769.6818.541.93.65
Qwen2-Audio1.533.067.728.41.63.6
GLM-4-voice Base2.46---2.827.66
Baichuan-Omni-1.5--6.98.4--
Qwen2.5-Omni1.182.365.97.71.83.4
Ming-Lite-Omni-Preview1.622.826.236.92.345.74

Knowledge

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-Lite-Omni-Preview27.328.925.9

OCR&GUI

ModelMing-Lite-Omni-PreviewQwen2.5-VL-7B-Instruct
ChartQA_TEST85.287.3
DocVQA_TEST93.295.7
OCRBenchV2_en/zh52.2/51.656.3/57.2
OmniDocBench↓34.7/34.530.8/39.8
TextVQA_VAL82.3684.9
ScreenSpot79.384.7

模型下载

你可以从 Huggingface 和 ModelScope 两个平台下载本模型。

ModelInput modalityOput modalityDownload
Ming-Lite-Omni-PreviewImage,text,viedio,audioImage,text,audio🤗 HuggingFace
🤖 ModelScope

如果你在中国大陆,强烈建议你通过以下平台下载模型: 🤖 ModelScope.

使用案例

视频音频问答

MultiModal InputQA
Q: <audio> (audio content: 请描述视频内容。)
A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky.
Q: Is there any food in front of me?
A: Yes, there's candy on the table.

语音转语音(支持方言)

快速上手

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......

# image qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
{"type": "text", "text": "What kind of flower is this?"},
],
},
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
{"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
],
},
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
# video qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
{"type": "text", "text": "What is the woman doing?"},
],
},
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

# multi-turn chat
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "中国的首都是哪里?"},
],
},
{
"role": "ASSISTANT",
"content": [
{"type": "text", "text": "北京"},
],
},
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
],
},
]
# Output:

# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=False,
eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)
# speech2speech
messages = [
{
"role": "HUMAN",
"content": [
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

许可证与法律声明

本代码库遵循 MIT 协议,法律免责声明见项目根目录下的 LEGAL.md 文件

Agentic Learning

· 阅读需 4 分钟
inclusionAI
Ant Group

Introduction

Agent exhibits powerful capabilities by interacting with the external environment and making decisions based on the feedback it receives from the environment. For complex problems, it is often necessary for an agent to have multi-turn interactions with the environment to reach a solution. The complexity and dynamism of environments, coupled with the necessity for multi-turn interactions, pose numerous challenges in training agents.

We introduce AgenticLearning, an open-source agent training paradigm designed to empower researchers to train and evaluate autonomous agents effectively. AgenticLearning offers a framework for multi-turn interactions with the environment, enabling models to learn how to interact with the environment and make decisions based on its feedback, thereby enhancing the models' ability to leverage the environment to solve complex problems.

AdvancementsModelsToolsEnvironmentTraining Framework
RAG-R1Qwen2.5-7b-instructoffline retrieval
online search
AWorldLLaMA-Factory
verl
AReaL
FunReasonQwen2.5-7b-Coder-instructBFCLAWorldLLaMA-Factory
verl

News

[2025/07/01] 🔥🔥🔥RAG-R1 We propose RAG-R1, a deepsearch training framework that incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism.

[2025/05/16] 🔥🔥🔥FunReason We propose FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss approach.

Advancements

Deepsearch

RAG-R1

  • Tools: Search Engines (offline or online)
  • LLM: Qwen2.5-7b-instruct

RAG-R1-framework

Overall framework of RAG-R1.

RAG-R1-result

Performance comparisons on QA benchmarks under the EM metric. The best and second best results are bold and underlined, respectively.

FunctionCall

FunReason

  • Tools: Real Human Function calling (BFCLv2 live&non-live)
  • LLM: Qwen2.5-7b-Coder-instruct

FunReason is a framework designed to enhance LLMs' function calling capabilities, achieving GPT-4o-comparable performance on BFCL, surpassing RL-based methods, mitigating catastrophic forgetting on HumanEval and MBPP, and using a data refinement strategy where natural CoT data outperforms artificial ones.

FunReason-Performance

Data refinement pipline of FunReason.

Overview of FunReason's data refinement pipeline. The pipeline consists of five stages: Function Call Classification, Query and Tool Identification, CoT Identification, Function and Parameter Identification, and Format Identification. Each stage ensures specific aspects of data quality, with failing examples either being discarded or regenerated.

FunReason-Performance

Performance of FunReason.

Citation

Please cite our repo if our works are helpful for your research.

@article{RAG-R1,
title={RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism},
author={Zhiwen Tan and Jiaming Huang and Qintong Wu and Hongxuan Zhang and Chenyi Zhuang and Jinjie Gu},
journal={arXiv preprint arXiv:2507.02962},
year={2025}
}

@article{FunReason,
title={FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement},
author={Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang},
journal={arXiv preprint arXiv:2505.20192},
year={2025}
}

Contact

For any question or feedback, please reach out to us at ender.tzw@antgroup.com or chenyi.zcy@antgroup.com

License

This project is licensed under the MIT License - see the LICENSE file for details.

AReaL: Ant Reasoning Reinforcement Learning for LLMs

· 阅读需 11 分钟
inclusionAI
Ant Group

| Paper | Documentation | Ask DeepWiki | 🤗 Models & Data | WeChat Group |

AReaL (Ant Reasoning RL) is an open-source fully asynchronous reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research. Built upon the open-source project RealHF, we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).

AReaL Highlights

  • 🔥 [NEW] Asynchronous RL: With algorithm-system co-design, AReaL supports fully asynchronous RL for the fastest training! Experimental support for multi-turn agentic RL is also provided.
  • 🛠️ Open & Reproducible: We continuously release all code, datasets, and training recipes for RL training of LLMs.
  • 🚀 Scalability: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
  • 🔪 Cutting-Edge Performance: AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.

News

[2025/06/03] (v0.3, boba²) We release boba² (double-boba) for fully asynchronous RL training, which achieves a 2.77x speedup while obtaining on-par or even better training performance compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out our v0.3 overview blog and the research paper.

[2025/03/31] (v0.2, boba) Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our v0.2 technical blog.

[2025/02/24] (v0.1) Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our v0.1 technical blog.

Release Highlights

In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:

  • A fully asynchronous RL training pipeline with system and RL algorithm co-design, achieving over 2.77x speedup without any performance drop. Check the benchmark scripts and instructions here.

  • SOTA coding models, i.e., a 14B model with a 69.1 score on LCB-v5. To reproduce, check the configs and instructions.

  • Experimental support for multi-turn agentic RL training. Check our complete example.

For the complete system design and more training details, please check our v0.3 blog and our research paper.

Jump to the quickstart section if you want to quickly run an experiment and get your hands dirty! 😈

Overview of Asynchronous RL Training

During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works (DeepCoder, Intellect) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.

Synchronous vs One-step Overlap RL

Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.

AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.

Asynchronous RL Training

Fig 2. Execution timeline of our fully asynchronous RL system.

AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.

We compare the scalability of asynchronous RL training based on our AReaL-boba² system with classical synchronous RL training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.

Scaling Comparison

Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.

SOTA Code Generation Model by AReaL-boba²

We use Qwen3 as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.

Model (8B)LiveCodeBench v5
(2024.10-2025.2)
CodeforcesCodeContests
Qwen3-8B58.81879/96.7%31.4
DeepSeek-R1-0528-Qwen3-8B58.41945/97.3%31.0
🤗 AReaL-boba²-8B-Open62.01933/97.2%41.4
🤗 AReaL-boba²-8B63.01962/97.5%40.8
Model (14B)LiveCodeBench v5
(2024.10-2025.2)
CodeforcesCodeContests
Qwen3-14B65.41978/97.7%38.3
DeepCoder-14B-Preview60.61936/95.3%40.1
🤗 AReaL-boba²-14B-Open67.31990/97.8%46.2
🤗 AReal-boba²-14B69.12044/98.2%46.1
Larger ModelsLiveCodeBench v5
(2024.10-2025.2)
CodeforcesCodeContests
Qwen3-235B70.72056-
DeepSeek-R164.32029-
OpenAI-o3-mini (Medium)66.32036-

Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.

We highlight the tutorials and code walkthroughs about the following key features for asynchronous training:

RL Training for Multi-turn Agent

AReaL-boba² allows you to independently customize the dataset, rollout behavior, and the training algorithm, without needing to modify the heavy system-level code.

In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the step-by-step guide if you want to implement your own agentic RL project.

Getting Started

Obtain the training data:

For code training data, a simple preprocessing script was provided in examples/data_preprocess/preprocess_training_data.py:

python3 preprocess_training_data.py --data_path $original_data_path --output_path $training_data_path

Train Qwen3 1.7B locally (Remember to modify dataset.path in the script below):

bash examples/run_async_ppo.sh

Evaluation:

cd evaluation
# Evaluate the model
python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25 \
--max_gen_tokens 32768 \
--data_names codeforces,lcb_v5 \
--prompt_type qwen3-think-pure \
--temperature 1.0

Resources

Quickstart

Benchmark and Reproduction

Customization Guide

System Code Walkthrough

Future Plan

AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also hiring interns and full-time employees with open positions in both the US and China.

For the research and development plan already in place, please see the following list:

System Development

  • Support for SGLang
  • RL training with coding problems
  • Asynchronous generation and RL training
  • Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
  • RL for vision-language models (VLM)
  • Multi-turn agentic RL
  • Function calling and tool use

Algorithm Development

  • RL training recipes for 1.5B and 7B models
  • A complete RL training recipe for 32B models
  • Sample-efficient multi-task RL algorithms
  • Agentic capabilities with end-to-end RL
  • Stable RL training for larger MOE models

Acknowledgement

We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.

Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.

We also appreciate all the pioneering works from the community, particularly the ReaLHF project from OpenPsi Inc. and other projects, including but not limited to DeepScaleR, Open-Reasoner-Zero, OpenRLHF, VeRL, SGLang, QwQ, Light-R1 and DAPO.

Citation

@inproceedings{mei2025real,
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
title = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
booktitle = {Proceedings of the Eighth Conference on Machine Learning and Systems,
MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025},
publisher = {mlsys.org},
year = {2025},
}
@misc{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
year={2025},
eprint={2505.24298},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.24298},
}

PromptCoT & PromptCoT-Mamba: Advancing the Frontiers of Reasoning

· 阅读需 5 分钟
inclusionAI
Ant Group

News

  • May 30, 2025: PromptCoT-Mamba released! Introducing an attention-free foundation model for reasoning tasks.
  • Apr 11, 2025: PromptCoT-QwQ-32B model and its training data released, achieving new state-of-the-art results.
  • Mar 7, 2025: PromptCoT project launched, including the problem generation model, distilled models (PromptCoT-DS series), and associated datasets.

Overview

This repository unifies two synergistic projects aimed at advancing the frontiers of mathematical and code reasoning in Large Language Models (LLMs): PromptCoT and PromptCoT-Mamba.

PromptCoT (Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models) addresses the critical challenge of acquiring high-quality, complex problems for training advanced LLMs. It introduces a novel methodology to systematically generate Olympiad-level mathematical problems by modeling the rationale behind expert problem design. This approach not only enhances problem diversity and difficulty but also ensures logical consistency in problem construction, providing a scalable solution for creating robust training datasets.

PromptCoT-Mamba (Scaling Reasoning without Attention) leverages the problem generation capabilities of the PromptCoT pipeline to train PromptCoT-Mamba-7B, the first attention-free foundation model based on the Mamba-2 architecture. This model demonstrates that structured training curricula can enable attention-free models to surpass strong Transformer baselines on a wide array of competition-level math and code reasoning tasks, all while maintaining constant-memory inference without KV caching.

Together, these projects offer a powerful suite of tools, models, and datasets for researchers and developers working on the cutting edge of AI reasoning.


Highlights & Key Results

1. PromptCoT: Problem Generation & Distilled Models

  • ✨ The Missing Piece for Test-Time Scaling: A lightweight yet powerful problem generation model enabling the construction of prompt sets at any scale with sufficient quality, perfect for SFT or RL post-training.
  • 📖 A Fully Open Project: All models (generation, distilled LLMs) and datasets (generation inputs, SFT data) are open-sourced.
  • 🏆 Superior Performance of Distilled Models:
    • PromptCoT-DS-7B consistently surpasses its base model, DeepSeek-R1-Distill-Qwen-7B, with significant gains:
      • +0.9% on MATH-500 (93.7%)
      • +3.2% on AIME2024 (58.7%)
      • +9.2% on AIME2025 (49.2%)
    • PromptCoT-DS-7B (7B parameters) achieves results comparable to larger 32B models like S1-32B and LIMO-32B.
    • PromptCoT-QwQ-32B sets a new standard, outperforming other 32B models by a significant margin:
      • MATH-500: 96.7% ± 0.5%
      • AIME2024: 83.8% ± 2.8%
      • AIME2025: 75.4% ± 4.7%
    • PromptCoT-DS-1.5B demonstrates competitive performance against RL-based models purely through distillation.
  • ⚡ Efficiency Without Compromise: PromptCoT-DS-1.5B achieves 40+% AIME scores using over 15× fewer A100 GPU hours compared to models like DeepScaleR-1.5B-Preview.

2. PromptCoT-Mamba: Attention-Free Reasoning

  • 🚀 First Attention-Free SOTA: PromptCoT-Mamba-7B is the first attention-free model (Mamba-2 architecture) to outperform strong Transformer baselines in math and code reasoning.
  • 🧠 Trained with PromptCoT Pipeline: Utilizes a structured, two-stage curriculum with data generated by PromptCoT.
  • 💪 Strong General Performance: PromptCoT-Mamba-7B consistently outperforms 7B-scale Transformer and hybrid Mamba-Transformer baselines.
    • MATH-500: 84.6%
    • AIME 2024: 35.2%
    • AIME 2025: 24.6%
    • Livecodebench: 29.9%
  • 🎯 Math Specialization: The math-specialized variant, PromptCoT-Mamba-Math-7B, further boosts math performance:
    • MATH-500: 88.0%
    • AIME 2024: 42.9% (+7.7% over generalist)
    • AIME 2025: 30.8% (+6.2% over generalist)
  • Inference Efficiency: Offers substantial speedups (e.g., 3.66× faster on 24GB GPU for long sequences) and constant-memory inference, ideal for cost-sensitive or long-context workloads.

Performance Details

PromptCoT Series Performance

ModelGSM8KMATH-500AIME2024AIME2025
🔹 1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B-83.9%28.9%28.1%
STILL-3-1.5B-preview-85.5%39.3%-
DeepScaleR-1.5B-Preview-🟢 87.8%🟢 43.1%🟢 37.1%
PromptCoT-DS-1.5B (ours)🟢 87.6% ± 0.5%85.3% ± 1.1%41.2% ± 6.9%36.7% ± 6.2%
🔹 7B Models
DeepSeek-R1-Distill-Qwen-7B-92.8%55.5%40.0%
Qwen2.5-7B-SimpleRL-82.4%26.7%-
OpenThinker-7B-89.6%30.0%33.3%
OpenR1-Qwen-7B-90.6%36.7%40.0%
PromptCoT-DS-7B (ours)🔥 92.8% ± 0.5%🔥 93.7% ± 0.7%🔥 58.7% ± 3.1%🔥 49.2% ± 7.9%
🔹 32B Models
DeepSeek-R1-Distill-Qwen-32B-94.3%72.6%-
S1-32B-93.0%56.7%26.6%
LIMO-32B-94.8%57.1%46.6%
QwQ-32B--82.1%70.8%
PromptCoT-QwQ-32B (ours)🔥🔥 96.4% ± 0.2%🔥🔥 96.7% ± 0.5%🔥🔥 83.8% ± 2.8%🔥🔥 75.4% ± 4.7%

PromptCoT-Mamba Performance

General Performance:

ModelMATH-500AIME 24AIME 25OlympiadBenchHumanEvalHumanEval+Livecodebench
PromptCoT-Mamba-7B84.6🔥🔥35.2🔥🔥24.650.781.775.0🔥🔥29.9
Gemma3-27B89.032.624.054.286.078.026.9
Gemma3-12B83.822.919.249.981.173.222.2
Sky-T1-7B85.019.219.249.241.537.218.3
S1.1-7B82.019.217.543.164.056.713.3
Bespoke-Stratos-7B81.218.316.345.073.268.38.6
Nemotron-H-8B77.6------79.374.4--
M1-3B81.723.022.043.6------

Math Specialization vs. Generalist:

ModelMATH-500AIME 24AIME 25OlympiadBenchHumanEvalHumanEval+Livecodebench
PromptCoT-Mamba-Math-7B🔥🔥88.0🔥🔥42.9🔥🔥30.8🔥🔥52.171.366.520.3
PromptCoT-Mamba-7B84.635.224.650.781.775.029.9

Citation

If you find PromptCoT or PromptCoT-Mamba useful in your research, please consider citing the respective papers:

For PromptCoT:

@article{zhao2025promptcot,
author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Kong, Lingpeng},
title = {PromptCoT: Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models},
year = {2025},
journal = {arXiv preprint arXiv:2503.02324},
url = {http://arxiv.org/abs/2503.02324}
}

For PromptCoT-Mamba:

@article{zhao2025scaling,
author = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
title = {Scaling Reasoning without Attention},
journal = {arXiv preprint arXiv:2505.22425},
year = {2025},
url = {https://arxiv.org/abs/2505.22425}
}

Ring: A Reasoning MoE LLM Provided and Open-sourced by InclusionAI

· 阅读需 2 分钟
inclusionAI
Ant Group

🤗 Hugging Face  |  🤖 ModelScope

News

  • [2025-06]:🎉 Add Ring-lite Model
  • [2025-04]:🎉 Add Ring-lite-linear-preview Model

Introduction

Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI, derived from Ling. We introduce Ring-lite-distill-preview, which has 16.8 billion parameters with 2.75 billion activated parameters. This model demonstrates impressive reasoning performance compared to existing models in the industry.

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model#Total Params#Activated ParamsContext LengthDownload
Ring-lite-distill-preview16.8B2.75B64K🤗 HuggingFace
🤖 ModelScope
Ring-lite16.8B2.75B128K🤗 HuggingFace
🤖 ModelScope

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ring-lite"

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

Please refer to Ling

Finetuning

Please refer to Ling

License

This code repository is licensed under the MIT License.

Citation

[TBD]