Skip to main content

Ming-Omni: A Unified Multimodal Model for Perception and Generation

· 10 min read
inclusionAI
Ant Group

GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope

Introduction

Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

📌 Updates

  • [2025.06.12] 🔥 Our Technical Report is in public on arxiv.
  • [2025.05.28] 🔥 The official version of Ming-lite-omni is released, with better performance and image generation support.
  • [2025.05.04] 🔥 We release the test version of Ming-lite-omni:Ming-lite-omni-Preview.

Key Features

  • Unified Omni-Modality Perception: Ming-lite-omni, built on Ling, an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.

  • Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.

  • Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.

Evaluation

Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods.

Image benchmark

BenchmarksMing-lite-omniQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.184.484.5
HallusionBench55.055.851.7
MMBench_TEST_V1180.882.882.0
MMMU56.356.654.8
MMStar64.765.365.2
MMVet71.371.668.1
MathVista71.668.167.9
OCRBench88.487.888.2
Average71.471.570.3

Encyclopedia Benchmarks

Object RecognitionMing-lite-omniQwen2.5-VL-7B-Instruct
Plants54.9647.8
Animals56.750.85
Vehicles41.9142.29
Food & Ingredients62.2854.09
Dishes44.339.07
General91.0892.42
Average58.5454.43

Video benchmark

BenchmarksMing-lite-omniQwen2.5VL-7B-Instruct
VideoMME67.067.3
MVBench67.767.4
Video-MMMU46.347.4
LongVideoBench56.654.7
Average59.459.2

Note: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark

SpeechQA

ModelAverageAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.5453.693.4035.3535.4349.0122.5798.85
Baichuan-Audio3.6954.003.3949.6448.8063.3041.3286.73
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04

ASR

Modelaishell1aishell2_androidaishell2_ioscv15_zhfleurs_zhwenetspeech_meetingwenetspeech_netlibrispeech_test_cleanlibrispeech_test_othermultilingual_librispeechcv15_enfleurs_envoxpopuli_v1.0_en
Ming-lite-omni1.472.552.526.312.965.955.461.442.804.156.893.395.80
Qwen2.-Omni1.182.752.635.203.005.907.701.803.407.567.604.105.80
Qwen2-Audio1.532.922.926.907.507.168.421.603.605.408.606.906.84
Kimi-Audio0.602.642.567.212.696.285.371.282.425.8810.314.447.97

Information-Seeking Benchmark

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-lite-omni27.730.425.4

OCR

ModelMing-lite-omniQwen2.5-VL-7B-Instruct
ChartQA_TEST85.187.3
DocVQA_TEST9395.7
OCRBenchV2_en/zh53.3/5256.3/57.2
OmniDocBench↓34/34.430.8/39.8
TextVQA_VAL82.884.9

GUI

ModelMing-lite-omniInternVL3 8BQwen2.5-VL-7B-Instruct
ScreenSpot82.179.578.9*
ScreenSpot-V284.181.4-
AITZ(EM)66.6-57.6*

Note: * denotes the reproduced results.

Unified Generation Benchmark

Modelsingle_objecttwo_objectcountingcolorspositioncolor_attrGENEVALDPGBenchFID↓
Ming-lite-omni0.98750.77270.68120.78720.310.290.6481.724.85
Metaquery-XL------0.6182.056.02
SDv2.10.980.510.440.850.070.170.5068.0926.96
Emu3-Gen0.980.710.340.810.170.210.5480.60-
SDXL0.980.740.390.850.150.230.5574.658.76
Janus0.970.680.300.840.460.420.6179.6810.10
JanusFlow------0.6380.099.51

Please refer to our technical report for more comprehensive evaluation results.

Model Downloads

You can download the model from both Huggingface and ModelScope.

ModelInput modalityOput modalityDownload
Ming-Lite-OmniImage,text,viedio,audioImage,text,audio🤗 HuggingFace
🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-Lite-Omni --local_dir inclusionAI/Ming-Lite-Omni --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Use Cases

Additional demonstration cases are available on our project page.

Environment Preparation

Installation with pip

pip install -r requirements.txt
# for python 3.10
pip install data/matcha_tts-0.0.5.1-cp310-cp310-linux_x86_64.whl
# for python 3.8
# pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8 # for H20 GPU

Installation with docker

You can also initialize the environment by building the docker image. First clone this repository:

git clone --depth 1 https://github.com/inclusionAI/Ming.git
cd Ming

Then build the docker image with the provided Dockerfile in docker/docker-py310-cu121. This step might take a while:

docker build -t ming:py310-cu121 docker/docker-py310-cu121

At last, start the container with the current repo directory mounted:

docker run -it --gpus all -v "$(pwd)":/workspace/Ming ming:py310-cu121 ming:py310-cu121 /bin/bash

You can run the model with python interface. You may download the huggingface model in the repo directory first (.../Ming/) or mount the downloaded model path when starting the container.

Example Usage

We provide a step-by-step running example:

Step 1 - Download the source code

git clone https://github.com/inclusionAI/Ming.git
cd Ming

Step 2 - Download the model weights and create a soft link to the source code directory

Download our model following Model Downloads

mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-Lite-Omni inclusionAI/Ming-Lite-Omni

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model.

jupyter notebook cookbook.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb.

import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# load model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]

# 1. Format inputs using chat template
text = processor.apply_chat_template(messages, add_generation_prompt=True)

# 2. Extract vision/audio data
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

# 3. Prepare tensor inputs
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# 4. Configure generation
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
eos_token_id=processor.gen_terminator,
generation_config=generation_config,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# 5. Decode output
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 62G GPU memory.

This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.

Citation

If you find our work helpful, feel free to give us a cite.


@misc{Mingomni2025,
title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
author = {Inclusion AI},
year = {2025},
eprint = {2506.09344},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2506.09344}
}

Ling: A MoE LLM Provided and Open-sourced by InclusionAI

· 11 min read
inclusionAI
Ant Group

🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope

Introduction

Ling is a MoE LLM provided and open-sourced by InclusionAI. We introduce two different sizes, which are Ling-lite and Ling-plus. Ling-lite has 16.8 billion parameters with 2.75 billion activated parameters, while Ling-plus has 290 billion parameters with 28.8 billion activated parameters. Both models demonstrate impressive performance compared to existing models in the industry.

Their structure makes it easy to scale up and down and adapt to different tasks, so users can use these models for a wide range of tasks, from processing natural language to solving complex problems. Furthermore, the open-source nature of Ling promotes collaboration and innovation within the AI community, fostering a diverse range of use cases and enhancements.

As more developers and researchers engage with the platform, we can expect rapid advancements and improvements, leading to even more sophisticated applications. This collaborative approach accelerates development and ensures that the models remain at the forefront of technology, addressing emerging challenges in various fields.

Update

  • [2025-5-10] Ling-lite-1.5 has been released! It achieves significant progress in reasoning ability compared with previous Ling-lite.
  • [2025-4-15] Ling-lite is upgraded to Ling-lite-0415. The new model demonstrates notable improvements over its predecessor, Ling-lite-0220, especially on code and math.

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model#Total Params#Activated ParamsContext LengthDownload
Ling-lite-base-1.516.8B2.75B128K🤗 HuggingFace
🤖 ModelScope
Ling-lite-1.516.8B2.75B128K🤗 HuggingFace
🤖 ModelScope
Ling-plus-base290B28.8B64K🤗 HuggingFace
🤖 ModelScope
Ling-plus290B28.8B64K🤗 HuggingFace
🤖 ModelScope
Ling-coder-lite-base16.8B2.75B16K🤗 HuggingFace
🤖 ModelScope
Ling-coder-lite16.8B2.75B16K🤗 HuggingFace
🤖 ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Evaluation

Ling-lite

Standard Benchmarks

Benchmark#shotsLing-lite-1.5Ling-liteQwen3-4B-InstructQwen3-8B-InstructMoonlight-16B-A3B-InstructLLaMA3.1-8B
MMLU(EM)574.3371.2770.0975.9770.7468.67
GPQA(Pass@1)036.5529.7340.447.1019.5127.59
HumanEval(Pass@1)087.2784.3881.9485.2972.9467.23
LiveCodeBench 2408-2502 (Pass@1)022.718.9421.826.8814.7618.41
LCBench(pass@1)060.3746.5748.6160.0328.3923.13
Math(EM)082.6272.8081.4682.7067.152.42
AIME2024(pass@1)021.8810.2120.6226.256.887.29
OlympiadBench(pass@1)052.3036.4454.3356.1132.8517.04
BBH(EM)075.7566.3878.2179.3363.4568.05
IFEval(Prompt Strict)077.7077.9981.0683.5549.0173.01
BFCL_live072.1567.9365.3569.8347.1449.98

Context Window

Evaluation results on the Needle In A Haystack (NIAH) tests. Ling-lite-1.5 has improved long text generation capability and performs well across most context window lengths up to 128K.

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-lite-1.5"

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:

git clone -b  v0.7.3 https://github.com/vllm-project/vllm.git
cd vllm
git apply Ling/inference/vllm/bailing_moe.patch
pip install -e .

Offline Inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-lite-1.5")

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)

llm = LLM(model="inclusionAI/Ling-lite", dtype='bfloat16')
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-lite \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--use-v2-block-manager \
--gpu-memory-utilization 0.90

To handle long context in vLLM using YaRN, we need to follow these two steps:

  1. Add a rope_scaling field to the model's config.json file, for example:
{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
  1. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

MindIE

This subject outlines the primary processes for executing a Ling MoE model with specified hardware and the MindIE inference framework.

Configure preparation

Create a model directory on the host for downloading, the directory example is: /root/models', which is used to mount the docker container later.

Download the mindie-related configuration from github:

cd /root/models
git clone git@github.com:inclusionAI/Ling.git

Machine network environment check

# Check the physical link
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Check the links
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check your network health
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# Check whether the detected IP address is correctly configured
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# Check whether the gateway is configured correctly
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# Check the consistency of the underlying TLS verification behavior of the NPU, recommend that all 0 be
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
# The underlying TLS check line of the NPU is set to 0
for i in {0..7}; do hccn_tool -i $i -tls -s enable 0; done

Pull the image

Go to Ascend Community/Development Resources and pull the mindie image

Image version: 1.0.0-800I-A2-py311-openeuler24.03-lts

The versions of each component are as follows:

ComponentVersion
MindIE1.0.0
CANN8.0.0
PTA6.0.0.beta1
HDK24.1.0

Container startup and configuration changes

Start the container

Execute the following startup command (reference):

docker run -itd --privileged --name=container name --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/models:/home/HwHiAiUser/Ascend \
mindie: 1.0.0-XXX-800I-A2-arm64-py3.11 (modified according to the name of the loaded image) \
bash
Download the model

In this case, we use ModelScope to download the model, and install ModelScope first:

pip install modelscope

Download the model:

# The model takes a long time to download and can be executed in the background
nohup modelscope download --model inclusionAI/Ling-plus --local_dir /home/HwHiAiUser/Ascend/Ling_plus 2>&1 > /tmp/ling_plus.log &

nohup modelscope download --model inclusionAI/Ling-plus-base --local_dir /home/HwHiAiUser/Ascend/Ling_plus_base 2>&1 > /tmp/ling_plus_base.log &

nohup modelscope download --model inclusionAI/Ling-lite --local_dir /home/HwHiAiUser/Ascend/Ling_lite 2>&1 > /tmp/ling_lite.log &

nohup modelscope download --model inclusionAI/Ling-lite-base --local_dir /home/HwHiAiUser/Ascend/Ling_lite_base 2>&1 > /tmp/ling_lite_base.log &

After the download is completed, you need to change the file permissions, otherwise an error will be reported when MindIE-Service is started:

chmod -R 750 *.json *.py
Model weight format conversion

This section applies to the Ling Lite model, the Ling Plus model does not need to worry about this chapter

mindie supports safetensors format weights, if the download weights are not in safetensors format, you need to convert the weights, take Ling Lite as an example, the conversion command is as follows:

# Convert Ling lite
python /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor.py

cd /home/HwHiAiUser/Ascend/Ling_lite
cp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_safetensor/

# Convert Ling lite base
python /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor_base.py

cd /home/HwHiAiUser/Ascend/Ling_lite_base
cp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_base_safetensor/

The path of loading the Ling Lite model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_safetensor', and the path of the Ling Lite Base model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_base_safetensor'

Change the model configuration

The default model configuration file (config.json) mindie cannot be loaded directly, and needs to be changed:

# Adapt to mindie's Ling lite model configuration
cp /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json

# Adapt to mindie's Ling lite base model configuration
cp /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_base_config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json

# Adapt to mindie's Ling plus model configuration
cp /home/HwHiAiUser/Ascend/Ling_plus/config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_plus/config.json

# Adapt to mindie's Ling plus base model configuration
cp /home/HwHiAiUser/Ascend/Ling_plus_base/config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_base_config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_plus_base/config.json

Execute the shell script that adapts the mindie to the Ling model:

bash /home/HwHiAiUser/Ascend/Ling/inference/mindie/patch_atb_llm.sh

Stand-alone Servitization Inference (Ling lite)

Set the underlying environment variables:

source /usr/local/Ascend/atb-models/set_env.sh

Set different mindie configurations according to the model type:

# Ling Lite
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

# Ling Lite base
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.base.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

Start the mindie service:

chmod 640 /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

cd $MIES_INSTALL_PATH
nohup ./bin/mindieservice_daemon > /tmp/service.log 2>&1 &

Check /tmp/service.log to check whether the output is Daemon start success!, if so, it means that MindIE-Service has started successfully.

Test if the request is correct:

# Chat model
wget -O- --post-data="{\"messages\":[{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Who are you?\"}], \"stream\": false, \"max_tokens\":100, \"model\": \"bailing_moe\", \"temperature\":0}" \
--header='Content-Type:application/json' \
'http://127.0.0.1:1025/v1/chat/completions'

# base model

wget -O- --post-data='{"inputs":"My name is Olivier and I","stream":false,"parameters":{"temperature":1,"max_new_tokens":100,"do_sample":false}}' \
--header='Content-Type:application/json' \
'http://127.0.0.1:1025/infer'

Multi-machine service-based inference (Ling plus)

All of the following commands need to be executed simultaneously on all machines.

To enable multi-machine service-based inference, you need to configure a multi-machine ranktable file.

  • Get the

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

· 7 min read
inclusionAI
Ant Group

GITHUB 📑 Paper|🤗 Hugging Face|🤖 ModelScope

Introduction

Ming-Lite-Uni is an open-source multimodal framework that includes a newly developed unified visual generator, and a native multimodal autoregressive model meant to integrate vision and language.

This project offers an open-source implementation of the integrated MetaQueries and M2-omni framework, while offering the innovative multi-scale learnable tokens and multi-scale representation alignment strategy. Ming-Lite-Uni utilizes a fixed MLLM and a learnable diffusion model, allowing native multimodal AR models to execute text-to-image production and instruction-based image editing tasks, hence enhancing their functionalities beyond mere visual comprehension. Our experimental findings demonstrate the robust efficacy of Ming-Lite-Uni and highlight the remarkable fluidity of its interactive process. Ming-Lite-Uni is now in the alpha phase and will soon undergo additional refinement.

We appreciate everyone's ongoing support and attention! We sincerely value your patience as we progressively enhance our solutions and model efficacy. We are now achieving significant progress and observing favorable outcomes, with forthcoming updates anticipated—remain attentive!

📌 Updates

Why It Matters

Ming-Lite-Uni's unified architecture overcomes fundamental limitations of conventional approaches:

Conventional MethodsMing-Lite-Uni's Advantages
Modular Pipelines
(CLIP/SigLIP + Diffusion Models)
End-to-End Unified Model
Seamless understanding-generation integration
Discrete Token AR
(Limited visual grounding)
Continuous Token Space
Native support for fine-grained visual concepts
Fixed-Resolution Processing
(Artifacts in upscaling)
Multi-Scale Adaptation
Consistent quality across resolutions
Separate Editing Workflows
(Manual alignment required)
Dialog-Driven Control
Natural language guided pixel-level editing
Understanding Bottlenecks
(Visual-semantic mismatch)
Joint Representation Learning
Mutually enhanced comprehension and generation

Key Enhancements

  • Unified Visual Understanding & Generation Architecture. Ming-Lite-Uni achieves an average understanding score of 69.7 on the OpenCompass leaderboard, surpassing DeepSeek-VL2 (66.4). At the same time, it achieves an image generation score of 0.62 on the GenEval benchmark, outperforming SDXL (0.55).
  • Multi-Scale Learnable Tokens. We employ a novel mechanism to establish feature correlations across resolutions of 4×/8×/16×. By introducing hierarchical tokens, the model captures global layout (low-res), object structures (mid-res), and fine textures (high-res), improving GenEval by 3.5%.
  • Multi-Scale Representation Alignment. We introduce a novel scale wised consistency loss to enforce alignment between hierarchical representations and final outputs through native-resolution optimization. This strategy directly enhances the high-res reconstruction quality (>2dB PSNR) and boosts GenEval by 1.5%.
  • AGI-Capable System. Our model supports complex chained operations, such as "generate castle → add sunset → adjust perspective", with a swift response time of under 1 second (benchmarked with RTX 4090). The system is designed to handle instruction-driven generation-editing and is synchronized with ChatGPT-4o(aligned with the industry milestone of March 2025).

Empowering Multimodal Interaction with Ming-Lite-Uni

Ming-Lite-Uni acts as a unified model for multimodal understanding, extending beyond traditional NLP tasks and multimodal comprehension to enable interactive multimodal generation. This includes capabilities such as image generation, image editing, and style transfer.

Model Structure

Ming-Lite-Uni is a unified multimodal model designed for both image understanding and high-fidelity image generation. It achieves this by compressing image representations into continuous visual tokens, which are processed alongside discrete text tokens using a scaled auto-regressive Transformer. The generation capability is powered by an externally trained diffusion model (SANA), conditioned on tokens produced by the Transformer.

B106FE9E-5839-48c3-A175-AE8A4D2D8BB8

Benchmark Evaluations

We conduct separate quantitative evaluations of Ming-Lite-Uni on multimodal understanding and text-to-image generation using public benchmarks. For multimodal understanding, we compare against traditional models that take images and text as input and output text, as well as against recent models with visual generative capabilities. For multimodal generation, we evaluate text-to-image performance on GenEval. Please refer to our TechReport for details.

Multimodal Understanding

TypeModelAvg.MMBMMSMMMUMathVHallAI2DMM-Vet
Und. OnlyLLaVA-72B68.084.565.856.668.447.986.260.6
Qwen2.5-VL-7B76.287.871.167.970.858.888.276.7
Emu3-Chat-58.5-31.6---37.2
InternVL2.5-78B75.287.569.57071.457.489.171.8
DeepSeek-VL266.481.261.050.759.451.584.560.0
GPT-4o-20241120 (closed)72.084.365.170.759.956.284.974.5
Step-1o (closed)77.787.369.369.974.755.889.182.8
Und. and Gen.TokenFlow-XL-68.9-38.7---40.7
Janus-Pro-7B-79.2-41.0---50.0
Ours (Ming-Lite-Uni)69.780.760.551.268.351.884.572.3

Image Generation

TypeMethodSingle Obj.Two Obj.CountingColorsPositionColor Attri.Overall
Gen. OnlyLlamaGen0.710.340.210.580.070.040.32
SDv2.10.980.510.440.850.070.170.50
Emu3-Gen0.980.710.340.810.170.210.54
SDXL0.980.740.390.850.150.230.55
DALL-E 30.960.870.470.830.430.450.67
SD3-Medium0.990.940.720.890.330.600.74
Und. and Gen.Show-o0.950.520.490.820.110.280.53
TokenFlow-XL0.950.600.410.810.160.240.55
Janus-Pro-1B0.980.820.510.890.650.560.73
Ours (Ming-Lite-Uni)0.990.760.530.870.260.300.62

Example Usage

System Requirements

  • Python: >= 3.8
  • PyTorch: >= 2.4.1+cu12.2 (CUDA 12.2 compatible)
  • flash-attn: >= 2.6.3

Installation

We recommend installing the following versions to set up your environment using pip:

pip install -r requirements.txt
  • Usage Guided

Below is an example of how to load and use the model:

import torch
import os
from Ming_Uni.MingUniInference import Ming_Uni_Inference
from Ming_Uni.process import MyProcessor
device = torch.cuda.current_device()
device = torch.device(device)

model_path='../Ming-Lite-Uni/'
model = Ming_Uni_Inference(model_path)
model.to(torch.bfloat16)
model.to(device)
model.eval()

llm_model=os.path.join(model_path, 'qwen2_5_llm')
my_proc=MyProcessor(llm_model)

image_file = "tests/cake.jpg"
prompt = "add a candle on top of the cake"
inputs = my_proc.process(image_file=image_file, prompt=prompt, device=device)

result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
result.save("result.png")

For more advanced usage, such as fine-tuning or generating images, refer to the documentation.

Acknowledgments

The project is currently in its early stages. While some preliminary results have been promising, substantial progress is needed to achieve seamless integration of understanding and generation. Both the code and models require further refinement and optimization, which is why we have chosen to open-source the project. We invite contributions from the community to help enhance and develop it collaboratively. If you have any suggestions or identify issues within the code, please contribute via Pull Requests. Thank you for your support and interest!

Open Collaboration

We're open-sourcing Ming-Lite-Uni to accelerate progress toward AGI, featuring:

  • 📂 Full model weights & test code
  • 🧩 Modular architecture for easy extension
  • 📊 Comprehensive benchmarks (vs GPT-4V, SDXL, etc.)

"The simultaneous release of ChatGPT-4's image generation in March 2025 confirms our vision of unified multimodal AI as the next paradigm."

Contact Information

If you require assistance or encounter troubles while utilizing our project, please open a GitHub issue.

Ming is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.

Citation

If you find our work helpful, feel free to give us a cite.

@article{Mingunify2025,
title = {Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction},
author = {Inclusion AI, Ant Group},
journal = {arXiv preprint},
year = {2025}
}

Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities

· 8 min read
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face | 🤖 ModelScope

Introduction

Ming-Lite-Omni-Preview is built upon Ling-Lite, which is a MoE model designed to perceive a wide range of modalities, including text, images, audio, and video, while generating text and natural speech in a streaming manner. To naturely handle the diverse modalities, we have enhanced Ling-Lite by incorporating modality-specific routers for each modality. As a result, Ming-Omni excels at handling information from diverse modalities and is highly scalable.

Key Features

  • Omni and Novel MoE Architecture: An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.

  • Video understanding: Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.

  • Natural Speech Generation and Fine-grained Voice Dialogue: Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression

Evaluation

Image benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.8483.984.5
HallusionBench54.6851.951.7
MMBench_TEST_V1179.6384.382.0
MMMU57.058.654.8
MMStar62.063.965.2
MMVet73.667.168.1
MathVista69.068.267.9
OCRBench87.986.488.2
Average70.9670.570.3

Object Recognition

Object RecognitionMing-Lite-Omni-PreviewQwen2.5-VL-7BInternVL-2.5-8B
Plants52.155.332.8
Animals52.654.836.5
Home appliances & furniture93.597.490.9
Personal Electronics96.195.193.2
Food & Ingredients57.560.048.7
Tableware96.6 94.988.1
Vehicles31.940.931.9
Average68.671.260.3

Video benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5VL-7B
VideoMME wo/w sub.63.9/67.665.1/71.6
MVBench67.072.0
Video-MMMU45.447.44
LongVideoBench53.760.0

Audio benchmark

SpeechQA

ModelAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.693.4035.3535.4349.0122.5798.85
Baichuan-Audio4.003.3949.6448.8063.3041.3286.73
GLM-4-Voice4.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.493.9355.7161.3281.1052.8799.42
Ming-Lite-Omni-Preview4.253.8858.9546.0660.0046.7196.53

ASR

ModelAishell-1Aishell-2 iosWenetspeech test-netWenet test-meetingLibrispeech test-cleanLibrispeech test-other
Whisper Large-v35.144.769.6818.541.93.65
Qwen2-Audio1.533.067.728.41.63.6
GLM-4-voice Base2.46---2.827.66
Baichuan-Omni-1.5--6.98.4--
Qwen2.5-Omni1.182.365.97.71.83.4
Ming-Lite-Omni-Preview1.622.826.236.92.345.74

Knowledge

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-Lite-Omni-Preview27.328.925.9

OCR&GUI

ModelMing-Lite-Omni-PreviewQwen2.5-VL-7B-Instruct
ChartQA_TEST85.287.3
DocVQA_TEST93.295.7
OCRBenchV2_en/zh52.2/51.656.3/57.2
OmniDocBench↓34.7/34.530.8/39.8
TextVQA_VAL82.3684.9
ScreenSpot79.384.7

Model Downloads

You can download the model from both Huggingface and ModelScope.

ModelInput modalityOput modalityDownload
Ming-Lite-Omni-PreviewImage,text,viedio,audioImage,text,audio🤗 HuggingFace
🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

Use Cases

Video-Audio-QA

MultiModal InputQA
Q: <audio> (audio content: 请描述视频内容。)
A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky.
Q: Is there any food in front of me?
A: Yes, there's candy on the table.

Speech2Speech (supports dialect)

Quickstart

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......

# image qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
{"type": "text", "text": "What kind of flower is this?"},
],
},
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
{"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
],
},
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
# video qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
{"type": "text", "text": "What is the woman doing?"},
],
},
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

# multi-turn chat
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "中国的首都是哪里?"},
],
},
{
"role": "ASSISTANT",
"content": [
{"type": "text", "text": "北京"},
],
},
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
],
},
]
# Output:

# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=False,
eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)
# speech2speech
messages = [
{
"role": "HUMAN",
"content": [
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.

Agentic Learning

· 4 min read
inclusionAI
Ant Group

Introduction

Agent exhibits powerful capabilities by interacting with the external environment and making decisions based on the feedback it receives from the environment. For complex problems, it is often necessary for an agent to have multi-turn interactions with the environment to reach a solution. The complexity and dynamism of environments, coupled with the necessity for multi-turn interactions, pose numerous challenges in training agents.

We introduce AgenticLearning, an open-source agent training paradigm designed to empower researchers to train and evaluate autonomous agents effectively. AgenticLearning offers a framework for multi-turn interactions with the environment, enabling models to learn how to interact with the environment and make decisions based on its feedback, thereby enhancing the models' ability to leverage the environment to solve complex problems.

AdvancementsModelsToolsEnvironmentTraining Framework
RAG-R1Qwen2.5-7b-instructoffline retrieval
online search
AWorldLLaMA-Factory
verl
AReaL
FunReasonQwen2.5-7b-Coder-instructBFCLAWorldLLaMA-Factory
verl

News

[2025/07/01] 🔥🔥🔥RAG-R1 We propose RAG-R1, a deepsearch training framework that incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism.

[2025/05/16] 🔥🔥🔥FunReason We propose FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss approach.

Advancements

Deepsearch

RAG-R1

  • Tools: Search Engines (offline or online)
  • LLM: Qwen2.5-7b-instruct

RAG-R1-framework

Overall framework of RAG-R1.

RAG-R1-result

Performance comparisons on QA benchmarks under the EM metric. The best and second best results are bold and underlined, respectively.

FunctionCall

FunReason

  • Tools: Real Human Function calling (BFCLv2 live&non-live)
  • LLM: Qwen2.5-7b-Coder-instruct

FunReason is a framework designed to enhance LLMs' function calling capabilities, achieving GPT-4o-comparable performance on BFCL, surpassing RL-based methods, mitigating catastrophic forgetting on HumanEval and MBPP, and using a data refinement strategy where natural CoT data outperforms artificial ones.

FunReason-Performance

Data refinement pipline of FunReason.

Overview of FunReason's data refinement pipeline. The pipeline consists of five stages: Function Call Classification, Query and Tool Identification, CoT Identification, Function and Parameter Identification, and Format Identification. Each stage ensures specific aspects of data quality, with failing examples either being discarded or regenerated.

FunReason-Performance

Performance of FunReason.

Citation

Please cite our repo if our works are helpful for your research.

@article{RAG-R1,
title={RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism},
author={Zhiwen Tan and Jiaming Huang and Qintong Wu and Hongxuan Zhang and Chenyi Zhuang and Jinjie Gu},
journal={arXiv preprint arXiv:2507.02962},
year={2025}
}

@article{FunReason,
title={FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement},
author={Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang},
journal={arXiv preprint arXiv:2505.20192},
year={2025}
}

Contact

For any question or feedback, please reach out to us at ender.tzw@antgroup.com or chenyi.zcy@antgroup.com

License

This project is licensed under the MIT License - see the LICENSE file for details.

AReaL: Ant Reasoning Reinforcement Learning for LLMs

· 11 min read
inclusionAI
Ant Group

| Paper | Documentation | Ask DeepWiki | 🤗 Models & Data | WeChat Group |

AReaL (Ant Reasoning RL) is an open-source fully asynchronous reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research. Built upon the open-source project RealHF, we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).

AReaL Highlights

  • 🔥 [NEW] Asynchronous RL: With algorithm-system co-design, AReaL supports fully asynchronous RL for the fastest training! Experimental support for multi-turn agentic RL is also provided.
  • 🛠️ Open & Reproducible: We continuously release all code, datasets, and training recipes for RL training of LLMs.
  • 🚀 Scalability: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
  • 🔪 Cutting-Edge Performance: AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.

News

[2025/06/03] (v0.3, boba²) We release boba² (double-boba) for fully asynchronous RL training, which achieves a 2.77x speedup while obtaining on-par or even better training performance compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out our v0.3 overview blog and the research paper.

[2025/03/31] (v0.2, boba) Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our v0.2 technical blog.

[2025/02/24] (v0.1) Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our v0.1 technical blog.

Release Highlights

In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:

  • A fully asynchronous RL training pipeline with system and RL algorithm co-design, achieving over 2.77x speedup without any performance drop. Check the benchmark scripts and instructions here.

  • SOTA coding models, i.e., a 14B model with a 69.1 score on LCB-v5. To reproduce, check the configs and instructions.

  • Experimental support for multi-turn agentic RL training. Check our complete example.

For the complete system design and more training details, please check our v0.3 blog and our research paper.

Jump to the quickstart section if you want to quickly run an experiment and get your hands dirty! 😈

Overview of Asynchronous RL Training

During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works (DeepCoder, Intellect) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.

Synchronous vs One-step Overlap RL

Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.

AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.

Asynchronous RL Training

Fig 2. Execution timeline of our fully asynchronous RL system.

AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.

We compare the scalability of asynchronous RL training based on our AReaL-boba² system with classical synchronous RL training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.

Scaling Comparison

Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.

SOTA Code Generation Model by AReaL-boba²

We use Qwen3 as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.

Model (8B)LiveCodeBench v5
(2024.10-2025.2)
CodeforcesCodeContests
Qwen3-8B58.81879/96.7%31.4
DeepSeek-R1-0528-Qwen3-8B58.41945/97.3%31.0
🤗 AReaL-boba²-8B-Open62.01933/97.2%41.4
🤗 AReaL-boba²-8B63.01962/97.5%40.8
Model (14B)LiveCodeBench v5
(2024.10-2025.2)
CodeforcesCodeContests
Qwen3-14B65.41978/97.7%38.3
DeepCoder-14B-Preview60.61936/95.3%40.1
🤗 AReaL-boba²-14B-Open67.31990/97.8%46.2
🤗 AReal-boba²-14B69.12044/98.2%46.1
Larger ModelsLiveCodeBench v5
(2024.10-2025.2)
CodeforcesCodeContests
Qwen3-235B70.72056-
DeepSeek-R164.32029-
OpenAI-o3-mini (Medium)66.32036-

Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.

We highlight the tutorials and code walkthroughs about the following key features for asynchronous training:

RL Training for Multi-turn Agent

AReaL-boba² allows you to independently customize the dataset, rollout behavior, and the training algorithm, without needing to modify the heavy system-level code.

In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the step-by-step guide if you want to implement your own agentic RL project.

Getting Started

Obtain the training data:

For code training data, a simple preprocessing script was provided in examples/data_preprocess/preprocess_training_data.py:

python3 preprocess_training_data.py --data_path $original_data_path --output_path $training_data_path

Train Qwen3 1.7B locally (Remember to modify dataset.path in the script below):

bash examples/run_async_ppo.sh

Evaluation:

cd evaluation
# Evaluate the model
python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25 \
--max_gen_tokens 32768 \
--data_names codeforces,lcb_v5 \
--prompt_type qwen3-think-pure \
--temperature 1.0

Resources

Quickstart

Benchmark and Reproduction

Customization Guide

System Code Walkthrough

Future Plan

AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also hiring interns and full-time employees with open positions in both the US and China.

For the research and development plan already in place, please see the following list:

System Development

  • Support for SGLang
  • RL training with coding problems
  • Asynchronous generation and RL training
  • Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
  • RL for vision-language models (VLM)
  • Multi-turn agentic RL
  • Function calling and tool use

Algorithm Development

  • RL training recipes for 1.5B and 7B models
  • A complete RL training recipe for 32B models
  • Sample-efficient multi-task RL algorithms
  • Agentic capabilities with end-to-end RL
  • Stable RL training for larger MOE models

Acknowledgement

We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.

Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.

We also appreciate all the pioneering works from the community, particularly the ReaLHF project from OpenPsi Inc. and other projects, including but not limited to DeepScaleR, Open-Reasoner-Zero, OpenRLHF, VeRL, SGLang, QwQ, Light-R1 and DAPO.

Citation

@inproceedings{mei2025real,
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
title = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
booktitle = {Proceedings of the Eighth Conference on Machine Learning and Systems,
MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025},
publisher = {mlsys.org},
year = {2025},
}
@misc{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
year={2025},
eprint={2505.24298},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.24298},
}

PromptCoT & PromptCoT-Mamba: Advancing the Frontiers of Reasoning

· 5 min read
inclusionAI
Ant Group

News

  • May 30, 2025: PromptCoT-Mamba released! Introducing an attention-free foundation model for reasoning tasks.
  • Apr 11, 2025: PromptCoT-QwQ-32B model and its training data released, achieving new state-of-the-art results.
  • Mar 7, 2025: PromptCoT project launched, including the problem generation model, distilled models (PromptCoT-DS series), and associated datasets.

Overview

This repository unifies two synergistic projects aimed at advancing the frontiers of mathematical and code reasoning in Large Language Models (LLMs): PromptCoT and PromptCoT-Mamba.

PromptCoT (Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models) addresses the critical challenge of acquiring high-quality, complex problems for training advanced LLMs. It introduces a novel methodology to systematically generate Olympiad-level mathematical problems by modeling the rationale behind expert problem design. This approach not only enhances problem diversity and difficulty but also ensures logical consistency in problem construction, providing a scalable solution for creating robust training datasets.

PromptCoT-Mamba (Scaling Reasoning without Attention) leverages the problem generation capabilities of the PromptCoT pipeline to train PromptCoT-Mamba-7B, the first attention-free foundation model based on the Mamba-2 architecture. This model demonstrates that structured training curricula can enable attention-free models to surpass strong Transformer baselines on a wide array of competition-level math and code reasoning tasks, all while maintaining constant-memory inference without KV caching.

Together, these projects offer a powerful suite of tools, models, and datasets for researchers and developers working on the cutting edge of AI reasoning.


Highlights & Key Results

1. PromptCoT: Problem Generation & Distilled Models

  • ✨ The Missing Piece for Test-Time Scaling: A lightweight yet powerful problem generation model enabling the construction of prompt sets at any scale with sufficient quality, perfect for SFT or RL post-training.
  • 📖 A Fully Open Project: All models (generation, distilled LLMs) and datasets (generation inputs, SFT data) are open-sourced.
  • 🏆 Superior Performance of Distilled Models:
    • PromptCoT-DS-7B consistently surpasses its base model, DeepSeek-R1-Distill-Qwen-7B, with significant gains:
      • +0.9% on MATH-500 (93.7%)
      • +3.2% on AIME2024 (58.7%)
      • +9.2% on AIME2025 (49.2%)
    • PromptCoT-DS-7B (7B parameters) achieves results comparable to larger 32B models like S1-32B and LIMO-32B.
    • PromptCoT-QwQ-32B sets a new standard, outperforming other 32B models by a significant margin:
      • MATH-500: 96.7% ± 0.5%
      • AIME2024: 83.8% ± 2.8%
      • AIME2025: 75.4% ± 4.7%
    • PromptCoT-DS-1.5B demonstrates competitive performance against RL-based models purely through distillation.
  • ⚡ Efficiency Without Compromise: PromptCoT-DS-1.5B achieves 40+% AIME scores using over 15× fewer A100 GPU hours compared to models like DeepScaleR-1.5B-Preview.

2. PromptCoT-Mamba: Attention-Free Reasoning

  • 🚀 First Attention-Free SOTA: PromptCoT-Mamba-7B is the first attention-free model (Mamba-2 architecture) to outperform strong Transformer baselines in math and code reasoning.
  • 🧠 Trained with PromptCoT Pipeline: Utilizes a structured, two-stage curriculum with data generated by PromptCoT.
  • 💪 Strong General Performance: PromptCoT-Mamba-7B consistently outperforms 7B-scale Transformer and hybrid Mamba-Transformer baselines.
    • MATH-500: 84.6%
    • AIME 2024: 35.2%
    • AIME 2025: 24.6%
    • Livecodebench: 29.9%
  • 🎯 Math Specialization: The math-specialized variant, PromptCoT-Mamba-Math-7B, further boosts math performance:
    • MATH-500: 88.0%
    • AIME 2024: 42.9% (+7.7% over generalist)
    • AIME 2025: 30.8% (+6.2% over generalist)
  • Inference Efficiency: Offers substantial speedups (e.g., 3.66× faster on 24GB GPU for long sequences) and constant-memory inference, ideal for cost-sensitive or long-context workloads.

Performance Details

PromptCoT Series Performance

ModelGSM8KMATH-500AIME2024AIME2025
🔹 1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B-83.9%28.9%28.1%
STILL-3-1.5B-preview-85.5%39.3%-
DeepScaleR-1.5B-Preview-🟢 87.8%🟢 43.1%🟢 37.1%
PromptCoT-DS-1.5B (ours)🟢 87.6% ± 0.5%85.3% ± 1.1%41.2% ± 6.9%36.7% ± 6.2%
🔹 7B Models
DeepSeek-R1-Distill-Qwen-7B-92.8%55.5%40.0%
Qwen2.5-7B-SimpleRL-82.4%26.7%-
OpenThinker-7B-89.6%30.0%33.3%
OpenR1-Qwen-7B-90.6%36.7%40.0%
PromptCoT-DS-7B (ours)🔥 92.8% ± 0.5%🔥 93.7% ± 0.7%🔥 58.7% ± 3.1%🔥 49.2% ± 7.9%
🔹 32B Models
DeepSeek-R1-Distill-Qwen-32B-94.3%72.6%-
S1-32B-93.0%56.7%26.6%
LIMO-32B-94.8%57.1%46.6%
QwQ-32B--82.1%70.8%
PromptCoT-QwQ-32B (ours)🔥🔥 96.4% ± 0.2%🔥🔥 96.7% ± 0.5%🔥🔥 83.8% ± 2.8%🔥🔥 75.4% ± 4.7%

PromptCoT-Mamba Performance

General Performance:

ModelMATH-500AIME 24AIME 25OlympiadBenchHumanEvalHumanEval+Livecodebench
PromptCoT-Mamba-7B84.6🔥🔥35.2🔥🔥24.650.781.775.0🔥🔥29.9
Gemma3-27B89.032.624.054.286.078.026.9
Gemma3-12B83.822.919.249.981.173.222.2
Sky-T1-7B85.019.219.249.241.537.218.3
S1.1-7B82.019.217.543.164.056.713.3
Bespoke-Stratos-7B81.218.316.345.073.268.38.6
Nemotron-H-8B77.6------79.374.4--
M1-3B81.723.022.043.6------

Math Specialization vs. Generalist:

ModelMATH-500AIME 24AIME 25OlympiadBenchHumanEvalHumanEval+Livecodebench
PromptCoT-Mamba-Math-7B🔥🔥88.0🔥🔥42.9🔥🔥30.8🔥🔥52.171.366.520.3
PromptCoT-Mamba-7B84.635.224.650.781.775.029.9

Citation

If you find PromptCoT or PromptCoT-Mamba useful in your research, please consider citing the respective papers:

For PromptCoT:

@article{zhao2025promptcot,
author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Kong, Lingpeng},
title = {PromptCoT: Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models},
year = {2025},
journal = {arXiv preprint arXiv:2503.02324},
url = {http://arxiv.org/abs/2503.02324}
}

For PromptCoT-Mamba:

@article{zhao2025scaling,
author = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
title = {Scaling Reasoning without Attention},
journal = {arXiv preprint arXiv:2505.22425},
year = {2025},
url = {https://arxiv.org/abs/2505.22425}
}

Ring: A Reasoning MoE LLM Provided and Open-sourced by InclusionAI

· 2 min read
inclusionAI
Ant Group

🤗 Hugging Face  |  🤖 ModelScope

News

  • [2025-06]:🎉 Add Ring-lite Model
  • [2025-04]:🎉 Add Ring-lite-linear-preview Model

Introduction

Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI, derived from Ling. We introduce Ring-lite-distill-preview, which has 16.8 billion parameters with 2.75 billion activated parameters. This model demonstrates impressive reasoning performance compared to existing models in the industry.

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model#Total Params#Activated ParamsContext LengthDownload
Ring-lite-distill-preview16.8B2.75B64K🤗 HuggingFace
🤖 ModelScope
Ring-lite16.8B2.75B128K🤗 HuggingFace
🤖 ModelScope

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ring-lite"

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

Please refer to Ling

Finetuning

Please refer to Ling

License

This code repository is licensed under the MIT License.

Citation

[TBD]