GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope

介绍

Ming-lite-omni 是 Ming-omni 的轻量版,源自 Ling-lite,拥有 28 亿激活参数。Ming-lite-omni 是一个统一的多模态模型,能够处理图像、文本、音频和视频,并在语音和图像生成方面表现出较强能力。Ming-lite-omni 使用专用编码器从不同模态提取 token,然后由 Ling 处理,Ling 是一个 MoE 架构,配备了新提出的模态专用路由器。该设计使单一模型能在统一框架内高效处理和融合多模态输入,从而支持多样化任务,无需使用多个模型、任务专用微调或结构改动。重要的是,Ming-lite-omni 超越传统多模态模型,支持音频和图像生成。通过集成先进的音频解码器实现自然语音,以及利用 Ming-Lite-Uni 实现高质量图像生成,模型还能进行上下文感知聊天、文本转语音及多功能图像编辑。我们的实验结果表明,Ming-lite-omni 在所有模态上的统一感知与生成方面提供了强大解决方案。值得注意的是,Ming-lite-omni 是我们所知首个模态支持与 GPT-4o 匹配的开源模型,且我们发布了全部代码和模型权重,以促进社区进一步研究和发展。

📌 更新

  • [2025.06.12] 🔥 我们的技术报告已公开发布于 arxiv。
  • [2025.05.28] 🔥 Ming-lite-omni 官方版本发布,性能更佳并支持图像生成。
  • [2025.05.04] 🔥 发布 Ming-lite-omni 测试版本:Ming-lite-omni-Preview

主要特性

  • 统一全模态感知:Ming-lite-omni 基于 Ling(一个 MoE 架构的大语言模型),通过模态专用路由器解决任务冲突,确保来自不同模态的 token 的连贯融合。

  • 统一感知与生成:Ming-lite-omni 实现统一的理解与生成,使模型在生成过程中能解读多模态指令和用户意图,从而提升生成质量并增强多任务使用便利性。

  • 创新的生成能力:Ming-lite-omni 能感知所有模态,同时生成高质量文本、实时语音和生动图像,展现出卓越的跨模态表现,涵盖图像感知、视听交互和图像生成等多样任务。

评测

Ming-lite-omni 在图像感知、视听交互及图像生成任务中均展现出优异的跨模态性能。具体来说,在图像感知任务中,Ming-lite-omni 仅激活 28 亿参数,性能已可与 Qwen2.5-VL-7B 相媲美。它在端到端语音理解和指令执行上表现优于 Qwen2.5-Omni 和 Kimi-Audio。同时支持原生分辨率的图像生成、编辑及风格迁移,GenEval 得分达 0.64,优于主流模型如 SDXL。在 FID 指标上,Ming-lite-omni 达到 4.85,刷新了现有方法的最佳水平。

Image benchmark

BenchmarksMing-lite-omniQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.184.484.5
HallusionBench55.055.851.7
MMBench_TEST_V1180.882.882.0
MMMU56.356.654.8
MMStar64.765.365.2
MMVet71.371.668.1
MathVista71.668.167.9
OCRBench88.487.888.2
Average71.471.570.3

Encyclopedia Benchmarks

Object RecognitionMing-lite-omniQwen2.5-VL-7B-Instruct
Plants54.9647.8
Animals56.750.85
Vehicles41.9142.29
Food & Ingredients62.2854.09
Dishes44.339.07
General91.0892.42
Average58.5454.43

Video benchmark

BenchmarksMing-lite-omniQwen2.5VL-7B-Instruct
VideoMME67.067.3
MVBench67.767.4
Video-MMMU46.347.4
LongVideoBench56.654.7
Average59.459.2
Note: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark

SpeechQA

ModelAverageAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.5453.693.4035.3535.4349.0122.5798.85
Baichuan-Audio3.6954.003.3949.6448.8063.3041.3286.73
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04

ASR

Modelaishell1aishell2_androidaishell2_ioscv15_zhfleurs_zhwenetspeech_meetingwenetspeech_netlibrispeech_test_cleanlibrispeech_test_othermultilingual_librispeechcv15_enfleurs_envoxpopuli_v1.0_en
Ming-lite-omni1.472.552.526.312.965.955.461.442.804.156.893.395.80
Qwen2.-Omni1.182.752.635.203.005.907.701.803.407.567.604.105.80
Qwen2-Audio1.532.922.926.907.507.168.421.603.605.408.606.906.84
Kimi-Audio0.602.642.567.212.696.285.371.282.425.8810.314.447.97

Information-Seeking Benchmark

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-lite-omni27.730.425.4

OCR

ModelMing-lite-omniQwen2.5-VL-7B-Instruct
ChartQA_TEST85.187.3
DocVQA_TEST9395.7
OCRBenchV2_en/zh53.3/5256.3/57.2
OmniDocBench↓34/34.430.8/39.8
TextVQA_VAL82.884.9

GUI

ModelMing-lite-omniInternVL3 8BQwen2.5-VL-7B-Instruct
ScreenSpot82.179.578.9*
ScreenSpot-V284.181.4-
AITZ(EM)66.6-57.6*
Note: * denotes the reproduced results.

Unified Generation Benchmark

Modelsingle_objecttwo_objectcountingcolorspositioncolor_attrGENEVALDPGBenchFID↓
Ming-lite-omni0.98750.77270.68120.78720.310.290.6481.724.85
Metaquery-XL------0.6182.056.02
SDv2.10.980.510.440.850.070.170.5068.0926.96
Emu3-Gen0.980.710.340.810.170.210.5480.60-
SDXL0.980.740.390.850.150.230.5574.658.76
Janus0.970.680.300.840.460.420.6179.6810.10
JanusFlow------0.6380.099.51

Please refer to our technical report for more comprehensive evaluation results.

模型下载

您可以从 Huggingface 和 ModelScope 两个平台下载模型。

模型输入模态输出模态下载地址
Ming-Lite-Omni图像、文本、视频、音频图像、文本、音频🤗 HuggingFace
🤖 ModelScope

如果您位于中国大陆,我们强烈建议您从 🤖 ModelScope 下载模型。

环境准备

Installation with pip

pip install -r requirements.txt
# for python 3.10
pip install data/matcha_tts-0.0.5.1-cp310-cp310-linux_x86_64.whl 
# for python 3.8 
# pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8  # for H20 GPU

Installation with docker

You can also initialize the environment by building the docker image. First clone this repository:

git clone --depth 1 https://github.com/inclusionAI/Ming.git
cd Ming

Then build the docker image with the provided Dockerfile in docker/docker-py310-cu121. This step might take a while:

docker build -t ming:py310-cu121 docker/docker-py310-cu121

At last, start the container with the current repo directory mounted:

docker run -it --gpus all -v "$(pwd)":/workspace/Ming ming:py310-cu121 ming:py310-cu121 /bin/bash

You can run the model with python interface. You may download the huggingface model in the repo directory first (.../Ming/) or mount the downloaded model path when starting the container.

使用样例

We provide a step-by-step running example:

Step 1 - Download the source code

git clone https://github.com/inclusionAI/Ming.git 
cd Ming

Step 2 - Download the model weights and create a soft link to the source code directory

Download our model following Model Downloads

mkdir inclusionAI 
ln -s /path/to/inclusionAI/Ming-Lite-Omni inclusionAI/Ming-Lite-Omni

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model.

jupyter notebook cookbook.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb.

import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# load model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]

# 1. Format inputs using chat template
text = processor.apply_chat_template(messages, add_generation_prompt=True)

# 2. Extract vision/audio data
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

# 3. Prepare tensor inputs
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# 4. Configure generation
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

# 5. Decode output
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 62G GPU memory.

许可与法律声明

本代码仓库遵循 MIT 许可证,法律声明见项目根目录下的 LEGAL.md 文件

引用

如果您觉得我们的工作对您有帮助,欢迎引用。


@misc{Mingomni2025,
      title  = {Ming-Omni: A Unified Multimodal Model for Perception and Generation}, 
      author = {Inclusion AI},
      year = {2025},
      eprint = {2506.09344},
      archivePrefix = {arXiv},
      url = {https://arxiv.org/abs/2506.09344}
}