Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities

Introduction

Ming-Lite-Omni-Preview is built upon Ling-Lite, which is a MoE model designed to perceive a wide range of modalities, including text, images, audio, and video, while generating text and natural speech in a streaming manner. To naturely handle the diverse modalities, we have enhanced Ling-Lite by incorporating modality-specific routers for each modality. As a result, Ming-Omni excels at handling information from diverse modalities and is highly scalable.

Key Features

Omni and Novel MoE Architecture: An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.
Video understanding: Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.
Natural Speech Generation and Fine-grained Voice Dialogue: Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression

Evaluation

Image benchmark

Benchmarks	Ming-Lite-Omni-Preview	Qwen2.5-VL-7B-Instruct	InternVL2.5-8B-MPO
AI2D	83.84	83.9	84.5
HallusionBench	54.68	51.9	51.7
MMBench_TEST_V11	79.63	84.3	82.0
MMMU	57.0	58.6	54.8
MMStar	62.0	63.9	65.2
MMVet	73.6	67.1	68.1
MathVista	69.0	68.2	67.9
OCRBench	87.9	86.4	88.2
Average	70.96	70.5	70.3

Object Recognition

Object Recognition	Ming-Lite-Omni-Preview	Qwen2.5-VL-7B	InternVL-2.5-8B
Plants	52.1	55.3	32.8
Animals	52.6	54.8	36.5
Home appliances & furniture	93.5	97.4	90.9
Personal Electronics	96.1	95.1	93.2
Food & Ingredients	57.5	60.0	48.7
Tableware	96.6	94.9	88.1
Vehicles	31.9	40.9	31.9
Average	68.6	71.2	60.3

Video benchmark

Benchmarks	Ming-Lite-Omni-Preview	Qwen2.5VL-7B
VideoMME wo/w sub.	63.9/67.6	65.1/71.6
MVBench	67.0	72.0
Video-MMMU	45.4	47.44
LongVideoBench	53.7	60.0

Audio benchmark

SpeechQA

Model	AlpacaEval	CommonEval	SD-QA	MMSU	OpenBookQA	IFEval	AdvBench
Qwen2-Audio-chat	3.69	3.40	35.35	35.43	49.01	22.57	98.85
Baichuan-Audio	4.00	3.39	49.64	48.80	63.30	41.32	86.73
GLM-4-Voice	4.06	3.48	43.31	40.11	52.97	24.91	88.08
Kimi-Audio	4.46	3.97	63.12	62.17	83.52	61.10	100.00
Qwen2.5-Omni	4.49	3.93	55.71	61.32	81.10	52.87	99.42
Ming-Lite-Omni-Preview	4.25	3.88	58.95	46.06	60.00	46.71	96.53

ASR

Model	Aishell-1	Aishell-2 ios	Wenetspeech test-net	Wenet test-meeting	Librispeech test-clean	Librispeech test-other
Whisper Large-v3	5.14	4.76	9.68	18.54	1.9	3.65
Qwen2-Audio	1.53	3.06	7.72	8.4	1.6	3.6
GLM-4-voice Base	2.46	-	-	-	2.82	7.66
Baichuan-Omni-1.5	-	-	6.9	8.4	-	-
Qwen2.5-Omni	1.18	2.36	5.9	7.7	1.8	3.4
Ming-Lite-Omni-Preview	1.62	2.82	6.23	6.9	2.34	5.74

Knowledge

Model	InfoSeek_H-mean	InfoSeek_unseen_question	InfoSeek_unseen_entity
GPT-4o	36.05	-	-
PaLI-X	22.06	23.5	20.8
Qwen2.5-vl-32B	19.35	20.55	18.28
Ming-Lite-Omni-Preview	27.3	28.9	25.9

OCR&GUI

Model	Ming-Lite-Omni-Preview	Qwen2.5-VL-7B-Instruct
ChartQA_TEST	85.2	87.3
DocVQA_TEST	93.2	95.7
OCRBenchV2_en/zh	52.2/51.6	56.3/57.2
OmniDocBench↓	34.7/34.5	30.8/39.8
TextVQA_VAL	82.36	84.9
ScreenSpot	79.3	84.7

Model Downloads

You can download the model from both Huggingface and ModelScope.

Model	Input modality	Oput modality	Download
Ming-Lite-Omni-Preview	Image,text,viedio,audio	Image,text,audio	🤗 HuggingFace 🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

Use Cases

Video-Audio-QA

MultiModal Input	QA
	Q: <audio> (audio content: 请描述视频内容。) A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky.
	Q: Is there any food in front of me? A: Yes, there’s candy on the table.

Speech2Speech (supports dialect)

Quickstart

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类，它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍：
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区，包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同，但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物，它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮，能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子，以帮助消化和补充矿物质。
# ......

# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n

# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里？"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少？有多少常住人口？"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里，常住人口约为21,542,000人。

# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=False,
    eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)

# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)

# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

License and Legal Disclaimer

This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project’s root directory.

Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities

Introduction#

Key Features#

Evaluation#

Image benchmark#

Object Recognition#

Video benchmark#

Audio benchmark#

SpeechQA#

ASR#

Knowledge#

OCR&GUI#

Model Downloads#

Use Cases#

Video-Audio-QA#

Speech2Speech (supports dialect)#

Quickstart#

License and Legal Disclaimer#