Ming-Lite-Omni-Preview: MOE架构的多模态大模型

简介

Ming-Lite-Omni-Preview 构建自 Ling-Lite，它是一个 MoE（专家混合）模型，能够感知文本、图像、音频和视频等多种模态，并以流式方式生成文本和自然语音。为了更自然地处理多模态输入，我们对 Ling-Lite 进行了增强，为每种模态引入了专用路由模块。因此，Ming-Omni 在处理多模态信息方面表现优异，并具有很强的可扩展性。

主要特性

Omni and Novel MoE Architecture: 一种基于专家混合（MoE）的创新型 Omni 架构，在多个多模态评测中取得了领先性能。
Video understanding: 支持视觉 Token 的 KV-Cache 动态压缩机制，既能理解数小时的长视频，也能对几秒钟的短视频进行精细分析。
Natural Speech Generation and Fine-grained Voice Dialogue: 支持端到端对话中的方言理解与生成，具备一次性语音克隆能力，并通过音频分词器压缩提升语调表现力。

评测结果

Image benchmark

Benchmarks	Ming-Lite-Omni-Preview	Qwen2.5-VL-7B-Instruct	InternVL2.5-8B-MPO
AI2D	83.84	83.9	84.5
HallusionBench	54.68	51.9	51.7
MMBench_TEST_V11	79.63	84.3	82.0
MMMU	57.0	58.6	54.8
MMStar	62.0	63.9	65.2
MMVet	73.6	67.1	68.1
MathVista	69.0	68.2	67.9
OCRBench	87.9	86.4	88.2
Average	70.96	70.5	70.3

Object Recognition

Object Recognition	Ming-Lite-Omni-Preview	Qwen2.5-VL-7B	InternVL-2.5-8B
Plants	52.1	55.3	32.8
Animals	52.6	54.8	36.5
Home appliances & furniture	93.5	97.4	90.9
Personal Electronics	96.1	95.1	93.2
Food & Ingredients	57.5	60.0	48.7
Tableware	96.6	94.9	88.1
Vehicles	31.9	40.9	31.9
Average	68.6	71.2	60.3

Video benchmark

Benchmarks	Ming-Lite-Omni-Preview	Qwen2.5VL-7B
VideoMME wo/w sub.	63.9/67.6	65.1/71.6
MVBench	67.0	72.0
Video-MMMU	45.4	47.44
LongVideoBench	53.7	60.0

Audio benchmark

SpeechQA

Model	AlpacaEval	CommonEval	SD-QA	MMSU	OpenBookQA	IFEval	AdvBench
Qwen2-Audio-chat	3.69	3.40	35.35	35.43	49.01	22.57	98.85
Baichuan-Audio	4.00	3.39	49.64	48.80	63.30	41.32	86.73
GLM-4-Voice	4.06	3.48	43.31	40.11	52.97	24.91	88.08
Kimi-Audio	4.46	3.97	63.12	62.17	83.52	61.10	100.00
Qwen2.5-Omni	4.49	3.93	55.71	61.32	81.10	52.87	99.42
Ming-Lite-Omni-Preview	4.25	3.88	58.95	46.06	60.00	46.71	96.53

ASR

Model	Aishell-1	Aishell-2 ios	Wenetspeech test-net	Wenet test-meeting	Librispeech test-clean	Librispeech test-other
Whisper Large-v3	5.14	4.76	9.68	18.54	1.9	3.65
Qwen2-Audio	1.53	3.06	7.72	8.4	1.6	3.6
GLM-4-voice Base	2.46	-	-	-	2.82	7.66
Baichuan-Omni-1.5	-	-	6.9	8.4	-	-
Qwen2.5-Omni	1.18	2.36	5.9	7.7	1.8	3.4
Ming-Lite-Omni-Preview	1.62	2.82	6.23	6.9	2.34	5.74

Knowledge

Model	InfoSeek_H-mean	InfoSeek_unseen_question	InfoSeek_unseen_entity
GPT-4o	36.05	-	-
PaLI-X	22.06	23.5	20.8
Qwen2.5-vl-32B	19.35	20.55	18.28
Ming-Lite-Omni-Preview	27.3	28.9	25.9

OCR&GUI

Model	Ming-Lite-Omni-Preview	Qwen2.5-VL-7B-Instruct
ChartQA_TEST	85.2	87.3
DocVQA_TEST	93.2	95.7
OCRBenchV2_en/zh	52.2/51.6	56.3/57.2
OmniDocBench↓	34.7/34.5	30.8/39.8
TextVQA_VAL	82.36	84.9
ScreenSpot	79.3	84.7

模型下载

你可以从 Huggingface 和 ModelScope 两个平台下载本模型。

Model	Input modality	Oput modality	Download
Ming-Lite-Omni-Preview	Image,text,viedio,audio	Image,text,audio	🤗 HuggingFace 🤖 ModelScope

如果你在中国大陆，强烈建议你通过以下平台下载模型： 🤖 ModelScope.

使用案例

视频音频问答

MultiModal Input	QA
	Q: <audio> (audio content: 请描述视频内容。) A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky.
	Q: Is there any food in front of me? A: Yes, there’s candy on the table.

语音转语音（支持方言）

快速上手

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类，它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍：
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区，包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同，但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物，它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮，能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子，以帮助消化和补充矿物质。
# ......

# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n

# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里？"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少？有多少常住人口？"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里，常住人口约为21,542,000人。

# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=False,
    eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)

# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)

# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

许可证与法律声明

本代码库遵循 MIT 协议，法律免责声明见项目根目录下的 LEGAL.md 文件。

Ming-Lite-Omni-Preview: MOE架构的多模态大模型

简介#

主要特性#

评测结果#

Image benchmark#

Object Recognition#

Video benchmark#

Audio benchmark#

SpeechQA#

ASR#

Knowledge#

OCR&GUI#

模型下载#

使用案例#

视频音频问答#

语音转语音（支持方言）#

快速上手#

许可证与法律声明#

简介