GITHUB 🤗 Hugging Face | 🤖 ModelScope

简介

Ming-Lite-Omni-Preview 构建自 Ling-Lite,它是一个 MoE(专家混合)模型,能够感知文本、图像、音频和视频等多种模态,并以流式方式生成文本和自然语音。 为了更自然地处理多模态输入,我们对 Ling-Lite 进行了增强,为每种模态引入了专用路由模块。 因此,Ming-Omni 在处理多模态信息方面表现优异,并具有很强的可扩展性。

主要特性

  • Omni and Novel MoE Architecture: 一种基于专家混合(MoE)的创新型 Omni 架构,在多个多模态评测中取得了领先性能。

  • Video understanding: 支持视觉 Token 的 KV-Cache 动态压缩机制,既能理解数小时的长视频,也能对几秒钟的短视频进行精细分析。

  • Natural Speech Generation and Fine-grained Voice Dialogue: 支持端到端对话中的方言理解与生成,具备一次性语音克隆能力,并通过音频分词器压缩提升语调表现力。

评测结果

Image benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.8483.984.5
HallusionBench54.6851.951.7
MMBench_TEST_V1179.6384.382.0
MMMU57.058.654.8
MMStar62.063.965.2
MMVet73.667.168.1
MathVista69.068.267.9
OCRBench87.986.488.2
Average70.9670.570.3

Object Recognition

Object RecognitionMing-Lite-Omni-PreviewQwen2.5-VL-7BInternVL-2.5-8B
Plants52.155.332.8
Animals52.654.836.5
Home appliances & furniture93.597.490.9
Personal Electronics96.195.193.2
Food & Ingredients57.560.048.7
Tableware96.694.988.1
Vehicles31.940.931.9
Average68.671.260.3

Video benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5VL-7B
VideoMME wo/w sub.63.9/67.665.1/71.6
MVBench67.072.0
Video-MMMU45.447.44
LongVideoBench53.760.0

Audio benchmark

SpeechQA

ModelAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.693.4035.3535.4349.0122.5798.85
Baichuan-Audio4.003.3949.6448.8063.3041.3286.73
GLM-4-Voice4.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.493.9355.7161.3281.1052.8799.42
Ming-Lite-Omni-Preview4.253.8858.9546.0660.0046.7196.53

ASR

ModelAishell-1Aishell-2 iosWenetspeech test-netWenet test-meetingLibrispeech test-cleanLibrispeech test-other
Whisper Large-v35.144.769.6818.541.93.65
Qwen2-Audio1.533.067.728.41.63.6
GLM-4-voice Base2.46---2.827.66
Baichuan-Omni-1.5--6.98.4--
Qwen2.5-Omni1.182.365.97.71.83.4
Ming-Lite-Omni-Preview1.622.826.236.92.345.74

Knowledge

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-Lite-Omni-Preview27.328.925.9

OCR&GUI

ModelMing-Lite-Omni-PreviewQwen2.5-VL-7B-Instruct
ChartQA_TEST85.287.3
DocVQA_TEST93.295.7
OCRBenchV2_en/zh52.2/51.656.3/57.2
OmniDocBench↓34.7/34.530.8/39.8
TextVQA_VAL82.3684.9
ScreenSpot79.384.7

模型下载

你可以从 Huggingface 和 ModelScope 两个平台下载本模型。

ModelInput modalityOput modalityDownload
Ming-Lite-Omni-PreviewImage,text,viedio,audioImage,text,audio🤗 HuggingFace
🤖 ModelScope
如果你在中国大陆,强烈建议你通过以下平台下载模型: 🤖 ModelScope.

使用案例

视频音频问答

MultiModal InputQA
Q: <audio> (audio content: 请描述视频内容。)
A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky.
Q: Is there any food in front of me?
A: Yes, there’s candy on the table.

语音转语音(支持方言)

快速上手

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......
# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.
# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里?"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=False,
    eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)
# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

许可证与法律声明

本代码库遵循 MIT 协议,法律免责声明见项目根目录下的 LEGAL.md 文件