GITHUB 🤗 Hugging Face | 🤖 ModelScope

Introduction

Ming-Lite-Omni-Preview is built upon Ling-Lite, which is a MoE model designed to perceive a wide range of modalities, including text, images, audio, and video, while generating text and natural speech in a streaming manner. To naturely handle the diverse modalities, we have enhanced Ling-Lite by incorporating modality-specific routers for each modality. As a result, Ming-Omni excels at handling information from diverse modalities and is highly scalable.

Key Features

  • Omni and Novel MoE Architecture: An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.

  • Video understanding: Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.

  • Natural Speech Generation and Fine-grained Voice Dialogue: Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression

Evaluation

Image benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.8483.984.5
HallusionBench54.6851.951.7
MMBench_TEST_V1179.6384.382.0
MMMU57.058.654.8
MMStar62.063.965.2
MMVet73.667.168.1
MathVista69.068.267.9
OCRBench87.986.488.2
Average70.9670.570.3

Object Recognition

Object RecognitionMing-Lite-Omni-PreviewQwen2.5-VL-7BInternVL-2.5-8B
Plants52.155.332.8
Animals52.654.836.5
Home appliances & furniture93.597.490.9
Personal Electronics96.195.193.2
Food & Ingredients57.560.048.7
Tableware96.694.988.1
Vehicles31.940.931.9
Average68.671.260.3

Video benchmark

BenchmarksMing-Lite-Omni-PreviewQwen2.5VL-7B
VideoMME wo/w sub.63.9/67.665.1/71.6
MVBench67.072.0
Video-MMMU45.447.44
LongVideoBench53.760.0

Audio benchmark

SpeechQA

ModelAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.693.4035.3535.4349.0122.5798.85
Baichuan-Audio4.003.3949.6448.8063.3041.3286.73
GLM-4-Voice4.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.493.9355.7161.3281.1052.8799.42
Ming-Lite-Omni-Preview4.253.8858.9546.0660.0046.7196.53

ASR

ModelAishell-1Aishell-2 iosWenetspeech test-netWenet test-meetingLibrispeech test-cleanLibrispeech test-other
Whisper Large-v35.144.769.6818.541.93.65
Qwen2-Audio1.533.067.728.41.63.6
GLM-4-voice Base2.46---2.827.66
Baichuan-Omni-1.5--6.98.4--
Qwen2.5-Omni1.182.365.97.71.83.4
Ming-Lite-Omni-Preview1.622.826.236.92.345.74

Knowledge

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-Lite-Omni-Preview27.328.925.9

OCR&GUI

ModelMing-Lite-Omni-PreviewQwen2.5-VL-7B-Instruct
ChartQA_TEST85.287.3
DocVQA_TEST93.295.7
OCRBenchV2_en/zh52.2/51.656.3/57.2
OmniDocBench↓34.7/34.530.8/39.8
TextVQA_VAL82.3684.9
ScreenSpot79.384.7

Model Downloads

You can download the model from both Huggingface and ModelScope.

ModelInput modalityOput modalityDownload
Ming-Lite-Omni-PreviewImage,text,viedio,audioImage,text,audio🤗 HuggingFace
🤖 ModelScope
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

Use Cases

Video-Audio-QA

MultiModal InputQA
Q: <audio> (audio content: 请描述视频内容。)
A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky.
Q: Is there any food in front of me?
A: Yes, there’s candy on the table.

Speech2Speech (supports dialect)

Quickstart

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......
# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.
# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里?"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=False,
    eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)
# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project’s root directory.