Skip to main content

Ming-Omni-TTS: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

· 27 min read
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face| 🤖 ModelScope

The Introduction Video of Ming-Omni-TTS

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

  • 🔊 Fine-grained Vocal Control: Enables precise control over speech rate, pitch, volume, emotion, and dialects via simple instructions. It achieves 93% accuracy for Cantonese and 46.7% for emotional control, outperforming CosyVoice3.
  • 🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
  • 🎶 Immersive Unified Generation: The industry's first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
  • ⚡ High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
  • 🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Model Structure

Ming-omni-tts is a unified audio language model for the generation of speech, music, and sound, based on a unified continuous audio tokenizer.

Unified Continuous Audio Tokenizer.

Diagram of the Unified Continuous Audio Tokenizer architecture.

Unified Audio Language Model for Speech, Music and Sound Generation.

Diagram of the Unified Audio Language Model for Speech, Music and Sound Generation architecture.

Benchmark Evaluations

Table showing benchmark evaluation results.

Voice Control – Support Structured and Natural Command Control

Basic Attributes Control: Speed, Volume and Pitch Control for Voice Generating

Input PromptTarget TextInstruction1TTS ResultInstruction2TTS Result
导航开始,全程二十五公里,预计需要十二分钟。语速:慢速语速:快速
烟雨弥漫下,山环绕着水耸立着,水环绕着山流淌着。语速慢一点语速快一点
目前共享出行市场处于高速增长阶段。音量:低音量:高
北京在出行规模,城市影响力方面表现优异。音量尽量低一点音量尽量高一点
他们脱掉笨重的冬衣,走起路来腰杆挺直步履轻盈。基频:低基频:高
自动驾驶将大幅提升出行安全,效率。基频低一点基频高一点

Same Dialect/Cross-Dialect Control: Generating Cantonese and Sichuanese from Mandarin and Native Prompts

InstructionInput PromptConversion TypeTarget TextTTS Result
方言:广粤话广粤话 -> 广粤话佢系头大冇脑脑大生草种
方言:广粤话广粤话 -> 广粤话今个周末全场货品低至五折,数量有限,卖晒就冇喇。
请用广粤话表达广粤话 -> 广粤话我觉得社会企业同个人都有责任
用广粤语说,越地道越好。普通话 -> 广粤话你嚟探我,我真系好感动,好耐冇见你啦!
以广粤话的口语风格来表达。普通话 -> 广粤话快啲啦,唔好再拖拖拉拉,大家都等紧你开会呀
方言:川渝话川渝话 -> 川渝话你要自己打扮,不穿咋个晓得穿起漂不漂亮嘛?看我们这新款多时尚。
方言:川渝话川渝话 -> 川渝话赛尔号那个时候,才出来的时候,还是他那个机制,还是特别好耍的。
请用川渝话表达川渝话 -> 川渝话哎,刚刚晚上想吃点啥子?煮点火锅要得。
模仿川渝话的语气来表达普通话 -> 川渝话你晓不晓得?你啥我都喜欢,嗯,就是有一点不喜欢装。
挑战一下用川渝话的味儿来朗读普通话 -> 川渝话你那哈屋头还有电脑,那时候就已经先进了。

Cross-Emotion Control: Cross-Emotion Synthesis Using a Single Neutral Prompt

InstructionInput PromptConversion TypeTarget TextTTS Result
情感: 高兴中性 -> 高兴If these examinations are held orally, they may be known colloquially as "orals".
情感: 愤怒中性 -> 愤怒I'm done arguing with you. You're not worth my time!
情感: 愤怒中性 -> 愤怒In cities, driving speeds are set by which lane a driver is in.
情感: 悲伤中性 -> 悲伤Everything has changed. The promises and dreams we once had are shattered. How should I face this?
情感: 高兴中性 -> 高兴But it does not allow for adding new members to interfaces.
情感: 愤怒愤怒 -> 愤怒港湾道是每年农历新年举行的香港新春花车巡游的路线之一。
情感: 悲伤悲伤 -> 悲伤我觉得自己好像在黑暗中迷失了,再也找不到出口了。
情感: 高兴中性 -> 高兴我竟然抢到了陈奕迅的演唱会门票!太棒了!终于可以现场听一听他的歌声了!
情感: 悲伤悲伤 -> 悲伤我们俩从一开始就君子之交,都说好啦,背信弃义出尔反尔的是她,我告诉你这件事我是受害者。
表达时要悲伤一点。悲伤 -> 悲伤有些软体开发者也注意到软体度量已成为软体开发过程中的一部份。
把这件事说得高兴一点。高兴 -> 高兴I bought my first mountain bike with my own earnings, a Merida Warrior 500! Go me!
表达时,请务必流露出高兴的情感。中性 -> 高兴I ran into a teacher I hadn't seen in years at the coffee shop today. He still remembered me, and we talked about so many fun memories.

Built-in Premium Sounds: Over 100 Built-in, High-Quality Timbres

InstructionDescribeTarget TextTTS Result
克隆一下灵小甄的说话腔调。销售、直播带货: 声音明亮清脆,语速轻快且充满活力,语气中带有强烈的推荐感和亲和力,典型的带货主播风格。这款产品的名字,叫变态坑爹牛肉丸。
模仿灵梦的风格。虚拟恋人: 充满糖分的高甜少女音,语气娇憨任性,完美演绎了想要人陪伴时的撒娇状态。认为在中文歌曲里,夹杂几句英文就很时髦。
麻烦学一下灵岩的口音新闻、客服: 声音清晰正式且专业届时会按照原定计划,与国防部签署相关以地换地协议。
克隆一下灵娇的说话腔调。邻家女孩、女大学生、Vlog博主: 清甜明亮的少女音,语感轻快活泼,在讲述生活趣事时充满画面感与青春朝气,极具感染力。总裁问,刚才皮皮鲁唱的歌是谁的词谁的曲,大手笔呀。
克隆一下妩媚妲己的说话腔调。妩媚角色: 声音甜美清脆,语调轻盈上扬,表现性感妩媚新娘是一位俄国公主,坐着六只驯鹿拉的雪车,从芬兰一路而来。
克隆一下灵绮木的说话腔调。透着刻薄与傲慢的冷艳御姐音这就是它第二个特色——灵活的音色设计能力,你可以直接用文字描述,比如"知性女主播的声音",它就能给你生成。要是懒得想,它还内置了一百多种精品音色,什么动漫角色、短视频配音统统搞定!
克隆一下灵若虚的说话腔调。老奶奶形象,声音饱含岁月的温暖与慈爱,语速舒缓,透着对生活细节的满足感,极具治愈力。这就是它第二个特色——灵活的音色设计能力,你可以直接用文字描述,比如"知性女主播的声音",它就能给你生成。要是懒得想,它还内置了一百多种精品音色,什么动漫角色、短视频配音统统搞定!
克隆一下花小呗的说话腔调。儿童角色,声音清脆甜美,带有明显的幼态特征,语调轻快活泼这就是它第二个特色——灵活的音色设计能力,你可以直接用文字描述,比如"知性女主播的声音",它就能给你生成。要是懒得想,它还内置了一百多种精品音色,什么动漫角色、短视频配音统统搞定!
克隆一下灵浅忧的说话腔调。小男孩,声音清脆明亮,充满元气今天天气不错,要出去玩了。

Voice Design: Zero-Shot Synthesis of Custom Vocal Identities via Natural Language Descriptions

InstructionTarget TextTTS Result
性别: 女童声音. 音高: 音高尖锐,持续偏高. 语速: 语速迅捷,语气急促. 音量: 音量响亮,情绪饱满. 年龄: 学龄儿童. 清晰度: 吐字清晰,发音用力. 流畅度: 表达流畅,伴强调性重复. 口音: 标准普通话. 音色质感: 童声清亮,略显尖锐. 情绪: 激动委屈,带有抗议. 语调: 声调高昂,语势急切. 性格: 急躁率真,不甘示弱.人家从那走过,他们就说我故意偷听,还说我是小广播,我偏要广播,偏要广播偏。
性别: 男性. 音高: 男性沉稳中低音. 语速: 语速舒缓,有自然停顿. 音量: 正常谈话音量. 年龄: 中老年男性. 清晰度: 吐字清晰,发音标准. 流畅度: 言语连贯,表达自然. 口音: 标准普通话. 音色质感: 音质温和,略显沧桑. 情绪: 饱含不舍与怀念,转为平静嘱托. 语调: 前段感叹意味,后段请求意味. 性格: 念旧重情,温和坦诚.这就是天望娃娃送给我的我一直舍不得丢掉它,你替我上交了吧。
性别: 男性语音特征. 音高: 男性中低音域,初始疑问时音调上扬. 语速: 整体偏快,表述急切清晰. 音量: 正常交谈音量,偶有强调加重. 年龄: 青年至中年男性. 清晰度: 吐字清晰,发音标准. 流畅度: 叙述流畅,偶有为强调而设的短暂停顿. 口音: 带有北方地区特征的普通话. 音色质感: 声音较为浑厚,略带一丝沙哑质感. 情绪: 从关切疑问过渡到解释性陈述,略显急切. 语调: 初始疑问扬起,后转为肯定叙述语调. 性格: 显得坦率直接,急于说明情况.没有欺负这孩子呢,报告团长没人欺负他,不是怎么的,他本来是给他师父小杨上门的,回来,就说鬼鬼的鬼。
性别: 女性. 音高: 女性高音,句末随情绪上扬. 语速: 语速偏缓,充满恳切感. 音量: 音量正常,激动处略有提高. 年龄: 中年女性. 清晰度: 吐字清晰,略带哭腔. 流畅度: 整体流畅,因情绪略显迟缓. 口音: 标准普通话. 音色质感: 音色略显沙哑,蕴含悲伤. 情绪: 悲伤焦虑,带有不解与恳求. 语调: 起伏较大,表达焦急质问. 性格: 情感浓烈,忧心忡忡.我们家好容易恢复成这个样子,你明知有危险,为什么还一定要拉着杉杉?
用活泼的童声带着喜悦和兴奋不间断地讲述一个有趣的故事。我有个大哥叫小王,能吃饭也能喝汤,别看他手里没武器啊,说话赛过歪白的机关枪。
这是一个粤语地区长辈的声音,是一种带有地域特色的创意风格。他使用粤语(广东话),年长男性声音沉厚,语速较慢。语气在说教时显得严肃,但言语间仍透露出对家人的关心。做人呢,最紧要就係开心。
这是一个粤语地区长辈的声音,是一种带有地域特色的创意风格。他使用粤语(广东话),年长男性声音沉厚,语速较慢。语气在说教时显得严肃,但言语间仍透露出对家人的关心。你睇你,成日挂住玩,书又唔读。
是一个粗犷豪放的东北大哥的声音,是一种极具地域辨识度的创意与特殊风格。他使用带有浓郁东北口音的普通话,中年男性声音洪亮,嗓门大。说话直来直去,语速快,语气中充满了幽默感和不拘小节的豪爽。哎呀我的妈呀,这嘎冷的天儿,你穿这点儿?
这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。音量极低,紧贴麦克风,语速极慢,旨在制造触发听者颅内快感的声学刺激。放松……现在……闭上你的眼睛……
这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。音量极低,紧贴麦克风,语速极慢,旨在制造触发听者颅内快感的声学刺激。听……这个声音……是不是……很舒服……
这是一个体育赛事激情解说员的声音,是极具感染力的创意与特殊风格。他使用高亢的普通话,中年男性声音沙哑(因长时间呐喊)。语速快如机枪,在关键时刻会瞬间爆发,语调充满了紧张、激动和不可思议的情绪。球进了!进了进了进了!伟大的胜利!
这是一个宫斗剧中的威严皇后的声音,展现了充满张力的戏剧叙事风格。她使用雍容华贵的普通话,中年女性声音沉稳。语速雍容和缓,但每个字都掷地有声,语气表面波澜不惊,实则暗藏锋芒和久居上位的威压。妹这话,是说给本宫听的吗?
这是一个宫斗剧中的威严皇后的声音,展现了充满张力的戏剧叙事风格。她使用雍容华贵的普通话,中年女性声音沉稳。语速雍容和缓,但每个字都掷地有声,语气表面波澜不惊,实则暗藏锋芒和久居上位的威压。放肆!在本宫面前,岂容你如此喧哗?
这是一个古装剧中的腹黑反派的声音,充满了戏剧性的叙事张力。他使用华丽而阴柔的普通话,青年男性声音说话时语速慢条斯理,语气看似温和,却在句尾带着一丝不易察觉的冷笑和威胁,让人不寒而栗。呵呵,看来,你还是不太明白自己的处境啊。

Podcast: Multi-person Conversation

Input Speaker1 PromptInput Speaker2 PromptTarget TextTTS Result
speaker_1: 你可以说一下,就大概说一下,可能虽然我也不知道,我看过那部电影没有。
speaker_2: 就是那个叫什么,变相一节课的嘛。
speaker_1: 嗯。
speaker_2: 一部搞笑的电影。
speaker_1: 一部搞笑的。
speaker_1: 所以你想成功的话,就推荐你看这些书。
speaker_2: 我会有时间去看一看的。
speaker_1: 要是像我看的话,我就会感觉特别的。
speaker_2: 枯燥。
speaker_1: 对枯燥无聊毕竟是古文也看不懂除非那些。
speaker_1: 知道家长在考虑什么让家长也知道孩子们在考虑什么。
speaker_2: 对。
speaker_1: 减少矛盾。
speaker_2: 对,就是感觉其实出这些电影或者电视剧,也是挺好的让彼此更加了解一下,我感觉如果是一个家长和一个小孩儿,去看电视剧的话,收获也是蛮多的。
speaker_1: 那你还有什么比较好的电影介绍给我呢。
speaker_1: 上个厕所,然后那有专门的人给你,就是你上厕所之前,专门有个人给你递纸了。
speaker_2: 对,上个厕所会出来给你递毛巾。
speaker_1: 啊对,让你去擦手这些什么的。
speaker_2: 是的。
speaker_1: 服务,服务非常周到,不过也有少数人就说,这个服务实在太久了,就是,就,就是像那种,就是那个。
speaker_1: 什么东西啊?
speaker_2: 叫那个的哪吒的那个。
speaker_1: 啊,那个哪吒,但是我没有去看一看嘛。
speaker_2: 我也没看过。
speaker_1: 我当时好像是本来是要去看的。
speaker_1: 啊,我吃过。
speaker_2: 是不是。
speaker_1: 因为我之前去过山东一次吃过人家那杂粮煎饼。
speaker_2: 反正跟咱们这儿,不一样是吧,正宗的人家那是正宗的。
speaker_1: 本地的。
speaker_1: 那就之前的妆都毁掉了。
speaker_2: 嗯,是是是。
speaker_1: 然后之后就是睫毛。
speaker_2: 哦,对,那睫毛涂睫毛膏。
speaker_1: 画睫呃涂睫毛的时候,先夹一下睫毛,夹。
speaker_1: 嗯哪三个字。
speaker_2: 足力健。
speaker_1: 哦听说过。
speaker_2: 那你给我讲讲。
speaker_1: 我听说这个足力健对老年人的脚底有好处,而且边走路都能健身是吗。
speaker_1: 就这样子,嗯,一般男生都是看什么电影啊? 推理的吗? 还是什么。
speaker_2: 也不是吧,就是看那种,嗯,具体也说不出哪种类型嘛。
speaker_1: 具体也说不出。
speaker_2: 嗯。
speaker_1: 就是都有看一点。
speaker_1: 是了,只有你,化化起妆了才能充实呢,自信心呃然后才,感觉自己的心情是美美哒的。
speaker_2: 你想化妆是,呃那就从眉毛开始说不是从打底开始说吧。
speaker_1: 嗯说,好想听呢。
speaker_2: 洁面以后就是拍水乳,水乳霜。
speaker_1: 嗯。

Music Generation

InstructionTTS Result
Genre: 迪斯科. Mood: 活力四射 / 精力充沛. Instrument: 电吉他. Theme: 运动. Duration: 30s
Genre: 当代古典音乐. Mood: 温暖 / 友善. Instrument: 合成拨弦. Theme: 节日. Duration: 60s.
Genre: 电子舞曲. Mood: 自信 / 坚定. Instrument: 架子鼓. Theme: 节日. Duration: 47s.
Genre: 独立民谣. Mood: 鼓舞人心 / 充满希望. Instrument: 合成铜管乐器. Theme: 节日. Duration: 63s.
Genre: 流行摇滚. Mood: 温暖 / 友善. Instrument: 低音鼓. Theme: 旅行. Duration: 76s.
Genre: 电子舞曲. Mood: 快乐. Instrument: 定音鼓. Theme: 好时光. Duration: 61s.
Genre: 流行乐. Mood: 温暖 / 友善. Instrument: 合成铜管乐器. Theme: 庆典与喜悦. Duration: 41s.
Genre: 当代古典音乐. Mood: 鼓舞人心 / 充满希望. Instrument: 合成拨弦. Theme: 庆典与喜悦. Duration: 45s.
Genre: 电子舞曲. Mood: 鼓舞人心 / 充满希望. Instrument: 电吉他. Theme: 运动. Duration: 94s.

Speech/Music Mono Generation: Single-Channel Generation of Speech and Music

InstructionInput PromptTarget TextTTS Result
Genre: 电子舞曲. Mood: 活力四射. Instrument: 合成铜管乐器. Theme: 运动. SNR: 5.0dB.全神贯注,跟上这强劲的节奏,冲向终点吧!
Genre: 流行摇滚. Mood: 快乐. Instrument: 电吉他. Theme: 旅行. SNR: 5.0dB.阳光洒满公路,带上行囊,出发去远方!
Genre: 迪斯科. Mood: 兴奋. Instrument: 架子鼓. Theme: 生日. SNR: 5.0dB.派对时刻到!让我们在鼓点中祝你生日快乐!
Genre: 电子舞曲. Mood: 兴奋. Instrument: 合成铜管乐器. Theme: 运动. SNR: 5.0dB.汗水在燃烧,感受这股能量,你就是最强的!
Genre: 流行摇滚. Mood: 活力四射. Instrument: 架子鼓. Theme: 旅行. SNR: 5.0dB.踏上未知的旅程,每一步都充满未知的惊喜!
Genre: 迪斯科. Mood: 快乐. Instrument: 电吉他. Theme: 生日. SNR: 5.0dB.吹灭蜡烛前,先跟着旋律尽情摇摆吧!
Genre: 电子舞曲. Mood: 快乐. Instrument: 合成铜管乐器. Theme: 生日. SNR: 5.0dB.这是属于你的闪耀时刻,生日派对正式开始!
Genre: 流行摇滚. Mood: 兴奋. Instrument: 电吉他. Theme: 运动. SNR: 5.0dB.超越极限,感受心跳的轰鸣,永不言弃!
Genre: 迪斯科. Mood: 活力四射. Instrument: 架子鼓. Theme: 旅行. SNR: 5.0dB.在霓虹闪烁的异国街头,找寻失落的快乐!
Genre: 流行摇滚. Mood: 快乐. Instrument: 合成铜管乐器. Theme: 运动. SNR: 5.0dB.运动让生活更有趣,让我们一起快乐出发!

Sound Generation(TTA)

InstructionTTS Result
A motor is revving and changing gears
Thunder and a gentle rain
Continuous snoring of a person
Nature sounds with a frog croaking
A man talking as a stream of water trickles in the background

Speech/Sound Mono Generation: Single-Channel Generation of Speech and Sound

InstructionInput PromptTarget TextTTS Result
Birds chirping副主任及以上号别就诊人次,为二百零八点二万。
Light rain其中又有大部分百分之四十一点九认为,由该品牌影楼拍摄。
Keyboard typing本次有害昆虫科普展,是一场专门为孩子准备的科普教育活动。
Fire engine siren他陪舅舅到简阳一所学校,考察捐资改建事宜。
Rainstorm请语音留言,告诉电话精灵您没有达到父母的哪些要求。

Ming-flash-omni-Preview: A Sparse, Unified Architecture for Multimodal Perception and Generation

· 7 min read
inclusionAI
Ant Group

GITHUB ARXIV 🤗 Hugging Face| 🤖 ModelScope

Omnimodal Ming-omni series update! Ming-flash-omni-Preview is the first open-source omnimodal large model with a parameter scale reaching the hundred billion-Scale level. Based on Ling 2.0's sparse MoE architecture, Ming-flash-omni-Preview has a total of 103B parameters with 9B activated. Compared to the previous version Ming-lite-omni-1.5, Ming-flash-omni-Preview has improved in both omnimodal understanding and generation capabilities. The overall performance across various modalities has reached a leading level among open-source omnimodal models, with particularly outstanding performance in controllable image generation, streaming video understanding, and speech recognition.

performance

Capability Overview

Controllable Image Generation

For image generation, Ming-flash-omni-Preview pioneers the Generative Segmentation Paradigm, reframing "image segmentation" as a semantic-preserving editing task (Generative Segmentation-as-Editing), achieving fine-grained spatial semantic control. Ming-flash-omni-Preview achieved a score of 0.90 on the GenEval benchmark, surpassing all non-reinforcement learning generation methods and demonstrating exceptional controllability.

Streaming Video Understanding

Users often have a need to engage in continuous dialogue with AI based on real-world scenarios and to use AI to understand those scenarios. Ming-flash-omni-Preview can effectively fulfill these needs. As shown in the video below, Ming-flash-omni-Preview can achieve fine-grained understanding of streaming video, recognizing objects and interactions within the video, and providing relevant understanding and explanations in real-time to support users in practical scenarios.

Speech and Dialect Understanding

Ming-flash-omni-Preview can achieve Context-Aware Speech Recognition (ContextASR) and dialect recognition, achieving SOTA across all 12 ContextASR subtasks. Its understanding ability for 15 Chinese dialects, including Hunanese, Minnanese, and Cantonese, is significantly enhanced, effectively providing translation and real-time understanding support for users who might be lost in an unfamiliar dialect.

Voice Cloning

Ming-flash-omni-Preview's speech generation has been upgraded from discrete tokenizers to continuous tokenizers, significantly enhancing voice cloning capabilities. It exhibits high stability in mixed Chinese-English pronunciation, and can effectively clone the voice from the original conversation into newly generated dialogue. The seed-tts-zh WER metric is 0.99, surpassing Qwen3-omni and seed-tts.

Model Architecture and Capability Introduction

Model structure diagram of Ming-flash-omni-Preview:

architecture

Compared to Ming-lite-omni-1.5, Ming-flash-omni-Preview primarily features the following technical optimizations:

Omnimodal Training Based on Sparse Expert Architecture

Ming-flash-omni-Preview extends the Ling-Flash-2.0 sparse MoE architecture to the omni-modality. It models the distribution and routing strategy of each modality based on the modality-level routing proposed by Ming-lite-omni, achieving "large capacity, small activation" for each modality. By introducing VideoRoPE in the Attention layer, it enhances spatiotemporal modeling for long videos, improving video interaction capability. Additionally, in terms of training strategy:

  1. Stable Sparse Training: Utilizes a mixed expert balancing scheme (combining auxiliary load balancing loss with router bias updates) to ensure uniform activation and convergence of omnimodal training under the sparse MoE architecture;
  2. Context-Aware ASR Training Paradigm: For speech training tasks, task/domain information input is used as the decoding condition, significantly improving proper noun recognition and transcription consistency. It also introduces high-quality dialect training corpora, leading to a significant increase in recognition accuracy for 15 Chinese dialects, including Hunanese, Minnanese, and Cantonese.

Unified Generative Segmentation and Editing

The core challenge in building a unified multimodal model lies in how to efficiently integrate image understanding and generation capabilities. Our Ming-lite-omni-1.5 achieved this by freezing the language pathway and injecting hierarchical semantics using multi-scale QueryTokens, thereby allowing the generation objective to better integrate with the understanding task while preserving understanding performance. Although this training strategy improved stability, the fundamental differences between the learning objectives of understanding and generation mean that even with the introduction of hierarchical semantics, fine-grained visual knowledge (such as object attributes and spatial relationships) remains difficult to efficiently transfer to high-precision generation and editing tasks, thus limiting the improvement in model generation quality and controllability.

To overcome this bottleneck, Ming-flash-omni-Preview proposes the "Generative Segmentation-as-Editing" collaborative training paradigm. This paradigm reframes image segmentation as a semantic-preserving editing task (e.g., "paint the banana purple"). The key assistance provided by this design is: It forcibly unifies the understanding and generation objectives — successful editing must rely on precise understanding of the object's outline, and the editing quality directly provides supervision signals for understanding. This paradigm directly enhances the model's fine-grained spatiotemporal semantic control ability and indirectly solves the compositionality problem in pure text-to-image generation.

On the GenEval benchmark, Ming-flash-omni-Preview achieved a score of 0.90, surpassing all leading non-reinforcement learning (non-RL) methods; On the GEdit benchmark, the average score for precise editing tasks such as object deletion and object replacement improved from 6.9 to 7.9. These two results collectively prove that the fine-grained spatiotemporal semantic control capability gained through the "Generative Segmentation-as-Editing" training not only significantly improves performance in precise editing tasks but can also effectively generalize to pure text-driven image generation tasks.

Efficient Omnimodal Training Architecture

Training omnimodal foundation models faces two major challenges: data heterogeneity (varied shapes of multi-modal inputs) and model heterogeneity (difficulty in parallelizing modality-specific encoders). These issues lead to load imbalance, memory fragmentation, and pipeline bubbles, severely slowing down the training speed. To address these problems, we made two key optimizations based on the Megatron-LM framework when training the Ming-flash-omni-Preview model:

  1. Sequence Packing: Solves data heterogeneity. Varied-length sequences are densely packed into fixed-length batches, significantly improving memory utilization and computational density;
  2. Flexible Encoder Sharding: Solves model heterogeneity. Extends Megatron-LM to support fine-grained sharding of modality encoders across DP/PP/TP, eliminating pipeline bubbles and achieving load balancing. These optimization measures resulted in a doubling of the training throughput of Ming-flash-omni-Preview compared to the baseline.

Getting Started with Ming-flash-omni-Preview

Our model and code are open source. We welcome everyone to try, provide feedback, and exchange ideas:

Future Plan

The version released this time is the Ming-flash-omni-Preview, and the current version has some imperfections:

  1. Visual-Text Understanding Capability: Although Ming-flash-omni-Preview's overall performance is leading among omnimodal models, there is still a gap compared to SOTA dedicated VL large models. We will continue to explore the performance upper limit of omnimodal models.
  2. Speech Capability: Overall performance in speech recognition and speech synthesis is leading. The effects of multi-turn speech dialogue and high-quality voice cloning are our next optimization priorities.
  3. Image Generation Capability: The model achieved a score of 0.90 on the GenEval benchmark, demonstrating good controllability, and already possesses text generation and editing capabilities. However, there is still room for improvement in rendering and editing text with complex layouts, as well as generating specific IP characters.

We are continuously optimizing the user experience of Ming-flash-omni-Preview. We welcome you to provide feedback via community discussion or issues. The official version will be released soon.

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

· 17 min read
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face| 🤖 ModelScope

The Introduction Video of Ming-UniAudio

Audio Edit Demo

Editing Tasks Video demos

🚀 Technical Highlights

  1. First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio is a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks.
  2. First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio is an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
  3. First universal free-form speech editing model for semantic and acoustic tasks without temporal regime: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.
  4. First benchmark for free-form speech editing: We propose Audio-Edit-Benchmark, the first open-source free-form evaluation set comprising editing tasks of four semantic and five acoustic types, to evaluate the model's editing performance.

Instruction-Guided Free-Form Speech Editing

Semantic Editing - Insert

InstructionTranscriptionTarget TranscriptionBefore EditSpeechedit Result
insert '简直' after the character or word at index 8.真是个浪漫的邂逅可以说是英雄救美了真是个浪漫的邂逅简直可以说是英雄救美了
insert '真正' before the character or word '好'.就有道而正焉可谓好学也已就有道而正焉可谓真正好学也已
insert 'clearly' before the character or word at index 8.Its legal status in Trinidad was insufficient to preserve its ecological status.Its legal status in Trinidad was insufficient clearly to preserve its ecological status.
insert 'successfully' after the character or word 'profession'.Previously an attorney Korona left the profession to pursue a career in music.Previously an attorney Korona left the profession successfully to pursue a career in music.

Semantic Editing - Substitute

InstructionTranscriptionTarget TranscriptionBefore EditSpeechedit Result
substitute '妈妈' with '爸爸'.我想对于妈妈来说会比任何礼物都要温暖我想对于爸爸来说会比任何礼物都要温暖
substitute the characters or words from index 8 to index 10 with '五万元'.当时我想等筹齐两万元聘礼就送她妈回家当时我想等筹齐五万元聘礼就送她妈回家
substitute 'get pictures off' with 'transfer photos from'.I'm trying to explain to my mother how to get pictures off her phone.I'm trying to explain to my mother how to transfer photos from her phone.
substitute the words from index 8 to index 9 with 'could become'.Considering the growth of human population insects might be the food of the future.Considering the growth of human population insects could become the food of the future.

Semantic Editing - Delete

InstructionTranscriptionTarget TranscriptionBefore EditSpeechedit Result
delete '比普通的茶叶要'.花草茶的口味一般比普通的茶叶要苦一些花草茶的口味一般苦一些
delete the characters or words from index 11 to index 15.我吃了点燕麦片煎鸡蛋还喝了点橙汁我吃了点燕麦片煎鸡蛋汁
delete 'times'.The classification of this gibbon has changed several times in the past few years.The classification of this gibbon has changed several in the past few years.
delete the characters or words from index 2 to index 6.On the second day the boy climbed to the top of a cliff near the campOn climbed to the top of a cliff near the camp

Acoustic Editing - Dialect Conversion

InstructionTranscriptionBefore EditSpeechedit Result
Change the accent of the speech to Dongbei.之后,他考取导游证,成为拱北口岸中旅的导游。
Change the accent of the speech to Chengdu.只有当科技为本地社群创造价值的时候,才能真正有意义。
Change the accent of the speech to Chengdu.我得用回想与幻想补充我所缺少的饮食,安慰我所得到的痛苦。
Change the accent of the speech to Guangxi.全国恶性肿瘤发病,及死亡第一位的是肺癌。

Acoustic Editing - Speed

InstructionTranscriptionBefore EditSpeechedit Result
adjusts the speed to 0.5.我用胸抵住车把,掌握方向,速度一点也不比别人慢。
adjusts the speed to 0.7.There is a growing body of case law on Bayh-Dole.
adjusts the speed to 1.3.Cribb was born near Bristol but moved to London before starting professional fighting.
adjusts the speed to 2.切实帮助困难群众解决生产生活中,遇到的困难和问题。

Acoustic Editing - Pitch

InstructionTranscriptionBefore EditSpeechedit Result
shifts the pitch by 3 steps.因为外面有战争,家里又有战争带来的悲伤和匮乏。
shifts the pitch by 5 steps.自动驾驶将大幅提升出行安全,效率。
shifts the pitch by -1 steps.The heart of the campus has a number of historic buildings.
shifts the pitch by -1 steps.Stevenson is also the director of music ministries at Angeles Mesa Presbyterian Church.

Acoustic Editing - Volume

InstructionTranscriptionBefore EditSpeechedit Result
adjusts the volume to 1.4.A woman sits as she shows the designs she has made in the floor.
adjusts the volume to 1.6.For example, they both consist of predominately older, hence redder, stars.
adjusts the volume to 0.9.伏羲的儿孙们看见伏羲捉来了鱼,也都欢欢喜喜跑来问长问短。
adjusts the volume to 0.3.他们还告诉巨人,那座城市里群英荟萃。

Acoustic Editing - Denoise

InstructionTranscriptionBefore EditSpeechedit Result
denoise the audio.Be shape of example,before deriving this formula we explained what we mean by problems of this kind we now generalize these ideas for general binomial experiments.
denoise the audio.Summoned to himself with firmness no surrender his superiors had also preached this saying it was the way of eternal honor his comrades were old.
denoise the audio.There are people who travel long distances to assure my continued existence we have also seen the power of faith at work among us it was muscular but it wasn't symmetrical.
denoise the audio.Theory eventually proved inexact the heavens refused to give up their weeping but what has been happening recently might be described as creeping mannerism clever.

Acoustic Editing - Background Music

InstructionBefore EditSpeechedit Result
add rain to audio.
add car sound to audio.
add carefree music to audio.
add groovy music to audio.

Acoustic Editing - Emotion Conversion

InstructionTranscriptionBefore EditSpeechedit Result
change the emotion to happy mood.比尔想再看小主人一眼然后走进森林安静地死去。
change the emotion to happy mood.世界爱眼日是每年十月的第二个星期四。
change the emotion to happy mood.我会玩很多游戏呢听说多喝水能治百病。
change the emotion to happy mood.建议戴口罩空气质量轻度污染。

Audio Understanding

Chinese and English ASR

InputTranscription
呃很久没有看到看过如此不带价值判断的电影
桃花庄人塔俱乐部是位于杭州市德清县的一个俱乐部
he was excited and at the same time uneasy maybe the girl had already forgotten him
it's true that everything has its destiny but one day that destiny will be realized

Dialect Understanding

InputTranscription
[方言-粤语] 你做乜嘢啊系咪唔想倾偈啊。
[方言-上海话] 阿拉考试还没定下来唻。
[方言-闽南语] 宝贝较早休困晚安。
[方言-川渝方言] 我难受得很别个都睡了。

Context ASR

InputPromptTranscription
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about Banking. This audio may contains the following words or phrases:Zelle,daily A C H transfer limit,cashier's checks,transaction memos,F D I C regulations,cryptocurrency wallet,K Y C requirements.Hey Chris, you won't believe what happened when I tried sending rent through Zelle yesterday. I hit some daily ACH transfer limit! My landlord's insisting on cashier's checks now. Remember how Sarah's Venmo payment got flagged last month? The bank's fraud detection system kept asking about transaction memos and 'source of funds' verification. Honestly, these FDIC regulations around peer-to-peer payments are getting ridiculous. I had to provide three months of bank statements just to increase my wire transfer threshold. Oh, and don't even get me started on cryptocurrency wallet KYC requirements.
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about Banking. This audio may contains the following words or phrases:Priority Pass lounges,T S A Pre Check,rewards structure,bonus miles,Citibank's Prestige Card,Visa Infinite,E M V chip security protocols,dynamic currency conversion.So listen, I finally canceled my Chase Sapphire Reserve last week. Remember how they touted those Priority Pass lounges and Luxury Hotel Collection benefits? Turns out I only used the T S A Pre Check credit once this whole year! The annual fee jumped to five hundred fifty dollars, plus they started requiring eighteen thousand points to waive it. My Amex Platinum isn't any better that seven hundred dollar fee just hit, and their new rewards structure requires thirty thousand in annual spending for bonus miles. Oh, and get this Citibank's Prestige Card now charges two hundred bucks for authorized users! Honestly, these Visa Infinite perks like concierge services and purchase protection sound fancy, but when do regular people actually use E M V chip security protocols or dynamic currency conversion?
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about 酒店常旅客计划. This audio may contains the following words or phrases:至悦大使,重庆来福士洲际,酒廊待遇,万豪旅享家,钛金会员.诶?小李,我最近在研究IHG的会员体系,这个‘至悦大使’的达标条件也太苛刻了吧!‘三百权益’里,洲际的认可房晚才给三十晚。你说,他们家的‘先行者任务’算不算‘里程碑奖励’啊?对了,我之前用积分兑换重庆来福士洲际的行政套房,礼宾部居然没给酒廊待遇,反而现金订房的客人能拿到双早。万豪旅享家的‘钛金会员’都能自动匹配套房升级券,IHG这个动态定价系统真是让人头大!
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about 汽车行业. This audio may contains the following words or phrases:汽车之家曹雷,矩阵式 L E D 大灯,四十八伏轻混系统,可变气门升程技术,M B U X 超联屏,Sportback,Allroad.嘿,老李,你看到‘汽车之家’曹雷发的文章没?说新款奥迪A3加长到四米六了。昨儿我去4S店试驾,销售说这车配了啥矩阵式LED大灯,还有四十八伏轻混系统。不过,宝马1系那个B48发动机也改了‘可变气门升程技术’,奔驰A级更夸张,直接把MBUX超联屏塞进紧凑车里!要我说啊,现在车企搞细分市场真够拼的!听说奥迪还要出Sportback、Allroad等四个版本呢,连自适应巡航都标配了!

Audio Generation

Voice Clone

Input PromptTarget TextTTS Result
全球每年有超过一百三十五万人,因交通事故而死亡。
The stained glass offered a hypnotic atmosphere.

Multi-lingual Synthesis

Input Prompt TextInput Prompt audioTarget TextTTS Result
We asked over twenty different people, and they all said it was his.The stained glass offered a hypnotic atmosphere.
The wedding was photographed by celebrity wedding photographer Kid Chan.Bender also conducted extensive research on autism.
关于不少万达广场的注册资本金更改。哎,这些情况在北京这样的大都市,是无法避免的。
长春周二之前晴天多云五月七日是晴天。两人一直对婚变封口,使传闻闹得热烘烘。

Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer

· 7 min read
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face| 🤖 ModelScope

🚀 Technical Highlights

  1. First Continuous Unified Tokenizer for Vision: MingTok seamlessly supports both image understanding and generation within a single continuous latent space—eliminating quantization and bridging modalities.
  2. First NTP-style Autoregressive MLLM with Unified Continuous Visual Tokens: By building on MingTok, Ming-UniVision unifies vision and language under a shared next-token prediction framework, enabling end-to-end autoregressive modeling of diverse vision tasks.
  3. Reduced Representational Competition → 3.5× Faster Convergence: The unified continuous representation aligns semantic understanding and generative dynamics, significantly accelerating joint training without performance trade-offs.
  4. Multi-Round In-Context Learning in a Single Feature Space: All operations—understanding, generation, and editing—occur in the same continuous space, eliminating costly cross-space conversions and enabling simpler, more efficient training and inference.

The Challenge: The Inverse Nature of Seeing and Drawing

Autoregression—the powerful paradigm of modeling the world by “predicting the next token”—has already unified diverse modalities like language and audio. The next frontier is to bring visual understanding (seeing) and visual generation (drawing) into this unified sequence‑to‑sequence framework.

However, this ambition encounters a deep challenge: in many respects, understanding and generation are inverse tasks.

  • Understanding: Pixels → high‑dimensional, abstract semantic concepts
  • Generation: Concepts → fine‑grained, high‑fidelity pixels

These tasks have drastically different—and often competing—preferences for their underlying visual representation.

Why Previous Approaches Fell Short

Existing models attempt unification via two limited strategies:

  1. Asymmetric Designs: Use different, heterogeneous feature spaces for each task. During multi‑turn interactions, this forces inefficient “round‑trips” between spaces, causing latency and complexity.
  2. Shared Discrete Tokens: Unify the token space but introduce quantization errors. This hurts image fidelity and degrades understanding capability.

Our Solution: Ming-UniVision and MingTok

To break this impasse, we introduce Ming-UniVision, a new generation of autoregressive vision‑language model built on a foundational innovation: MingTok.

MingTok is the first visual tokenizer based on a continuous latent space. It delivers a truly unified and efficient representation that serves as the bedrock for Ming‑UniVision’s unified NTP (Next‑Token Prediction) framework—harmonizing image understanding, generation, and editing in one in‑context multimodal loop.

The Core Design: A Three-Stage Architecture to Reconcile Competition

At the heart of Ming-UniVision is the MingTok tokenizer, a three-stage sequential architecture elegantly designed to reconcile the competing representational demands of understanding and generation within a single framework.

Figure 1: Architecture Comparison Figure 1: (a) Existing models use separate visual representations. (b) MingTok, the engine of Ming-UniVision, uses a unified scheme for both semantic and generative representations. (c) This unified approach leads to over 3.5x faster training convergence.

  1. Low-level Encoder: Maps an input image into a sequence of compact, continuous latent codes, optimized for high-quality and efficient autoregressive generation.
  2. Semantic Decoder: Autoregressively "refines" the compact latent codes into high-dimensional, rich semantic features aligned with top-tier understanding models like CLIP.
  3. Pixel Decoder: Serves as a quality-assurance module, ensuring the original image can be reconstructed with high fidelity, guaranteeing a high-fidelity representation process.

The Key Innovation: MingTok creates a unified, differentiable interface. The high-level features for understanding can be directly fed as conditional input for the next round of generation or editing. This completely eliminates the costly detour through pixel space.

The Breakthrough: A Fundamental Leap in Efficiency

By integrating MingTok, Ming-UniVision achieves competitive results on both understanding and generation tasks. The shared continuous latent space unlocks two fundamental layers of efficiency, resolving bottlenecks that have plagued previous architectures.

Figure 2: Benchmark Results Figure 2: On general recognition tasks, our method approaches the performance of models with separated representations and significantly outperforms other unified representation models. For generation, our model shows a clear advantage on fine-grained tasks.

1. A Revolution in Training: >3.5x Faster Convergence

Traditional approaches expend massive resources aligning heterogeneous representations, creating an intrinsic "task competition" that slows learning. MingTok solves this at its root.

  • Synergistic Enhancement: Our ablation studies show that using MingTok for both tasks fosters a synergy where understanding and generation capabilities enhance each other, rather than competing.
  • >3.5x Speedup: By avoiding inefficient alignment, the model focuses its energy on learning, reaching the same performance level in a fraction of the time compared to traditional schemes.

Figure 3: Pre-training Performance Figure 3: The performance drop between generation-only training and joint training is minimal with MingTok, proving the advantage of our unified approach.

2. A Revolution in Interaction: Goodbye to the "Pixel Round-Trip"

The efficiency of multi-turn interactions (e.g., generate → edit → re-generate) depends on the "understanding-generation" loop. This is precisely where traditional architectures falter.

Architecture TypeMulti-turn CapabilityCore BottleneckInteraction PathEfficiency & Fidelity
DiT-based Models❌ Not Natively SupportedNon-autoregressive, statelessN/A (Full process restart)Low
Hybrid Architectures⚠️ Supported, but InefficientDual-branch, un-unified spacesLatent → Pixel → FeatureLow, complex, lossy
Unified AR⚠️ Supported, but InefficientHeterogeneous spacesLatent → Pixel → FeatureLow, lossy
Ming-UniVisionNative & Highly EfficientUnified Continuous SpaceFeature → FeatureHigh & High-Fidelity

As the table shows, any architecture with separated spaces is doomed to the inefficient Latent → Pixel → Feature round-trip. This "pixel detour" introduces massive latency and causes contextual information to decay.

Ming-UniVision achieves a direct Feature → Feature closed loop. High-level features from an understanding task can be directly consumed by the next generation task, unlocking truly coherent multimodal sequence modeling. This enables tasks that once required multiple specialized models to emerge naturally within a single, unified framework:

  • Iterative Image Enhancement: Perform super-resolution, then directly continue with colorization or denoising.
  • Generative Chain-of-Thought: Perform an understanding task (e.g., "segment the car"), then directly apply an editing command to that region.

Figure 4: Multi-turn Interaction Demo Figure 4: Multi-turn tasks like "Super-resolution → Colorization" and "Segmentation → Editing" are now part of a seamless flow.

Understanding, generation, and editing are no longer isolated pipelines but are woven into a continuous visual conversation.


Conclusion and The Road Ahead

We believe that a unified and continuous visual representation like MingTok opens up new possibilities for building more flexible and intuitive multimodal interactive systems.

We know this is just one step in a long journey. We have open-sourced our code and initial model weights, hoping to provide a useful foundation for the community and to inspire more discussion around unified representations. We look forward to collaborating with our peers to collectively advance the future of multimodal AI.

Get Involved

Try out our open-source model Ming-UniVision and MingTok-Vision on our GitHub Page / Demo Page. Please star our repo if you like it!

Segmentation-as-Editing for Unified Multimodal AI

· 8 min read
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face| 🤖 ModelScope

The Hype and the Hidden Question

The multimodal AI world has been thriving.

From the debut of Qwen-Image to the interactive editing hype sparked by Nano Banana, image editing has rapidly become the next battlefield for generative AI.

Editing fundamentally requires two distinct skill sets:

  • Know where, what, and how to change (understanding the image)
  • Produce the change with high visual quality (generating the image)

Its rich gameplay and strong interactivity have pulled in users, developers, and creators alike.

But behind the noise, few are asking:

Beneath this prosperity, how close are we to a truly unified “understanding + generation” AI?

Understanding and Generation: Two Hands, Often Out of Sync

For years, we’ve chased an ambitious goal:

Build a unified multimodal model that understands the world like a scientist (e.g., image segmentation) while creating it like an artist (e.g., image editing).

In theory, these abilities should be mutually reinforcing:

“The deeper the understanding, the better the creation; the more the creation, the deeper the understanding.”

Reality is messier.

In AI today:

  • Understanding = the left hand: precise abstractions, semantic reasoning, boundaries.
  • Generation = the right hand: coherent pixels, style, aesthetics.

But training a model to recognize 10,000 cat photos doesn’t magically make it capable of painting cats, and painting cats repeatedly doesn’t make it understand cats better.

Worse, in multitask training, the two often compete for resources — optimizations for understanding can hurt generation, and vice versa.

We’re missing a catalyst: a task that forces the left and right hands to evolve together.


The Struggle: 16% Segmentation and Out-of-Control Generation

Before finding our solution, our unified model was struggling with generative segmentation:

Given an instruction like “segment the banana in the upper-right corner”, we wanted the model to output a segmentation mask directly.

The results were painful.

Struggling with Segmentation

On RefCOCO-val, our cIoU plateaued at ~16%.

The root cause is the distribution gap.

Generative models thrive on natural, continuous image distributions. Segmentation masks, however, are synthetic, abstract, binary maps — as unnatural as it gets for an image generator.

It was like asking a painter to draw an X-ray: doable, but far from their artistic instincts.

Here, generation wasn’t helping segmentation — it was tripping it up.

We needed a new task that:

  1. Met the precision demands of understanding.
  2. Played to the strengths of generation.

The “Aha” Moment: Dressing Segmentation in Color

Here’s the analogy that unlocked it for us:

If you want a child to mark an object, is it easier to have them draw a tight outline with a pencil, or fill it in with bright colors?

Obviously, the latter.

Instead of forcing our model to output abstract black-and-white masks, we turned the segmentation task into a color-editing task.

Example:

  • Instruction:segment the banana in the upper-right
  • Old way: Output a mask ❌
  • New way: Directly edit the image: “paint the banana purple”, “make the banana red”, etc. ✅

Segmentation as Editing

This brought the task’s data distribution back to the realm of natural images — where generative models shine.

Why This Works: The Hidden Catalyst

That small twist turned out to be exactly the catalyst we’d been searching for.

  • Boosting Understanding: To color the banana without bleeding outside the boundary, the model must internally nail pixel-perfect segmentation. The segmentation step became an implicit prerequisite to editing.

  • Unleashing Generation: No more awkward synthetic masks — the model is doing what it knows best: image-to-image editing. All its strengths in shading, texture, and edge blending go into making the change look natural.

For the first time, the left hand and right hand weren’t fighting — they were helping each other.


The Numbers: From 16% to 72.4% — and Beyond

1. SOTA-level Segmentation

The cIoU score didn’t just improve — it soared from 16% to 72.4% on RefCOCO-val, a relative gain of over 350%.

Qualitatively, the model outperformed competitors in pinpointing and segmenting targets, even in reasoning-heavy cases.

Against Qwen-Image and Nano Banana, our model:

  • Located small or occluded targets more reliably.
  • Produced boundaries that were visually and semantically aligned with instructions.

Segmentation Comparison 1 Our model (right) accurately locates and segments the target subject. Qwen-Image (second from left) fails to locate the correct target, while Nano-banana (third from left) fails to accurately segment the man's head and has loose boundary lines.

Segmentation Comparison 2 For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.

During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image.

Calculating difference on Ming-Lite-Omni1.5, Qwen-Image-Edit, Nano-banana

The results show that our model's performance on segmentation is now on par with specialized vision models.

Model CategoryModel NameRefCOCO (val)RefCOCO+ (val)RefCOCOg (val)
Vision Specialist ModelsVLT67.556.355.0
CRIS70.562.359.9
LAVT72.762.161.2
PolyFormer-B74.867.667.8
MLLM + Specialist (SAM)LISA-7B74.162.466.4
PixelLM-7B73.066.369.3
Generative ModelsNano-banana*15.713.914.9
Qwen-Image-Edit*30.328.834.0
Ming-Lite-Omni1.572.462.864.3

For each test set, Nano-banana and Qwen-Image-Edit was evaluated on a randomly sampled subset of 500 images, to reduce computational cost while preserving the key statistical trends. We observed that Nano-banana frequently fails to accurately grasp the image segmentation intent during inference, leading to its comparatively lower evaluation metrics. This may be attributed to differences in training objectives and data emphasis.

2. Sharper, More Controllable Editing

The beauty of this method is that it not only fixed the segmentation weakness but also dramatically enhanced the model's general editing capabilities.

Because the model has learned an unprecedented "respect for boundaries" through thousands of "precise coloring" exercises, this "muscle memory" for fine-grained control has transferred to all editing tasks. Our edit controllability score saw a significant jump from 7.69 to 8.12 across sub-tasks like background, color, and material changes.

Editing Controllability Comparison Prompt: "remove the bow tie of the man on the far right." Our model (right) precisely removes only the target bow tie while maintaining background consistency. Qwen (second from left) incorrectly removes multiple bow ties and introduces inconsistencies. Nano-banana (third from left) also struggles with consistency.

3. Stronger ID Consistency

A core challenge in portrait editing is maintaining identity. Our model excels here as well. Whether changing a hairstyle or adjusting an expression, the model skillfully preserves the person's core features.

ID Consistency Comparison Top Row (Turn head): Our model (right) maintains ID and background consistency, unlike competitors. Middle Row (Smile): Our model (right) correctly follows the prompt while preserving ID, avoiding distortions seen in others. Bottom Row (Change background): Our model (right) excels at preserving the subject's ID and appearance during a background swap.

See More Editing Consistency in Action:


An Honest Look: Where We Can Still Improve

Despite the leap forward, challenges remain:

  • Large pose changes (e.g., standing → running) need more reliability.
  • Multi-step or compound instructions require better parsing and execution.
  • Instruction diversity support needs expansion.

These are our next milestones.

Takeaway: The Next Catalysts Are Out There

From 16% to 72.4% — this wasn’t driven by a massive architecture overhaul or billion-image datasets.

It came from one change in task design.

The lesson: Instead of gluing capabilities together after the fact, find naturally cooperative tasks — where solving the problem requires multiple abilities to mesh seamlessly.

“Segmentation-as-editing” is just the first example.

We suspect 3D understanding, video generation, and other domains have their own hidden catalysts, waiting to be discovered.

At last, AI’s left and right hands have learned to high-five.

And this is only the overture.

Try out our open-source model Ming-lite-omni 1.5 on our GitHub Page / Demo Page. Please star our repo if you like it!

Introducing Ring-lite-2507

· 7 min read
inclusionAI
Ant Group

📖 Technical Report | 🤗 Hugging Face| 🤖 ModelScope

Overview

We present Ring-lite-2507, an upgraded version of our previously released lightweight reasoning model, Ring-lite (2506). Built upon a 16.8B Mixture-of-Experts (MoE) large language model with 2.75B activated parameters, Ring-lite-2507 further advances its reasoning capabilities while demonstrating superior performance across a comprehensive range of LLM benchmarks, including general text understanding, alignment, coding, logical, and agentic tasks. Thanks to our innovative and robust reinforcement learning training pipeline, Ring-lite-2507 distinguishes itself from the latest public dense models under 10B parameters by offering competitive performance across various tasks, despite activating only 1/3 of their parameter size.

To address the optimization instability of MoE RL training, we propose a novel approach, Constrained Contextual Computation Policy Optimization(C3PO), which enhances training stability and improves computational throughput via algorithm-system co-design. Additionally, we systematically investigate the dynamic relationship between long chain-of-thought SFT and RL training. Rather than relying solely on validation metrics, we explore optimal strategies for selecting the suitable fine-tuned model for RL scaling, yielding superior performance-efficiency trade-offs in our RL training pipeline. Last, we develop a two-stage training paradigm to harmonize multi-domain data integration, enhancing reasoning ability while effectively improving performance across various downstream general tasks.

Highlights

  • 🚀 Superior performance across tasks: Ring-lite-2507 demonstrates outstanding performance across both reasoning and general tasks;
  • 🔥 Only 2.75B activated parameters: Ring-lite-2507 is built upon a Mixture-of-Experts (MoE)-based large language model with only 2.75 billion activated parameters;
  • ⛓️‍💥 Algorithm-system co-design: We proposed novel C3PO approach and employ token efficiency to improve training stability and effectiveness;
  • 🔍 Publicly available: We fully release our training recipe and model weights.

Evaluation

We conduct a comprehensive evaluation of our models across two main domains: reasoning and general. We utilize a diverse set of public benchmarks, organized according to the specific aspects they measure.

Knowledge Understanding

BenchmarkRing-lite-2507Ring-lite-2506Qwen3-8B-Thinking
MMLU-Pro (EM)72.5063.4472.56
GPQA-Diamond (Pass@1)69.3563.5162.00
SuperGPQA (EM)40.0513.9740.36
Phybench (Pass@1)28.5129.1922.14

Math

BenchmarkRing-lite-2507Ring-lite-2506Qwen3-8B-Thinking
MATH-500 (Pass@1)97.9596.8097.30
CNMO 2024 (Pass@1)75.0977.2674.57
AIME 2024 (Pass@1)79.7979.0074.90
AIME 2025 (Pass@1)72.9269.5067.19
LiveMathBench (Pass@1)83.3785.0881.90
TheoremQA (Pass@1)70.0070.1968.81
OlympiadBench (math) (Pass@1)80.6482.8680.20

Coding

BenchmarkRing-lite-2507Ring-lite-2506Qwen3-8B-Thinking
LiveCodeBench(2408-2505) (Pass@1)60.3559.5355.12
Codeforces(Percentile) (Pass@1)183016731580
Codeforces(Rating)92.1688.0079.44

Reasoning & Agentic

BenchmarkRing-lite-2507Ring-lite-2506Qwen3-8B-Thinking
DROP (zero-shot F1)89.2760.2187.13
BBH (EM)88.6550.8487.30
ARCPrize (Pass@1)19.003.123.88
MuSR (EM)77.1966.7776.92
BFCL_Live (Pass@1)74.8166.7675.99

Alignment

BenchmarkRing-lite-2507Ring-lite-2506Qwen3-8B-Thinking
IFEval (Prompt Strict)84.6654.3485.40
AlignBench v1.1(gpt-4.1)80.9069.6074.70
FoFo (gpt-4-turbo)85.0267.8181.93
ArenaHard (gpt-4.1)88.8556.1286.14

Constrained Contextual Computation Policy Optimization(C3PO)

We introduce Constrained Contextual Computation Policy Optimization(C3PO), an innovative token-level optimization framework designed to mitigate training instability while enhancing throughput consistency. Different from sampling-level filtering, C3PO operates at the token level by sampling tokens to form a token-level global batch, each training step maintains consistent token input to optimizer, which results in reduced gradient variance and consequently achieving stable optimization.

Image description

C3PO

Balancing Token efficiency between Distillation and RL

While distillation is effective, we find that it requires more training tokens to achieve comparable performance than RL training. Furthermore, we observe that varying the number of training epochs for the distilled model significantly influences the trend of entropy loss, thereby affecting the exploration scope for RL. Our experiments show that increasing the number of SFT training epochs leads to a rapid collapse in entropy, whereas insufficient SFT training inevitably results in inferior performance. To systematically quantify the choice of optimal SFT epoch, we employ token efficiency to determine the suitable checkpoint for RL scaling.

Training Data

To ensure a high-quality training dataset for reinforcement learning, we established a comprehensive and meticulous data curation pipeline. This pipeline encompasses several key stages, such as data cleansing, answer verification, and data annotation, all designed to thoroughly decontaminate the data and ensure it is both suitable and informative for RL training.

Image description

Data Pipeline

Training Pipeline

Image description

Training Pipeline

Reasoning RL

Compared to our previously released Ring-lite-2506, we expanded our reasoning dataset by incorporating more challenging math, coding, and STEM data. Specifically, we adopted 67K math problems, 32K coding problems, and 9.9K scientific problems for reasoning RL training. In addition, we amplified our reasoning dataset by including more than 19K logical games, such as ARC-AGI, Countdown, Sudoku, AlphaMaze, etc. For each type of problem, we specifically designed suitable reward functions to ensure our training examples are verifiable.

General RL

Apart from reasoning tasks, our Ring-lite-2507 has significantly expanded the collection of general datasets for RL training. Our general RL training does not compromise performance on reasoning tasks; instead, it enhances overall text understanding across a broad range of general benchmarks.

Our general RL training incorporates a variety of tasks, including instruction following, question answering, text summarization, and more. For open-ended questions, we employ a robust reward model to assign appropriate scores. Additionally, we have integrated a rule-based verifier to handle problems that can be easily validated, such as instruction-following tasks.

Citation

@misc{lingteam2025ringlitescalablereasoningc3postabilized,
title={Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs},
author={Ling Team and Bin Hu and Cai Chen and Deng Zhao and Ding Liu and Dingnan Jin and Feng Zhu and Hao Dai and Hongzhi Luan and Jia Guo and Jiaming Liu and Jiewei Wu and Jun Mei and Jun Zhou and Junbo Zhao and Junwu Xiong and Kaihong Zhang and Kuan Xu and Lei Liang and Liang Jiang and Liangcheng Fu and Longfei Zheng and Qiang Gao and Qing Cui and Quan Wan and Shaomian Zheng and Shuaicheng Li and Tongkai Yang and Wang Ren and Xiaodong Yan and Xiaopei Wan and Xiaoyun Feng and Xin Zhao and Xinxing Yang and Xinyu Kong and Xuemin Yang and Yang Li and Yingting Wu and Yongkang Liu and Zhankai Xu and Zhenduo Zhang and Zhenglei Zhou and Zhenyu Huang and Zhiqiang Zhang and Zihao Wang and Zujie Wen},
year={2025},
eprint={2506.14731},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14731},
}

Introducing Ming-Lite-Omni V1.5

· 6 min read
inclusionAI
Ant Group

GITHUB 🤗 Hugging Face| 🤖 ModelScope

Overview

Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of Ming-lite-omni(Github). It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon Ling-lite-1.5, Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models.

Image description

Performance Comparison

Introduce Ming-lite-omni v1.5

Controllable Image Generation: Pixel-Level Control, Infinite Creativity

Ming-lite-omni v1.5 significantly optimizes Scene Consistency and ID Consistency (Character / Style Consistency) in image editing. When editing human figures, it demonstrates a clear advantage in maintaining scene and character ID. Furthermore, it expands support for perceptual tasks such as generative segmentation, depth prediction, object detection, and edge contour generation.

Image description
Generative Image Segmentation
User
Given the following instructions: little girl, pink, your monitors colors off friend p pink shirt girl; please perform referring segmentation on this image.
Ming-lite-omni v1.5
User
Please segment different classes in this image.
Ming-lite-omni v1.5
User
Please segment different instances in this image.
Ming-lite-omni v1.5

Depth and Edge Detection

Original ImageGenerated Depth MapGenerated Bounding BoxesGenerated Edge Contours

Audio-Video Interactive Understanding

Experience Alignment Optimization: Smarter Interaction That Understands You Better

Visual Dialogue
User
This picture looks very design-oriented. Can you describe its various design elements in detail?
Ming-Lite-Omni
This image shows a high-end gaming mouse with a modern and technological design. The mouse body is black with a smooth metallic finish. There is a blue glowing logo on the top of the mouse, and below the logo is the mouse wheel and left and right button area. The side of the mouse has a blue light decoration, which increases the visual appeal. The bottom of the mouse has a non-slip texture design to ensure stability during use. The overall design is simple and futuristic, suitable for gamers.
Ming-Lite-Omni V1.5
This image showcases a gaming mouse with a striking design, featuring the following design elements:
1. **Color and Material**: - The main body of the mouse is black, with a smooth surface and a metallic finish. - The edges and button areas of the mouse have blue LED light strips, adding a high-tech feel and visual appeal.
2. **Shape and Structure**: - The mouse has an overall streamlined design, with a slightly sloped top that allows for natural hand placement. - The sides of the mouse feature pronounced curves, designed ergonomically to provide a comfortable grip.
3. **Buttons and Scroll Wheel**: - There is a scroll wheel on the left side of the mouse, with a blue LED light strip above it. - There are additional buttons on the right side of the mouse, which also have blue LED light strips, enhancing both functionality and aesthetics.
4. **Brand Logo**: - In the center of the mouse's palm rest, there is an illuminated "G" logo, which is a signature design element of the brand.
5. **Lighting Effects**: - The blue LED light strips on the mouse stand out prominently against the dark background, enhancing its high-tech and futuristic feel.
6. **Texture and Details**: - The top of the mouse has a textured design, likely for an anti-slip effect. - The sides of the mouse feature fine, elongated textured patterns, which add a sense of visual depth.
Overall, the design of this mouse combines a modern, high-tech aesthetic with ergonomic design, making it both beautiful and practical, and well-suited for gamers.
User
Why is the calligraphy in the picture good?
Ming-Lite-Omni
This calligraphy work showcases superb calligraphic skills and profound cultural heritage. The brushwork is fluid and natural, the character structures are rigorous, and the layout is well-proportioned, reflecting the calligrapher's deep skill and artistic cultivation.
Ming-Lite-Omni V1.5
This piece of calligraphy displays several notable characteristics that make it an excellent work of art:
1. **Exquisite Brushwork**: The calligraphy's brushwork is fluid and natural, with rich variations in line thickness, demonstrating the calligrapher's profound skill and proficient technique.
2. **Rigorous Structure**: The structure of each character is meticulously designed, with natural connections between strokes, maintaining a good balance and harmony between characters.
3. **Composition and Layout**: The overall composition of the piece is well-arranged, with even line and character spacing, creating a sense of order and neatness.
4. **Ink Tone Variation**: The variations in ink tone, from dark to light, are natural, adding layers and a three-dimensional feel to the work.
5. **Elegant Content**: The content of the calligraphy is a beautiful classical text, rich in philosophical and literary value, which enhances the cultural depth of the work.
6. **Seal Embellishment**: The red seals on the work not only add a decorative element but also serve to embellish and balance the composition.
Together, these features constitute the high artistic value of this calligraphy piece, making it a masterpiece worthy of appreciation and study.
User
Is the phone in the picture suitable to buy for parents?
Ming-Lite-Omni
This phone has a simple and stylish appearance, suitable for parents who prefer a minimalist style. The camera configuration is also quite advanced, making it suitable for users who enjoy taking photos.
Ming-Lite-Omni V1.5
This phone is suitable for buying for parents for the following reasons:
1. **Appearance Design**: The phone features a simple and stylish design. The white body looks clean and neat, aligning with modern aesthetics.
2. **Camera Configuration**: The rear camera has a unique and neat arrangement, which may provide a good photography experience.
3. **Brand**: The Honor (荣耀) brand has a good reputation in the smartphone market, offering stable and reliable products.
4. **Features**: Phones of this type are typically equipped with a large-capacity battery and long battery life, suitable for daily use.
5. **Ease of Use**: It may offer a simple user interface and easy-to-operate functions, suitable for users who are not familiar with complex operations.
6. **Value for Money**: Honor phones are generally known for their high cost-performance ratio, making them suitable for users on a limited budget who still want a good experience.
Of course, whether it is truly suitable depends on the actual needs and preferences of the parents.
User
How to buy a plane ticket using the software in the picture? Please list the specific steps
Ming-Lite-Omni
The specific steps to buy a plane ticket using the software in the picture are as follows:
1. Open the Ctrip app. 2. Click on the "Flights" option. 3. On the flight page, select the departure city, destination, departure date, and return date. 4. Click the search button to view available flight information. 5. After selecting a suitable flight, click the "Book" button. 6. Fill in passenger information and choose a payment method. 7. Confirm the order details and submit the payment to complete the purchase.
Ming-Lite-Omni V1.5
The specific steps to purchase a plane ticket using the Ctrip app are as follows:
1. **Download and Install the Ctrip App**: - If you haven't installed the Ctrip App, you can search for "Ctrip Travel" (携程旅行) in the app store (like the Apple App Store or Android app markets) and download it.
2. **Open the Ctrip App**: - After installation is complete, open the Ctrip app.
3. **Select the "Flights" Option**: - On the homepage navigation bar, tap the "Flights" option.
4. **Enter Departure and Destination Cities**: - On the flight booking page, enter your departure and destination cities.
5. **Select Departure and Return Dates**: - Choose your departure date and return date (if booking a round trip).
6. **Select Passenger Information**: - Enter the number of passengers and their types (adult, child, infant).
7. **Choose a Flight**: - Browse and select your preferred flight.
8. **Fill in Passenger Information**: - Enter the passenger's name, ID number, contact information, etc.
9. **Select a Payment Method**: - Choose your preferred payment method (e.g., Alipay, WeChat Pay, bank card, etc.).
10. **Confirm Order and Pay**: - After confirming that the order information is correct, tap the "Pay" button to complete the payment.
11. **Receive Your E-ticket**: - After successful payment, you will receive an e-ticket, which you can view or download from the "My Orders" section within the app.
By following these steps, you can easily purchase plane tickets using the Ctrip App.
User
Based on the chart, under which primer is the Δct difference between samples the smallest? | Based on the chart, the Δct difference between samples is smallest under the cGAS-qF+qR primer. (Image source: Publicly available online - 舒桐科技公众号)
Ming-Lite-Omni
Based on the chart, the Δct difference between samples is smallest under the cGAS-qF+qR primer.
Ming-Lite-Omni V1.5
The difference in Δct between samples is smallest under the **EGFP-cGAS monoclonal** primer.

**Calculation Process:**
**1. Understand the meaning of Δct:** Δct is the difference in relative expression between the target gene (cGAS-qF+qR or EGFP-cGAS) and the reference gene (cTubulin). A smaller Δct indicates a smaller difference in expression between the target and reference genes.
**2. Compare Δct under different primers:**
- **cGAS-qF+qR primer:** - Wild type: Δct = 8.06 - EGFP-cGAS monoclonal: Δct = -0.15 - Difference: 8.06 - (-0.15) = 8.21
- **EGFP-cGAS monoclonal primer:** - Wild type: Δct = 10.44 - EGFP-cGAS monoclonal: Δct = -2.9 - Difference: 10.44 - (-2.9) = 13.34
- **EGFP-cGAS-qF+qR primer:** - Wild type: Δct = 21.43 - EGFP-cGAS monoclonal: Δct = -13.45 - Difference: 21.43 - (-13.45) = 34.88
**3. Conclusion:** Under the **EGFP-cGAS monoclonal** primer, the difference in Δct between samples is the smallest (Δct difference is 13.34), indicating the smallest difference in expression between samples under this primer.

**Summary:** The difference in Δct between samples is smallest under the **EGFP-cGAS monoclonal** primer.

Model Architecture Upgrade and Capability Evaluation

The Ming-lite-omni v1.5 model architecture is outlined below. The core design references the structure of Ming-lite-omni V1. However, a key distinction is the upgrade of the Vision head to support reference image feature input, specifically to enhance character and scene consistency in image editing.

Image description

Mode Architecture

The model's capabilities have been significantly optimized and upgraded across three key areas: enhanced Omni-modal comprehension, precise visual editing control, and improved user experience.

Enhanced Omni-Modal Comprehension

Thanks to optimized data quality, Ming-lite-omni v1.5 shows significant improvements in tasks such as vision-text comprehension (including image-text, document, and video understanding) and speech understanding. It has reached an industry-leading level for models of comparable scale.

Vision-text Comprehension

Task TypeDatasetQwen2.5-VL-7BMing-lite-omniMing-lite-omni v1.5
Image-text UnderstandingAI2D84.3683.184.91
HallusionBench55.7755.054.59
MMBench_TEST_V1182.7580.880.73
MMMU56.5656.354.33
MMStar65.2764.765.07
MMVet71.6171.373.99
MathVista68.1071.672.00
OCRBench87.8088.488.90
Average71.571.471.8
Video UnderstandingVideoMME(w/o subs)65.1063.467.07
VideoMME(w/ subs)71.6066.0172.59
VideoMME(avg)68.3567.769.83
MVBench69.6067.769.43
LongVideoBench56.0056.659.54
OvOBench51.1048.4852.17
Average61.2658.8962.74
Document UnderstandingChartQA_test87.2485.188.84
DocVQA_test95.579393.68
TextVQA_val85.0682.882.27
OCRBench87.888.488.9
Average88.9187.3288.42

Speech Understanding

ModelAverage(Open-ended QA)AlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Ming-lite-omni v1.54.4744.6484.361.1645.7765.93455.59998.076
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04
MiniCPM-o4.2854.424.1550.7254.7878.0249.2597.69
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08

Precise Visual Editing Control

Ming-lite-omni v1.5 employs the following optimization strategies to address the issues of character ID and scene ID consistency during image editing:

  1. ID and Scene Consistency Loss: This is achieved by increasing the weight of the edited region in the target image and the reference strength of the non-edited region in the reference image, while simultaneously decreasing the reference strength of the edited region in the reference image. This approach enhances image editing consistency.
  2. Incorporating Generative Detection and Segmentation Tasks to Boost Perceptual Capabilities: By supporting generative segmentation and keypoint detection, the model's understanding of image details and spatial relationships is improved. This enhances the structural controllability of the editing and generation processes, leading to significant increases in evaluation metrics related to position, structure, and quantity.
  3. Multi-Task Collaborative Learning Strategy: Through a joint training pipeline, generation and editing mutually reinforce each other. Segmentation tasks are transformed into colorization editing tasks, which significantly improves segmentation metrics and the precision and controllability of local image editing, resulting in smoother edges for edited regions.

Based on these optimizations, Ming-lite-omni v1.5 shows a significant improvement in image editing capabilities, achieving a GenEval score of 0.87.

1-Obj2-ObjCountingColorsPositionColor AttrAvg.
Ming-lite-omni0.990.770.680.780.460.420.64
Ming-lite-omni v1.50.990.930.860.870.900.660.87

Optimized User Experience

Thanks to the construction of high-quality alignment preference data, Ming-lite-omni v1.5 demonstrates a certain advantage over leading models in terms of correctness, relevance, format aesthetics, and fluency of expression for image-text Q&A. Ming-lite-omni v1.5 achieved a win rate of 87.07% against Ming-lite-omni V1 on internal adversarial evaluation sets, indicating a significant optimization in user experience.

Evaluation DimensionQwen2.5-VL-7BMing-lite-omni V1.5
Relevance4.3084.5
Fluency4.7654.91
Richness of Content3.8283.69
Format aesthetics4.7274.8
Correctness3.7413.92
Average Score4.2744.365

Get Started with Ming-lite-omni v1.5

The model and code for Ming-lite-omni v1.5 are now open-source, and we invite everyone to try it out, share feedback, and join the discussion. Looking ahead, we're excited to announce that a quantized and accelerated version is on the way. This future release will not only further enhance omni-modal performance but also make the model even more lightweight, all while strengthening its multimodal reasoning and generation capabilities. Stay tuned for more updates!

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

· 6 min read
inclusionAI
Ant Group

📖 Technical Report | 🤗 Hugging Face| 🤖 ModelScope

Introduction

We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

📌 Updates

Key Features

  • A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
  • A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
  • Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.

Evaluation

We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:

  • General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
ModelsMathVistaMathVisionMathVerseDynaMathWeMathLogicVistaAvg. (Δ)
Base-Scale General Models
InternVL3-8B70.530.038.525.739.544.541.4
InternVL3-9B69.029.337.925.134.849.040.8
Qwen2.5-VL-7B68.125.441.121.836.247.940.1
MUG-U-7B74.826.135.417.226.539.836.6
SAIL-VL-1.6-8B74.223.233.414.029.641.436.0
Base-Scale Reasoning Models
WeThink-VL-7B71.626.044.224.848.051.244.3 (+4.2)
Taichu-VLR-7B72.327.146.723.044.048.343.6
VLAA-Thinker-7B68.026.448.222.441.548.542.5 (+2.4)
URSA-8B-PS-GRPO67.831.841.522.438.344.741.1 (+8.2)
Ovis2-8B71.825.942.320.427.239.437.8
Our Models
Base Model70.225.930.520.227.237.835.5
M2-Reasoning-CI-7B71.729.242.125.042.846.842.9 (+7.4)
M2-Reasoning-7B75.031.544.726.841.850.045.0 (+9.5)
  • Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench

    • CV-Bench:
    ModelsCountRelationDepthDistanceAvg.
    Large-Scale Models
    GPT-4O65.985.787.878.278.9
    Gemini-1.5-pro70.485.282.472.877.4
    Base-Scale Models
    InternVL3-8B74.090.684.381.082.0
    Qwen2.5-VL-7B-Instruct65.286.670.679.875.0
    LLava-NEXT-Video-7B59.377.071.354.765.2
    Our Models
    M2-Reasoning-7B66.692.889.384.382.3
    • VSI-Bench:
    OCADOSRSRDsRDrRPAOAvg.
    Large-Scale Models
    Gemini-1.5-pro56.230.964.143.651.346.336.034.645.4
    GPT-4O46.25.343.838.237.041.331.528.534.0
    Base-Scale Models
    InternVL3-8B68.139.048.433.648.336.427.335.442.1
    Video-R1-7B--------37.1
    Qwen2.5-VL-7B-Instruct37.720.149.737.438.540.431.432.035.9
    LLava-NeXT-Video-7B48.514.047.824.243.542.434.030.635.6
    Our Models
    M2-Reasoning-7B41.034.060.955.440.747.329.928.842.3

Model Downloads

You can download the model from both Hugging Face and ModelScope.

If you're in mainland China, we strongly recommend you to download our model from ModelScope.

Example Usage

The basic environment is python=3.10, torch=2.6.0+cu124, transformers=4.49.0

We provide a small example on the usage of this repo.

import os
import torch

from transformers import (
AutoProcessor,
AutoTokenizer,
)

import warnings
import argparse
from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor

warnings.filterwarnings("ignore")

class BailingMMInfer:
def __init__(self,
model_name_or_path,
device="cuda",
max_pixels=None,
min_pixels=None,
video_max_pixels=768 * 28 * 28,
video_min_pixels=128 * 28 * 28,
generation_config=None
):
super().__init__()
self.model_name_or_path = model_name_or_path

self.device = device

self.device_map = device

self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28

self.model, self.tokenizer, self.processor = self.load_model_processor()
if max_pixels is not None:
self.processor.max_pixels = max_pixels
if min_pixels is not None:
self.processor.min_pixels = min_pixels
if generation_config is None:
generation_config = {
"num_beams": 1,
"do_sample": True,
"temperature": 0.9
}

self.generation_config = generation_config


def load_model_processor(self):

model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
self.model_name_or_path,
torch_dtype=torch.bfloat16,
device_map=self.device_map,
_attn_implementation="flash_attention_2"
).eval()

tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)

return model, tokenizer, processor

def generate(self, messages, max_new_tokens=512):
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, use_system=True
)

image_inputs, video_inputs = self.processor.process_vision_info(messages)


inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
)
# print(inputs)
print(self.tokenizer.decode(inputs['input_ids'][0]))

inputs = inputs.to(self.device)

for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)

with torch.no_grad():
generated_ids = self.model.generate(
inputs,
max_new_tokens=max_new_tokens,
eos_token_id=self.processor.tokenizer.eos_token_id,
**self.generation_config,
)

generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)[0]

return output_text

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning")
parser.add_argument('--max_pixels', type=int, default=401408)
parser.add_argument('--min_pixels', type=int, default=401408)
parser.add_argument('--max_new_tokens', type=int, default=4096)

args = parser.parse_args()

device = "cuda" if torch.cuda.is_available() else "cpu"
# model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
bailing2 = BailingMMInfer(
args.model_name_or_path,
device=device,
max_pixels=args.max_pixels,
min_pixels=args.min_pixels
)

messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
{
"role": "user",
"content": [
{"type": "image", "image": "./assets/example1.png"},
{"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
],
},
]
output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
print(output_text)



'''
[Output]:

<think>
To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:

\[
\text{Area} = \frac{1}{2} \times d_1 \times d_2
\]

where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
- The area of the rhombus is 137.9 square meters.
- One of the diagonals,

ABench: An Evolving Open-Source Benchmark

· 2 min read
inclusionAI
Ant Group
GITHUB

🌟 Overview

ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cross-domain tasks. By targeting current model weaknesses, ABench provides systematic challenges in high-difficulty specialized domains, including physics, actuarial science, logical reasoning, law, and psychology.

🎯 Core Objectives

  1. Address Evaluation Gaps: Design high-differentiation assessment tasks targeting underperforming question types
  2. Establish Unified Standards: Create reliable, comparable benchmarks for multi-domain LLM evaluation
  3. Expand Capability Boundaries: Drive continuous optimization of knowledge systems and reasoning mechanisms through challenging innovative problems

📊 Dataset Release Status

DomainDescriptionStatus
Physics500 university/competition-level physics problems (400 static + 100 dynamic parametric variants) covering 10+ fields from classical mechanics to modern physics✅ Released
ActuaryCurated actuarial exam problems covering core topics: probability statistics, financial mathematics, life/non-life insurance, actuarial models, and risk management✅ Released
LogicHigh-differentiation logical reasoning problems from authoritative tests (LSAT/GMAT/GRE/SBI/Chinese Civil Service Exam)🔄 In Preparation
PsychologyPsychological case studies and research questions (objective/subjective) evaluating understanding of human behavior and theories🔄 In Preparation
LawAuthoritative judicial exam materials covering core legal domains: criminal/civil/administrative/procedural/international law🔄 In Preparation

AWorld: The Agent Runtime for Self-Improvement

· 8 min read
inclusionAI
Ant Group

"Self-awareness: the hardest problem isn't solving within limits, it's discovering the own limitations" Twitter Follow WeChat QR Code Discord License: MIT DeepWiki

Table of Contents

  • News — Latest updates and announcements.
  • Introduction — Overview and purpose of the project.
  • Installation — Step-by-step setup instructions.
  • Quick Start — Get started with usage examples.
  • Architecture — Explore the multi-agent system design.
  • Demo — See the project in action with demonstrations.
  • Contributing — How to get involved and contribute.
  • License — Project licensing details.

News

  • 🦤 [2025/07/07] AWorld, as a runtime, is now ready for agentic training. See Self-Improvement section for details. We have updated our score to 77.08 on the GAIA test. Learn how to construct a GAIA runtime in the Demo section.
  • 🦩 [2025/06/19] We have updated our score to 72.43 on the GAIA test. Additionally, we have introduced a new local running mode. See ./README-local.md for detailed instructions.
  • 🐳 [2025/05/22] For quick GAIA evaluation, MCP tools, AWorld, and models are now available in a single Docker image. See ./README-docker.md for instructions and youtube video for demo.
  • 🥳 [2025/05/13] AWorld has updated its state management for browser use and enhanced the video processing MCP server, achieving a score of 77.58 on GAIA validation (Pass@1 = 61.8) and maintaining its position as the top-ranked open-source framework. Learn more: GAIA leaderboard
  • ✨ [2025/04/23] AWorld ranks 3rd on GAIA benchmark (69.7 avg) with impressive Pass@1 = 58.8, 1st among open-source frameworks. Reproduce with python examples/gaia/run.py

Introduction

AWorld (Agent World) is a multi-agent playground that enables agents to collaborate and self-improve. The framework supports a wide range of applications, including but not limited to product prototype verification, foundation model training and Multi-Agent System (MAS) design meta-learning.

Runtime Key Features

1. Agent Construction2. Topology Orchestration3. Environments
• ✅ Support for various model services
• ✅ Integration with MCP tools
• ✅ Custom tool support
• ✅ Protocol encapsulation between models and tools
• ✅ Protocol encapsulation among agents
• ✅ Runtime state management
• ✅ State tracing support
• ✅ Distributed, high-concurrency environments for training

Self-Improvement with Diverse Runtimes

By constructing diverse runtime environments (with tools, agents, or models in them), AWorld aims to find the limitations of a model and push intelligence forward. Here we will record some of our work to prove the effectiveness of our proposal.

CategoryRuntimePerformanceKey Information
Tool UseFunction call runtime to be releasedCompetitive on BFCL benchmark
Agent Framework
Dataset
Model
Paper
Blog
Code
Deep SearchSearch runtime to be releasedSOTA on HotpotQA benchmark
Agent Framework
Dataset
Model
Paper
Code

Demo of GAIA Agent-Runtime

GAIA Agent Runtime Demo

Here we first introduce the GAIA runtime, which can be constructed on your local computer. It can be used for:

  • Product prototype verification
  • Self-improvement training (See training pipeline for details)

Follow the instructions in ./examples/gaia/README.md to initialize the GAIA agent runtime and run the demo shown above.

Want to build your own multi-agent system? Check out the detailed tutorials below to get started! ⬇️⬇️⬇️

Installation

Python>=3.11:

git clone https://github.com/inclusionAI/AWorld
cd AWorld
python setup.py install

Quick Start

Here's a quick start guide to: (1) create your first agent; (2) equip it with a MCP tool; (3) assign a teammate; and (4) answer a user query through teamwork.

from aworld.config.conf import AgentConfig
from aworld.agents.llm_agent import Agent
from aworld.runner import Runners
from aworld.core.agent.swarm import Swarm

if __name__ == '__main__':
agent_config = AgentConfig(
llm_provider="openai",
llm_model_name="gpt-4o",

# Set via environment variable or direct configuration
# llm_api_key="YOUR_API_KEY",
# llm_base_url="https://api.openai.com/v1"
)

# Register the MCP tool here, or create a separate configuration file.
mcp_config = {
"mcpServers": {
"amap-amap-sse": {
"type": "sse",
"url": "https://mcp.amap.com/sse?key=YOUR_API_KEY",
"timeout": 5,
"sse_read_timeout": 300
}
}
}

# Create your first agent equipped with an MCP tool
search = Agent(
conf=agent_config,
name="search_agent",
system_prompt="You are a helpful agent.",
mcp_servers=["amap-amap-sse"], # MCP server name for agent to use
mcp_config=mcp_config
)

# Add a new teammate to the agent
summary = Agent(
conf=agent_config,
name="summary_agent",
system_prompt="You are a helpful summary agent."
)

# Collaborate as a team; the default is a static workflow
swarm = Swarm(search, summary)

# Run agent team
res = Runners.sync_run(input="Hotels within 1 kilometer of West Lake in Hangzhou",
swarm=swarm)
print(res)

Architecture

AWorld is designed to achieve two primary objectives: (1) provide an efficient forward process, and (2) facilitate diverse backward processes, including but not limited to foundation model training and system design meta-learning.

Forward

An illustration of the runtime, showing the message workflow when Agent1 receives a query from a user.

Backward

During training, an action-state rollout demonstration using AWorld's distributed environments.

Demo

Running Pre-defined Agents (e.g., see demo code). Below are demonstration videos showcasing AWorld's capabilities across various agent configurations and environments.

ModeTypeDemo
Single AgentBrowser useAWorld Browser Demo on YouTube

▶️ Watch Browser Demo on YouTube

Phone useAWorld Mobile Demo on YouTube

▶️ Watch Mobile Demo on YouTube

Multi AgentCooperative TeamsAWorld Travel Demo on YouTube

▶️ Watch Travel Demo on YouTube

Competitive TeamsAWorld Debate Demo on YouTube

▶️ Watch Debate Arena on YouTube

Mixed of both TeamsComing Soon 🚀

Contributing

We warmly welcome developers to join us in building and improving AWorld! Whether you're interested in enhancing the framework, fixing bugs, or adding new features, your contributions are valuable to us.

For academic citations or wish to contact us, please use the following BibTeX entry:

@software{aworld2025,
author = {Agent Team at InclusionAI},
title = {AWorld: Enabling Agent Self-Improvement through Interactive Experience with Dynamic Runtime},
year = {2025},
url = {https://github.com/inclusionAI/AWorld},
version = {0.1.0},
publisher = {GitHub},
email = {chenyi.zcy at antgroup.com}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Star History