This organization contains the series of open-source projects from Ant Group with dedicated efforts to work towards Artificial General Intelligence (AGI).

Introducing Ming-Lite-Omni V1.5

GITHUB 🤗 Hugging Face| 🤖 ModelScope Overview Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of Ming-lite-omni(Github). It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon Ling-lite-1.5, Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models. ...

2277 words

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

📖 Technical Report | 🤗 Hugging Face| 🤖 ModelScope Introduction We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains. ...

1052 words

ABench: An Evolving Open-Source Benchmark

GITHUB 🌟 Overview ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cross-domain tasks. By targeting current model weaknesses, ABench provides systematic challenges in high-difficulty specialized domains, including physics, actuarial science, logical reasoning, law, and psychology. 🎯 Core Objectives Address Evaluation Gaps: Design high-differentiation assessment tasks targeting underperforming question types Establish Unified Standards: Create reliable, comparable benchmarks for multi-domain LLM evaluation Expand Capability Boundaries: Drive continuous optimization of knowledge systems and reasoning mechanisms through challenging innovative problems 📊 Dataset Release Status Domain Description Status Physics 500 university/competition-level physics problems (400 static + 100 dynamic parametric variants) covering 10+ fields from classical mechanics to modern physics ✅ Released Actuary Curated actuarial exam problems covering core topics: probability statistics, financial mathematics, life/non-life insurance, actuarial models, and risk management ✅ Released Logic High-differentiation logical reasoning problems from authoritative tests (LSAT/GMAT/GRE/SBI/Chinese Civil Service Exam) 🔄 In Preparation Psychology Psychological case studies and research questions (objective/subjective) evaluating understanding of human behavior and theories 🔄 In Preparation Law Authoritative judicial exam materials covering core legal domains: criminal/civil/administrative/procedural/international law 🔄 In Preparation

185 words

AWorld: The Agent Runtime for Self-Improvement

“Self-awareness: the hardest problem isn’t solving within limits, it’s discovering the own limitations” Table of Contents News — Latest updates and announcements. Introduction — Overview and purpose of the project. Installation — Step-by-step setup instructions. Quick Start — Get started with usage examples. Architecture — Explore the multi-agent system design. Demo — See the project in action with demonstrations. Contributing — How to get involved and contribute. License — Project licensing details. News 🦤 [2025/07/07] AWorld, as a runtime, is now ready for agentic training. See Self-Improvement section for details. We have updated our score to 77.08 on the GAIA test. Learn how to construct a GAIA runtime in the Demo section. 🦩 [2025/06/19] We have updated our score to 72.43 on the GAIA test. Additionally, we have introduced a new local running mode. See ./README-local.md for detailed instructions. 🐳 [2025/05/22] For quick GAIA evaluation, MCP tools, AWorld, and models are now available in a single Docker image. See ./README-docker.md for instructions and youtube video for demo. 🥳 [2025/05/13] AWorld has updated its state management for browser use and enhanced the video processing MCP server, achieving a score of 77.58 on GAIA validation (Pass@1 = 61.8) and maintaining its position as the top-ranked open-source framework. Learn more: GAIA leaderboard ✨ [2025/04/23] AWorld ranks 3rd on GAIA benchmark (69.7 avg) with impressive Pass@1 = 58.8, 1st among open-source frameworks. Reproduce with python examples/gaia/run.py Introduction AWorld (Agent World) is a multi-agent playground that enables agents to collaborate and self-improve. The framework supports a wide range of applications, including but not limited to product prototype verification, foundation model training and Multi-Agent System (MAS) design meta-learning. ...

895 words

Ming-Omni: A Unified Multimodal Model for Perception and Generation

GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope Introduction Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community. ...

1379 words