ABench: An Evolving Open-Source Benchmark

2025年7月8日 · 阅读需 2 分钟

Ant Group

🌟 Overview

ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cross-domain tasks. By targeting current model weaknesses, ABench provides systematic challenges in high-difficulty specialized domains, including physics, actuarial science, logical reasoning, law, and psychology.

🎯 Core Objectives

Address Evaluation Gaps: Design high-differentiation assessment tasks targeting underperforming question types
Establish Unified Standards: Create reliable, comparable benchmarks for multi-domain LLM evaluation
Expand Capability Boundaries: Drive continuous optimization of knowledge systems and reasoning mechanisms through challenging innovative problems

📊 Dataset Release Status

Domain	Description	Status
Physics	500 university/competition-level physics problems (400 static + 100 dynamic parametric variants) covering 10+ fields from classical mechanics to modern physics	✅ Released
Actuary	Curated actuarial exam problems covering core topics: probability statistics, financial mathematics, life/non-life insurance, actuarial models, and risk management	✅ Released
Logic	High-differentiation logical reasoning problems from authoritative tests (LSAT/GMAT/GRE/SBI/Chinese Civil Service Exam)	🔄 In Preparation
Psychology	Psychological case studies and research questions (objective/subjective) evaluating understanding of human behavior and theories	🔄 In Preparation
Law	Authoritative judicial exam materials covering core legal domains: criminal/civil/administrative/procedural/international law	🔄 In Preparation

🌟 Overview​

🎯 Core Objectives​

📊 Dataset Release Status​

🌟 Overview

🎯 Core Objectives

📊 Dataset Release Status