GPQA-Diamond

Emerging

6papers using it

2025first seen

The 'GPQA-Diamond' dataset/benchmark contains multi-agent debate scenarios used to evaluate the mechanisms of convergence in reasoning among language models, specifically focusing on distinguishing between genuine deliberation and social compliance.

🔎 Find this dataset

Papers using GPQA-Diamond (4)

Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate2026

ATLAS: Agentic Test-time Learning-to-Allocate Scaling2026

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems2026

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning2025