Key Points
- Google released Gemini 3.1 Pro in preview for developers and consumers.
- Gemini 3.1 Pro scores 44.4% on Humanity’s Last Exam, beating Gemini 3 and GPT 5.2.
- On ARC‑AGI‑2, the model reaches 77.1%, more than double its predecessor’s score.
- The model does not top the Arena leaderboard; Claude Opus 4.6 leads text, and Opus/GPT lead code.
- Gemini 3.1 Pro powers the latest Deep Think enhancements.
- Google emphasizes improved reasoning and complex problem‑solving capabilities.
Model Overview
Google introduced Gemini 3.1 Pro as the next iteration of its Gemini series, rolling it out today in a preview format for developers and consumers. The company describes the model as delivering stronger problem‑solving and reasoning abilities compared with its predecessor, Gemini 3.
Benchmark Performance
In the Humanity’s Last Exam, which measures advanced domain‑specific knowledge, Gemini 3.1 Pro achieved a record score of 44.4 percent, surpassing Gemini 3’s 37.5 percent and outperforming OpenAI’s GPT 5.2 at 34.5 percent. On the ARC‑AGI‑2 test, designed to assess novel logic challenges that cannot be directly trained, Gemini 3.1 Pro more than doubled Google’s prior score, reaching 77.1 percent compared with Gemini 3’s 31.1 percent.
Competitive Landscape
Despite the gains, Gemini 3.1 Pro does not lead the public Arena leaderboard, which reflects user preference votes on model outputs. In the text category, Claude Opus 4.6 leads by four points, while for code tasks, Opus 4.6, Opus 4.5, and GPT 5.2 High maintain a modest edge over Gemini 3.1 Pro.
Deep Think Integration
The new model also powers the latest upgrades to Google’s Deep Think tool, indicating that Gemini 3.1 Pro serves as the underlying “core intelligence” for that feature.
Implications
Google’s announcement highlights a continued focus on refining large language models for higher‑order reasoning, even as competitive benchmarks show mixed results. The preview rollout allows developers early access to test the model’s capabilities in real‑world applications, while the company continues to gather feedback for broader deployment.
Source: arstechnica.com