The specifics:
The model achieved the gold standard by solving five out of six IMO 2025 tasks and scoring 118/120 in the 2024 Putnam competition, surpassing the highest human score.
It earned 61.9% on IMO ProofBench, shattering GPT-5, which only scored 20%, and almost matching Google's customized Gemini Deep Think, which got IMO gold.
Rather of rewarding final answers alone, Math-V2 employs a generator-verifier system in which one model suggests a proof and another evaluates it.
By giving stages confidence scores, the verifier ensures that reasoning is self-debugged step-by-step and forces the generator to improve weak logic.
It earned 61.9% on IMO ProofBench, shattering GPT-5, which only scored 20%, and almost matching Google's customized Gemini Deep Think, which got IMO gold.
Rather of rewarding final answers alone, Math-V2 employs a generator-verifier system in which one model suggests a proof and another evaluates it.
By giving stages confidence scores, the verifier ensures that reasoning is self-debugged step-by-step and forces the generator to improve weak logic.
DeepSeek has cracked the monopoly on frontier mathematical reasoning by open-sourcing a model that competes with Google's internal heavyweight. This has given the community a template for creating agents that can debug their own mental processes. In fields like engineering, where errors are expensive, this may be revolutionary.