Which AI Model is Best at Solving Technical Problems? My 5-Question Accuracy Test

This experiment evaluated the numerical accuracy of nine different AI models in solving a set of five technical problems. Each question was asked five times, using identical wording, to measure consistency and accuracy. The questions are similar to the type of questions an environmental engineer would be asked when taking the Professional Engineering exam. The models tested were: Astral (independent developer), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), Gemini 2.5 Pro (Google), Gemini 2.5 Flash (Google), Llama 4 Maverick (Meta), GPT-4o (OpenAI), O3 (OpenAI), and O3-Pro (OpenAI).

Here are the five questions I asked for each model:

  1. A municipal landfill uses a compacted 1.08-m-thick clay liner that has a hydraulic conductivity of 1 × 10–7 cm/s. If the depth of the leachate above the clay liner is 30 cm and the porosity of the clay is 55%, what is the time (years) required for the leachate to migrate through the liner?
  2. A 40-ft-thick confined aquifer has a piezometric surface 85 ft above the bottom-confining layer. Groundwater is being extracted from a 4-in.-diameter fully penetrating well. The pumping rate is 35 gpm. The aquifer is relatively sandy with a hydraulic conductivity of 175 gpd/ft². Steady-state drawdown of 5 ft is observed in a monitoring well 10 ft from the pumping well. What is the drawdown (ft) in the pumping well?
  3. A radiation monitor reads 100 mR/hr at a distance of 6 ft from the geometric center of a 2-ft diameter drum of radioactive waste. What is the expected dose rate (mR/hr) at the surface of the drum?
  4. A subsurface remedial treatment technology costs $245,000 to construct initially with annual operation and maintenance costs of $9,000 for a 5-year operational life. Using an annual interest rate of 6% and no equipment salvage value, what is the annualized cost for the remedial treatment technology?
  5. The following information applies to the reaction of methane: CH₄ + 2O₂ → CO₂ + 2H₂O.
SpeciesEnthalpy of Formation (J/mol)
CH₄-74,980
O₂0
CO₂-394,088
H₂O-242,174

The total heat of reaction (J/mol) is most nearly:

Evaluation Method and Result:

Each model’s numerical answers were compared to the correct values. The result was considered correct if it was within ±1% of the true answer. Accuracy was calculated as the percentage of the 25 total responses (5 questions × 5 trials) that met this criterion.

RankModelAccuracy (%)
1Gemini 2.5 Flash80
2Claude Sonnet 464
3Gemini 2.5 Pro60
4o360
5o3 – Pro40
6Grok 332
7Llama 4 Maverick20
8GPT-4o4
9Astral0

Key Findings

  • Gemini 2.5 Flash outperformed all other models, providing correct answers 80% of the time. Its strong performance suggests high numerical precision and consistency despite being optimized for speed.
  • Claude Sonnet 4 and Gemini 2.5 Pro also performed well, demonstrating reliable reasoning capabilities.
  • OpenAI’s O3 models showed mixed results, with the base version outperforming the Pro variant in this test.
  • GPT-4o and Astral were notably less accurate, indicating potential weaknesses in numerical computation for these specific technical problems.

Conclusion

This experiment showed that accuracy can vary widely between AI models when tackling engineering and scientific problems. Among the models tested, Gemini 2.5 Flash delivered the most consistent and accurate results, demonstrating that a model built for speed can still excel in technical problem-solving.

As someone who contributes to AI training in math and engineering, I’m fortunate to have free access to many different AI models — which means I can run experiments like this purely out of curiosity. It’s always interesting to see how each model approaches the same problem, and I look forward to exploring even more challenging scenarios in future tests.

Author: Mohamed Hersi, Licensed Environmental Engineer (P.E.)

Posted in

One response to “Which AI Model is Best at Solving Technical Problems? My 5-Question Accuracy Test”

  1. Which AI Model is Best at Solving Technical Problems? My 5-Question Accuracy Test – THE CATALYST Avatar

    […] August 15, 2025 Which AI Model is Best at Solving Technical Problems? My 5-Question Accuracy Test […]

    Like

Leave a reply to Which AI Model is Best at Solving Technical Problems? My 5-Question Accuracy Test – THE CATALYST Cancel reply