TrialQA
trialqa
10 runs · 5 models · evaluated by HybridEvaluator.
| # | Model ↕ | Variant ↕ | Mode ↕ | Score ↓ | Avg. dur ↕ | Tokens ↕ | Date ↕ |
|---|---|---|---|---|---|---|---|
| 1 | gpt-5-2-pro | tools,high | inject | 0.933 | 3.5m | 4.1M | 2026-02-03 |
| 2 | gpt-5-2 | tools,high | inject | 0.908 | 1.3m | 3.8M | 2026-02-03 |
| 3 | gemini-3-pro-preview | tools,high | inject | 0.800 | 1.3m | 39.5k | 2026-02-03 |
| 4 | claude-opus-4-6 | tools,high | inject | 0.492 | 1.6m | 43.8M | 2026-03-23 |
| 5 | claude-opus-4-5 | tools,high | inject | 0.475 | 1.0m | 27.3M | 2026-03-22 |
| 6 | claude-opus-4-6 | — | inject | 0.300 | 7.2s | 33.6k | 2026-03-20 |
| 7 | gpt-5-2-pro | — | inject | 0.267 | 1.1m | 108.1k | 2026-02-03 |
| 8 | gemini-3-pro-preview | — | inject | 0.242 | 1.1m | 34.8k | 2026-02-03 |
| 9 | gpt-5-2 | — | inject | 0.225 | 3.0s | 15.8k | 2026-02-03 |
| 10 | claude-opus-4-5 | — | inject | 0.167 | 6.9s | 34.5k | 2026-03-20 |
Click column headers to sort. Click mode chips to filter.