Name-extraction scorecard (30 models)

Method

  • Ground truth = per file, a normalized name is "true" if a strict majority of the models that ran that file extracted it. Names only one or two models found (likely hallucinations or OCR-variant duplicates) are excluded.
  • Precision = fraction of a model's extracted names that reached consensus;
  • Recall = fraction of consensus names the model found; F1 their mean.
  • Names are normalized (lowercase, honorifics/titles stripped, punctuation removed) before comparison.

Coverage is near-complete this run. 26 of 30 models finished all 10 files; four are partial: the three smallest models (llama3.2:1b 8/10, llama3.2:3b and qwen2.5:3b 9/10), which are broken on quality anyway, plus qwen3.6:35b-a3b (7/10), which is partial not from weakness but from sheer latency - it timed out / DNF'd on the longest documents. Read its high F1 (0.70) with that coverage caveat: it only emitted names on 3 of the 7 files it reached. For the other 26 models the "DNF inflates F1" caveat no longer bites, so the separate easy-subset table has been dropped.

The corpus has 10 files but only ~7 carry real person lists: three short docs (NAI_..._07_0036, VRTI_CEN_Report_1871, op1246585) are essentially empty, so a healthy Empty count is 3. Empty above 3 means the model is missing people on documents that actually have them. The two hard documents are the tcd ~70-name rent-roll and NAI_..._06 - where weak models collapse.

The comparison was run on a Linux box with an Nvidia RTX 3060 16GB video card.

Full corpus (10 files), sorted by F1

Model P R F1 Files Empty Persons Median s s/name Notes
anthropic/claude-sonnet-4-6 0.80 0.99 0.88 10 3 108 2.94 0.50 Best overall; near-perfect recall
deepseek/deepseek-reasoner 0.75 0.99 0.86 10 3 114 22.79 2.70 Excellent quality, but slow (reasoning)
anthropic/claude-opus-4-8 0.75 0.99 0.85 10 3 115 3.11 0.52 Top-tier, but no edge over sonnet at higher cost
deepseek/deepseek-chat 0.76 0.93 0.84 10 3 106 2.06 0.34 Standout value - fast, cheap, accurate
gemini/gemini-2.5-flash-lite 0.74 0.99 0.84 10 3 117 1.21 0.17 Fastest cloud at this quality; best value
gemini/gemini-2.5-flash 0.73 0.98 0.83 10 3 117 9.83 1.22 Good, but flash-lite beats it on speed
gemini/gemini-2.5-pro 0.72 0.98 0.83 10 3 118 16.72 1.75 No quality gain over flash-lite, ~14× slower
mistral/mistral-large-latest 0.73 0.95 0.83 10 3 113 6.26 1.64 Strong; slow, high-variance latency
anthropic/claude-haiku-4-5 0.72 0.94 0.82 10 3 114 2.05 0.30 Current names default; fast and solid
openai/gpt-4.1 0.71 0.92 0.80 10 3 113 1.99 0.35 Reliable quality/speed balance
ollama/qwen3:14b 0.74 0.86 0.80 10 4 101 81.89 14.77 Best local - finished all 10; brutally slow
mistral/mistral-small-latest 0.70 0.90 0.79 10 4 111 1.10 0.22 Fastest median overall
openai/gpt-4.1-mini 0.68 0.87 0.77 10 3 111 2.79 0.45 Solid cheap cloud
mistral/ministral-8b-latest 0.67 0.90 0.76 10 3 117 5.07 0.58 Decent; uncalibrated
openai/gpt-4o-mini 0.74 0.68 0.71 10 4 80 1.48 0.59 Precise but misses ~⅓
ollama/qwen3.6:35b-a3b 0.94 0.56 0.70 7 4 16 266.86 180.16 Highest precision in field, but only ran 7/10; unusably slow MoE
ollama/llama3.1:8b 0.61 0.79 0.69 10 0 114 4.68 1.02 Hallucinates; never returns empty
ollama/gemma2:9b 0.67 0.69 0.68 10 5 90 1.49 0.73 Mediocre but stable
ollama/granite3.3:8b 0.71 0.53 0.61 10 8 65 1.34 0.66 Broken: empty on 8/10, dumps names only on rent-roll
openai/gpt-4.1-nano 0.69 0.55 0.61 10 9 70 0.60 0.34 Too weak: empty on 9/10
ollama/gemma3:12b 0.52 0.67 0.59 10 4 111 7.64 1.14 Full coverage but low precision
ollama/mistral:7b 0.66 0.52 0.58 10 4 68 1.75 0.83 Low recall
ollama/llama3.2:3b 0.59 0.46 0.52 9 1 66 1.09 0.48 Weak; dropped a file
ollama/qwen2.5:14b 0.51 0.51 0.51 10 5 87 2.81 1.37 Surprisingly weak for its size
ollama/phi4 0.48 0.47 0.47 10 4 86 5.50 1.56 Weak + slow
ollama/qwen3:8b 0.56 0.28 0.37 10 4 43 46.98 25.39 Low recall and extremely slow
ollama/qwen2.5:7b 0.78 0.16 0.27 10 7 18 0.99 0.93 Broken: bails on long input
ollama/qwen2.5:3b 0.75 0.16 0.26 9 5 16 0.71 25.15 Broken; one file timed out (~394 s)
ollama/mistral-nemo:12b 0.40 0.07 0.12 10 7 15 1.41 1.74 Broken: 7/10 empty, 15 names total
ollama/llama3.2:1b 0.29 0.03 0.05 8 1 7 0.66 34.94 Effectively nonfunctional on this task

Takeaways

  • claude-sonnet-4-6 is still the quality leader (F1 0.88), with deepseek-reasoner (0.86) and claude-opus-4-8 (0.85) right behind - all three hit ~0.99 recall. Opus shows no advantage over sonnet here at higher cost, and the reasoner's quality comes at a 10× latency penalty (median 23 s, s/name 2.70).
  • The new value champions are gemini-2.5-flash-lite and deepseek-chat (both F1 0.84). flash-lite is the fastest cloud model in the whole field (median 1.21 s, 0.17 s/name) yet matches the mid-frontier on quality - it's a strong candidate to become the names default over haiku-4-5 (0.82). deepseek-chat is nearly as fast (2.06 s) and similarly cheap.
  • Bigger Gemini is not better Gemini. gemini-2.5-pro (0.83) does not beat flash-lite (0.84) on quality while being ~14× slower (16.7 s vs 1.2 s). flash sits between them on speed with no quality edge. Use flash-lite.
  • qwen3:14b is the best local model and the first to be genuinely cloud-adjacent - F1 0.80, tying gpt-4.1, and it finished all 10 files including the hard ones. But it's unusable at scale: median 82 s/file (max 689 s ≈ 11 min), s/name 14.77 vs cloud's 0.17–0.6. Quality-per-token is there; throughput is not.
  • qwen3.6:35b-a3b posts the highest precision in the whole field (0.94) but is the slowest model benchmarked, by a wide margin. Median 267 s/file (max 828 s ≈ 14 min), s/name 180 - ~1000× flash-lite's. That latency is why it only finished 7/10 files, and on 4 of those 7 it returned empty, so its recall (0.56) and 0.70 F1 are coverage-limited rather than a true quality ceiling. When it does emit a name it is almost always right, but at this throughput it is a research curiosity, not a usable extractor.
  • Most other locals are weak or broken. llama3.1:8b (0.69) hallucinates and never returns empty; gemma3:12b (0.59) gets full coverage but poor precision. The genuinely broken set - empty on most real documents or bailing on long input - is granite3.3:8b, gpt-4.1-nano (the one broken cloud model), qwen2.5:7b, qwen2.5:3b, mistral-nemo:12b, and llama3.2:1b (effectively nonfunctional at 0.05). qwen3:8b is the worst combination: low recall (0.28) and 25 s/name.
  • s/name remains the meaningful throughput metric - latency is output-bound (number of names emitted), not input-bound. It's inflated for the broken tiny models (llama3.2:1b 34.9, qwen2.5:3b 25.2) because they emit almost no names while occasionally timing out, so read those rows as "broken," not "slow."