As I've mentioned in previous posts, one of my hobbies is genealogy - the study of my family history. Two thirds of my family trace their heritage back to England, Ireland, and Scotland. The majority of those, including both of my parents' paternal lines, are from what we in the US call the "Scotch Irish". This refers to families that first relocated from Scotland to Ulster (present day Northern Ireland and County Donegal) and then later left for the United States (or the American Colonies). In my case, both the Youngs and the Houstons came to the colonies shortly before or immeditately after the American Revolution. In both cases they arrived in the American South through Pennsylvania and North Carolina, later migrating through Tennesee, Mississippi, Arkansas, and Oklahoma before arriving in Texas. These were folks who were living on the American frontier, many of them from Appalachia.
The length of time that my family has been in the United States makes tracking them back to their original immigration records difficult. This is made more difficult by two compounding factors. First, the families were very close knit and there was a common practice of reusing names within a family. This means that sometimes its hard to figure out which "Thomas Young" or "John Houston" I'm looking for in a given community. The second is that the US didn't keep very good census records until about 1850, only listing heads of families before then. After digging through many records I'm finally 95% sure I know which Young and Houston men came over from Ireland. However, anyone who has tried to follow the path of their family's immigration from Ireland runs into the same wall eventually: the records you need are in Ireland, they are old, and a great many of them no longer exist.
The largest hurdle to overcome is that in 1922, at the start of the Irish Civil War, the Public Record Office of Ireland at the Four Courts in Dublin was destroyed by fire, and seven centuries of records went with it. The Virtual Record Treasury of Ireland is an extraordinary effort to undo some of that loss. It is a collaborative research project that has been reconstructing the destroyed archive from surviving copies, transcripts, and duplicates scattered across other institutions, and then publishing the results - high-resolution images, metadata, and full transcriptions - online for anyone to use. I am enormously grateful for the work they have done and the documents they make freely available; this post would not exist without them, and I wouldn't be able to research my Irish roots.
Over the past month or so I have been working on a set of tools designed to make it easier for me to work with and search the data the VRTI has published. Included in these tools is vtextract, a command-line tool that downloads resources from the Virtual Record Treasury into a local, resumable archive so I can search and cross-reference them offline. This post is about one particular piece of that project: the work I did to run a large language model over the transcriptions in my local archive to find the names of all the people mentioned on each page. It turned into a much more interesting engineering problem than I expected - a story about benchmarking models, building a cost projection that turned out to be wildly optimistic, and then figuring out why it cost more than I expected and how to fix it going forward.

The Goal: People, Not Just Pages
A transcription is just text. If I want to find every page that mentions a particular family, full-text search gets me part of the way, but it is brittle: spelling was not standardized, the same person appears as "Jno. Houston," "John Huston," and "Houstoun," and a page that lists two hundred names is no easier to skim on a screen than it was on parchment. What I really wanted was a structured list of people attached to each page - a durable record of who appears where, normalized so that all those surface forms collapse to one searchable person.
My solution was to add a names tool to vtextract. This tool reads over the local archive and runs an LLM over each page transcription. It then writes out a small json file (called a 'sidecar' because it sits next to the main transcription file) that lists the people found on that page. Each person is stored compactly as the canonical (normalized) name followed by the verbatim surface forms actually seen on the page, so I can later search by either the clean name or any of its messy variants. My local archive currently holds 174,494 page transcriptions. Running a cloud LLM over all of them is an expensive use of tokens, so I first ran analysed a random sample of files to try and perfect the process before running it at scale.
Picking a Model
I wanted to (a) pick the best model for the job and (b) estimate what it would cost. I built a small benchmarking harness, vtnamebench, that runs the exact production extraction pipeline over a fixed corpus of transcriptions and scores each model. I ran it against thirty models across local Ollama, OpenAI, Anthropic, DeepSeek, Mistral, and the full Gemini 2.5 lineup to compare their cost and performance.
Scoring name extraction is itself a little tricky, because there is no official answer key for "who is mentioned on this page." I used a cross-model consensus as the ground truth: a name counts as real if a strict majority of the models that processed a file found it, which neatly filters out both hallucinations and one-off OCR variants. Each model then gets a precision, recall, and F1 score against that consensus, plus its latency distribution.
The results sorted out cleanly:
| Model | F1 | Median s | Notes |
|---|---|---|---|
| claude-sonnet-4-6 | 0.88 | 2.9 | Best overall; near-perfect recall |
| deepseek-reasoner | 0.86 | 22.8 | Excellent, but slow |
| claude-opus-4-8 | 0.85 | 3.1 | Top-tier; no edge over Sonnet here |
| deepseek-chat | 0.84 | 2.1 | Standout value |
| gemini-2.5-flash-lite | 0.84 | 1.2 | Fastest cloud; cheapest tier |
| gemini-2.5-pro | 0.83 | 16.7 | No gain over flash-lite, ~14× slower |
| qwen3:14b (local) | 0.80 | 81.9 | Best local model; far too slow at scale |
claude-sonnet-4-6 was the quality ceiling at F1 0.88, but the interesting result was further down. gemini-2.5-flash-lite matched the mid-frontier on quality (F1 0.84) while being the fastest cloud model in the entire field - a 1.2-second median - at the cheapest published price tier I could find ($0.10 per million input tokens, $0.40 per million output). A couple of things jumped out along the way: bigger Gemini was not better Gemini (the much slower gemini-2.5-pro did not beat flash-lite on quality), and the local models were either far too slow to use at scale - the best of them, qwen3:14b, took over a minute per page - or simply broke on long input. flash-lite was the obvious pick.
The Projection That Lied
With a model chosen, I extrapolated the cost. The benchmark recorded the tokens used per page, so multiplying that out across 174,494 pages gave a total. The number came back in the low tens of dollars. That was the figure that set my expectations, and it was wrong by something like a factor of five.
The flaw, in hindsight, is embarrassingly simple. The benchmark corpus was chosen to be quality-diverse - a spread of easy and hard documents to separate the good models from the bad ones. It was not chosen to be size-representative. It was ten files averaging about 2,400 bytes, with the largest at 6,582. Nothing in it was big enough to trigger the failure modes that, it turned out, dominates the real bill.
Dense Pages Break Things
The Virtual Record Treasury archive is full of registry and index pages: long, repetitive lists of names, line after line of "Surname, Place, Year." The real distribution looks nothing like the benchmark sample:
| Files | Avg bytes | Max bytes | |
|---|---|---|---|
| Benchmark sample | 10 | 2,448 | 6,582 |
| Real archive | 174,494 | 3,980 | 45,065 |
The average page is over 1.6× the sample, and the largest page in the archive is seven times bigger than the largest thing the benchmark ever saw. On these dense pages, two things went wrong, and they compounded each other.
The first was repetition loops. At temperature 0 (greedy, deterministic decoding), a model fed highly repetitive text can fall into a loop, emitting names - some real, some hallucinated - until it hits its own output ceiling. For flash-lite that ceiling is 65,536 tokens. A page that should have cost a few hundred output tokens instead cost tens of thousands. The second was truncation: a response cut off at the token ceiling is incomplete JSON, which the parser then chokes on with a misleading error - and the wasted tokens are billed regardless.
Safeguards, Built Reactively
As these problems surfaced, I added layers of protection. The single most important one was an output-token cap: a per-call limit (which I set to 12,000 tokens) so that a runaway loop hits my cap instead of the model's 65,536 ceiling. That one change cut the cost of the worst pages by roughly 9×, and 12,000 sits comfortably above what any legitimate page produces, so real extractions were unaffected.
Around that I added truncation detection (recognizing when a response was cut off and raising a typed error instead of feeding broken JSON to the parser), a retry that re-issues a truncated page once at a higher temperature to break the greedy-decoding loop, a repair path for merely malformed JSON, and handling for a Gemini-specific quirk where the model flags dense, list-like text for "recitation" and returns an empty response. That last case originally crashed with a cryptic error and got misclassified as a temporary failure, so the same pages failed on every run; once I gave it a proper error type, those pages were properly marked as errors and were not automatically retired, which reduced the cost of later runs.
The retry temperature was worth tuning carefully. I measured it on the real failing pages: 0.3 still looped to the cap, 0.7 sometimes produced malformed JSON, and 0.5 broke the loop reliably while keeping the output valid. So the first pass stays fully deterministic at temperature 0, and only the retry runs hot.
The Bill Still Came In High
Even with the safeguards in place, by the time about 83% of the archive (144,920 pages) had been processed, it was clear the spend was several times the projection. I reconstructed the actual cost from the per-page token counts recorded in each sidecar for the pages that succeeded:
| Tokens | Cost | |
|---|---|---|
| Input | 286.5M | $28.65 |
| Output | 129.4M | $51.78 |
| Subtotal (144,920 pages) | ~$80 | |
| Parked error pages (untracked) | est. | +$10–20 |
| Total | ≈ $90–100 |
A few things stood out once I had the real numbers. Output dominated the bill. Output tokens cost 4× what input tokens do, and they made up 64% of the total - name extraction emits a JSON entry per person, and dense pages emit enormous lists. The waste was savagely concentrated, too: just 2,636 pages, 1.82% of the archive, produced 30% of all the output tokens. One single page hit 68,886 output tokens. And the error pages were generating real, billable tokens that were recorded in no sidecar at all - invisible to my tracking, fully present on the invoice.
The Prompt-Caching Dead End
One obvious lever for an input-heavy workload is prompt caching: my extraction prompt has a fixed system prompt sent on every call, so why pay for it 145,000 times? I went looking and hit a wall. Gemini's minimum cacheable prefix is 1,024 tokens; my system prompt is only about 530, well below the floor, so any cache control I set would be ignored.
The Two Fixes That Worked
Since the cost was output-bound, the fixes had to attack the output. Two of them did the real work.
The first was a compact output schema. The original sidecar format was a verbose object per person, with a confidence field on the name and on every alias. The confidence values were near-uniformly "high" - noise, really - and that scaffolding ("confidence": "high", repeated for every name and every alias, plus all the JSON braces and keys) was being emitted on every page. I switched to a compact array - the canonical name followed by its surface forms, nothing else - and dropped confidence entirely. On name-bearing pages that is roughly 3× less output, and far more than that on the dense index pages where the per-name overhead dominates: exactly the pages that were driving the bill.
The second was adaptive chunking. I had spent some time trying to predict, from the input text alone, which pages would loop - so I could handle them differently before spending more tokens. It turns out I couldn't cleanly separate loop-prone pages from normal ones: the loop-causers are split between long name-dense registries and short garbled-OCR pages, and no simple feature (page size, capitalized-word count, etc.) draws a clean line without also flagging a quarter of the perfectly normal pages.
The key realization was that I didn't need a precise classifier. A false positive is nearly free if the penalty is "chunk this page into smaller pieces" rather than "skip this page." Splitting a normal large page into a few extra windows cost a handful of cheap calls and lost no quality at all. So instead of chasing a classifier, I made the cost of being wrong negligible: any page over a size threshold gets split into smaller windows, which keeps each individual generation short. Short generations are far less likely to loop, and when one does, the output cap bounds the damage to a single small window instead of a giant one.
Did It Work?
After landing the compact schema, the empty-response fix, and adaptive chunking, I re-ran the parked-error pages and the roughly 28,000 pages that had not yet been processed. (The 144,920 already-successful pages kept their existing results; a one-time migration just rewrote the JSON shape without re-calling the model.)
| Status | Before | After |
|---|---|---|
| Successful | 144,920 | 174,279 (99.88%) |
| Errored | 1,551 | 215 |
| Not yet run | 28,023 | 0 |
The new code recovered 92.6% of the old errors, regressed nothing, and succeeded on 99.6% of the never-run pages. Comparing the pages processed under the new code against the old ones directly from their recorded token counts:
| Old code | New code | |
|---|---|---|
| Output tokens/page | 893 | 329 (−63%) |
| Input tokens/page | 1,977 | 1,319 (−33%) |
| Cost/page | $0.000555 | $0.000264 (−53%) |
The 63% drop in output tokens is the headline, and it was the intended effect of the compact schema and loop elimination. Extrapolated to the whole archive, the new rate works out to roughly $46 against about $97 at the old rate - almost half the cost.
That left 215 stubborn pages that even adaptive chunking could not crack - extreme dense or garbled documents that loop even at small chunk sizes, plus a handful Gemini refused on recitation grounds. These are content-hard, not a bug, and the right tool for them is not more chunking but a stronger model. So I re-ran just those 215 on claude-sonnet-4-6 - the quality leader from the original benchmark - and it cleared all 215 with zero regressions.
The final state of the archive: 174,494 of 174,494 pages successfully extracted - 100%, zero errors. This is exactly the division of labour the benchmark pointed at all along: the cheap, fast model does the overwhelming bulk at a fraction of a cent per page, and the expensive model is held in reserve for the few hundred pathological pages it handles better. Best of both worlds.
Takeaways
A few lessons from this that I think generalize well beyond Irish parish records:
Benchmark on a size- and shape-representative sample, not just a quality-diverse one. Every expensive failure mode in this project - loops, truncation, recitation blocks - lived in the tail of the size distribution, and a sample of small pages hid all of them. The benchmark answered "which model is most accurate?" perfectly and "what will this cost?" not at all, because it wasn't really built to answer the second question.
For extraction tasks, project on output tokens. Output is the expensive side (4× the input price here) and the one that explodes on pathological input. Input scales predictably with document size; output does not.
Track wasted tokens, including failures. My error pages recorded no usage, which made a real and recurring cost completely invisible until I went looking for it on the actual bill.
Cap the blast radius early. The output-token cap was the cheapest, highest-leverage control in the whole project, but I added it reactively after the damage was done. It should have been there from the first run.
Prefer prevention to recovery when recovery is expensive. The temperature-0.5 retry rescued about 63% of truncated pages, but each rescue still paid for the wasted first attempt. Smaller chunks attack the loop at its source, so fewer pages need rescuing at all. A recovery mechanism that works is still worth replacing if it works expensively.
When no clean classifier exists, make false positives cheap instead. I could not reliably flag loop-prone pages in advance - but by choosing an action (chunk smaller) whose false-positive cost is negligible, an imperfect, blunt rule became safe to apply aggressively. Sometimes the right move is not a better predictor but a cheaper mistake.
The end result is that I now have a fully searchable index of every person named across the entire reconstructed archive - which is exactly the tool I need to go hunting for those 18th-century Irish ancestors. I am grateful, again, to the Virtual Record Treasury of Ireland for making the underlying documents available in the first place. Now I just have to find my family in them.