Building a clinical trial document pipeline that survives audits
One of our early customers runs a clinical-data platform for oncology tumor-board support. Every quarter they ingest around 800 ASCO clinical guideline PDFs, converting them to structured JSON that feeds a retrieval system their oncologists query during reviews. The pipeline is not experimental — it runs in production and its outputs inform clinical workflows.
Their engineering team came in with a specific constraint: the pipeline had to be re-runnable, auditable, and able to explain every parse decision. Generic OCR was not a viable starting point.
The pipeline shape
The pipeline runs on a weekly schedule. Source PDFs arrive from an internal document store. Each PDF is submitted to the Docira batch API with output_mode=json and include_trace=true. The JSON output goes into a PostgreSQL table. The routing trace for each page goes into a separate audit table, keyed by document ID and page number.
Downstream, a retrieval layer indexes the JSON. Oncologists query by treatment line, tumor type, or biomarker. The routing audit table is never queried by clinicians — it exists for engineers and for regulators.
Why generic OCR fails this workflow
ASCO guideline PDFs contain a specific type of difficult content: treatment-algorithm tables with merged cells, footnote columns that span half the page width, and recommendation grades embedded as superscripts within cell text. Character-level OCR reads the characters correctly but loses the structure. A transposed table cell looks like correct text.
The team had previously run a pilot with a traditional OCR service. On a set of 40 tables sampled from five guideline documents, manual review found structural errors in 11 of them — cells associated with the wrong column header, merged-cell content split across two rows, footnote text merged into adjacent cells. None of the errors were detectable by checking character accuracy. They required a human reader to verify table structure against the source PDF.
The failure mode mattered clinically. A transposed ORR value in a first-line treatment table is not a typo — it is a number that an oncologist might read as evidence for a recommendation.
The routing trace as audit log
When Docira processes a page, the response includes a routing trace with the complexity score, tier, provider, model, and confidence for that specific page. This team stores it verbatim. A sample record from their audit table looks like this:
{
"document_id": "asco-guideline-2024-xyz",
"page": 14,
"processed_at": "2026-04-15T03:22:41Z",
"routing": {
"complexity_score": 0.87,
"tier": "expert",
"provider": "nvidia",
"model": "nvidia/llama-3.1-nemotron-ultra-253b-v1",
"confidence": 0.91
}
}That record answers “which model processed page 14 of this guideline, and what confidence did it return?” It was not built as a separate audit feature. It is a side effect of the parse response.
The team’s compliance lead described the value in practical terms: when an auditor asks “can you show me the decision process for how this clinical table was parsed?”, the answer is a database query, not a manual reconstruction.
What they did not have to build
Two things this team did not build that a custom pipeline would have required: a model-selection layer and a confidence-scoring layer. Both are handled by Docira’s routing stage. On pages where the complexity score is below 0.3 — clean text, no tables — a Fast-tier model handles the parse at $0.003/page. On pages with complex tables and high complexity scores, the Expert tier runs automatically.
They also did not build retry logic for individual pages. The circuit breaker and provider fallback are handled at the infrastructure level. If NVIDIA’s quota is exhausted mid-batch, the request falls over to another Expert-tier provider. The audit table records the actual provider that handled each page, which means re-runability is traceable.
How the pipeline scaled
The initial batch was 800 PDFs. After six months, the corpus grew to 2,400 documents as the platform added two additional guideline sources. The pipeline did not change. Batch size, provider selection, and confidence thresholds all scale with the document count without configuration changes.
The one tuning they did: after reviewing their audit table, they found 12% of pages routing to the Expert tier with complexity scores between 0.75 and 0.82. Those pages were borderline — guideline body text with a small sidebar table. They adjusted the table-cell-count weighting in their submitted schema to push those pages to Pro tier, reducing the blended cost by about 8% without measurable impact on output quality.
A note on HIPAA
This pipeline processes published clinical guidelines, not patient data. It does not involve protected health information. Docira is HIPAA-ready — a BAA is available on the Enterprise plan for customers who need to process PHI. The Free, Starter, Pro, and Scale tiers do not cover PHI processing. Customers with workflows that involve patient data should contact us to discuss Enterprise terms before processing.