technical7 min readMay 2, 2026

Why vision-language models beat OCR for tables

Table extraction is the hardest single problem in document parsing. Not because the data is hard to read — a human reads most tables in seconds — but because the standard approach was built on a foundation that cannot represent what makes tables meaningful: visual structure.

How traditional OCR table extraction works

Classic OCR engines like Tesseract 5.x operate at the character level. They identify bounding boxes for individual characters, group characters into words, and group words into lines. To reconstruct a table, a post-processing step groups the character boxes into cells by proximity.

The proximity heuristic works on tables with uniform row heights, solid-line borders, a single row of column headers, and no merged cells. Clean digital invoices and US government forms fit this profile. For everything else, the model is guessing. It does not know what a cell is — it knows where characters are and infers cell membership from spatial proximity.

Where it breaks: four failure modes

Merged cells. When a cell spans two columns, character boxes sit in the center of both columns. The proximity heuristic assigns them to one column or splits them across two. Neither is correct. There is no spatial signal for “this cell belongs to both columns simultaneously.”

Rotated column headers. Academic and clinical tables often rotate headers 90 degrees to save horizontal space. Tesseract reads them as a separate text region with no connection to the column below. The stitching step loses the association, so the resulting output has headers that do not align with their data.

Scanned skew. Scanned documents rotated even 0.5 degrees cause character boxes from adjacent rows to overlap vertically. Row grouping fails. Characters from row N and row N+1 get merged into the same output line.

Hand-annotated tables. Margin notes, correction marks, or highlighted cells produce spurious character boxes that the proximity heuristic cannot distinguish from table content. The annotation gets written into the cell output.

What a VLM does instead

A vision-language model receives the page as an image. It does not run character detection first. It looks at the page the way a human reader does — as a visual composition with semantic meaning. A merged cell is visually obvious: the cell boundary spans two columns. The model reads it as merged because that is what it looks like, not because a heuristic inferred it from proximity.

On the DocVQA benchmark (Mathew et al., 2021), frontier VLMs reach ANLS above 0.90 on document-understanding tasks. DocVQA measures answer extraction, not cell-level table reconstruction, but the visual reasoning capability that drives those scores handles merged cells, rotated headers, and skew correctly by the same mechanism.

For rotated headers, the model reads the rotation as a visual feature and associates the header with its column. For scanned skew, it normalizes for orientation before reading. For hand-annotated tables, it identifies annotation as a separate visual layer.

The cost

VLM inference is not free. A frontier model like Llama-3.1-Nemotron-Ultra-253B costs $0.010–$0.015/page at current hosted rates (internal benchmark; public release pending). That is 10–15x the per-page cost of Tesseract at scale.

The cost is justified on complex tables. On plain text pages, it is not. A multi-model router solves this by sending each page to the right tool: a fast, cheap model for clean text; a VLM for pages with complex tables. The blended rate across a real mixed corpus is typically $0.005–$0.009/page.

When OCR is still the right choice

Three cases where character-level OCR remains correct: documents with clean digital text and no tables; very high-volume pipelines where the table error rate is acceptable; and documents where raw character data (not structured output) is the required deliverable. For all three, Tesseract or a hosted OCR API is faster and cheaper than any VLM approach.

The mistake is applying OCR-class tools to VLM-class problems. Most enterprise document corpora contain both. Route per page, not per corpus.

Code recipe: same table, two approaches

Tesseract with LLM stitching, then the Docira path on the same input:

# Path 1: Tesseract + LLM stitching
import pytesseract
from PIL import Image

img = Image.open("table-page.png")
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)

# Group bounding boxes into rows by top-coordinate proximity
# Then pass to an LLM to reconstruct table structure as Markdown
# Fails on: merged cells, rotated headers, skewed scans

# Path 2: Docira (VLM routing)
import httpx, base64, pathlib

img_b64 = base64.b64encode(
    pathlib.Path("table-page.png").read_bytes()
).decode()

resp = httpx.post(
    "https://api.docira.io/v1/parse?include_trace=true",
    headers={"Authorization": "Bearer $DOCIRA_API_KEY"},
    json={
        "content": img_b64,
        "content_type": "image/png",
        "output_mode": "markdown",
    },
    timeout=30,
)
resp.raise_for_status()
data = resp.json()

# Merged cells: handled. Rotated headers: handled. Skew: handled.
print(data["pages"][0]["content_markdown"])
print("tier:", data["pages"][0]["routing"]["tier"])
print("confidence:", data["pages"][0]["routing"]["confidence"])