What a PDF table extractor actually does
PDFs look like they contain tables, but from the file format's perspective there are no tables at all. A PDF is a stack of text, lines, and images placed at exact coordinates on a page. What your eye reads as a table is really a grid of positioned text runs that happen to line up. Extracting a table means reversing that process: inferring which text belongs to which row and column based on either the visible borders (if there are any) or the whitespace between words.
This tool does that inference on our server because the algorithms involved (lattice detection from line art, column inference from text positions, OCR for scanned pages) are too heavy to run reliably in a browser. Your PDF is uploaded, processed, and the original plus the generated CSV are deleted automatically after you download the result.
Lattice vs stream: which extraction method should you pick?
The tool exposes three modes: Auto, Lattice, and Stream. Auto is almost always the right choice, but understanding the difference helps when Auto picks wrong.
Lattice
Lattice extraction follows visible table borders. The extractor looks at the line art in the PDF, finds horizontal and vertical rules that form a grid, and uses those lines to decide where cells begin and end. This produces the cleanest output when it works, because the grid is unambiguous. Use it for bank statements with ruled tables, engineering drawings with titleblocks, regulatory filings, and anything where you can see borders around every cell.
Stream
Stream extraction ignores lines entirely and infers columns from whitespace. It looks at the horizontal positions of every word on a page, finds gaps that are wider than normal word spacing, and uses those gaps as column dividers. This is how you extract "tables" from plain-text reports, financial statements without borders, academic papers, and anything where the layout is visually table-like but nothing is actually drawn.
Auto
Auto mode tries lattice first. If lattice returns nothing (no gridlines found), it falls back to stream. If you are not sure which mode to pick, start with Auto. If the output looks wrong, try forcing the other mode explicitly.
Frequently asked questions
Is this PDF table extractor free to use?
Yes. No signup, no ads, no usage limits. Processing runs on our servers and the tool is free to use.
Does it work on scanned PDFs or only digital-born ones?
Both. Digital-born PDFs (generated from Word, Google Docs, LaTeX, pandas, or any other source that writes real text) work fastest and most reliably. For scanned PDFs where the pages are actually images, the tool automatically runs OCR first to recover the text, then extracts tables from the result. OCR accuracy depends on the scan quality.
What is the difference between lattice and stream extraction?
Lattice extraction follows visible table borders and gridlines. If your PDF has ruled tables (lines around cells), lattice gives you the cleanest output. Stream extraction infers column boundaries from whitespace gaps between words. Use it for tables that look like tables to a human but have no drawn borders, such as financial statements or plain text reports. Auto mode tries lattice first and falls back to stream if lattice finds nothing.
Are my PDFs uploaded, and how long are they kept?
Yes, this tool processes files server-side, unlike most of the other Forgelit tools. Your PDF is uploaded, extracted, and the original file plus the generated CSV are deleted automatically after processing completes. Nothing is stored long-term and nothing is logged beyond anonymous request counts.
What is the maximum file size?
Around 150 MB, though the practical limit depends on your browser and network. Very large PDFs with many tables can take 30+ seconds to process. If you are hitting a limit, try extracting a specific page range instead of the whole document by using the Page field.
How does this compare to Tabula or Camelot?
Tabula is a desktop Java app and Camelot is a Python library. Both are excellent, and they power much of the table-extraction ecosystem. This tool is for the cases where you do not want to install anything: you have a PDF, you want a CSV, and you would rather not set up a Python environment or download a JAR file. The extraction quality is comparable for most common cases. For batch jobs or programmatic use, Camelot in a script is still the better choice.