PDF Table Extractor

Extract tables from PDF files to CSV format. Supports lattice/stream detection and large files.

How to Use

Drag & drop a PDF file or click "Browse" to select one
Choose extraction flavor: Auto (recommended), Lattice (for ruled grids), or Stream (for whitespace-separated)
Optionally specify a single page number to extract from
Click "Extract Tables" to process the PDF
Download the resulting CSV file or preview the results

Note: Processing happens on our secure servers. Files are automatically deleted after processing.

Drag & drop a PDF here or

Max ~150MB (browser-dependent). Files are processed server-side and auto-deleted.

Extraction Method: Page (optional):

Waiting for file…

Preview (first ~20 lines)

…

What a PDF table extractor actually does

PDFs look like they contain tables, but from the file format's perspective there are no tables at all. A PDF is a stack of text, lines, and images placed at exact coordinates on a page. What your eye reads as a table is really a grid of positioned text runs that happen to line up. Extracting a table means reversing that process: inferring which text belongs to which row and column based on either the visible borders (if there are any) or the whitespace between words.

This tool does that inference on our server because the algorithms involved (lattice detection from line art, column inference from text positions, OCR for scanned pages) are too heavy to run reliably in a browser. Your PDF is uploaded, processed, and the original plus the generated CSV are deleted automatically after you download the result.

Lattice vs stream: which extraction method should you pick?

The tool exposes three modes: Auto, Lattice, and Stream. Auto is almost always the right choice, but understanding the difference helps when Auto picks wrong.

Lattice

Lattice extraction follows visible table borders. The extractor looks at the line art in the PDF, finds horizontal and vertical rules that form a grid, and uses those lines to decide where cells begin and end. This produces the cleanest output when it works, because the grid is unambiguous. Use it for bank statements with ruled tables, engineering drawings with titleblocks, regulatory filings, and anything where you can see borders around every cell.

Stream

Stream extraction ignores lines entirely and infers columns from whitespace. It looks at the horizontal positions of every word on a page, finds gaps that are wider than normal word spacing, and uses those gaps as column dividers. This is how you extract "tables" from plain-text reports, financial statements without borders, academic papers, and anything where the layout is visually table-like but nothing is actually drawn.

Auto

Auto mode tries lattice first. If lattice returns nothing (no gridlines found), it falls back to stream. If you are not sure which mode to pick, start with Auto. If the output looks wrong, try forcing the other mode explicitly.

When PDF table extraction works well (and when it does not)

Table extraction is not magic. Some PDFs are a pleasure to process and others will fight you no matter what tool you throw at them. The difference usually comes down to how the PDF was generated.

Works well

Digital-born PDFs generated from Word, Google Docs, LaTeX, pandas, or any tool that writes real text. The text layer is already there and the coordinates are clean.
Bank and credit card statements with consistent column layouts across pages.
Research papers with clearly delimited data tables.
Government data releases (budgets, statistics, regulatory filings) where the tables are the point of the document.
Invoices and quotes with a predictable line-item structure.

Struggles with

Scanned PDFs where every page is an image with no text layer. The tool auto-OCRs these, but OCR accuracy depends heavily on scan quality. Faded, skewed, or low-resolution scans will produce messy output.
Tables that span multiple pages with headers that only appear on the first page. Each page gets extracted independently.
Merged cells, especially when the merge is vertical across rows. Some merges survive, others do not.
Rotated or sideways tables (landscape tables on portrait pages). Pre-rotate the PDF if you can.
Multi-column document layouts where a table shares a page with prose in two columns. The extractor may pick up text from adjacent columns as spurious cells.

When the output is messy, the fastest debug is usually to target a single page with the Page field and compare that page against the raw PDF side by side. Nine times out of ten you can tell within 10 seconds whether the problem is the wrong mode or whether the PDF itself is the issue.

Common use cases for extracting PDF tables to CSV

Financial analysis: pulling monthly transactions out of bank or brokerage statements so you can run them through a spreadsheet, ledger tool, or pandas script.
Research data collection: extracting datasets from academic papers that publish results as tables but not as supplementary CSVs.
Product catalogs: turning PDF price sheets or spec sheets into structured data for import into a store or CRM.
Regulatory and legal work: converting schedules, exhibits, and compliance tables from filings into queryable form.
Migrations from legacy systems that only export reports as PDFs. Extract the tables, import the CSV, skip the retyping.

Frequently asked questions

Is this PDF table extractor free to use?

Yes. No signup, no ads, no usage limits. Processing runs on our servers and the tool is free to use.

Does it work on scanned PDFs or only digital-born ones?

Both. Digital-born PDFs (generated from Word, Google Docs, LaTeX, pandas, or any other source that writes real text) work fastest and most reliably. For scanned PDFs where the pages are actually images, the tool automatically runs OCR first to recover the text, then extracts tables from the result. OCR accuracy depends on the scan quality.

What is the difference between lattice and stream extraction?

Lattice extraction follows visible table borders and gridlines. If your PDF has ruled tables (lines around cells), lattice gives you the cleanest output. Stream extraction infers column boundaries from whitespace gaps between words. Use it for tables that look like tables to a human but have no drawn borders, such as financial statements or plain text reports. Auto mode tries lattice first and falls back to stream if lattice finds nothing.

Are my PDFs uploaded, and how long are they kept?

Yes, this tool processes files server-side, unlike most of the other Forgelit tools. Your PDF is uploaded, extracted, and the original file plus the generated CSV are deleted automatically after processing completes. Nothing is stored long-term and nothing is logged beyond anonymous request counts.

What is the maximum file size?

Around 150 MB, though the practical limit depends on your browser and network. Very large PDFs with many tables can take 30+ seconds to process. If you are hitting a limit, try extracting a specific page range instead of the whole document by using the Page field.

How does this compare to Tabula or Camelot?

Tabula is a desktop Java app and Camelot is a Python library. Both are excellent, and they power much of the table-extraction ecosystem. This tool is for the cases where you do not want to install anything: you have a PDF, you want a CSV, and you would rather not set up a Python environment or download a JAR file. The extraction quality is comparable for most common cases. For batch jobs or programmatic use, Camelot in a script is still the better choice.