How to Organize Scanned Documents by What Is Inside Them

How to organize scanned documents by what is inside them

To organize scanned documents by what is inside them, you read the content, because a scanned file has no useful name and no metadata to sort on. Two steps: make sure every scan has a text layer (searchable PDFs already do; image-only scans get one in bulk from a free tool like OCRmyPDF), then let Sortio read that text with an LLM and file each document by what it actually is (a lease, a 1099, an insurance policy, a signed contract) rather than by a meaningless name like scan_0421.pdf. You point it at the folder of mixed scans, run Preview to confirm the proposed names and destinations, then apply.

That is the whole game with scanned paper. The filename is noise, the metadata is empty or wrong, and the only reliable signal is the text on the page. Everything below is the detailed version: why scans defeat ordinary tools, why reading the content is the only approach that survives a real-world drawer of mixed documents, the folder structure that holds up, and a working prompt you can paste in today.

The short version

Read the content. A scanned file has no useful name and no metadata, so sorting by filename or date is useless. Give every scan a text layer (searchable PDFs have one already; OCRmyPDF adds one to image-only scans for free), then Sortio reads that text with an LLM and files each scan by what the document actually is into a clean per-type folder tree. One prompt handles a mixed drawer of leases, tax forms, medical bills, and contracts, and it is robust to OCR noise that breaks regex-based tools.

Why scanned documents are uniquely hard to sort

A typical digital file gives you something to work with. A photo has EXIF date and GPS. A downloaded invoice often has a sensible name and a creation date. A scanned document gives you almost nothing. The filename is whatever the scanner assigned, usually scan_0421.pdf or 20260605_0001.pdf or IMG_4471.jpg. The created date is the day you ran the scanner, not the day the document is from. And the file is frequently image only, so there is not even a text layer to search.

The result is the scanned-document drawer that almost everyone has somewhere: a single folder, hundreds of files, every one named like its neighbor, holding a decade of leases, tax forms, medical bills, warranties, IDs, and signed contracts in no order at all. You cannot sort it by name (the names are interchangeable). You cannot sort it by date (the dates are all the scan date). You cannot search it (no text layer). The information that would let you organize it is trapped inside the pages.

So the only approach that actually works is to read each page and decide what the document is from its content. For image-only scans that means OCR first to recover the text, then classification on the recovered text. This is exactly the workflow research librarians recommend for keeping a collection usable. As Stanford University Libraries puts it, you should be consistent and descriptive in naming and organizing files so that it is obvious where to find a document and what it contains. With scans you cannot get there by hand at any real volume, which is where content-reading automation comes in.

OCR when there is no text layer

The first step for any scanned file is to make the words on the page readable by software. If the scan was saved as a searchable PDF, the OCR was already done at scan time and the text layer is there. If it was saved as an image-only PDF or a JPG (the default on a lot of phone scanning apps and older scanners), there is no text layer, and you have to OCR it before anything else can read it.

Sortio reads the text layer that is already in the file; it does not run its own OCR. So for image-only scans, add the layer first. OCRmyPDF is a well-regarded free tool that adds a searchable text layer to existing PDFs in bulk (a single command handles a whole drawer), and Sortio will happily read the layer it produces. After that one-time pass, an image-only drawer classifies just as well as a searchable one: real text that a model can reason about.

Why reading the content beats matching the name

The reason content reading wins on scans is that OCR output is noisy and inconsistent, and an LLM is robust to that noise in a way a regex never is. The same field comes out differently from one scan to the next. "Date of Birth," "D.O.B.", and "DOB" all mean the same thing. "Acct No 0123-4567", "Account Number 0123 4567", and "A/C 01234567" are the same account. A model reads all of these as what they are. A regex matches one exact spelling and silently fails on the rest.

Reading the content also lets the tool decide what a document is, not just pull a field out of it. A model can look at a page and conclude "this is a residential lease," "this is a 1099-MISC," "this is an explanation of benefits from an insurer," because it understands the structure and language of the document. That classification is the part that name-matching and field-extraction tools cannot do, and it is the part that turns a shapeless drawer into a filed archive.

A folder structure that holds up

File by document type at the top, then by year, with the person or entity in the filename. Type-first keeps related paper together so a tax pull or an insurance claim is a single-folder operation. Year-second keeps each folder small and makes retention easy (you can keep a tax year as long as you need to). This is the structure that scales from a hundred scans to ten thousand:

Documents/Scans/
  Tax/
    2025/
      2025-04-10_1099-NEC_Upwork.pdf
      2025-04-12_W-2_Acme_Corp.pdf
    2026/
      2026-01-31_1099-INT_Chase.pdf
  Medical/
    2026/
      2026-03-02_EOB_BlueCross.pdf
      2026-03-18_Lab_Results_Quest.pdf
  Legal/
    2026/
      2026-02-14_Lease_123_Main_St.pdf
      2026-05-01_Contract_VendorName.pdf
  Property/
    2026/
      2026-02-09_Insurance_Policy_StateFarm.pdf
      2026-04-22_Warranty_Bosch_Dishwasher.pdf
  Personal/
    2026/
      2026-01-05_Passport_JaneDoe.pdf

The naming convention that makes that tree searchable and self-sorting is a single pattern:

{YYYY-MM-DD}_{DocType}_{Party}.pdf

ISO date first so files sort chronologically in any file manager without a custom view. Document type next so you can skim a folder and know what each file is. The party (the issuer, counterparty, or person the document concerns) last so two documents of the same type on the same day stay distinct. The date you want is the date on the document, which the content reading recovers, not the scan date the file system recorded.

How the tools compare on scanned documents

There are several reasonable ways to attack a drawer of scans, and they are genuinely different tools for different jobs. Here is the honest comparison.

Approach	Reads content	Classifies by meaning	Best for
Manual sorting	You do	You do	Tiny piles, one-off filing
Batch renamers (A Better Finder Rename, NameChanger)	No	No	Bulk name fixes on already-known files
Hazel (rules + regex on text layer)	Via regex match	Limited, per-pattern	Clean scans from a consistent source
paperless-ngx (self-hosted DMS)	Yes (OCR + tags)	Rule, pattern, and ML "Auto" based	A searchable stored archive with a web UI
Dedicated OCR tools (OCRmyPDF, Tesseract)	Yes (text only)	No	Adding a text layer before something else files them
Sortio (AI by content)	Yes (text layer + LLM; pre-OCR image-only scans)	Yes	Mixed drawers, naming and filing into plain folders

paperless-ngx deserves a fair word, because it is excellent at what it does. It OCRs incoming scans, stores them in its own database, and gives you tags and full-text search through a web interface. The classification leans on correspondents, document types, and a trainable "Auto" machine-learning matcher, and the documents live inside its archive rather than your normal folders. If you want a true document management system, that is a strength. If you want your scans named and dropped into ordinary folders you already use, it is more system than you need. Sortio sits at the lighter end: it reads each scan with an LLM, decides what it is, and files it into plain folders with a clean name. Dedicated OCR tools like OCRmyPDF solve the text-layer half of the problem and pair well with either approach.

A working Sortio prompt for a drawer of scans

Drop this into the Sortio prompt box, point it at the folder where your scans pile up, and run Preview before applying. It handles a mixed drawer in one pass.

Read each scanned document's text. Decide which
type each document is: tax, medical, legal, property,
or personal.

Tax documents (W-2, 1099, tax returns, receipts kept for
taxes) go to ~/Documents/Scans/Tax/{year}/.

Medical documents (bills, EOBs, lab results, prescriptions)
go to ~/Documents/Scans/Medical/{year}/.

Legal documents (leases, contracts, agreements, court
papers) go to ~/Documents/Scans/Legal/{year}/.

Property documents (insurance policies, warranties, deeds,
appraisals) go to ~/Documents/Scans/Property/{year}/.

Personal documents (IDs, passports, certificates) go to
~/Documents/Scans/Personal/{year}/.

Anything you cannot confidently classify goes to
~/Documents/Scans/Review/ untouched.

Rename every file to {YYYY-MM-DD}_{DocType}_{Party}.pdf,
where the date is the date printed on the document (not
the scan date), DocType is the specific form or document
name, and Party is the issuer, counterparty, or person
the document concerns. Use short canonical names.

Click Preview. Sortio shows the proposed name and target folder for every scan, plus the fields it pulled off each page. Anything it is unsure about lands in the Review folder untouched, so a low-confidence classification never gets silently misfiled. Correct any individual decision before applying, and Sortio remembers the fix.

Apply commits the moves. Nothing is destructive: the preview-before-apply step is the safety net, and the Sortio backup folder keeps the original copy of every renamed and moved file for 30 days in case you want to revert. A several-hundred-file drawer typically finishes in a few minutes on the managed AI tier.

If you want the field-extraction detail behind the renaming (how Sortio pulls vendor, date, and amount cleanly out of OCR text), that is covered in the companion piece on automatically renaming PDFs by content.

Building a paperless workflow that stays clean

Sorting the backlog once is satisfying, but the drawer fills back up unless the new scans get handled too. The fix is to set your scanner or scanning app to save searchable PDFs (most have an OCR option), point the same prompt at the folder it writes to, and make it a watch folder. On Sortio Pro ($14.99/month or $99/year), every new scan that lands gets read, named, and filed on arrival without anyone opening the app.

For the first week, leave the watch in Preview mode so Sortio queues proposed moves and notifies you instead of applying them, which lets you confirm the classification is right on your particular mix of documents. After a week of clean previews, switch to Apply, and the paperless flow runs itself. For a very high volume of a single consistent document type, where the routing is deterministic, you can promote that flow to an AI Rule Builder rule that does not use your AI allowance at all.

Privacy and local processing

Scanned documents are often the most sensitive files a person owns: tax returns, medical records, IDs, signed legal agreements. Sortio supports local-only processing through Ollama, so the LLM runs on your own machine (Llama 3, Mistral) and no scanned content ever leaves it. Setup takes a few minutes. Managed AI and bring-your-own-key are also available when speed matters more and the documents are not sensitive. The honest trade-off between the two is laid out in the piece on local AI vs cloud AI for file organization: local is slower and slightly less accurate, but fully functional and the right pick for anything you would not want a third-party provider to read.

FAQ

How do you organize scanned documents by content?

You read the content, because a scanned file has no useful name and no metadata to sort on. The workflow has two steps. First, make sure each scan has a text layer: scans saved as searchable PDFs already do, and image-only scans get one in bulk from a free tool like OCRmyPDF. Second, Sortio reads that text with an LLM and files each document by what it actually is (a lease, a 1099, an invoice, a signed contract) instead of by a filename like scan_0421.pdf. Point it at the folder, run Preview to confirm the proposed names and destinations, then apply, and a drawer of scanned paper becomes a searchable, correctly filed archive in one pass.

What if my scanned PDFs have no text layer at all?

That is the common case for documents scanned to image-only PDF or saved straight from a phone camera. Sortio does not run its own OCR, so those files need a text layer added first. OCRmyPDF is a solid free option that adds a searchable text layer to existing PDFs in bulk (one command over the whole folder), and Sortio reads the layer it produces. Once the layer exists, the document gets classified by what is inside it rather than by its filename. Files with no text layer still sort, but only on filename, dates, and folder context.

How is Sortio different from paperless-ngx?

paperless-ngx is an excellent self-hosted document management system. It OCRs incoming scans, stores them in a database, and lets you tag and full-text search them. The classification combines rule and pattern matching with a machine-learning "Auto" mode that learns correspondents, document types, and tags from the documents you have already tagged, and it keeps documents inside its own archive rather than your normal folder tree. Sortio is lighter. It reads each scan with an LLM, decides what the document is from the meaning of the page, and files it into ordinary folders on your Mac or PC with a clean name. If you want a full DMS with a web UI and a stored archive, paperless-ngx is the right tool. If you want your scans named and filed into plain folders by content, Sortio is the simpler fit.

Can Hazel sort scanned documents by their content?

Hazel can match a regex against a PDF text layer and act on the result, so for clean scans from a single consistent source it can route files. It breaks down on a mixed drawer of scans where OCR text drifts between documents and every source has a different layout, because each pattern needs its own rule and the regex stops matching when the text comes out slightly different. Sortio reads the scan semantically, so OCR noise that breaks a regex does not break the classification.

What folder structure works best for scanned documents?

File by document type at the top level, then by year, with the person or entity in the filename. A structure like Documents/Scans/Tax/2026/, Documents/Scans/Medical/2026/, Documents/Scans/Legal/2026/, Documents/Scans/Property/2026/ keeps related paper together and makes a tax-time or audit pull a single-folder operation. Name each file {YYYY-MM-DD}_{DocType}_{Party}.pdf so it sorts chronologically and tells you what it is at a glance.

Is it safe to run AI on sensitive scanned documents?

It can be, depending on the mode you choose. Sortio supports local-only inference through Ollama (Llama 3, Mistral), so the content of medical, legal, and financial scans never leaves your machine. Managed AI or bring-your-own-key are also available when you want more speed and accuracy and the documents are not sensitive. Nothing is destructive either way: Sortio previews every move before applying and keeps backups of renamed and moved files for 30 days.

How do I keep new scans organized automatically?

Once the backlog is sorted, promote the same prompt to a watch folder on Sortio Pro. Set your scanner or scanning app to save searchable PDFs (most have an OCR setting), point the watch folder at its output directory, and every new scan gets read, named, and filed on arrival without you opening the app. For very high volume of a single consistent document type, turn the flow into an AI Rule Builder rule so it runs deterministically without using your AI allowance at all.

How to Organize Scanned Documents by What Is Inside Them