Why Hazel Content Matching Breaks on Real-World PDFs (and the Fix)

If you have used Hazel for more than a year, you have probably hit the wall: a rule that reads an account number on one bank statement reliably, then silently fails on the next statement from the same bank. The rule looks correct. The PDF looks correct. The text is in there. And Hazel just shrugs.

This is not user error. It is not a bad rule. It is the difference between how Hazel reads PDFs and how real-world PDFs actually behave. This post walks through the failure mode, why it happens, what workflows still hold up, and how AI-powered file sorting handles the cases where regex on OCR text falls apart.

The short version

Hazel matches text exposed by the PDF's OCR layer using regex. When the OCR layer is inconsistent (different scanner, different software, different version of the same bank's template), the regex misses. Tools that read PDF content semantically, not by regex, are robust to those differences. The Hazel-plus-Sortio workflow keeps the rules that work and hands the brittle cases to AI.

How Hazel content matching actually works

When you write a Hazel rule that says "if the file contents contain Account Number 0123," Hazel does three things behind the scenes. First, it asks the PDF for its embedded text layer. Second, it runs your match pattern against that text. Third, if the match succeeds, it fires the action.

None of those steps include reading the visual page. Hazel does not OCR the PDF itself. It reads whatever text the PDF claims to have. For digitally generated PDFs (statements produced by online banking, invoices generated by accounting software, exports from tax preparation tools) that text layer is usually correct, and Hazel content matching works.

For scanned PDFs, the story is different. The text layer is whatever the OCR engine produced when the scan was processed. If the scan came out of Preview's Quick Action, you get one OCR result. If the same document went through ScanSnap, you get a slightly different result. If your bank changed its statement template between Q2 and Q3, the document looks the same to a human but the OCR layer extracts subtly different text, and your regex misses.

The three real-world failure modes

1. OCR noise (whitespace, ligatures, punctuation)

The text Hazel sees might be "Account Number: 0123-4567" on one statement and "Account Number: 0123 4567" with two spaces on another. Or "Account No." instead of "Account Number." Or "Acct #" because the template changed. A regex written against the first form fails on the second. You can widen the regex with optional whitespace and alternation, but you cannot anticipate every variant a bank's template designer might introduce next quarter.

2. Layout-dependent extraction order

PDFs do not store text in reading order. They store text in placement order. Two-column statements, tabular invoices, and side-by-side metadata blocks come back from the text-extraction call in an order that depends on the PDF producer. Sometimes the account number is adjacent to the customer name in the extracted stream; sometimes there is a page header between them. Regex that depends on proximity ("Account Number" within 50 characters of a number pattern) breaks unpredictably.

3. Mixed-quality scans in a long archive

A real archive of three years of bank statements probably has scans from at least four sources: recent ones from online banking PDFs (clean OCR), older ones scanned at home (medium OCR), even older ones scanned at a copy shop years ago (rough OCR), and a few that were emailed as images and never properly OCR'd. A single Hazel rule cannot handle all four; you end up writing four rules, each with its own quirks, and the failure modes become indistinguishable.

The workarounds that mostly work

Heavy Hazel users have built a small ecosystem of workarounds for this. The most common is to insert an OCR-rebuilding step before the content-matching rule fires. OCRmyPDF, run as a shell command via Hazel's action, will produce a canonical text layer that is more consistent than whatever the original scanner produced. It works. It is also brittle, because OCRmyPDF has its own version differences and edge cases, and the shell-script-in-a-rule pattern is hard to debug when it breaks.

The second workaround is to use Hazel only for routing by filename or extension, and to do the content extraction in a downstream script. AppleScript, JavaScript for Automation, or a Python script can pull out the data, normalize it, and then move the file. This is a real engineering project and most users do not want to maintain a shell-script library to keep their bank statements organized.

The third workaround is to give up on content matching and route on filename instead. This works if you can rename PDFs at the point of arrival (for example, if your scanner names files using a fixed template). It does not work for downloaded statements, which usually arrive with the bank's generic name like "statement.pdf."

Why AI sorting handles these cases differently

A language model does not need a regex. When Sortio reads a PDF, it asks an LLM to look at the extracted text and decide where the file should go. The LLM is robust to OCR noise in a way regex is not: "Account Number: 0123-4567" and "Acct # 0123 4567" and "A/C No 01234567" all read as the same thing to a model. The model also understands the document. It knows what a bank statement looks like, what an invoice looks like, what a tax form looks like, and routes accordingly.

The trade-off is cost and speed. Every AI sort consumes a credit, and inference takes longer than regex. For routing 30 bank statements a month, this is invisible: a single sort takes seconds and uses a handful of credits. For routing 30,000 files where the routing logic could be expressed in a regex, AI is overkill and rules are still the right tool.

This is why Sortio has both. AI Sort takes a prompt and runs an LLM per file. AI Rule Builder turns a plain-English description into a deterministic rule that runs without consuming credits. You write the rule once and it runs forever. The mental model: use rules for the clean cases that have stable patterns, and use AI sort for the messy ones where Hazel content matching has been breaking.

What the migration looks like in practice

Say you have a Hazel rule that has been breaking on bank statements from one of your accounts. The rule is supposed to read the account number, identify the bank, and route the PDF to the right account folder. Some statements work, some do not, and you cannot tell why.

With Sortio, you delete that rule and replace it with a sort prompt. You point Sortio at the folder where bank statements land, and write: "Sort these bank statements into folders by account. The folder names are Personal Checking, Personal Savings, Joint Checking, and Business. The account number is on the first page of each statement." Sortio reads each PDF, identifies the account, and routes accordingly. Statements that previously broke the Hazel rule because of OCR drift work because the LLM is reading the document, not pattern-matching on noisy text.

For the rules that already worked in Hazel (move screenshots to a screenshots folder, archive installers, route downloaded receipts by extension), there is no reason to migrate. Leave them in Hazel. The Sortio + Hazel workflow guide explains how to set up a staging folder that catches the files Hazel could not confidently route and hands them to Sortio for AI sorting.

What stays in Hazel

Hazel is still the best tool for a specific set of patterns. Anything that depends purely on metadata (extension, name pattern, date, source app, size, location) is rule-shaped and belongs in Hazel. AppleScript and shell-script hooks are deep and mature, and if you have built a workflow around them, do not throw it away. The same applies to color-tag based workflows, Finder comment workflows, and date-based archival.

The Hazel rules that break, in our experience and in the public threads from MacPowerUsers and Reddit, are almost always the content-matching ones. Those are the ones to migrate. The rest can keep running.

FAQ

Why does Hazel content matching fail on some PDFs but work on others?

It is almost always the underlying OCR text layer. Hazel does not read the visual page; it reads the text the PDF's OCR layer exposes. When the source bank scans on a slightly different machine, the OCR engine introduces tiny differences (extra whitespace, transposed characters, broken ligatures) that the Hazel regex does not anticipate. The fix is to either rebuild the OCR layer with a more permissive engine or move the routing decision out of regex entirely.

Can I use OCRmyPDF to fix the OCR layer before Hazel reads it?

Yes, and many heavy Hazel users do exactly that. The pattern is: a Hazel rule that detects a new PDF and runs OCRmyPDF against it, then a second rule that does the content matching once the canonical OCR layer is in place. It works, but it adds a brittle build step, and it does not help with PDFs that have correct OCR but inconsistent layout.

Does Sortio replace Hazel for PDF content matching?

For the cases where Hazel content matching is brittle, yes. Sortio uses an LLM to read PDF content, which is robust to OCR noise in a way regex is not. For deterministic patterns where Hazel already works, you do not need to replace anything. The pragmatic workflow is to use rules for the clean cases and Sortio for the messy ones.

Does Sortio process PDFs locally or in the cloud?

Both options are supported. The managed AI option (Sortio-hosted or BYOK) is faster and more accurate. The local-only option runs through Ollama with a local LLM (Llama, Mistral) so file content never leaves your machine. For sensitive PDFs (legal, medical, financial), the local option is the right pick.

Can I keep my Hazel rules and add Sortio for the PDF cases that break?

Yes. Many users run both. Hazel handles the patterns that have always worked, and Sortio takes the PDFs that Hazel cannot route reliably. The Sortio + Hazel workflow guide walks through how to set up a staging folder that catches PDFs Hazel cannot match, then hands them to Sortio.

Why Hazel Content Matching Breaks on Real-World PDFs (and What to Do Instead)