Benchmarks
Head-to-head evaluations of Allyanonimiser against openai/privacy-filter — a 1.5B-parameter MoE token classifier released by OpenAI under Apache 2.0.
Methodology
- Scoring: character-level binary masking. For each category, we compute precision / recall / F1 on which characters should be masked as PII and whether each tool actually masked them.
- Runtime: both tools on CPU.
openai/privacy-filtervia the int4-quantized ONNX build (onnx/model_q4.onnx) throughonnxruntime— no torch or GPU required. - Reproducibility: all scripts and the synthetic AU-insurance fixture live under
bench/in the repo.
pip install "allyanonimiser[bench]"
python bench/run_tab_eval.py # 127 ECHR docs
python bench/run_ai4privacy_eval.py # 1000 English rows
python bench/run_au_insurance_eval.py # 100 synth docs
AU insurance (enriched synthetic bench, 160 documents, ~8 PII spans/doc)
The primary use case. Templates cover claim notes, adjuster emails,
underwriting forms, medical reports, plus three new templates that mix
international PII into AU insurance contexts — expat customer claims,
payment records with credit cards on file, and business-travel claims with
US SSN, ISO timestamps, and intl phones. Values come from Faker en_AU
plus published test-valid TFN / Medicare / ABN / ACN, Luhn-valid credit
cards, and SSA-valid SSNs.
| Category | Allyanonimiser F1 | openai/privacy-filter F1 |
|---|---|---|
| PERSON | 0.954 | 0.908 |
| ADDRESS | 0.962 | 0.940 |
| 1.000 | 0.982 | |
| PHONE | 1.000 | 0.870 |
| DATE | 0.997 | 0.964 |
| Account-like IDs (TFN, Medicare, ABN, policy, VIN, US_SSN, credit card, etc.) | 0.997 | 0.880 |
| Overall (any PII) | 0.950 | 0.958 |
Allyanonimiser wins 5 of 6 categories. The AU-specific regex patterns
(TFN/Medicare/ABN checksum validation, AU phone formats, state+postcode
address anchoring) give perfect precision on AU formats. The international
patterns (PHONE_INTL, ISO_DATETIME, US_SSN, CREDIT_CARD with Luhn) cover
the expat/business-travel scenarios. openai/privacy-filter still leads
overall by 0.008 because of slightly better recall on multi-token PERSON
spans, but Ally is faster.
Throughput: on a 2019 Intel Mac (CPU only), Allyanonimiser ran in ~2.3s
and openai/privacy-filter in ~55s for the full 160-doc eval (~24×
difference). Numbers fluctuate ±10% across machines and Python versions;
re-run bench/run_au_insurance_eval.py locally to verify against your
environment.
AI4Privacy open-pii-masking-500k (English validation, 1,000 rows)
General-purpose multilingual PII corpus. The international pattern additions materially close the gap on the format-driven categories.
| Category | Allyanonimiser F1 | openai/privacy-filter F1 |
|---|---|---|
| PERSON | 0.653 | 0.836 |
| 0.990 | 0.915 | |
| ADDRESS | 0.217 | 0.464 |
| PHONE | 0.802 | 0.829 |
| DATE | 0.719 | 0.642 |
| ACCOUNT | 0.177 | 0.700 |
| Overall (any PII) | 0.782 | 0.781 |
Allyanonimiser now matches openai/privacy-filter on overall ANY F1 (0.782
vs 0.781) and beats it on EMAIL and DATE. The remaining gap is ADDRESS
(no AU-style anchors for international addresses) and ACCOUNT (most
AI4Privacy ACCOUNT FNs are passport / generic-ID numbers in non-standard
formats — too high-risk to catch with regex without major AU-side FPs).
Text Anonymization Benchmark (TAB, 127 ECHR legal docs)
Real multi-annotator court cases from the European Court of Human Rights. TAB's CODE label covers legal case numbers (e.g. 40593/04) that neither tool is designed for; LOC covers country/city names which openai/privacy-filter doesn't emit as private_address.
| Category | Allyanonimiser F1 | openai/privacy-filter F1 |
|---|---|---|
| PERSON | 0.761 | 0.805 |
| DATE | 0.904 | 0.459 |
| LOCATION | 0.424 | 0.000 |
| CODE | 0.012 | 0.077 |
| Overall (any PII) | 0.560 | 0.378 |
Allyanonimiser's DATE wins decisively because TAB DATETIME includes bare years (1997), relative phrases (14 days), and month/year combos — all of which the v3.4 validator now accepts. openai/privacy-filter is extremely precise on DATE (P=0.988) but narrow.
How to read this
- On Australian insurance data (enriched bench with international-customer scenarios), Allyanonimiser beats a state-of-the-art 1.5B-parameter model on 5 of 6 categories, at ~24× the throughput, with no GPU and no heavy ML dependencies.
- On general multilingual PII (AI4Privacy English split), Allyanonimiser now matches
openai/privacy-filteroverall (0.782 vs 0.781). It still trails on PERSON (broader name training) and ADDRESS / ACCOUNT (international shapes without anchors), but EMAIL, DATE, and PHONE are competitive. - See the International PII patterns page for the entity types that drove the AI4Privacy lift.
- Use both if you need broad coverage — feed the same text to each and union the spans.
All tables are live against the most recent run of bench/run_*.py; re-run locally to verify against your Python / spaCy / onnxruntime versions.