Allyanonimiser
Australian-focused PII detection and anonymization for the insurance industry with support for stream processing of very large files.
Overview
Allyanonimiser detects and anonymizes personally identifiable information (PII) in text, with first-class support for Australian formats (TFN, ABN, Medicare, AU phone, etc.) and insurance-industry identifiers (policy numbers, claim references, vehicle rego, VIN).
What's new in v3.3
- Default spaCy model is now
en_core_web_sm(44 MB, fast). Previously wasen_core_web_lg(587 MB). Pattern-based detection is unchanged; NER recall on PERSON/LOCATION/ORG is lower withsm. Switch explicitly withspacy_model=SPACY_MODEL_ACCURATEwhen accuracy matters. SPACY_MODEL_FAST/SPACY_MODEL_ACCURATEconstants exported for clarity.- Full v3.2 improvements: TFN/ABN checksum validation in every code path, pre-release smoke gate on the built sdist, direct unit tests for the conflict resolver.
Key Features
- Australian-focused PII: TFN (with checksum), ABN (with checksum), Medicare, AU_PHONE, driver's license, Centrelink CRN, passport, postcode
- Insurance domain: policy numbers, claim references, vehicle registration, VIN
- Flexible anonymization: replace, mask, redact, hash (SHA-256), age-bracket, consistent-replacement
- Stream processing: memory-efficient chunked processing for very large files via Polars
- DataFrame support: pandas with optional PyArrow backing; expand_acronyms wiring for preprocessing
- Reporting: session-level statistics, entity histograms, Jupyter-native rendering
Quick example
from allyanonimiser import create_allyanonimiser
ally = create_allyanonimiser() # defaults to en_core_web_sm
results = ally.analyze(
"Customer John Smith (TFN: 123 456 782) called about policy POL-987654."
)
for r in results:
print(f"{r.entity_type}: {r.text!r} (score={r.score:.2f})")
out = ally.anonymize(
"Customer John Smith (TFN: 123 456 782) called about policy POL-987654.",
operators={
"PERSON": "replace",
"AU_TFN": "mask",
"INSURANCE_POLICY_NUMBER": "hash",
},
)
print(out["text"])
Choosing a spaCy model
SPACY_MODEL_FAST (en_core_web_sm) |
SPACY_MODEL_ACCURATE (en_core_web_lg) |
|
|---|---|---|
| Default in v3.3+? | yes | no |
| Download size | 44 MB | 587 MB |
| Cold start | ~0.5s | 2–5s |
| Pattern detection (TFN, ABN, MEDICARE, AU_PHONE, EMAIL, dates) | identical | identical |
PERSON / LOCATION / ORG recall |
medium | high |
| Serverless friendliness (Azure Functions, Lambda) | good | poor |
Opt into the accurate model when a missed name is expensive in your downstream workflow:
from allyanonimiser import create_allyanonimiser, SPACY_MODEL_ACCURATE
ally = create_allyanonimiser(spacy_model=SPACY_MODEL_ACCURATE)
Next steps
- Installation — prerequisites and install options
- Quick Start — 5-minute walkthrough
- Analyzing Text — detection deep-dive
- Patterns Overview — the full entity catalogue
- Anonymization Operators — how each operator works
- Main API — the full class + function reference
License
MIT — see LICENSE.