Allyanonimiser

Australian-focused PII detection and anonymization for the insurance industry with support for stream processing of very large files.

Overview

Allyanonimiser detects and anonymizes personally identifiable information (PII) in text, with first-class support for Australian formats (TFN, ABN, Medicare, AU phone, etc.) and insurance-industry identifiers (policy numbers, claim references, vehicle rego, VIN).

What's new in v3.5

International PII coverage plus a precision overhaul. Beats openai/privacy-filter on 5 of 6 categories of the enriched AU bench — see Benchmarks.

5 new entity types loaded by default — PHONE_INTL (with +CC, 00 IDD prefix, and parenthesised area-code variants), US_SSN (with SSA reservation rules), CREDIT_CARD (Luhn-validated 13-19 digits), ISO_DATETIME (2024-05-22T14:32:00), TIME (12/24h, with/without seconds). All anchored on structural features that don't collide with AU patterns — see International Patterns.
Validate-then-pick conflict resolution — when multiple patterns match the same span, the resolver now walks candidates from highest priority down and returns the first that passes per-type validation. Previously a permissive pattern (e.g. CREDIT_CARD on a 13-digit phone) could win by priority, fail its checksum, and silently drop the valid runner-up.
PERSON precision overhauled — city / state-postcode / date-shape / acronym / label-word rejection, iterative trim of trailing label tokens (Joe Smith\nDOB → Joe Smith), and the FP check now applied to single-candidate spans. AU bench PERSON F1 0.836 → 0.954.
VEHICLE_REGISTRATION tightened — SSN/TIN/NIN added to the label deny-list plus an SSN-shape negative lookahead so bad SSN 999-04-7100 no longer absorbs SSN 999-04 as a plate.
DATE_OF_BIRTH / INCIDENT_DATE spans no longer eat the prefix — capture-group rewrite so spans equal just the date (was 'DOB: 04/01/1959', now '04/01/1959').

v3.4 (prior)

Tightened AU_ADDRESS / AU_POSTCODE — dropped loose fallbacks; bare 4-digit numbers (years, amounts) no longer match AU_POSTCODE.
Expanded DATE validator — accepts spaCy's natural-language DATE outputs (March 2024, next Monday, Q1 2024, yesterday).
Widened INSURANCE_CLAIM_NUMBER — accepts CLM prefix alongside CL/C.
New [bench] optional extra — pip install "allyanonimiser[bench]" for the benchmark suite.

Key Features

Australian-focused PII: TFN (with checksum), ABN (with checksum), Medicare, AU_PHONE, driver's license, Centrelink CRN, passport, postcode
Insurance domain: policy numbers, claim references, vehicle registration, VIN
Flexible anonymization: replace, mask, redact, hash (SHA-256), age-bracket, consistent-replacement
Stream processing: memory-efficient chunked processing for very large files via Polars
DataFrame support: pandas with optional PyArrow backing; expand_acronyms wiring for preprocessing
Reporting: session-level statistics, entity histograms, Jupyter-native rendering

Quick example

from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()  # defaults to en_core_web_sm

results = ally.analyze(
    "Customer John Smith (TFN: 123 456 782) called about policy POL-987654."
)
for r in results:
    print(f"{r.entity_type}: {r.text!r} (score={r.score:.2f})")

out = ally.anonymize(
    "Customer John Smith (TFN: 123 456 782) called about policy POL-987654.",
    operators={
        "PERSON": "replace",
        "AU_TFN": "mask",
        "INSURANCE_POLICY_NUMBER": "hash",
    },
)
print(out["text"])

Choosing a spaCy model

	`SPACY_MODEL_FAST` (`en_core_web_sm`)	`SPACY_MODEL_ACCURATE` (`en_core_web_lg`)
Default in v3.3+?	yes	no
Download size	44 MB	587 MB
Cold start	~0.5s	2–5s
Pattern detection (TFN, ABN, MEDICARE, AU_PHONE, EMAIL, dates)	identical	identical
`PERSON` / `LOCATION` / `ORG` recall	medium	high
Serverless friendliness (Azure Functions, Lambda)	good	poor

Opt into the accurate model when a missed name is expensive in your downstream workflow:

from allyanonimiser import create_allyanonimiser, SPACY_MODEL_ACCURATE

ally = create_allyanonimiser(spacy_model=SPACY_MODEL_ACCURATE)

Next steps

Installation — prerequisites and install options
Quick Start — 5-minute walkthrough
Analyzing Text — detection deep-dive
Patterns Overview — the full entity catalogue
Anonymization Operators — how each operator works
Main API — the full class + function reference

License

MIT — see LICENSE.