Skip to content

Allyanonimiser

Australian-focused PII detection and anonymization for the insurance industry with support for stream processing of very large files.

PyPI version Python Versions Tests Release Check License: MIT

Overview

Allyanonimiser detects and anonymizes personally identifiable information (PII) in text, with first-class support for Australian formats (TFN, ABN, Medicare, AU phone, etc.) and insurance-industry identifiers (policy numbers, claim references, vehicle rego, VIN).

What's new in v3.3

  • Default spaCy model is now en_core_web_sm (44 MB, fast). Previously was en_core_web_lg (587 MB). Pattern-based detection is unchanged; NER recall on PERSON/LOCATION/ORG is lower with sm. Switch explicitly with spacy_model=SPACY_MODEL_ACCURATE when accuracy matters.
  • SPACY_MODEL_FAST / SPACY_MODEL_ACCURATE constants exported for clarity.
  • Full v3.2 improvements: TFN/ABN checksum validation in every code path, pre-release smoke gate on the built sdist, direct unit tests for the conflict resolver.

Key Features

  • Australian-focused PII: TFN (with checksum), ABN (with checksum), Medicare, AU_PHONE, driver's license, Centrelink CRN, passport, postcode
  • Insurance domain: policy numbers, claim references, vehicle registration, VIN
  • Flexible anonymization: replace, mask, redact, hash (SHA-256), age-bracket, consistent-replacement
  • Stream processing: memory-efficient chunked processing for very large files via Polars
  • DataFrame support: pandas with optional PyArrow backing; expand_acronyms wiring for preprocessing
  • Reporting: session-level statistics, entity histograms, Jupyter-native rendering

Quick example

from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()  # defaults to en_core_web_sm

results = ally.analyze(
    "Customer John Smith (TFN: 123 456 782) called about policy POL-987654."
)
for r in results:
    print(f"{r.entity_type}: {r.text!r} (score={r.score:.2f})")

out = ally.anonymize(
    "Customer John Smith (TFN: 123 456 782) called about policy POL-987654.",
    operators={
        "PERSON": "replace",
        "AU_TFN": "mask",
        "INSURANCE_POLICY_NUMBER": "hash",
    },
)
print(out["text"])

Choosing a spaCy model

SPACY_MODEL_FAST (en_core_web_sm) SPACY_MODEL_ACCURATE (en_core_web_lg)
Default in v3.3+? yes no
Download size 44 MB 587 MB
Cold start ~0.5s 2–5s
Pattern detection (TFN, ABN, MEDICARE, AU_PHONE, EMAIL, dates) identical identical
PERSON / LOCATION / ORG recall medium high
Serverless friendliness (Azure Functions, Lambda) good poor

Opt into the accurate model when a missed name is expensive in your downstream workflow:

from allyanonimiser import create_allyanonimiser, SPACY_MODEL_ACCURATE

ally = create_allyanonimiser(spacy_model=SPACY_MODEL_ACCURATE)

Next steps

License

MIT — see LICENSE.