Anonymizing Text

Detection tells you where the PII is. Anonymization rewrites the text so you can ship it downstream (to an LLM, into a dataset, to another team) without the sensitive parts. This guide walks through the built-in operators with real examples.

We'll use the same running claim note as in Analyzing Text:

CLAIM_NOTE = """
Customer John Smith (TFN: 123 456 782, DOB: 15/04/1985) reported a collision
on 03/06/2023. He can be reached at 0412 345 678 or john.smith@example.com.
Policy POL-987654 covers his vehicle (rego ABC123) garaged at 42 Main St,
Sydney NSW 2000. Claim reference CL-98765432 has been opened.
"""

The minimal call

from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()
result = ally.anonymize(CLAIM_NOTE)

print(result["text"])

With no operators specified, Allyanonimiser defaults to replace — every detected entity becomes <ENTITY_TYPE>:

Customer <PERSON> (TFN: <AU_TFN>, <DATE_OF_BIRTH>) reported a collision
on <DATE>. He can be reached at <AU_PHONE> or <EMAIL_ADDRESS>.
Policy <INSURANCE_POLICY_NUMBER> covers his vehicle (rego <VEHICLE_REGISTRATION>) garaged at 42 Main St,
Sydney NSW <AU_POSTCODE>. Claim reference <INSURANCE_CLAIM_NUMBER> has been opened.

The return value is a dict:

result["text"] — the anonymized string.
result["items"] — per-replacement records (entity_type, original, replacement, operator) you can log or audit.

Per-entity operators

Pass operators={entity_type: operator} to pick what happens per type:

result = ally.anonymize(
    CLAIM_NOTE,
    operators={
        "PERSON":                    "replace",       # <PERSON>
        "AU_TFN":                    "mask",          # ***********
        "EMAIL_ADDRESS":             "hash",          # HASH-a1b2c3d4e5
        "DATE_OF_BIRTH":             "age_bracket",   # 40-45
        "INSURANCE_POLICY_NUMBER":   "redact",        # [REDACTED]
    },
)

Operator catalogue

Operator	What it does	Output example
`replace`	Substitute with `<ENTITY_TYPE>`	`John Smith` → `<PERSON>`
`mask`	Replace every character with `*`	`0412 345 678` → `************`
`redact`	Replace with `[REDACTED]`	`POL-987654` → `[REDACTED]`
`hash`	SHA-256 prefix (stable across a run)	`john@example.com` → `HASH-a1b2c3d4e5`
`age_bracket`	Convert a birthdate to an age band	`DOB: 15/04/1985` → `40-45`

Hashing is deterministic within a single run — the same input always hashes to the same output — so relationships in the text survive (e.g. two mentions of the same email get the same hash).

Age bracketing

When DATE_OF_BIRTH entities use the age_bracket operator, the anonymizer computes an age from today and emits a bracket. Adjust the bracket width via age_bracket_size:

result = ally.anonymize(
    CLAIM_NOTE,
    operators={"DATE_OF_BIRTH": "age_bracket"},
    age_bracket_size=5,   # default; "40-45", "45-50", etc.
)

# Wider brackets for stronger k-anonymity:
result = ally.anonymize(
    CLAIM_NOTE,
    operators={"DATE_OF_BIRTH": "age_bracket"},
    age_bracket_size=10,  # "40-50"
)

Keeping postcodes in addresses

Postcodes often carry aggregated signal (metro/regional, premium zones) that's useful to retain even after scrubbing the street address. The default behavior keeps them:

result = ally.anonymize(
    "42 Main St, Sydney NSW 2000",
    operators={"AU_ADDRESS": "replace"},
    keep_postcode=True,   # default
)
# "42 Main St, Sydney NSW 2000" → "<AU_ADDRESS> 2000"

Set keep_postcode=False to scrub the full address including the postcode.

Restricting which entities get anonymized

If you only want to scrub a subset, pass active_entity_types. Other detected entities pass through untouched:

result = ally.anonymize(
    CLAIM_NOTE,
    active_entity_types=["PERSON", "EMAIL_ADDRESS"],
    operators={
        "PERSON":        "replace",
        "EMAIL_ADDRESS": "hash",
    },
)

TFNs, phone numbers, policy numbers, etc. survive unchanged. Useful when you want to keep internal identifiers visible to analysts but remove customer names and contact info.

Expanding acronyms before anonymization

Same mechanism as in the analyzer — expand internal shorthand before detection runs so anonymization catches what was hiding:

ally.set_acronyms({"TL": "Team Leader", "MVA": "Motor Vehicle Accident"})

result = ally.anonymize(
    "TL Jane Doe reviewed the MVA claim.",
    operators={"PERSON": "replace"},
    expand_acronyms=True,
)
# Before: "TL Jane Doe reviewed the MVA claim."
# After:  "Team Leader <PERSON> reviewed the Motor Vehicle Accident claim."

Reusable settings via AnonymizationConfig

If you're making the same call repeatedly, build a config once:

from allyanonimiser import create_allyanonimiser, AnonymizationConfig

config = AnonymizationConfig(
    operators={
        "PERSON":        "replace",
        "AU_TFN":        "mask",
        "AU_PHONE":      "mask",
        "EMAIL_ADDRESS": "hash",
        "DATE_OF_BIRTH": "age_bracket",
    },
    age_bracket_size=5,
    keep_postcode=True,
)

ally = create_allyanonimiser()
for note in claim_notes:
    result = ally.anonymize(note, config=config)
    ...

Auditing what was replaced

result["items"] gives you a per-replacement log — useful for debugging, reporting, or compliance evidence:

result = ally.anonymize(CLAIM_NOTE, operators={"PERSON": "replace", "AU_TFN": "mask"})

for item in result["items"]:
    print(f"{item['entity_type']:30} {item['operator']:15} "
          f"{item['original']!r:30} -> {item['replacement']!r}")

Sample output:

PERSON                         replace         'John Smith'                  -> '<PERSON>'
AU_TFN                         mask            '123 456 782'                 -> '***********'

What's next

Working with DataFrames — scale this to pandas
Anonymization Operators (Advanced) — deep dive per operator
Custom Operators — register your own rewriter
Custom Patterns — detect entity types the library doesn't ship