Quick Start
This guide will walk you through the basic usage of Allyanonimiser to detect and anonymize personally identifiable information (PII) in text.
Creating an Allyanonimiser Instance
The first step is to create an instance of the Allyanonimiser class:
from allyanonimiser import create_allyanonimiser
# Default: uses en_core_web_sm (44 MB, fast)
ally = create_allyanonimiser()
This instance comes pre-configured with all built-in patterns for Australian, general, and insurance-specific PII.
Choosing a spaCy model
In v3.3+ the default spaCy model is en_core_web_sm — small (44 MB) and fast to load. Pattern-based detection (TFN, ABN, Medicare, AU phone, email, dates) is identical regardless of the model. If you need higher recall on PERSON, LOCATION, or ORG entities, opt into the larger model:
from allyanonimiser import create_allyanonimiser, SPACY_MODEL_ACCURATE
ally = create_allyanonimiser(spacy_model=SPACY_MODEL_ACCURATE) # en_core_web_lg, 587 MB
Pass spacy_model=None to disable spaCy entirely — pattern detection keeps working. See the Installation guide for the full tradeoff table.
Analyzing Text for PII
To detect PII entities in a text:
# Text to analyze
text = "Please reference your policy AU-12345678 for claims related to your vehicle registration XYZ123."
# Analyze the text
results = ally.analyze(text)
# Print the results
for result in results:
print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")
Output:
Entity: POLICY_NUMBER, Text: AU-12345678, Score: 0.85
Entity: VEHICLE_REGISTRATION, Text: XYZ123, Score: 0.7
Anonymizing Text
To anonymize the detected PII:
# Anonymize the text with specific operators for each entity type
anonymized = ally.anonymize(
text="Please reference your policy AU-12345678 for claims related to your vehicle registration XYZ123.",
operators={
"POLICY_NUMBER": "mask", # Replace with asterisks
"VEHICLE_REGISTRATION": "replace" # Replace with entity type
}
)
# Print the anonymized text
print(anonymized["text"])
Output:
Please reference your policy ********** for claims related to your vehicle registration <VEHICLE_REGISTRATION>.
Adding Custom Patterns
You can add your own patterns to detect additional entity types:
# Add a custom pattern with regex
ally.add_pattern({
"entity_type": "REFERENCE_CODE",
"patterns": [r"REF-\d{6}-[A-Z]{2}", r"Reference\s+#\d{6}"],
"context": ["reference", "code", "ref"],
"name": "Reference Code"
})
# Test the custom pattern
text = "Your reference code is REF-123456-AB for this inquiry."
results = ally.analyze(text)
for result in results:
print(f"Found {result.entity_type}: {result.text}")
Output:
Found REFERENCE_CODE: REF-123456-AB
Generating Patterns from Examples
Allyanonimiser can also generate patterns from example strings:
# Generate a pattern from examples
ally.create_pattern_from_examples(
entity_type="EMPLOYEE_ID",
examples=["EMP00123", "EMP45678", "EMP98765"],
context=["employee", "staff", "id"],
generalization_level="medium" # Options: none, low, medium, high
)
# Test the generated pattern
text = "Employee EMP12345 submitted the request."
results = ally.analyze(text)
for result in results:
print(f"Found {result.entity_type}: {result.text}")
Output:
Found EMPLOYEE_ID: EMP12345
Next Steps
Now that you understand the basics, explore the following topics to learn more:
- Analyzing Text - Learn about the analysis capabilities in depth
- Anonymizing Text - Explore the various anonymization operators
- Working with DataFrames - Process tabular data efficiently
- Pattern Reference - See all the built-in patterns
- Creating Custom Patterns - Learn how to create and manage custom patterns