TL;DR: Apply unicodedata.normalize('NFC', text), strip invisible control characters, expand Latin ligatures, and collapse whitespace before submitting addresses to any geocoding API. This four-step sequence — part of the Unicode and Character Normalization in Python workflow — eliminates the silent mismatches that cause duplicate geocoded records and failed coordinate lookups.
Why Special Characters Break Geocoding Pipelines
Global address datasets rarely arrive clean. A single street name like café du Marché may appear in your pipeline as six structurally different byte sequences depending on its source system: CRM export, municipal open data, e-commerce checkout, or legacy CSV dump. Geocoding APIs and spatial databases expect strict UTF-8 in NFC form. When a pipeline passes NFD strings, BOM markers, or zero-width spaces, parsers either reject the payload, return null coordinates, or silently create duplicate records for the same physical location.
Four failure modes account for the majority of production incidents:
| Failure Mode | Example | Effect |
|---|---|---|
| NFD vs NFC divergence | café as e + ́ vs é |
Hash-based dedup treats identical strings as distinct |
| Invisible formatting chars | Zero-width non-joiner U+200C |
Survives CSV export; breaks string equality |
| Latin ligature fragmentation | fl (U+FB02) vs fl |
Tokenizer splits misalign street numbers and names |
| Encoding drift | Windows-1252 bytes in UTF-8 stream | Replacement character <?> corrupts spatial joins |
The diagram below shows where normalization must sit in the pipeline — before field extraction, not after.
The Normalization Spec
The four operations must run in this exact order. Changing the sequence produces different results: NFC after ligature expansion leaves expanded pairs in decomposed state; whitespace collapse before control stripping misses spaces introduced by removed invisible chars.
| Step | Operation | Python call | Why this order |
|---|---|---|---|
| 1 | Input coercion | raw.decode('utf-8', errors='replace') |
Ensures all downstream steps work on str, not bytes |
| 2 | NFC composition | unicodedata.normalize('NFC', text) |
Composes combining marks before pattern matching |
| 3 | Control char strip | CONTROL_RE.sub('', text) |
Removes invisibles that NFC does not affect |
| 4 | Ligature expansion | LIGATURE_RE.sub(lambda m: normalize('NFKD', m.group(0)), text) |
Converts fi→fi without touching valid diacritics |
| 5 | Whitespace collapse | re.sub(r'\s+', ' ', text).strip() |
Collapses gaps left by removed characters |
| 6 | ASCII fallback (opt.) | text.encode('ascii', 'ignore').decode('ascii') |
Only when the downstream API rejects UTF-8 |
Minimal Runnable Implementation
import unicodedata
import re
from typing import Union
import pandas as pd
# Compile once at module level — never inside a loop or function body.
# Targets: C0/C1 control codes, zero-width chars U+200B–U+200F,
# directional marks U+202A–U+202E, BOM U+FEFF, and soft hyphen U+00AD.
_CONTROL_RE: re.Pattern[str] = re.compile(
r'[\x00-\x1F\x7F-\x9F--]'
)
# Latin compatibility ligatures U+FB00–U+FB06 (ff, fi, fl, ffi, ffl, st, st)
_LIGATURE_RE: re.Pattern[str] = re.compile(r'[ff-st]')
def normalize_address(raw: Union[str, bytes, None], ascii_fallback: bool = False) -> str:
"""
Normalize a global address string for geocoding API submission.
Steps applied in order:
1. Input coercion (bytes → str, None → '')
2. NFC canonical composition
3. Control and invisible character removal
4. Latin ligature expansion (NFKD on matched ligatures only)
5. Whitespace collapse
6. ASCII fallback (optional, lossy — log when triggered)
Args:
raw: Raw address value; accepts str, bytes, or None.
ascii_fallback: When True, strips all non-ASCII characters.
Use only for legacy APIs that reject UTF-8.
Returns:
A normalized, NFC-encoded str safe for geocoder submission.
"""
if raw is None:
return ''
if isinstance(raw, bytes):
raw = raw.decode('utf-8', errors='replace')
elif not isinstance(raw, str):
raw = str(raw)
# Step 2: canonical composition
text: str = unicodedata.normalize('NFC', raw)
# Step 3: strip control and invisible characters
text = _CONTROL_RE.sub('', text)
# Step 4: expand ligatures to base character pairs
text = _LIGATURE_RE.sub(
lambda m: unicodedata.normalize('NFKD', m.group(0)),
text,
)
# Step 5: collapse whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Step 6 (optional, lossy): ASCII-only fallback
if ascii_fallback:
text = unicodedata.normalize('NFKD', text)
text = text.encode('ascii', 'ignore').decode('ascii')
return text
# Vectorized variant for pandas — avoid Python-level loops on large frames
def normalize_address_series(
series: pd.Series,
ascii_fallback: bool = False,
) -> pd.Series:
"""
Apply normalize_address to a pandas Series of raw address values.
Args:
series: Series of raw address strings (may contain NaN).
ascii_fallback: Passed through to normalize_address.
Returns:
Series of normalized strings with NaN replaced by empty string.
"""
return series.fillna('').apply(
lambda v: normalize_address(v, ascii_fallback=ascii_fallback)
)
Pattern Breakdown
Control character regex
[\x00-\x1F\x7F-\x9F--]
| Component | Code points | Why it is necessary |
|---|---|---|
\x00-\x1F |
C0 control block | Null bytes, carriage returns, and tab characters from legacy exports |
\x7F-\x9F |
C1 control block | Windows-1252 artifacts that survive UTF-8 round-trips |
|
Soft hyphen | Invisible in display but changes string hash and breaks street-name matching |
- |
Zero-width spaces and marks | Common in CJK and Arabic address exports; invisible but alter tokenizer boundaries |
- |
Bidirectional controls | Injected by Arabic/Hebrew editors; cause regex anchors to mis-fire |
|
Byte order mark | BOM prepended by Excel/Windows UTF-8 exports; corrupts first-field values |
Ligature regex
[ff-st]
This range covers the seven Latin compatibility ligatures (ff ff, fi fi, fl fl, ffi ffi, ffl ffl, ſt st, st st). Applying NFKD inside the substitution lambda decomposes each ligature to its constituent ASCII letters without affecting any other character in the string — in particular, without touching valid diacritics like ü or ñ.
Edge Cases and Failure Modes
1. CJK full-width digits in Japanese addresses
Japanese addresses frequently contain full-width ASCII digits (123 U+FF11–U+FF19) for house numbers. The normalize_address function above preserves these because NFC does not decompose compatibility forms. If a downstream parser expects ASCII digits, apply unicodedata.normalize('NFKC', text) as an additional step only to the house-number field extracted after splitting — not to the whole address string. Global NFKC collapses superscripts and fractions, which breaks addresses in some European locales.
import unicodedata
def normalize_digits(field: str) -> str:
"""Normalize full-width digits to ASCII in a single address field."""
return unicodedata.normalize('NFKC', field)
# Apply only after field extraction, not to the raw string:
house_number = normalize_digits(parsed["house_number"])
2. Arabic presentation forms in Gulf address data
Gulf-region datasets from municipal portals often contain Arabic presentation forms (U+FE70–U+FEFF) — isolated, initial, medial, and final glyphs rather than the base Unicode code points. NFC does not normalize these. Use NFKC on Arabic-script fields specifically, or apply the arabic-reshaper + python-bidi libraries when visual shaping is also required.
def normalize_arabic_field(field: str) -> str:
"""Normalize Arabic presentation forms to base code points."""
return unicodedata.normalize('NFKC', field)
3. Mixed Windows-1252 / UTF-8 streams
Legacy CRM exports often inject Windows-1252 bytes into nominally UTF-8 streams. The bytes.decode('utf-8', errors='replace') call in step 1 converts undecodable bytes to U+FFFD (replacement character <?>) rather than crashing. Log these occurrences: a high replacement-character rate signals that the source encoding declaration is wrong and the entire batch needs re-ingestion with chardet-detected encoding.
import chardet
def safe_decode(raw: bytes) -> str:
"""Detect encoding and decode; fall back to UTF-8 with replacement."""
detected = chardet.detect(raw)
encoding = detected.get('encoding') or 'utf-8'
return raw.decode(encoding, errors='replace')
Integration Note
normalize_address belongs at the ingestion boundary of your Unicode and Character Normalization in Python pipeline — immediately after CSV/JSON parsing and before any field extraction, tokenization, or API call. Applying it later (inside the geocoding client, for example) allows malformed strings to corrupt intermediate data structures and makes debugging harder.
For distributed workloads on Airflow, Spark, or Kafka Streams, broadcast the two compiled regex patterns as module-level constants — never call re.compile() inside a task function or a UDF, as it adds unnecessary compilation overhead per row. When building async geocoding requests in Python, normalize the entire batch synchronously before fanning out to the async layer; mixing normalization into async tasks complicates error attribution.
### Why does NFC miss soft hyphens and zero-width spaces?
NFC is a canonical equivalence operation: it only touches characters that represent the same abstract character through different code point sequences. Soft hyphens (U+00AD) and zero-width spaces (U+200B) are distinct characters in their own right — they are not alternate representations of anything else. The control-character regex is what removes them.
### Is it safe to run normalize_address on non-Latin scripts?
Yes for NFC, with one caution: do not enable ascii_fallback=True on Cyrillic, Greek, CJK, or Arabic scripts. The fallback drops every non-ASCII character, turning a valid address into an empty string or meaningless fragment. For transliteration of non-Latin scripts, use unidecode or the transliterate library, and treat the transliterated form as a search key only — never overwrite the canonical stored value.
### Should normalization run before or after deduplication hashing?
Before. Two addresses that are canonically identical must produce the same hash for deduplication to work. If you hash before NFC + invisible-char stripping, NFD and NFC versions of the same address produce different hashes and survive as duplicates. Normalize first, then hash.
Related
- Unicode and Character Normalization in Python — the parent reference covering all four normalization forms, CJK full-width variants, Arabic presentation forms, and locale-specific casing rules.
- Core Address Parsing & Standardization — the top-level guide to the full address pipeline from raw ingestion through validated, geocodable output.
- Normalizing International Addresses with libpostal — how to parse and standardize global address components after character normalization is complete.
- How to Parse Street Numbers and Suffixes with Regex — regex patterns that consume the normalized string to extract house numbers, directionals, and street type suffixes.