Should I use NFC or NFKC for address strings?

Use NFC for storage and geocoding submission. NFKC collapses compatibility variants (superscripts, fractions, full-width digits) which can destroy meaningful address tokens in some locales. Reserve NFKC only for building search keys or deduplication hashes, not for the canonical stored form.

Why does my deduplication still produce duplicates after NFC?

Invisible characters — zero-width non-joiners (U+200C), soft hyphens (U+00AD), and BOM remnants (U+FEFF) — survive NFC unchanged. Run the control-character stripping step after NFC, then case-fold and whitespace-collapse before hashing.

When is ASCII fallback safe to use?

Only when a downstream system explicitly rejects UTF-8 bytes. ASCII fallback silently drops characters that cannot be represented in ASCII; always log the original string when fallback triggers so data stewards can audit transliteration loss.

Handling Special Characters in Global Address Data

TL;DR: Apply unicodedata.normalize('NFC', text), strip invisible control characters, expand Latin ligatures, and collapse whitespace before submitting addresses to any geocoding API. This four-step sequence — part of the Unicode and Character Normalization in Python workflow — eliminates the silent mismatches that cause duplicate geocoded records and failed coordinate lookups.

Why Special Characters Break Geocoding Pipelines

Global address datasets rarely arrive clean. A single street name like café du Marché may appear in your pipeline as six structurally different byte sequences depending on its source system: CRM export, municipal open data, e-commerce checkout, or legacy CSV dump. Geocoding APIs and spatial databases expect strict UTF-8 in NFC form. When a pipeline passes NFD strings, BOM markers, or zero-width spaces, parsers either reject the payload, return null coordinates, or silently create duplicate records for the same physical location.

Four failure modes account for the majority of production incidents:

Failure Mode	Example	Effect
NFD vs NFC divergence	`café` as `e + ́` vs `é`	Hash-based dedup treats identical strings as distinct
Invisible formatting chars	Zero-width non-joiner `U+200C`	Survives CSV export; breaks string equality
Latin ligature fragmentation	`ﬂ` (U+FB02) vs `fl`	Tokenizer splits misalign street numbers and names
Encoding drift	Windows-1252 bytes in UTF-8 stream	Replacement character `<?>` corrupts spatial joins

The diagram below shows where normalization must sit in the pipeline — before field extraction, not after.

The Normalization Spec

The four operations must run in this exact order. Changing the sequence produces different results: NFC after ligature expansion leaves expanded pairs in decomposed state; whitespace collapse before control stripping misses spaces introduced by removed invisible chars.

Step	Operation	Python call	Why this order
1	Input coercion	`raw.decode('utf-8', errors='replace')`	Ensures all downstream steps work on `str`, not `bytes`
2	NFC composition	`unicodedata.normalize('NFC', text)`	Composes combining marks before pattern matching
3	Control char strip	`CONTROL_RE.sub('', text)`	Removes invisibles that NFC does not affect
4	Ligature expansion	`LIGATURE_RE.sub(lambda m: normalize('NFKD', m.group(0)), text)`	Converts `ﬁ`→`fi` without touching valid diacritics
5	Whitespace collapse	`re.sub(r'\s+', ' ', text).strip()`	Collapses gaps left by removed characters
6	ASCII fallback (opt.)	`text.encode('ascii', 'ignore').decode('ascii')`	Only when the downstream API rejects UTF-8

Minimal Runnable Implementation

import unicodedata
import re
from typing import Union
import pandas as pd

# Compile once at module level — never inside a loop or function body.
# Targets: C0/C1 control codes, zero-width chars U+200B–U+200F,
# directional marks U+202A–U+202E, BOM U+FEFF, and soft hyphen U+00AD.
_CONTROL_RE: re.Pattern[str] = re.compile(
    r'[\x00-\x1F\x7F-\x9F-‏‪-‮]'
)

# Latin compatibility ligatures U+FB00–U+FB06 (ff, fi, fl, ffi, ffl, st, st)
_LIGATURE_RE: re.Pattern[str] = re.compile(r'[ﬀ-ﬆ]')


def normalize_address(raw: Union[str, bytes, None], ascii_fallback: bool = False) -> str:
    """
    Normalize a global address string for geocoding API submission.

    Steps applied in order:
      1. Input coercion (bytes → str, None → '')
      2. NFC canonical composition
      3. Control and invisible character removal
      4. Latin ligature expansion (NFKD on matched ligatures only)
      5. Whitespace collapse
      6. ASCII fallback (optional, lossy — log when triggered)

    Args:
        raw: Raw address value; accepts str, bytes, or None.
        ascii_fallback: When True, strips all non-ASCII characters.
                        Use only for legacy APIs that reject UTF-8.

    Returns:
        A normalized, NFC-encoded str safe for geocoder submission.
    """
    if raw is None:
        return ''
    if isinstance(raw, bytes):
        raw = raw.decode('utf-8', errors='replace')
    elif not isinstance(raw, str):
        raw = str(raw)

    # Step 2: canonical composition
    text: str = unicodedata.normalize('NFC', raw)

    # Step 3: strip control and invisible characters
    text = _CONTROL_RE.sub('', text)

    # Step 4: expand ligatures to base character pairs
    text = _LIGATURE_RE.sub(
        lambda m: unicodedata.normalize('NFKD', m.group(0)),
        text,
    )

    # Step 5: collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Step 6 (optional, lossy): ASCII-only fallback
    if ascii_fallback:
        text = unicodedata.normalize('NFKD', text)
        text = text.encode('ascii', 'ignore').decode('ascii')

    return text


# Vectorized variant for pandas — avoid Python-level loops on large frames
def normalize_address_series(
    series: pd.Series,
    ascii_fallback: bool = False,
) -> pd.Series:
    """
    Apply normalize_address to a pandas Series of raw address values.

    Args:
        series: Series of raw address strings (may contain NaN).
        ascii_fallback: Passed through to normalize_address.

    Returns:
        Series of normalized strings with NaN replaced by empty string.
    """
    return series.fillna('').apply(
        lambda v: normalize_address(v, ascii_fallback=ascii_fallback)
    )

Pattern Breakdown

Control character regex

[\x00-\x1F\x7F-\x9F-‏‪-‮]

Component	Code points	Why it is necessary
`\x00-\x1F`	C0 control block	Null bytes, carriage returns, and tab characters from legacy exports
`\x7F-\x9F`	C1 control block	Windows-1252 artifacts that survive UTF-8 round-trips
	Soft hyphen	Invisible in display but changes string hash and breaks street-name matching
`-‏`	Zero-width spaces and marks	Common in CJK and Arabic address exports; invisible but alter tokenizer boundaries
`‪-‮`	Bidirectional controls	Injected by Arabic/Hebrew editors; cause regex anchors to mis-fire
	Byte order mark	BOM prepended by Excel/Windows UTF-8 exports; corrupts first-field values

Ligature regex

[ﬀ-ﬆ]

This range covers the seven Latin compatibility ligatures (ﬀ ff, ﬁ fi, ﬂ fl, ﬃ ffi, ﬄ ffl, ﬅ st, ﬆ st). Applying NFKD inside the substitution lambda decomposes each ligature to its constituent ASCII letters without affecting any other character in the string — in particular, without touching valid diacritics like ü or ñ.

Edge Cases and Failure Modes

1. CJK full-width digits in Japanese addresses

Japanese addresses frequently contain full-width ASCII digits (１２３ U+FF11–U+FF19) for house numbers. The normalize_address function above preserves these because NFC does not decompose compatibility forms. If a downstream parser expects ASCII digits, apply unicodedata.normalize('NFKC', text) as an additional step only to the house-number field extracted after splitting — not to the whole address string. Global NFKC collapses superscripts and fractions, which breaks addresses in some European locales.

import unicodedata

def normalize_digits(field: str) -> str:
    """Normalize full-width digits to ASCII in a single address field."""
    return unicodedata.normalize('NFKC', field)

# Apply only after field extraction, not to the raw string:
house_number = normalize_digits(parsed["house_number"])

2. Arabic presentation forms in Gulf address data

Gulf-region datasets from municipal portals often contain Arabic presentation forms (U+FE70–U+FEFF) — isolated, initial, medial, and final glyphs rather than the base Unicode code points. NFC does not normalize these. Use NFKC on Arabic-script fields specifically, or apply the arabic-reshaper + python-bidi libraries when visual shaping is also required.

def normalize_arabic_field(field: str) -> str:
    """Normalize Arabic presentation forms to base code points."""
    return unicodedata.normalize('NFKC', field)

3. Mixed Windows-1252 / UTF-8 streams

Legacy CRM exports often inject Windows-1252 bytes into nominally UTF-8 streams. The bytes.decode('utf-8', errors='replace') call in step 1 converts undecodable bytes to U+FFFD (replacement character <?>) rather than crashing. Log these occurrences: a high replacement-character rate signals that the source encoding declaration is wrong and the entire batch needs re-ingestion with chardet-detected encoding.

import chardet

def safe_decode(raw: bytes) -> str:
    """Detect encoding and decode; fall back to UTF-8 with replacement."""
    detected = chardet.detect(raw)
    encoding = detected.get('encoding') or 'utf-8'
    return raw.decode(encoding, errors='replace')

Integration Note

normalize_address belongs at the ingestion boundary of your Unicode and Character Normalization in Python pipeline — immediately after CSV/JSON parsing and before any field extraction, tokenization, or API call. Applying it later (inside the geocoding client, for example) allows malformed strings to corrupt intermediate data structures and makes debugging harder.

For distributed workloads on Airflow, Spark, or Kafka Streams, broadcast the two compiled regex patterns as module-level constants — never call re.compile() inside a task function or a UDF, as it adds unnecessary compilation overhead per row. When building async geocoding requests in Python, normalize the entire batch synchronously before fanning out to the async layer; mixing normalization into async tasks complicates error attribution.

### Why does NFC miss soft hyphens and zero-width spaces?

NFC is a canonical equivalence operation: it only touches characters that represent the same abstract character through different code point sequences. Soft hyphens (U+00AD) and zero-width spaces (U+200B) are distinct characters in their own right — they are not alternate representations of anything else. The control-character regex is what removes them.

### Is it safe to run normalize_address on non-Latin scripts?

Yes for NFC, with one caution: do not enable ascii_fallback=True on Cyrillic, Greek, CJK, or Arabic scripts. The fallback drops every non-ASCII character, turning a valid address into an empty string or meaningless fragment. For transliteration of non-Latin scripts, use unidecode or the transliterate library, and treat the transliterated form as a search key only — never overwrite the canonical stored value.

### Should normalization run before or after deduplication hashing?

Before. Two addresses that are canonically identical must produce the same hash for deduplication to work. If you hash before NFC + invisible-char stripping, NFD and NFC versions of the same address produce different hashes and survive as duplicates. Normalize first, then hash.

Unicode and Character Normalization in Python — the parent reference covering all four normalization forms, CJK full-width variants, Arabic presentation forms, and locale-specific casing rules.
Core Address Parsing & Standardization — the top-level guide to the full address pipeline from raw ingestion through validated, geocodable output.
Normalizing International Addresses with libpostal — how to parse and standardize global address components after character normalization is complete.
How to Parse Street Numbers and Suffixes with Regex — regex patterns that consume the normalized string to extract house numbers, directionals, and street type suffixes.