Unicode and Character Normalization in Python

As part of the Core Address Parsing & Standardization pipeline, character encoding inconsistencies are one of the most common — and least visible — causes of address matching failures. This page covers how to sanitize raw address strings to a consistent Unicode representation before they reach a tokenizer, regex extractor, or geocoding API.

Prerequisites

Why Character Encoding Breaks Address Matching

Raw address data flows from municipal databases, e-commerce checkouts, OCR jobs, and legacy CRM exports. Each source system encodes characters differently. The character é can exist as a single precomposed code point (U+00E9) or as a base e (U+0065) followed by a combining acute accent (U+0301). These two representations are visually identical but byte-for-byte distinct — so a string equality check treats them as different values, a geocoder may hash them to different cache keys, and a database UNIQUE constraint may accept both as separate records for the same address.

The diagram below shows how a single raw address travels through the normalization stages before it is safe to parse.

Unicode normalization pipeline for address strings Five sequential stages: Decode (UTF-8), NFKC Normalize, Strip Controls, Collapse Whitespace, Parse/Validate. Arrows connect each stage left to right. Decode (UTF-8 / chardet) NFKC normalize() Strip Controls ZWJ, NBSP, C0/C1 Collapse WS \s+ → single space Parse / Validate Normalization-first address pipeline All stages execute before tokenization, regex extraction, or database insertion

Unicode Normalization Forms: Which One to Use

The Python unicodedata module exposes all four forms defined in Unicode Standard Annex #15. The choice matters for address pipelines.

Form Behaviour When to use in address work
NFC Canonical Decomposition, then Composition Safe default for Western European display; does not resolve full-width or ligature variants
NFD Canonical Decomposition only Useful as an intermediate step before stripping combining marks (category Mn)
NFKC Compatibility Decomposition, then Composition Recommended default: maps full-width Latin, ligatures (fi), superscripts, and fractions to standard equivalents
NFKD Compatibility Decomposition only Use before aggressive ASCII transliteration; leaves combining marks separate for explicit removal

For global address pipelines, NFKC is the right default. It handles the compatibility variants that appear most often in scraped web forms, OCR outputs, and mainframe EBCDIC exports: full-width characters common in East Asian address entry, typographic ligatures from Western print systems, and superscript ordinals (1a) in Spanish or Portuguese addresses.

Production-Ready Workflow

Step 1 — Decode with an explicit encoding

Always declare or detect the source encoding before any string operation. Passing bytes through normalization without decoding silently corrupts the data.

import chardet

def safe_decode(raw_bytes: bytes) -> str:
    """Decode bytes to str, detecting encoding when UTF-8 fails."""
    try:
        return raw_bytes.decode("utf-8")
    except UnicodeDecodeError:
        detected = chardet.detect(raw_bytes)
        enc = detected.get("encoding") or "latin-1"
        return raw_bytes.decode(enc, errors="replace")

Step 2 — Apply NFKC normalization

import unicodedata

normalized = unicodedata.normalize("NFKC", decoded_str)

Run this before any other string operation. Normalization is a no-op on already-normalized strings, so it is safe to call unconditionally.

Step 3 — Strip invisible and control characters

Zero-width joiners (U+200D), non-breaking spaces (U+00A0), the byte order mark (U+FEFF), and C0/C1 control characters all pass through normalization unchanged and silently break downstream comparisons.

import re

# Precompile at module level — never inside a hot loop
_CONTROL_RE = re.compile(
    r"[​-‍ \x00-\x1F\x7F-\x9F]+"
)

cleaned = _CONTROL_RE.sub("", normalized)

Step 4 — Collapse whitespace

After stripping control characters, multiple consecutive whitespace characters may remain. Collapse them to a single ASCII space and strip the result.

_WS_RE = re.compile(r"\s+")

cleaned = _WS_RE.sub(" ", cleaned).strip()

Step 5 — Validate and route

Return None for strings that reduce to empty after normalization. Log the raw input so upstream ingestion issues surface rather than being silently discarded.

return cleaned if cleaned else None

Primary Code Implementation

The function below combines all five steps into a single, cacheable, production-safe unit.

import unicodedata
import re
import logging
from functools import lru_cache
from typing import Optional

logger = logging.getLogger(__name__)

# Precompile all patterns at module level
_CONTROL_RE = re.compile(
    r"[​-‍ \x00-\x1F\x7F-\x9F]+"
)
_WS_RE = re.compile(r"\s+")


@lru_cache(maxsize=4096)
def normalize_address_string(
    raw: str,
    form: str = "NFKC",
) -> Optional[str]:
    """
    Normalize a single address string for pipeline consumption.

    Applies Unicode normalization (NFKC by default), strips invisible
    and control characters, then collapses whitespace. Returns None
    if the result is empty so callers can quarantine unparseable inputs.

    Args:
        raw:  The raw address string from any source system.
        form: Unicode normalization form — NFKC for most pipelines.

    Returns:
        Cleaned string, or None if input reduces to empty.
    """
    if not isinstance(raw, str):
        logger.warning("Non-string input received: %s (type=%s)", raw, type(raw))
        return None

    try:
        normalized = unicodedata.normalize(form, raw)
        cleaned = _CONTROL_RE.sub("", normalized)
        cleaned = _WS_RE.sub(" ", cleaned).strip()
        return cleaned or None
    except Exception as exc:  # noqa: BLE001
        logger.error("Normalization failed for input %r: %s", raw, exc)
        return None

Vectorized variant for pandas

unicodedata.normalize has no native pandas vectorization, so use Series.apply. For datasets under ~500 k rows this is fast enough; for larger volumes see the Polars path below.

import pandas as pd
from typing import Optional


def normalize_series_pandas(series: pd.Series) -> pd.Series:
    """Apply normalize_address_string to every element of a Series."""
    return series.apply(normalize_address_string)


# Usage
df["address_clean"] = normalize_series_pandas(df["address_raw"])

Polars variant for large datasets

import polars as pl


def normalize_frame_polars(
    df: pl.DataFrame,
    col: str,
    out_col: str = "",
) -> pl.DataFrame:
    """
    Return df with a new column containing normalized addresses.

    Uses map_elements because unicodedata has no native Polars expression.
    """
    target = out_col or f"{col}_normalized"
    return df.with_columns(
        pl.col(col)
        .map_elements(normalize_address_string, return_dtype=pl.Utf8)
        .alias(target)
    )

Note: Polars 1.x has no built-in .str.normalize() for Unicode normalization forms. map_elements is the correct approach and still benefits from Polars’ Arrow-backed memory layout for the surrounding operations.

Normalization Form Reference Table

Character type Example input After NFKC Why it matters
Full-width Latin Ave Ave East Asian IME input; breaks ASCII tokenizers
Precomposed diacritic é (U+00E9) é (U+00E9) Already canonical; NFC/NFKC leave it unchanged
Decomposed diacritic é (U+0065 + U+0301) é (U+00E9) Recomposed to precomposed form; equality now works
Ligature first first Typographic ligature common in print-to-digital OCR
Non-breaking space Rue de Rue de (after strip) Breaks .split() and whitespace tokenizers
Ordinal indicator 1a Spanish/Portuguese address ordinators
Zero-width joiner Main‍Street MainStreet (after strip) Invisible; causes false equality failures
Byte order mark Main St Main St (after strip) Left at file head by some Windows editors

Edge Cases

Cyrillic homoglyphs in transliterated addresses

OCR from Russian address documents sometimes produces Cyrillic lookalikes embedded in otherwise Latin strings: С (Cyrillic) instead of C (Latin), А instead of A. NFKC cannot resolve these because both representations are canonically correct in their own scripts. Build an explicit homoglyph map for affected character pairs and apply it as a post-normalization pass.

_HOMOGLYPHS: dict[str, str] = {
    "А": "A",   # Cyrillic А → Latin A
    "С": "C",   # Cyrillic С → Latin C
    "Е": "E",   # Cyrillic Е → Latin E
    # ... extend for your document corpus
}
_HOMOGLYPH_TABLE = str.maketrans(_HOMOGLYPHS)


def remove_cyrillic_homoglyphs(s: str) -> str:
    return s.translate(_HOMOGLYPH_TABLE)

Aggressive ASCII fallback (diacritic stripping)

When the downstream consumer is strictly ASCII — for example, a legacy batch file expected by USPS CASS Certification tooling — strip combining marks after NFD decomposition rather than NFKC.

import unicodedata


def to_ascii_fallback(s: str) -> str:
    """
    Strip diacritics by decomposing to NFD then removing combining marks.
    Use only when the downstream system is strictly ASCII.
    """
    nfd = unicodedata.normalize("NFD", s)
    return "".join(
        ch for ch in nfd
        if unicodedata.category(ch) != "Mn"
    ).encode("ascii", errors="replace").decode("ascii")

Only use this path as a last resort. Modern geocoders and international address format standardization workflows accept diacritics natively; stripping them permanently destroys information.

Replacement character U+FFFD from lossily decoded bytes

When a source file is decoded with errors="replace", corrupted byte sequences become U+FFFD (the replacement character). NFKC leaves U+FFFD intact. If your normalization output contains replacement characters, the upstream decoding is lossy and the address component may be irrecoverable. Count and alert on U+FFFD frequency as a pipeline health metric.

def count_replacement_chars(s: str) -> int:
    return s.count("�")

Mixed-script addresses (Arabic, Hebrew, Thai)

Right-to-left scripts and scripts without word-boundary spaces require additional handling beyond Unicode normalization. After NFKC, pass mixed-script addresses through a script-detection step (using the unicodedata.bidirectional property or langdetect) before applying handling for special characters in global address data. Normalization alone is not sufficient for these inputs.

Performance and Vectorization

Caching behaviour of lru_cache

Address datasets have high repetition in city names, street type suffixes, and country names. With a cache of 4 096 entries, a typical city-state column will achieve a hit rate above 90 % after the first few thousand rows, cutting normalization CPU time proportionally. The cache is per-process; for multi-process Airflow tasks, consider a shared Redis cache keyed on sha256(raw).

Benchmarks at scale

Volume Approach Approx. throughput
< 100 k rows Series.apply (pandas) 80–120 k rows/s
100 k – 2 M rows Series.apply with lru_cache 200–400 k rows/s (cache warm)
> 2 M rows Polars map_elements 400–900 k rows/s
> 10 M rows Multiprocessing pool + Polars 1–3 M rows/s (8 cores)

These figures assume short strings (< 200 characters). Longer strings or high cardinality (many unique raw values) reduce cache effectiveness and push you toward Polars or parallel execution.

Chunked streaming for memory-constrained environments

import pandas as pd
from pathlib import Path


def normalize_csv_chunked(
    path: Path,
    col: str,
    chunk_size: int = 50_000,
) -> None:
    """Stream-normalize a large CSV without loading it entirely into memory."""
    for chunk in pd.read_csv(path, chunksize=chunk_size, encoding="utf-8"):
        chunk[f"{col}_normalized"] = normalize_series_pandas(chunk[col])
        # Write or append chunk to output here

Troubleshooting

Normalization succeeds but string equality still fails

Root cause: Two strings that look identical in a terminal or IDE may still differ due to Unicode escapes outside the BMP (characters with code points above U+FFFF). Use repr() to inspect the raw code point sequence and confirm both strings are in the same normalization form before comparing.

assert unicodedata.is_normalized("NFKC", s1)
assert unicodedata.is_normalized("NFKC", s2)
print(repr(s1), repr(s2))

lru_cache grows without bound in long-running processes

Root cause: lru_cache size is bounded by maxsize, so old entries are evicted when the limit is reached. If you see memory growth, the string cardinality in your dataset is high (many unique raw addresses). Reduce maxsize to 1 024 or remove the cache decorator and rely on Polars batching instead.

Non-breaking spaces survive into the parsed output

Root cause: U+00A0 is a valid non-control character that \s matches in Python 3 but which survives unicodedata.normalize. Ensure your control character regex explicitly includes   — the pattern in the implementation above does this.

chardet detection returns None encoding

Root cause: The byte sequence is too short or too uniform for heuristic detection to resolve. Fall back to latin-1 (ISO 8859-1), which decodes every byte sequence without raising an exception, then flag the record for manual review.

Polars map_elements returns all nulls

Root cause: If normalize_address_string raises an unhandled exception inside map_elements, Polars silently returns null for affected rows. Add a try/except inside the function (as shown above) and log the failure so it surfaces. Confirm with df["col_normalized"].null_count() before and after normalization.

FAQ

Why does NFKC beat NFC for address pipelines?

NFC only handles canonical composition — it leaves full-width Latin characters (common in East Asian input forms), typographic ligatures, and superscripts intact. NFKC also applies compatibility decomposition, mapping those variants to their ASCII or standard Unicode equivalents before recomposing. That makes tokenization and exact-match lookups deterministic across source systems.

Is it safe to run NFKC normalization twice on the same string?

Yes. All four Unicode normalization forms are idempotent: normalize(form, normalize(form, s)) == normalize(form, s). This property makes NFKC safe for distributed retries in Airflow or Dagster without risking data drift.

When should I strip diacritics entirely vs. keeping them?

Strip diacritics only when your downstream matcher is ASCII-only (e.g. a legacy USPS batch file). For modern geocoders that accept UTF-8, retaining diacritics via NFKC is more accurate — München and Munchen may resolve to different spatial records. Use NFD to decompose, then filter Mn category characters, only when an ASCII fallback is explicitly required.

How do I handle addresses from OCR that contain garbled encoding?

OCR output often mixes Latin and Cyrillic lookalikes, introduces replacement characters (U+FFFD), and drops combining marks unpredictably. Apply NFKC first to consolidate compatibility variants, then use a homoglyph map to replace visually ambiguous characters before parsing. Log any remaining U+FFFD occurrences as candidates for manual review.

Does lru_cache interact badly with None returns?

No — lru_cache caches None returns like any other value. If a malformed string produces None on first call, subsequent calls with the same string return None immediately without re-executing the function. This is correct behaviour: the string is definitively unprocessable, and caching the failure prevents repeated log noise.