Normalizing International Addresses with libpostal

TL;DR: Call parse_address(raw_string) to decompose an address into labelled components and expand_address(raw_string) to generate canonical, abbreviation-resolved variants — both from the pypostal bindings to the libpostal C library. This page is part of the International Address Format Standardization workflow.


How libpostal Splits Parsing from Expansion

libpostal separates normalization into two operations that serve distinct pipeline roles:

  • parse_address — tokenizes a raw string and assigns semantic labels (house_number, road, city, postcode, country, state, suburb, and more). Returns a list of (value, label) tuples.
  • expand_address — resolves abbreviations, normalizes casing, and applies regional formatting rules (StStreet, AptApartment, PlPlaza). Returns a list of normalized strings suitable for deduplication or geocoding.

The model is trained on OpenStreetMap and OpenAddresses data, covering addressing conventions from German compound street names to Japanese prefecture ordering to Brazilian neighborhood hierarchies. Because the engine is statistical rather than rule-based, it handles typos, missing fields, and mixed-language inputs gracefully — situations where rigid regex patterns for US address parsing would fail silently.

libpostal parse_address and expand_address data flow A raw address string enters libpostal. The parse_address branch outputs labelled components (house_number, road, city, postcode). The expand_address branch outputs canonical string variants for deduplication or geocoding. Raw address string "123 Main St, Apt 4B" parse_address house_number: "123" road: "main street" unit: "4b" expand_address "123 main street apt 4b" "123 main street apartment 4b" Structured record → geocoding payload Canonical variants → deduplication key

Installation Spec

libpostal is a C library. pypostal provides Python bindings but does not bundle compiled binaries or training data, so you must build the core library first.

System requirements

Resource Minimum
RAM 1.8 GB (model stays resident after first import)
Disk 10 GB for training data at --datadir path
Storage type SSD recommended for I/O-heavy batch loads
Python 3.9 – 3.13

Linux (Ubuntu / Debian)

sudo apt-get install -y curl autoconf automake libtool pkg-config
git clone https://github.com/openvenues/libpostal.git
cd libpostal
./bootstrap.sh
./configure --datadir=/opt/libpostal_data
make -j$(nproc)
sudo make install
sudo ldconfig
pip install pypostal

macOS (Intel and Apple Silicon)

brew install autoconf automake libtool pkg-config
# Clone and build same as Linux.
# On Apple Silicon add --disable-sse2 if compilation fails:
./configure --datadir=/opt/libpostal_data --disable-sse2
make -j$(nproc)
sudo make install
pip install pypostal

Docker

Build a custom image from Ubuntu and run the Linux steps above. No official pre-built image exists. Mount application code into the image and install pypostal in the same layer so the shared library is on LD_LIBRARY_PATH.

Component Label Reference

parse_address assigns labels from libpostal’s taxonomy. The table below covers the labels most relevant to production pipelines:

Label What it captures Example value
house_number Street number "123"
road Street name including type "main street"
unit Apartment, suite, or floor "apt 4b"
city Municipality "berlin"
state Region, province, or Bundesland "berlin"
postcode Postal or ZIP code "10115"
country Country name or ISO code "germany"
suburb Neighbourhood or borough "mitte"
house Named building or POI "empire state building"
po_box PO box number "po box 44"

All output values are lowercased by default. Reconstruct display casing at the application layer with str.title() or locale-aware routines.

Minimal Runnable Implementation

from postal.parser import parse_address
from postal.expand import expand_address
from typing import Optional
import logging

logger = logging.getLogger(__name__)

REQUIRED_LABELS: frozenset[str] = frozenset({"house_number", "road", "city", "postcode"})


def normalize_address(raw: str) -> dict:
    """
    Parse and expand a raw international address string.

    Returns a dict with:
      - status: 'success' | 'low_confidence' | 'error'
      - confidence: float 0.0–1.0 based on required-label coverage
      - components: dict[label, value] from parse_address
      - canonical_variants: list[str] from expand_address
      - missing_labels: list[str] of required labels absent from the parse
    """
    raw = raw.strip()
    if len(raw) < 5:
        return {"status": "error", "reason": "Input too short"}

    try:
        parsed = parse_address(raw)
        components: dict[str, str] = {label: value for value, label in parsed}

        missing = sorted(REQUIRED_LABELS - components.keys())
        confidence = 1.0 - len(missing) / len(REQUIRED_LABELS)

        canonical_variants: list[str] = expand_address(raw)

        return {
            "status": "success" if confidence >= 0.75 else "low_confidence",
            "confidence": round(confidence, 2),
            "components": components,
            "canonical_variants": canonical_variants,
            "missing_labels": missing,
        }
    except Exception as exc:
        logger.error("libpostal parse failed for %r: %s", raw, exc)
        return {"status": "error", "reason": str(exc)}

Vectorized pandas variant

Pre-warm libpostal once at import time; the bindings load the model into RAM on first call and reuse it for all subsequent calls in the process.

import pandas as pd
from typing import Any


def normalize_series(series: pd.Series) -> pd.DataFrame:
    """
    Apply normalize_address to every row of a Series.
    Returns a DataFrame with columns: status, confidence, components,
    canonical_variants, missing_labels.
    """
    results: list[dict[str, Any]] = series.apply(normalize_address).tolist()
    return pd.DataFrame(results)


# Usage:
# df = pd.read_csv("addresses.csv")
# normalized = normalize_series(df["raw_address"])
# df = pd.concat([df, normalized], axis=1)

Apply normalize_series after any upstream Unicode and character normalization step so that libpostal receives clean UTF-8 rather than mixed-encoding fragments.

Edge Cases and Failure Modes

1. PO boxes and military addresses

libpostal recognises po_box as a label but its confidence on US-style PO BOX 44 is lower than on street addresses, and APO/FPO/DPO military designators often parse with missing city and postcode. Pre-screen inputs with a lightweight pattern before calling libpostal:

import re

_PO_BOX_RE = re.compile(
    r"\b(P\.?\s?O\.?\s?BOX|POST\s+OFFICE\s+BOX)\s+\d+\b",
    re.IGNORECASE,
)


def route_before_parse(raw: str) -> str:
    """Return 'po_box', or 'standard' to signal the downstream handler."""
    if _PO_BOX_RE.search(raw):
        return "po_box"
    return "standard"

See the dedicated handling PO boxes and rural routes page for the complete extraction workflow.

2. Mixed-language inputs

An address like "Potsdamer Platz 1, 10785 Berlin, Germany" may parse correctly, but a hybrid like "Potsdamer Platz 1 Berlim Alemanha" (Portuguese city/country spellings) can drop postcode. Use expand_address output as the deduplication key rather than re-assembling raw components, because expansion normalizes across language variants.

3. Addresses without house numbers

Venue-only inputs ("Eiffel Tower, Paris") label house but omit house_number and road. These produce a confidence of 0.5 against the four-field requirement. Route them to a geocoding provider directly rather than attempting component assembly — see implementing fallback chains for failed lookups for the routing pattern.

Integration Note

In a full International Address Format Standardization pipeline, libpostal sits between raw input sanitization and geocoding API dispatch. The components dict it returns maps directly onto the structured address fields expected by most geocoding providers, while the canonical_variants list provides a deterministic key for deduplication before records reach the geocoder. When confidence falls below threshold, routing ambiguous records to a secondary provider prevents silent data loss without blocking the main pipeline.

For batch workloads, run normalization in a background queue rather than synchronously — libpostal’s model is not async-safe and should live in a dedicated worker process. Pairing the worker with a Redis LRU cache for expand_address results (which are deterministic for identical inputs) cuts redundant C-library calls by 30–60% on typical address datasets with repeated values.