TL;DR: Call parse_address(raw_string) to decompose an address into labelled components and expand_address(raw_string) to generate canonical, abbreviation-resolved variants — both from the pypostal bindings to the libpostal C library. This page is part of the International Address Format Standardization workflow.
How libpostal Splits Parsing from Expansion
libpostal separates normalization into two operations that serve distinct pipeline roles:
parse_address— tokenizes a raw string and assigns semantic labels (house_number,road,city,postcode,country,state,suburb, and more). Returns a list of(value, label)tuples.expand_address— resolves abbreviations, normalizes casing, and applies regional formatting rules (St→Street,Apt→Apartment,Pl→Plaza). Returns a list of normalized strings suitable for deduplication or geocoding.
The model is trained on OpenStreetMap and OpenAddresses data, covering addressing conventions from German compound street names to Japanese prefecture ordering to Brazilian neighborhood hierarchies. Because the engine is statistical rather than rule-based, it handles typos, missing fields, and mixed-language inputs gracefully — situations where rigid regex patterns for US address parsing would fail silently.
Installation Spec
libpostal is a C library. pypostal provides Python bindings but does not bundle compiled binaries or training data, so you must build the core library first.
System requirements
| Resource | Minimum |
|---|---|
| RAM | 1.8 GB (model stays resident after first import) |
| Disk | 10 GB for training data at --datadir path |
| Storage type | SSD recommended for I/O-heavy batch loads |
| Python | 3.9 – 3.13 |
Linux (Ubuntu / Debian)
sudo apt-get install -y curl autoconf automake libtool pkg-config
git clone https://github.com/openvenues/libpostal.git
cd libpostal
./bootstrap.sh
./configure --datadir=/opt/libpostal_data
make -j$(nproc)
sudo make install
sudo ldconfig
pip install pypostal
macOS (Intel and Apple Silicon)
brew install autoconf automake libtool pkg-config
# Clone and build same as Linux.
# On Apple Silicon add --disable-sse2 if compilation fails:
./configure --datadir=/opt/libpostal_data --disable-sse2
make -j$(nproc)
sudo make install
pip install pypostal
Docker
Build a custom image from Ubuntu and run the Linux steps above. No official pre-built image exists. Mount application code into the image and install pypostal in the same layer so the shared library is on LD_LIBRARY_PATH.
Component Label Reference
parse_address assigns labels from libpostal’s taxonomy. The table below covers the labels most relevant to production pipelines:
| Label | What it captures | Example value |
|---|---|---|
house_number |
Street number | "123" |
road |
Street name including type | "main street" |
unit |
Apartment, suite, or floor | "apt 4b" |
city |
Municipality | "berlin" |
state |
Region, province, or Bundesland | "berlin" |
postcode |
Postal or ZIP code | "10115" |
country |
Country name or ISO code | "germany" |
suburb |
Neighbourhood or borough | "mitte" |
house |
Named building or POI | "empire state building" |
po_box |
PO box number | "po box 44" |
All output values are lowercased by default. Reconstruct display casing at the application layer with str.title() or locale-aware routines.
Minimal Runnable Implementation
from postal.parser import parse_address
from postal.expand import expand_address
from typing import Optional
import logging
logger = logging.getLogger(__name__)
REQUIRED_LABELS: frozenset[str] = frozenset({"house_number", "road", "city", "postcode"})
def normalize_address(raw: str) -> dict:
"""
Parse and expand a raw international address string.
Returns a dict with:
- status: 'success' | 'low_confidence' | 'error'
- confidence: float 0.0–1.0 based on required-label coverage
- components: dict[label, value] from parse_address
- canonical_variants: list[str] from expand_address
- missing_labels: list[str] of required labels absent from the parse
"""
raw = raw.strip()
if len(raw) < 5:
return {"status": "error", "reason": "Input too short"}
try:
parsed = parse_address(raw)
components: dict[str, str] = {label: value for value, label in parsed}
missing = sorted(REQUIRED_LABELS - components.keys())
confidence = 1.0 - len(missing) / len(REQUIRED_LABELS)
canonical_variants: list[str] = expand_address(raw)
return {
"status": "success" if confidence >= 0.75 else "low_confidence",
"confidence": round(confidence, 2),
"components": components,
"canonical_variants": canonical_variants,
"missing_labels": missing,
}
except Exception as exc:
logger.error("libpostal parse failed for %r: %s", raw, exc)
return {"status": "error", "reason": str(exc)}
Vectorized pandas variant
Pre-warm libpostal once at import time; the bindings load the model into RAM on first call and reuse it for all subsequent calls in the process.
import pandas as pd
from typing import Any
def normalize_series(series: pd.Series) -> pd.DataFrame:
"""
Apply normalize_address to every row of a Series.
Returns a DataFrame with columns: status, confidence, components,
canonical_variants, missing_labels.
"""
results: list[dict[str, Any]] = series.apply(normalize_address).tolist()
return pd.DataFrame(results)
# Usage:
# df = pd.read_csv("addresses.csv")
# normalized = normalize_series(df["raw_address"])
# df = pd.concat([df, normalized], axis=1)
Apply normalize_series after any upstream Unicode and character normalization step so that libpostal receives clean UTF-8 rather than mixed-encoding fragments.
Edge Cases and Failure Modes
1. PO boxes and military addresses
libpostal recognises po_box as a label but its confidence on US-style PO BOX 44 is lower than on street addresses, and APO/FPO/DPO military designators often parse with missing city and postcode. Pre-screen inputs with a lightweight pattern before calling libpostal:
import re
_PO_BOX_RE = re.compile(
r"\b(P\.?\s?O\.?\s?BOX|POST\s+OFFICE\s+BOX)\s+\d+\b",
re.IGNORECASE,
)
def route_before_parse(raw: str) -> str:
"""Return 'po_box', or 'standard' to signal the downstream handler."""
if _PO_BOX_RE.search(raw):
return "po_box"
return "standard"
See the dedicated handling PO boxes and rural routes page for the complete extraction workflow.
2. Mixed-language inputs
An address like "Potsdamer Platz 1, 10785 Berlin, Germany" may parse correctly, but a hybrid like "Potsdamer Platz 1 Berlim Alemanha" (Portuguese city/country spellings) can drop postcode. Use expand_address output as the deduplication key rather than re-assembling raw components, because expansion normalizes across language variants.
3. Addresses without house numbers
Venue-only inputs ("Eiffel Tower, Paris") label house but omit house_number and road. These produce a confidence of 0.5 against the four-field requirement. Route them to a geocoding provider directly rather than attempting component assembly — see implementing fallback chains for failed lookups for the routing pattern.
Integration Note
In a full International Address Format Standardization pipeline, libpostal sits between raw input sanitization and geocoding API dispatch. The components dict it returns maps directly onto the structured address fields expected by most geocoding providers, while the canonical_variants list provides a deterministic key for deduplication before records reach the geocoder. When confidence falls below threshold, routing ambiguous records to a secondary provider prevents silent data loss without blocking the main pipeline.
For batch workloads, run normalization in a background queue rather than synchronously — libpostal’s model is not async-safe and should live in a dedicated worker process. Pairing the worker with a Redis LRU cache for expand_address results (which are deterministic for identical inputs) cuts redundant C-library calls by 30–60% on typical address datasets with repeated values.
Related
- International Address Format Standardization — the parent workflow covering ISO country schemas, component ordering, and provider field mappings that contextualize libpostal’s output.
- Handling PO Boxes and Rural Routes — pre-screening patterns for address types that libpostal handles with lower confidence.
- Unicode and Character Normalization in Python — NFKC normalization and diacritic handling to apply before passing strings to libpostal.
- Implementing Fallback Chains for Failed Lookups — routing low-confidence libpostal outputs to a secondary geocoding provider without blocking the main pipeline.