Automating Address Component Extraction with spaCy

Q: Should I use a blank spaCy model or a pre-trained one for address parsing?

For deterministic rule-based extraction, a blank model (spacy.blank('en')) is preferred: it skips the tagger, parser, and pre-trained NER components, giving faster throughput and predictable output. Use a pre-trained model (en_core_web_sm or better) only when you need the statistical NER layer to recover components the EntityRuler misses — and even then, set overwrite_ents=True so your address rules take priority.

Q: Does the EntityRuler handle multi-token street names correctly?

Yes. Patterns are sequences of token matchers, so 'Rue de la Paix' can be captured with a four-token pattern that anchors on the leading street-type token ('Rue') and allows any subsequent tokens before a postal code boundary. Avoid single-token LOWER/IN patterns for street names — they fragment compound names.

Q: How do I handle German Straße vs Strasse in patterns?

Apply NFKC Unicode normalization before the spaCy pipeline runs, which converts 'ß' to 'ss' only in compatibility-equivalent forms. For matching, include both 'straße' and 'strasse' in your LOWER IN list, or use a REGEX token matcher: {"TEXT": {"REGEX": "(?i)stra[sß]e"}}.

Q: What is the throughput of nlp.pipe() compared to calling nlp() in a loop?

nlp.pipe() can be 5–15x faster than a plain loop on large datasets because it batches tokenization and avoids repeated Python object allocation per document. For 10 million addresses, set batch_size between 5,000–10,000 and n_process=-1 to use all available CPU cores.

TL;DR: Configure spaCy’s EntityRuler with postal-code, house-number, and street-suffix patterns, apply regex fallbacks for components the ruler misses, then stream large datasets through nlp.pipe(). This page is part of Parsing European Address Conventions, which covers the full country-first normalization workflow that feeds into this extraction step.

When spaCy fits address parsing

Address parsing is a structured information-extraction problem, not a generative language task. Postal codes follow strict patterns, street suffixes are a finite set, and administrative hierarchies map cleanly to lookup tables. spaCy’s architecture provides deterministic rule injection via EntityRuler, dependency parsing, and a pluggable NER pipeline — all of which scale horizontally on standard CPUs without GPU overhead.

The key design choice is determinism: unlike fine-tuned neural sequence-taggers, a rule-based EntityRuler guarantees reproducibility, executes at predictable latency, and avoids the hallucination risk that makes statistical models unreliable for regulated address data. For teams building geocoding normalization pipelines that feed CRM systems or logistics platforms, this reproducibility is a production requirement, not a preference.

Before running any spaCy pipeline, apply NFKC Unicode normalization to raw input strings. Without it, Straße and Strasse tokenize differently and your patterns will miss half your German records.

Pipeline architecture

The diagram below shows how a raw address string moves through the spaCy pipeline to produce structured component fields.

Pattern block: EntityRuler configuration

The production pattern set below targets postal codes, house numbers, street suffixes (English and common European forms), and country names. Patterns are token-level, so compound values like 12A/3 or SW1A 1AA match as single logical entities.

ADDRESS_PATTERNS = [
    # Postal codes: matches DE (12345), NL (1234 AB), UK (SW1A 1AA), FR/ES (12345)
    {
        "label": "POSTAL_CODE",
        "pattern": [
            {"TEXT": {"REGEX": r"^[A-Z]{1,2}\d[A-Z\d]?$"}},   # UK outward code
            {"TEXT": {"REGEX": r"^\d[A-Z]{2}$"}}                # UK inward code
        ]
    },
    {
        "label": "POSTAL_CODE",
        "pattern": [{"TEXT": {"REGEX": r"^\d{4,5}(?:\s?[A-Z]{2})?$"}}]
    },
    # House numbers: 12, 12A, 12/3, 12-14
    {
        "label": "HOUSE_NUMBER",
        "pattern": [{"TEXT": {"REGEX": r"^\d{1,4}[a-zA-Z]?(?:[/\-]\d{1,4}[a-zA-Z]?)?$"}}]
    },
    # Street suffixes — English and major European forms
    {
        "label": "STREET_SUFFIX",
        "pattern": [{"LOWER": {"IN": [
            "street", "st", "avenue", "ave", "road", "rd", "boulevard", "blvd",
            "lane", "ln", "drive", "dr", "close", "court", "ct",
            "platz", "straße", "strasse", "gasse", "weg", "allee",
            "via", "viale", "piazza", "corso",
            "calle", "avenida", "paseo", "carrer",
            "rue", "avenue", "impasse", "allée",
            "straat", "laan", "weg", "plein"
        ]}}]
    },
    # Country names — extend as needed
    {
        "label": "COUNTRY",
        "pattern": [{"LOWER": {"IN": [
            "germany", "deutschland", "france", "spain", "españa",
            "italy", "italia", "netherlands", "nederland",
            "uk", "united kingdom", "ireland", "belgium", "belgique",
            "switzerland", "schweiz", "austria", "österreich",
            "poland", "polska", "sweden", "sverige", "norway", "norge"
        ]}}]
    }
]

Pattern component table

Pattern label	What it matches	Why it is necessary
`POSTAL_CODE` (two-token)	UK format: `SW1A` + `1AA`	UK codes split across two tokens at the space; a single-token regex misses them
`POSTAL_CODE` (one-token)	DE `12345`, NL `1234AB`, FR `75001`	Catches most continental formats in one pass
`HOUSE_NUMBER`	`12`, `12A`, `12/3`, `12-14`	Numeric-anchor extraction before alphabetic components prevents mislabelling
`STREET_SUFFIX`	30+ Latin/EU forms	Finite vocabulary means LOWER IN is faster and more predictable than a regex
`COUNTRY`	Common names and native forms	Enables country-first routing downstream without a separate classifier

Minimal runnable implementation

import re
import unicodedata
import spacy
from typing import Dict, List, Generator, Iterable

# Compile regex fallbacks once at module level — never inside a loop
_POSTAL_RE = re.compile(
    r"\b([A-Z]{1,2}\d[A-Z\d]?\s\d[A-Z]{2}|\d{4,5}(?:\s[A-Z]{2})?)\b"
)
_HOUSE_RE = re.compile(r"\b(\d{1,4}[a-zA-Z]?(?:[/\-]\d{1,4}[a-zA-Z]?)?)\b")

# ---- Pipeline setup (run once at import time) ----------------------------

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents": True})
ruler.add_patterns(ADDRESS_PATTERNS)  # defined in the pattern block above


# ---- Component extraction ------------------------------------------------

def _extract_from_doc(doc) -> Dict[str, List[str]]:
    """Map doc.ents to address component fields; apply regex fallbacks."""
    fields: Dict[str, List[str]] = {
        "street_suffix": [],
        "house_number": [],
        "postal_code": [],
        "country": [],
    }
    for ent in doc.ents:
        key = ent.label_.lower()
        if key in fields:
            fields[key].append(ent.text)

    # Regex fallbacks fire only when the EntityRuler returned nothing
    if not fields["postal_code"]:
        m = _POSTAL_RE.search(doc.text)
        if m:
            fields["postal_code"].append(m.group(1))

    if not fields["house_number"]:
        m = _HOUSE_RE.search(doc.text)
        if m:
            fields["house_number"].append(m.group(1))

    return fields


def extract_address_components(raw: str) -> Dict[str, List[str]]:
    """Extract components from a single raw address string.

    Applies NFKC normalization before processing so that German ß,
    accented characters, and ligatures tokenize consistently.
    """
    normalized = unicodedata.normalize("NFKC", raw.strip())
    doc = nlp(normalized)
    return _extract_from_doc(doc)


def batch_extract(
    texts: Iterable[str],
    batch_size: int = 2000,
    n_process: int = 1,
) -> Generator[Dict[str, List[str]], None, None]:
    """Stream-process a large address list with minimal memory overhead.

    Args:
        texts: Any iterable of raw address strings.
        batch_size: Documents per chunk. Increase to 5000–10000 for
            datasets over 1 M rows to reduce Python overhead per batch.
        n_process: Set to -1 to use all available CPU cores.
    """
    normalized = (unicodedata.normalize("NFKC", t.strip()) for t in texts)
    # Disable unused components to avoid processing overhead
    with nlp.select_pipes(disable=["parser", "tagger", "lemmatizer"]):
        for doc in nlp.pipe(normalized, batch_size=batch_size, n_process=n_process):
            yield _extract_from_doc(doc)


# ---- Vectorized pandas variant -------------------------------------------

def extract_components_series(series) -> "pd.DataFrame":
    """Apply batch_extract to a pandas Series; return a component DataFrame.

    Usage:
        import pandas as pd
        df = pd.DataFrame({"address": ["12 Rue de Rivoli 75001 Paris France", ...]})
        components = extract_components_series(df["address"])
        df = pd.concat([df, components], axis=1)
    """
    import pandas as pd
    records = list(batch_extract(series.tolist(), batch_size=2000, n_process=-1))
    return pd.DataFrame(records, index=series.index)

Edge cases and failure modes

1. UK postcodes split across two tokens

spaCy tokenizes SW1A 1AA as two tokens: SW1A and 1AA. A single-token postal code pattern misses both halves. The two-token pattern in the configuration above addresses this, but only if the input has exactly one space at the sector boundary. Inputs with no space (SW1A1AA) need a pre-processing normalization step to insert a space after the outward code before the pipeline runs:

import re

_UK_COMPACT = re.compile(r"^([A-Z]{1,2}\d[A-Z\d]?)(\d[A-Z]{2})$")

def normalize_uk_postcode(code: str) -> str:
    m = _UK_COMPACT.match(code.upper().replace(" ", ""))
    return f"{m.group(1)} {m.group(2)}" if m else code

2. German compound house number suffixes

German addresses use suffixes like 12A, 12/14, and 12a-b as a single logical token, but the / and - characters force spaCy’s tokenizer to split them. Configure a custom tokenizer rule or use merge_entities=True on the ruler so adjacent matched tokens are merged into one span:

ruler = nlp.add_pipe(
    "entity_ruler",
    config={"overwrite_ents": True, "phrase_matcher_attr": "LOWER"}
)

Alternatively, pre-process the address string to replace / inside numeric sequences with a placeholder character before tokenization, then restore it in the output.

3. Postal codes misidentified as house numbers

A four-digit Dutch postal code (1234) and a four-digit house number share the same numeric-only surface form. The HOUSE_NUMBER regex fires first alphabetically if both patterns match. Resolve this by running postal-code extraction before house-number extraction in the _extract_from_doc function and checking whether a match is already claimed:

# Inside _extract_from_doc — check claimed spans before house number fallback
claimed_spans = {(ent.start_char, ent.end_char) for ent in doc.ents if ent.label_ == "POSTAL_CODE"}
if not fields["house_number"]:
    for m in _HOUSE_RE.finditer(doc.text):
        if (m.start(), m.end()) not in claimed_spans:
            fields["house_number"].append(m.group(1))
            break

Integration note

This extraction step sits inside the broader Parsing European Address Conventions workflow at the tokenization stage — after Unicode normalization and country detection, and before postal-code validation against national registries. The EntityRuler output feeds a standardization layer that maps components to a unified schema (street name, building number, postal code, locality, country ISO code) before geocoding API calls are made.

For datasets where spaCy pattern coverage is insufficient — for example, highly irregular rural addresses or addresses in non-Latin scripts — route records with empty postal_code or house_number fields to a secondary parser such as libpostal, as described in Parsing European Address Conventions. For pipelines that call external geocoding APIs at scale, implementing fallback chains for failed lookups covers provider routing and retry logic that wraps this extraction layer.

When addresses contain special characters — diacritics, ligatures, or non-ASCII digits — apply the normalization patterns described in handling special characters in global address data before the spaCy pipeline sees the string.

FAQ

Should I use a blank spaCy model or a pre-trained one for address parsing?

For deterministic rule-based extraction, a blank model (spacy.blank("en")) is preferred: it skips the tagger, parser, and pre-trained NER components, giving faster throughput and predictable output. Use a pre-trained model (en_core_web_sm or better) only when you need the statistical NER layer to recover components the EntityRuler misses — and even then, set overwrite_ents=True so your address rules take priority.

Does the EntityRuler handle multi-token street names correctly?

Yes. Patterns are sequences of token matchers, so Rue de la Paix can be captured with a four-token pattern that anchors on the leading street-type token and allows any subsequent tokens before a postal code boundary. Avoid single-token LOWER/IN patterns for street names — they fragment compound names.

How do I handle German Straße vs Strasse in patterns?

Apply NFKC normalization before the spaCy pipeline runs, then include both straße and strasse in your LOWER IN list. For mixed-case inputs, add a REGEX token matcher: {"TEXT": {"REGEX": "(?i)stra[sß]e"}}.

What throughput does nlp.pipe() give compared to a plain loop?

nlp.pipe() is typically 5–15x faster on large datasets because it batches tokenization and avoids repeated Python object allocation per document. For 10 million addresses, set batch_size between 5,000 and 10,000 and n_process=-1 to use all available CPU cores.

Parsing European Address Conventions — the parent workflow covering country detection, postal-code validation, and the full normalization pipeline this extraction step feeds into.
Handling Special Characters in Global Address Data — normalization patterns for diacritics, ligatures, and non-ASCII characters that must run before this spaCy pipeline.
How to Parse Street Numbers and Suffixes with Regex — regex-only approach to house-number and suffix extraction, useful as a performance baseline or fallback when spaCy is not available.
Implementing Fallback Chains for Failed Lookups — how to route addresses that spaCy extraction cannot fully resolve to secondary geocoding providers.