Why does the pattern miss directional prefixes like '123 N MAIN ST'?

The core pattern anchors immediately from the number to the suffix. When a directional ('N', 'S', 'NE', 'SW') sits between them, the street name segment absorbs it. Add an optional directional group — (?:[NSEW]{1,2}\s+)? — before the suffix alternation to handle these cases.

Should I use re.match() or re.search()?

Use re.search() so the pattern finds the number+suffix block even when the string contains a leading apartment label or building name before the civic number. re.match() only looks at the start of the string, which fails on records like 'Suite 4B, 123 Main St'.

How do I add more USPS suffix abbreviations without breaking the pattern?

Extend the non-capturing alternation group inside the 'suffix' capture group. List longer abbreviations before shorter ones (e.g., PKWY before PKW) to prevent the engine from matching a prefix and ignoring the remaining characters. Always add \b after the group to enforce word-boundary termination.

Is re.compile() faster than passing the raw pattern string each time?

Python's re module caches up to 512 compiled patterns internally, so repeated calls with the same string are fast — but explicit re.compile() at module level removes any cache-eviction risk in long-running workers and makes the intent visible to code reviewers and linters.

How to Parse Street Numbers and Suffixes with Regex

Compile a single named-capture-group regex to extract both the street number and the USPS suffix abbreviation from a raw US address string in one pass — part of the Regex Patterns for US Address Parsing workflow inside Core Address Parsing & Standardization.

The Production Regex

^(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)\s+(?P<suffix>(?:AVE|BLVD|CIR|CT|DR|EXPY|FWY|HWY|LN|PKWY|PL|RD|ST|TER|TRL|WAY)\.?)\b

Compile with re.IGNORECASE. The pattern extracts the civic number and street type only — it deliberately omits the street name between them so the pattern stays narrow and avoids catastrophic backtracking on malformed strings. Extend it with a full-address pattern when you also need the street name, pre-directional, or secondary unit.

The suffix list aligns with USPS Publication 28 standard abbreviations. EXPY and FWY are included because they appear frequently in logistics datasets even though the core list in Publication 28 Appendix C runs to over 200 entries — add others as your data demands.

The diagram below shows how the two named groups map onto a typical address token sequence:

Pattern Breakdown

Segment	What it matches	Why it is necessary
`^`	Start anchor	Prevents false positives when the input contains a leading building name or label.
`(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)`	Integers (`123`), fractions (`1/2`), hyphenated ranges (`12-14`), alphanumeric codes (`45B`)	Covers all civic number formats recognized by USPS Publication 28.
`\s+`	One or more whitespace characters	Bridges the numeric block and the street name; handles tabs as well as spaces.
`(?P(?:AVE	BLVD	…).?)`
`\b`	Word boundary	Prevents `ST` from matching the prefix of `STREET` or `STATION`.

The suffix group intentionally excludes the street name so the alternation list stays short and the engine never needs to scan backwards. Adding a greedy .+ for the name between the number and suffix is possible but increases backtracking surface — benchmark before deploying on dirty data.

Minimal Runnable Python Implementation

import re
import pandas as pd
from typing import Optional, Tuple

# Compile once at module level — thread-safe, no per-call overhead
STREET_REGEX = re.compile(
    r"^(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)\s+"
    r"(?P<suffix>(?:AVE|BLVD|CIR|CT|DR|EXPY|FWY|HWY|LN|PKWY|PL|RD|ST|TER|TRL|WAY)\.?)\b",
    re.IGNORECASE,
)


def parse_street_components(
    address_line: str,
) -> Tuple[Optional[str], Optional[str]]:
    """Extract the civic number and USPS suffix from a raw address string.

    Args:
        address_line: A single-line US street address, e.g. '1204B Main St'.

    Returns:
        A (number, suffix) tuple, both normalized to uppercase with internal
        whitespace collapsed. Returns (None, None) on non-match.
    """
    if not address_line:
        return None, None

    match = STREET_REGEX.search(address_line.strip())
    if not match:
        return None, None

    number = re.sub(r"\s+", "", match.group("number")).upper()
    suffix = match.group("suffix").upper().rstrip(".")
    return number, suffix


def vectorized_parse(
    df: pd.DataFrame, col: str = "street_line"
) -> pd.DataFrame:
    """Vectorized extraction across a DataFrame column using str.extract.

    str.extract is implemented in C and substantially outperforms .apply()
    on large datasets. Named groups in STREET_REGEX become column names
    automatically.
    """
    parsed = df[col].str.extract(STREET_REGEX)
    parsed["number"] = parsed["number"].str.replace(r"\s+", "", regex=True).str.upper()
    parsed["suffix"] = parsed["suffix"].str.upper().str.rstrip(".")
    return pd.concat([df, parsed], axis=1)

str.extract passes the compiled pattern object directly to pandas’ underlying C extension, so you benefit from both precompilation and vectorized iteration in a single call. Avoid .apply(parse_street_components) unless you need the (None, None) fallback behavior row-by-row — the vectorized path is typically 5–15× faster on DataFrames with more than a few thousand rows.

Edge Cases and Failure Modes

Directional prefixes between number and suffix

Input 123 N MAIN ST fails because N sits in the street name position and the pattern expects the suffix to follow the whitespace directly after the number. Add an optional directional segment before the street name if your dataset consistently includes pre-directionals:

STREET_REGEX_WITH_DIR = re.compile(
    r"^(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)\s+"
    r"(?:(?P<pre_dir>[NSEW]{1,2})\s+)?"  # optional: N, S, NE, SW, etc.
    r"(?P<suffix>(?:AVE|BLVD|CIR|CT|DR|EXPY|FWY|HWY|LN|PKWY|PL|RD|ST|TER|TRL|WAY)\.?)\b",
    re.IGNORECASE,
)

For handling PO Boxes and rural routes, use a separate upstream filter — those records lack a civic number entirely and should never reach this pattern.

Missing whitespace between number and suffix

OCR pipelines and legacy batch exports sometimes strip the space between the number and the street name: 123MAINST. The \s+ requirement in the pattern correctly rejects these, surfacing them as (None, None) for routing to a repair step. Insert a preprocessing regex to inject whitespace before any known suffix:

INJECT_SPACE = re.compile(
    r"(\d)(?=[A-Z]{2,}(?:AVE|BLVD|ST|RD|DR|LN|CT|PL|WAY|TRL|TER|CIR|HWY|PKWY)\b)",
    re.IGNORECASE,
)

def repair_missing_space(raw: str) -> str:
    return INJECT_SPACE.sub(r"\1 ", raw)

Apply repair_missing_space before calling parse_street_components.

Alphanumeric unit codes that look like directionals

45B PINE AVE correctly yields number="45B", suffix="AVE" because [A-Z]? in the number group absorbs the trailing letter. However, 45 B PINE AVE — with a space before B — causes the pattern to stop at 45 and fail to find a recognized suffix immediately after the whitespace. Collapse space-separated unit suffixes (45 B → 45B) in the normalization step before extraction:

UNIT_SUFFIX_SPACE = re.compile(r"(\b\d+)\s+([A-D]|[F-HJ-NP-Z])\b")

def collapse_unit_suffix(raw: str) -> str:
    """Join '45 B' → '45B', avoiding false collapse of directionals like '45 N'."""
    return UNIT_SUFFIX_SPACE.sub(r"\1\2", raw)

The character class excludes E (East) and single-letter directionals (N, S, W) to reduce false collapses.

Integration Note

This pattern is the first extraction step in a layered Regex Patterns for US Address Parsing pipeline: sanitize input → extract number and suffix → extract street name → extract secondary unit → validate. The civic number and suffix you extract here are passed directly to downstream validation — either a USPS-certified engine as described in Step-by-Step Guide to CASS Address Validation or a geocoding API that expects pre-parsed components. Keeping the extraction regex narrow and the normalization logic in a separate function means you can swap in an extended pattern (to add the street name group) without touching the validation layer.

If your pipeline processes international records alongside US addresses, apply a locale detection step first — this pattern is strictly US-centric and will produce incorrect results on UK, Canadian, or Australian civic numbers where the format differs fundamentally from USPS conventions.

FAQ

Why does `re.search()` produce better results than `re.match()` here?

re.match() requires the pattern to match at the very start of the string. If your raw data sometimes contains a building name or suite label before the civic number — for example Suite 4B, 123 Main St — re.match() returns None even though the address is parseable. re.search() scans forward until it finds the number+suffix block, handling those prefixes transparently.

How should I handle suffix abbreviations not in the list?

Maintain a lookup dictionary that maps full street type words to their USPS abbreviations, run it over the address before calling the regex, and add the abbreviated form to the pattern’s alternation group. For example, map STREET → ST, AVENUE → AVE, BOULEVARD → BLVD. This keeps the regex list manageable and puts full-word expansion in a testable, version-controlled dict.

Does `re.IGNORECASE` hurt performance?

On CPython 3.8+, re.IGNORECASE incurs negligible overhead for ASCII-range patterns because the engine folds case at compile time. The flag does slow down Unicode patterns with non-ASCII characters, but US address data is overwhelmingly ASCII. Keep the flag — removing it to “optimize” forces uppercase normalization on the caller and is error-prone in practice.

When should I use `.str.extract()` versus `.apply()`?

Use .str.extract() whenever you want both captured groups as new DataFrame columns — it is the fastest path. Switch to .apply(parse_street_components) only if you need explicit (None, None) sentinel values that differ from pandas’ default NaN, or when you need to apply multi-step preprocessing logic (repair + extract) atomically per row.

Regex Patterns for US Address Parsing — the parent reference covering the full layered extraction workflow, prerequisite checklist, and suffix-table spec.
Handling PO Boxes and Rural Routes — filter these record types out upstream; they carry no civic number and will always produce (None, None) with this pattern.
Python Script to Extract PO Box Numbers — the companion extraction approach for non-street address records, using the same compile-once pattern style.
Step-by-Step Guide to CASS Address Validation — where the extracted number and suffix feed next: pre-validation normalization and certified API submission.
Handling Special Characters in Global Address Data — apply NFKC normalization before this regex if your source data contains accented characters or Unicode lookalikes that can corrupt byte boundaries.