How to Parse Street Numbers and Suffixes with Regex

Compile a single named-capture-group regex to extract both the street number and the USPS suffix abbreviation from a raw US address string in one pass — part of the Regex Patterns for US Address Parsing workflow inside Core Address Parsing & Standardization.

The Production Regex

^(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)\s+(?P<suffix>(?:AVE|BLVD|CIR|CT|DR|EXPY|FWY|HWY|LN|PKWY|PL|RD|ST|TER|TRL|WAY)\.?)\b

Compile with re.IGNORECASE. The pattern extracts the civic number and street type only — it deliberately omits the street name between them so the pattern stays narrow and avoids catastrophic backtracking on malformed strings. Extend it with a full-address pattern when you also need the street name, pre-directional, or secondary unit.

The suffix list aligns with USPS Publication 28 standard abbreviations. EXPY and FWY are included because they appear frequently in logistics datasets even though the core list in Publication 28 Appendix C runs to over 200 entries — add others as your data demands.


The diagram below shows how the two named groups map onto a typical address token sequence:

Regex capture group mapping for a US street address A US address string '1204B MAIN ST' is split into three labeled token regions: the 'number' group covering '1204B', an unlabeled street name region covering 'MAIN', and the 'suffix' group covering 'ST'. Arrows connect each region label to the corresponding tokens. 1204B MAIN ST number street name suffix Named groups highlighted; street name passes through uncaptured

Pattern Breakdown

Segment What it matches Why it is necessary
^ Start anchor Prevents false positives when the input contains a leading building name or label.
(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?) Integers (123), fractions (1/2), hyphenated ranges (12-14), alphanumeric codes (45B) Covers all civic number formats recognized by USPS Publication 28.
\s+ One or more whitespace characters Bridges the numeric block and the street name; handles tabs as well as spaces.
`(?P(?:AVE BLVD …).?)`
\b Word boundary Prevents ST from matching the prefix of STREET or STATION.

The suffix group intentionally excludes the street name so the alternation list stays short and the engine never needs to scan backwards. Adding a greedy .+ for the name between the number and suffix is possible but increases backtracking surface — benchmark before deploying on dirty data.

Minimal Runnable Python Implementation

import re
import pandas as pd
from typing import Optional, Tuple

# Compile once at module level — thread-safe, no per-call overhead
STREET_REGEX = re.compile(
    r"^(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)\s+"
    r"(?P<suffix>(?:AVE|BLVD|CIR|CT|DR|EXPY|FWY|HWY|LN|PKWY|PL|RD|ST|TER|TRL|WAY)\.?)\b",
    re.IGNORECASE,
)


def parse_street_components(
    address_line: str,
) -> Tuple[Optional[str], Optional[str]]:
    """Extract the civic number and USPS suffix from a raw address string.

    Args:
        address_line: A single-line US street address, e.g. '1204B Main St'.

    Returns:
        A (number, suffix) tuple, both normalized to uppercase with internal
        whitespace collapsed. Returns (None, None) on non-match.
    """
    if not address_line:
        return None, None

    match = STREET_REGEX.search(address_line.strip())
    if not match:
        return None, None

    number = re.sub(r"\s+", "", match.group("number")).upper()
    suffix = match.group("suffix").upper().rstrip(".")
    return number, suffix


def vectorized_parse(
    df: pd.DataFrame, col: str = "street_line"
) -> pd.DataFrame:
    """Vectorized extraction across a DataFrame column using str.extract.

    str.extract is implemented in C and substantially outperforms .apply()
    on large datasets. Named groups in STREET_REGEX become column names
    automatically.
    """
    parsed = df[col].str.extract(STREET_REGEX)
    parsed["number"] = parsed["number"].str.replace(r"\s+", "", regex=True).str.upper()
    parsed["suffix"] = parsed["suffix"].str.upper().str.rstrip(".")
    return pd.concat([df, parsed], axis=1)

str.extract passes the compiled pattern object directly to pandas’ underlying C extension, so you benefit from both precompilation and vectorized iteration in a single call. Avoid .apply(parse_street_components) unless you need the (None, None) fallback behavior row-by-row — the vectorized path is typically 5–15× faster on DataFrames with more than a few thousand rows.

Edge Cases and Failure Modes

Directional prefixes between number and suffix

Input 123 N MAIN ST fails because N sits in the street name position and the pattern expects the suffix to follow the whitespace directly after the number. Add an optional directional segment before the street name if your dataset consistently includes pre-directionals:

STREET_REGEX_WITH_DIR = re.compile(
    r"^(?P<number>\d+(?:[\s\-/]*\d+)?[A-Z]?)\s+"
    r"(?:(?P<pre_dir>[NSEW]{1,2})\s+)?"  # optional: N, S, NE, SW, etc.
    r"(?P<suffix>(?:AVE|BLVD|CIR|CT|DR|EXPY|FWY|HWY|LN|PKWY|PL|RD|ST|TER|TRL|WAY)\.?)\b",
    re.IGNORECASE,
)

For handling PO Boxes and rural routes, use a separate upstream filter — those records lack a civic number entirely and should never reach this pattern.

Missing whitespace between number and suffix

OCR pipelines and legacy batch exports sometimes strip the space between the number and the street name: 123MAINST. The \s+ requirement in the pattern correctly rejects these, surfacing them as (None, None) for routing to a repair step. Insert a preprocessing regex to inject whitespace before any known suffix:

INJECT_SPACE = re.compile(
    r"(\d)(?=[A-Z]{2,}(?:AVE|BLVD|ST|RD|DR|LN|CT|PL|WAY|TRL|TER|CIR|HWY|PKWY)\b)",
    re.IGNORECASE,
)

def repair_missing_space(raw: str) -> str:
    return INJECT_SPACE.sub(r"\1 ", raw)

Apply repair_missing_space before calling parse_street_components.

Alphanumeric unit codes that look like directionals

45B PINE AVE correctly yields number="45B", suffix="AVE" because [A-Z]? in the number group absorbs the trailing letter. However, 45 B PINE AVE — with a space before B — causes the pattern to stop at 45 and fail to find a recognized suffix immediately after the whitespace. Collapse space-separated unit suffixes (45 B45B) in the normalization step before extraction:

UNIT_SUFFIX_SPACE = re.compile(r"(\b\d+)\s+([A-D]|[F-HJ-NP-Z])\b")

def collapse_unit_suffix(raw: str) -> str:
    """Join '45 B' → '45B', avoiding false collapse of directionals like '45 N'."""
    return UNIT_SUFFIX_SPACE.sub(r"\1\2", raw)

The character class excludes E (East) and single-letter directionals (N, S, W) to reduce false collapses.

Integration Note

This pattern is the first extraction step in a layered Regex Patterns for US Address Parsing pipeline: sanitize input → extract number and suffix → extract street name → extract secondary unit → validate. The civic number and suffix you extract here are passed directly to downstream validation — either a USPS-certified engine as described in Step-by-Step Guide to CASS Address Validation or a geocoding API that expects pre-parsed components. Keeping the extraction regex narrow and the normalization logic in a separate function means you can swap in an extended pattern (to add the street name group) without touching the validation layer.

If your pipeline processes international records alongside US addresses, apply a locale detection step first — this pattern is strictly US-centric and will produce incorrect results on UK, Canadian, or Australian civic numbers where the format differs fundamentally from USPS conventions.

FAQ

Why does re.search() produce better results than re.match() here?

re.match() requires the pattern to match at the very start of the string. If your raw data sometimes contains a building name or suite label before the civic number — for example Suite 4B, 123 Main Stre.match() returns None even though the address is parseable. re.search() scans forward until it finds the number+suffix block, handling those prefixes transparently.

How should I handle suffix abbreviations not in the list?

Maintain a lookup dictionary that maps full street type words to their USPS abbreviations, run it over the address before calling the regex, and add the abbreviated form to the pattern’s alternation group. For example, map STREETST, AVENUEAVE, BOULEVARDBLVD. This keeps the regex list manageable and puts full-word expansion in a testable, version-controlled dict.

Does re.IGNORECASE hurt performance?

On CPython 3.8+, re.IGNORECASE incurs negligible overhead for ASCII-range patterns because the engine folds case at compile time. The flag does slow down Unicode patterns with non-ASCII characters, but US address data is overwhelmingly ASCII. Keep the flag — removing it to “optimize” forces uppercase normalization on the caller and is error-prone in practice.

When should I use .str.extract() versus .apply()?

Use .str.extract() whenever you want both captured groups as new DataFrame columns — it is the fastest path. Switch to .apply(parse_street_components) only if you need explicit (None, None) sentinel values that differ from pandas’ default NaN, or when you need to apply multi-step preprocessing logic (repair + extract) atomically per row.