Python Script to Extract PO Box Numbers

Compile a case-insensitive USPS-compliant regex once at module level, pre-clean OCR artifacts, capture the box identifier in a typed function, and apply it vectorized with pandas — full details under Handling PO Boxes and Rural Routes.

Production Regex Pattern

The pattern below covers every USPS-recognized abbreviation for a post office box as defined in USPS Publication 28. Compile it once at module level so it is not recompiled on every function call.

import re
from typing import Optional

# Compile once at module level — reuse across all calls in the process
PO_BOX_PATTERN = re.compile(
    r"""
    \b                              # Word boundary — prevent partial matches
    (?:                             # Non-capturing: PO Box prefix variants
        p(?:ost)?\s*o(?:ffice)?\.?  # PO, P.O., Post Office, PostOffice
        |                           # OR
        box                         # Standalone BOX (must be guarded by \b above)
    )
    \s*                             # Optional whitespace between prefix and id
    (?:[.\-]\s*)?                   # Optional delimiter: dash or period + trailing space
    ([A-Z0-9][A-Z0-9\-]*?)          # Capture group: alphanumeric box id (lazy)
    \b                              # Trailing word boundary
    """,
    re.IGNORECASE | re.VERBOSE,
)

Pattern Breakdown

Component What it matches Why it is necessary
\b (leading) Word boundary before the prefix Prevents matching BOX inside MAILBOX or INBOX
p(?:ost)?\s*o(?:ffice)?\.? PO, P.O., Post Office, PostOffice Covers all USPS Publication 28 abbreviations for the prefix
|box Standalone BOX keyword Handles entries where the PO prefix is entirely absent
\s*(?:[.\-]\s*)? Optional whitespace and delimiter Tolerates PO-Box, P.O. Box, Box.123, and similar inconsistencies
([A-Z0-9][A-Z0-9\-]*?) Capturing group: the box identifier Preserves alphanumeric ids like 12A or BOX-991; lazy quantifier stops before trailing punctuation
\b (trailing) Word boundary after the identifier Prevents consuming characters beyond the box number

The leading \b protects against matching BOX inside longer words. The lazy quantifier *? paired with a trailing \b stops capture before commas, semicolons, or line terminators that often follow a box number in raw address strings.

Minimal Runnable Implementation

The following function wraps the compiled pattern with type safety, OCR pre-cleaning, and deterministic normalization. It returns None for non-matching inputs so it chains safely in data pipelines without raising exceptions.

def extract_po_box(address: str) -> Optional[str]:
    """Extract a PO Box identifier from a raw address string.

    Args:
        address: Raw address line that may contain PO Box notation.

    Returns:
        Normalized identifier (uppercase, no internal spaces, no trailing
        punctuation) or None if no PO Box token is found.
    """
    if not isinstance(address, str) or not address.strip():
        return None

    # Pre-clean common OCR substitutions: digit zero mistaken for letter O
    cleaned = (
        address
        .replace("P0", "PO")   # P-zero → PO
        .replace("B0X", "BOX") # B-zero-X → BOX
        .replace("B0", "BO")   # partial corruption
    )

    match = PO_BOX_PATTERN.search(cleaned)
    if match:
        raw_id = match.group(1)
        return raw_id.replace(" ", "").upper().rstrip(".,;")
    return None

Vectorized usage with pandas

For ETL workloads, avoid Python-level loops. Use Series.apply() when the OCR pre-cleaning step is required; use Series.str.extract() for maximum throughput on clean data because it delegates to the underlying C engine.

import pandas as pd

# Option A — preserves OCR pre-cleaning; suitable for up to ~1 M rows
def extract_po_box_series(series: pd.Series) -> pd.Series:
    """Apply extract_po_box across a Series, preserving index alignment."""
    return series.apply(extract_po_box)


# Option B — native pandas regex; fastest path for clean/large datasets
_NATIVE_PATTERN = (
    r"(?i)\b(?:p(?:ost)?\s*o(?:ffice)?\.?|box)"
    r"\s*(?:[.\-]\s*)?([A-Z0-9][A-Z0-9\-]*?)\b"
)

def extract_po_box_native(series: pd.Series) -> pd.Series:
    """Extract PO Box ids via pandas C engine; no OCR pre-cleaning."""
    return (
        series
        .str.extract(_NATIVE_PATTERN, expand=False)
        .str.upper()
        .str.replace(r"\s+", "", regex=True)
    )

Always load CSV or Parquet source files with dtype=str to prevent pandas from coercing numeric box numbers such as 00123 to integers before extraction runs.

Data-Flow Diagram

The diagram below shows where PO Box extraction sits inside a typical address normalization pipeline, from raw ingest through to route-ready output.

PO Box extraction pipeline A horizontal flow diagram showing five processing stages: Raw Input, OCR Pre-clean, Regex Extract, Normalize, Validate, and Route-ready Output connected by arrows. Raw Input OCR Pre-clean Regex Extract Normalize Output Validate & Route no match → return None

Edge Cases and Failure Modes

Zero/Letter O confusion in scanned documents

Optical character recognition frequently renders the letter O as the digit 0. The pre-cleaning step handles the most common forms (P0 BOX, B0X), but heavily degraded scans may produce P0ST 0FFICE B0X. Extend the pre-cleaning if your OCR source produces this pattern:

# More aggressive OCR recovery for heavily degraded input
def aggressive_ocr_clean(address: str) -> str:
    return (
        address
        .replace("P0ST", "POST")
        .replace("0FFICE", "OFFICE")
        .replace("B0X", "BOX")
        .replace("P0", "PO")
    )

Dual-format lines containing both street and box data

Some address lines carry both a street component and a PO Box — for example, 123 Main St, PO Box 456. The regex extracts only the box identifier and ignores the street portion, which is the correct behaviour for a delivery-routing pipeline. If your downstream system needs to flag these dual-format lines (perhaps to warn that mail delivery will follow box rules, not street rules), add a secondary classifier:

import re

_STREET_SIGNAL = re.compile(
    r"\b(?:st(?:reet)?|ave(?:nue)?|blvd|dr(?:ive)?|rd|ln|ct|way)\b",
    re.IGNORECASE,
)

def is_dual_format(address: str) -> bool:
    """True when the address contains both a street indicator and a PO Box."""
    return bool(
        extract_po_box(address) and _STREET_SIGNAL.search(address)
    )

Alphanumeric and hyphenated box identifiers

Rural and commercial routes regularly use formats such as PO Box 12A, BOX-991, or P.O. Box 4B-2. The capture group [A-Z0-9][A-Z0-9\-]*? preserves hyphens and trailing letters while the trailing \b and rstrip step removes any punctuation that follows the identifier. Verify your test corpus includes these forms:

assert extract_po_box("PO Box 12A")    == "12A"
assert extract_po_box("BOX-991")       == "991"
assert extract_po_box("P.O. Box 4B-2") == "4B-2"

Integration Note

This extraction step belongs immediately after the OCR pre-cleaning stage and before any USPS CASS certification or deliverability validation. Within the broader Handling PO Boxes and Rural Routes workflow, the normalized box identifier returned here is the key passed to carrier-route lookup tables; feeding a raw, unparsed string to those lookups is the single most common source of failed delivery assignments. If your pipeline must also handle non-US addresses that encode postal box information differently, see International Address Format Standardization before applying this US-specific pattern to global datasets.