Implementing Fallback Chains for Failed Geocoding Lookups

As part of the Multi-API Routing & Fallback Chains resilience strategy, this page covers one specific problem: what to do when a geocoding lookup fails. A fallback chain is a prioritised sequence of geocoding providers that the system advances through until a valid coordinate pair is returned or a definitive failure state is reached.

Network partitions, provider outages, malformed input, and rate-limit exhaustion all interrupt address resolution workflows. Relying on a single geocoding API introduces a single point of failure that cascades into downstream logistics, routing, and analytics systems. A well-designed fallback chain transforms that fragility into a self-healing pipeline.

Prerequisites

Fallback chains are not a substitute for clean input. Running un-normalised free-text through a chain wastes quota on every tier and degrades aggregate resolution rates.

Architecture: How the Chain Flows

The diagram below shows the decision path for a single address through a three-tier chain. Each provider node either resolves the address (exits right) or hands off to the next tier. A circuit breaker sits in front of each node; if the provider has exceeded its consecutive-failure threshold it is skipped entirely.

Geocoding Fallback Chain — Three-Tier State Diagram An address enters from the left. It passes through a circuit-breaker gate before each provider tier. On success, the resolved coordinates exit to the right. On exhaustion of all tiers, the address is routed to a dead-letter queue at the bottom. Address Circuit Breaker 1 Circuit Breaker 2 Circuit Breaker 3 Tier 1 Google / HERE Tier 2 Mapbox / TomTom Tier 3 Nominatim resolved resolved resolved Coordinates (lat, lng) Dead-Letter Queue open circuit — bypass open circuit — bypass open circuit — bypass

Step-by-Step Implementation Workflow

1. Define Provider Priority and Cost Tiers

Rank providers by accuracy, regional coverage, latency, and operational cost. Commercial APIs with high match rates and global coverage occupy Tier 1. Open-source or regional providers serve as Tier 2 or Tier 3. Document this matrix explicitly — it becomes the configuration contract for your executor. When pairing commercial precision with open coverage, configuring Google Maps fallback to OpenStreetMap is the most common starting point.

API quota tracking and cost management must integrate with your routing context so you can attribute spend per fallback tier and dynamically adjust priorities when a budget threshold is breached.

2. Map Failure Conditions and State Transitions

Identify which HTTP responses and payload states trigger a fallback vs. halt the chain:

Signal Action
429 Too Many Requests Advance to next tier; note provider as rate-limited
5xx server error Advance to next tier; increment circuit-breaker counter
Connection timeout Advance to next tier; apply backoff before retry
200 OK + empty results / ZERO_RESULTS Advance to next tier
Schema validation failure Advance to next tier; log payload for debugging
400 Bad Request Halt chain — bad input will fail every provider
401 / 403 Halt chain — auth failure requires operator action

Client errors (4xx except 429) indicate bad input or configuration; forwarding the same query to subsequent providers wastes quota. Server and transient errors (5xx, 429, timeouts) are the only signals that justify progression.

3. Implement Stateful Request Context

Maintain a request context object that records which providers have been attempted, elapsed time, and accumulated cost. This prevents circular routing and enables accurate billing attribution. The context must capture the original query, the normalised input form, and the final resolution state for downstream analytics and SLA reporting.

from __future__ import annotations

import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class FallbackState(str, Enum):
    SUCCESS = "success"
    EXHAUSTED = "exhausted"
    INVALID_INPUT = "invalid_input"


@dataclass
class RequestContext:
    """Immutable audit trail for a single geocoding attempt."""

    query: str
    normalised_query: str = ""
    attempts: list[str] = field(default_factory=list)
    total_latency_ms: float = 0.0
    state: FallbackState = FallbackState.EXHAUSTED
    coordinates: Optional[tuple[float, float]] = None

    def record_attempt(self, provider: str, latency_ms: float) -> None:
        self.attempts.append(provider)
        self.total_latency_ms += latency_ms

4. Configure Exponential Backoff with Jitter

Immediate retries after a transient error amplify load and often trigger stricter rate limits. Implement exponential backoff with randomised jitter to spread retry attempts across the provider’s recovery window:

import asyncio
import random


async def backoff_sleep(attempt: int, base: float = 0.5, cap: float = 10.0) -> None:
    """Exponential backoff with full jitter (capped at `cap` seconds)."""
    delay = min(cap, base * (2 ** attempt))
    jitter = random.uniform(0, delay)
    await asyncio.sleep(jitter)

Tier 1 providers may warrant shorter delays (base 0.25 s) since their outages tend to be brief. Tier 3 providers, often community-run, can tolerate longer waits.

5. Build the Async Fallback Executor

The full executor below wires together the context, backoff, and per-provider parsing. It uses pydantic for configuration validation and httpx for non-blocking HTTP. Pair it with the async geocoding request patterns to optimise throughput when running the chain over large address batches.

import asyncio
import logging
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from urllib.parse import urljoin

import httpx
from pydantic import BaseModel, field_validator

logger = logging.getLogger(__name__)


class FallbackState(str, Enum):
    SUCCESS = "success"
    EXHAUSTED = "exhausted"
    INVALID_INPUT = "invalid_input"


@dataclass
class RequestContext:
    """Audit trail for a single geocoding resolution attempt."""

    query: str
    normalised_query: str = ""
    attempts: list[str] = field(default_factory=list)
    total_latency_ms: float = 0.0
    state: FallbackState = FallbackState.EXHAUSTED
    coordinates: Optional[tuple[float, float]] = None


class ProviderConfig(BaseModel):
    """Configuration for one geocoding provider tier."""

    name: str
    base_url: str
    api_key: str
    timeout: float = 5.0
    max_retries: int = 2

    @field_validator("base_url")
    @classmethod
    def require_https(cls, v: str) -> str:
        if not v.startswith("https://"):
            raise ValueError("Provider base_url must use HTTPS")
        return v


class CircuitBreaker:
    """Opens after `threshold` consecutive failures; resets after `cooldown` seconds."""

    def __init__(self, name: str, threshold: int = 5, cooldown: float = 60.0) -> None:
        self.name = name
        self.threshold = threshold
        self.cooldown = cooldown
        self._failures = 0
        self._opened_at: Optional[float] = None

    def is_open(self) -> bool:
        if self._opened_at is None:
            return False
        if time.monotonic() - self._opened_at >= self.cooldown:
            self._reset()
            return False
        return True

    def record_failure(self) -> None:
        self._failures += 1
        if self._failures >= self.threshold:
            self._opened_at = time.monotonic()
            logger.warning("Circuit opened for %s after %d failures", self.name, self._failures)

    def record_success(self) -> None:
        self._reset()

    def _reset(self) -> None:
        self._failures = 0
        self._opened_at = None


class GeocodingFallbackChain:
    """
    Resolves an address by walking a prioritised list of geocoding providers.

    Each provider is guarded by a circuit breaker. On transient failure the
    chain advances to the next tier after an exponential-backoff wait. Hard
    client errors (400, 401, 403) halt the chain immediately.
    """

    def __init__(self, providers: list[ProviderConfig]) -> None:
        self.providers = providers
        self._breakers: dict[str, CircuitBreaker] = {
            p.name: CircuitBreaker(p.name) for p in providers
        }
        self._client = httpx.AsyncClient(timeout=30.0)

    async def resolve(self, address: str) -> RequestContext:
        """Return a RequestContext with coordinates on success or EXHAUSTED/INVALID_INPUT."""
        ctx = RequestContext(query=address)

        for provider in self.providers:
            breaker = self._breakers[provider.name]
            if breaker.is_open():
                logger.info("Skipping %s — circuit open", provider.name)
                continue

            start = time.monotonic()
            try:
                coordinates = await self._call_with_retry(provider, address)
                ctx.total_latency_ms += (time.monotonic() - start) * 1000
                ctx.attempts.append(provider.name)

                if coordinates is not None:
                    ctx.coordinates = coordinates
                    ctx.state = FallbackState.SUCCESS
                    breaker.record_success()
                    logger.info(
                        "Resolved %r via %s in %.1f ms",
                        address,
                        provider.name,
                        ctx.total_latency_ms,
                    )
                    return ctx

                # ZERO_RESULTS — try next provider
                logger.debug("No results from %s for %r", provider.name, address)

            except _HaltChainError as exc:
                ctx.total_latency_ms += (time.monotonic() - start) * 1000
                ctx.attempts.append(provider.name)
                ctx.state = FallbackState.INVALID_INPUT
                logger.error("Chain halted: %s", exc)
                return ctx

            except Exception as exc:
                ctx.total_latency_ms += (time.monotonic() - start) * 1000
                ctx.attempts.append(provider.name)
                breaker.record_failure()
                logger.warning("Provider %s failed: %s", provider.name, exc)

        logger.info(
            "Fallback chain exhausted for %r after %d provider(s)", address, len(ctx.attempts)
        )
        return ctx

    async def _call_with_retry(
        self, config: ProviderConfig, address: str
    ) -> Optional[tuple[float, float]]:
        """Retry up to `config.max_retries` times with exponential backoff."""
        last_exc: Optional[Exception] = None
        for attempt in range(config.max_retries + 1):
            if attempt:
                await _backoff_sleep(attempt - 1)
            try:
                return await self._call_provider(config, address)
            except _HaltChainError:
                raise
            except Exception as exc:
                last_exc = exc
        raise last_exc  # type: ignore[misc]

    async def _call_provider(
        self, config: ProviderConfig, address: str
    ) -> Optional[tuple[float, float]]:
        """
        Call one provider and return (lat, lng) on match, None on ZERO_RESULTS.

        Raises _HaltChainError for 400/401/403; re-raises httpx errors for
        transient conditions (5xx, timeout, network).
        """
        # Illustrates Google Maps Geocoding API structure.
        # Adapt params/response parsing for each provider.
        url = urljoin(config.base_url, "json")
        params = {"address": address, "key": config.api_key}

        response = await self._client.get(
            url, params=params, timeout=config.timeout
        )

        if response.status_code in (400, 401, 403):
            raise _HaltChainError(
                f"HTTP {response.status_code} from {config.name} — halting chain"
            )
        response.raise_for_status()

        data: dict = response.json()
        status = data.get("status", "")

        if status == "OK" and data.get("results"):
            loc = data["results"][0]["geometry"]["location"]
            lat, lng = float(loc["lat"]), float(loc["lng"])
            _validate_coordinates(lat, lng, config.name)
            return lat, lng

        if status in ("ZERO_RESULTS", "NOT_FOUND"):
            return None

        raise ValueError(f"Unexpected status '{status}' from {config.name}")

    async def close(self) -> None:
        await self._client.aclose()

    async def __aenter__(self) -> "GeocodingFallbackChain":
        return self

    async def __aexit__(self, *_: object) -> None:
        await self.close()


class _HaltChainError(Exception):
    """Signals a hard failure that must stop chain progression."""


async def _backoff_sleep(attempt: int, base: float = 0.5, cap: float = 10.0) -> None:
    import random
    delay = min(cap, base * (2 ** attempt))
    await asyncio.sleep(random.uniform(0, delay))


def _validate_coordinates(lat: float, lng: float, provider: str) -> None:
    """Reject coordinates outside valid ranges or known geocoding artefacts."""
    if not (-90.0 <= lat <= 90.0 and -180.0 <= lng <= 180.0):
        raise ValueError(f"{provider} returned out-of-range coordinates ({lat}, {lng})")
    # Null Island guard — (0, 0) is a common geocoding artefact
    if lat == 0.0 and lng == 0.0:
        raise ValueError(f"{provider} returned Null Island coordinates")

6. Vectorised Pandas Variant

For bulk address files, wrap the async chain in a pandas apply with asyncio.run or run it inside an event loop to parallelise resolution. The rate-limiting strategies for batch processing page covers semaphore-based concurrency controls that prevent quota exhaustion when running many chains simultaneously.

import asyncio
from typing import Any

import pandas as pd


async def resolve_batch(
    addresses: list[str], providers: list[ProviderConfig]
) -> list[RequestContext]:
    """Resolve a list of addresses concurrently with a shared semaphore."""
    sem = asyncio.Semaphore(10)  # max 10 in-flight requests

    async with GeocodingFallbackChain(providers) as chain:

        async def bounded_resolve(addr: str) -> RequestContext:
            async with sem:
                return await chain.resolve(addr)

        return await asyncio.gather(*[bounded_resolve(a) for a in addresses])


def geocode_dataframe(df: pd.DataFrame, providers: list[ProviderConfig]) -> pd.DataFrame:
    """
    Add 'lat', 'lng', 'provider_chain', and 'resolution_state' columns to df.

    Expects a 'normalised_address' column produced upstream by the parsing pipeline.
    """
    contexts = asyncio.run(resolve_batch(df["normalised_address"].tolist(), providers))
    df = df.copy()
    df["lat"] = [c.coordinates[0] if c.coordinates else None for c in contexts]
    df["lng"] = [c.coordinates[1] if c.coordinates else None for c in contexts]
    df["provider_chain"] = [" → ".join(c.attempts) for c in contexts]
    df["resolution_state"] = [c.state.value for c in contexts]
    return df

Provider Parameter Reference

Provider Base URL Key parameter ZERO_RESULTS signal Notes
Google Maps Geocoding https://maps.googleapis.com/maps/api/geocode/ key status == "ZERO_RESULTS" Richest component detail; charges per request
HERE Geocode https://geocode.search.hereapi.com/v1/geocode apiKey Empty items array Strong European coverage
Mapbox Geocoding https://api.mapbox.com/geocoding/v5/mapbox.places/ access_token Empty features array GeoJSON response; good for US addresses
TomTom Search https://api.tomtom.com/search/2/geocode/ key Empty results array Good fallback for logistics routes
Nominatim (OSM) https://nominatim.openstreetmap.org/search None (user-agent required) Empty JSON array Free; strict rate limit (1 req/s); last resort

Edge Cases

Partial Address Resolution Produces a Wrong Centroid

Some providers return a result for a truncated address — e.g. resolving “123 Main St Springfield” to the city centroid rather than the street. The response types or result_type field (provider-specific) distinguishes rooftop precision from city/zip centroids. Reject low-precision results and advance the chain rather than accepting a coarse match.

ACCEPTABLE_TYPES = {"rooftop", "range_interpolated", "geometric_center"}

def is_precise_enough(result: dict) -> bool:
    """Return True only for rooftop or interpolated matches."""
    location_type: str = (
        result.get("geometry", {}).get("location_type", "")
        or result.get("result_type", "")
    )
    return location_type in ACCEPTABLE_TYPES

International Addresses and Character Encoding

Providers differ in their handling of diacritics and non-Latin scripts. Always apply NFKC normalisation before sending an address to any provider tier. A provider that rejects a raw Unicode query might accept the normalised form. Send both forms if the first fails, rather than immediately advancing the chain.

Unstructured Free-Text Input

If an address arrives as a single unstructured string, run it through the core address parsing pipeline to separate street number, street name, city, postcode, and country before geocoding. Structured components substantially improve first-pass match rates on all tiers.

Provider Returns a Result in the Wrong Country

When a query omits the country component, some providers silently resolve it to a city in a different country with a similar name. Validate the returned country code against your expected country before accepting the result.

def country_matches(result: dict, expected_iso2: str) -> bool:
    """Check the address component for the country short name."""
    for component in result.get("address_components", []):
        if "country" in component.get("types", []):
            return component.get("short_name", "").upper() == expected_iso2.upper()
    return True  # No country component — cannot validate, pass through

Rate Limit Spikes During Batch Runs

If you are tracking API spend with Redis, integrate the quota counters with the fallback selector so the chain automatically routes around providers that have consumed their daily budget before hitting a live 429.

Performance and Scaling

Per-provider timeout discipline is the single biggest lever. A 5-second timeout per tier means a three-tier chain can block for 15 seconds on a pathological input. Set aggressive timeouts (2–3 s for commercial APIs, 5 s for community APIs) and rely on the chain to advance rather than waiting for the full window.

Concurrency with semaphores (shown in the pandas variant above) lets you saturate your quota without exceeding it. Start with a concurrency limit of 10 and benchmark against your Tier 1 provider’s documented rate limit. For sustained throughput above 1 000 records per minute, offload the chain to a worker pool and decouple input ingestion from resolution using a message queue.

Caching resolved coordinates by normalised address string eliminates redundant chain traversals for repeated inputs. A Redis TTL of 7–30 days is appropriate for most address data. See the dynamic provider selection based on region page for how to partition the cache by geography when your providers have asymmetric regional accuracy.

Dead-letter queue throughput. Unresolvable addresses should flow into a DLQ for human review or periodic reprocessing. Instrument the DLQ depth as a key metric — a sustained DLQ backlog signals either input quality problems upstream or systemic provider degradation.

Troubleshooting

httpx.ReadTimeout on Every Provider

The timeout is set too aggressively for your network environment, or the provider endpoint is unreachable. Verify connectivity with a direct curl call. Increase timeout in ProviderConfig incrementally and monitor p99 latency before settling on a value.

Chain Returns INVALID_INPUT for Valid Addresses

A _HaltChainError is being raised for a non-4xx reason. Check that your provider-specific parsing logic is not throwing a ValueError on an unexpected-but-valid response format. Add structured logging for the raw response body in _call_provider to diagnose.

Circuit Breaker Stays Open After Provider Recovers

The cooldown window has not elapsed, or the successful health probe is not being routed through the same CircuitBreaker instance. Ensure the GeocodingFallbackChain instance is long-lived (shared across requests in an async application) rather than re-instantiated per call. Use an async context manager (async with GeocodingFallbackChain(providers) as chain) to manage the client lifecycle.

ZERO_RESULTS Rate Increases After Switching to a New Tier Configuration

The new Tier 1 provider has lower regional coverage for your address corpus. Compare match rates by country or postcode prefix across provider tiers. Use the geocoding accuracy comparison patterns to benchmark before promoting a provider.

Coordinate Validation Raises on Legitimate Remote Locations

Some valid coordinates near the equator and prime meridian are falsely caught by the Null Island guard. Tighten the guard to a small bounding box (e.g. abs(lat) < 0.5 and abs(lng) < 0.5) rather than exact equality if your data includes addresses in that region.

FAQ

Should I retry on HTTP 400 or halt the chain?

Halt. A 400 Bad Request signals malformed input the provider rejected — passing the same query to the next provider will produce the same result. Fix the input upstream before re-entering the chain.

How many providers should a production fallback chain have?

Three tiers is the practical ceiling for most pipelines: a high-accuracy commercial API (Tier 1), an open or regional alternative (Tier 2), and a last-resort provider with broad but lower-precision coverage (Tier 3). Beyond three, latency accumulates and marginal resolution gains drop steeply.

What is the difference between a fallback chain and a retry loop?

A retry loop re-issues the same request to the same provider; a fallback chain advances to a different provider on failure. Good implementations combine both: retry transiently-failing providers with backoff before falling back to the next tier.

How do I handle ZERO_RESULTS vs. a network timeout differently?

ZERO_RESULTS means the provider received and understood the query but found no match — advance to the next provider immediately. A timeout means the provider may be degraded — apply backoff before retrying or advancing. Never conflate these two signals in your state machine.

Can I run fallback providers in parallel instead of sequentially?

Yes, but only as an optimisation for latency-critical paths. Parallel fanout burns quota on every provider simultaneously. The safer pattern is sequential with aggressive per-provider timeouts (2–5 s), falling back immediately on timeout rather than waiting for the full window.