As part of the Multi-API Routing & Fallback Chains resilience strategy, this page covers one specific problem: what to do when a geocoding lookup fails. A fallback chain is a prioritised sequence of geocoding providers that the system advances through until a valid coordinate pair is returned or a definitive failure state is reached.
Network partitions, provider outages, malformed input, and rate-limit exhaustion all interrupt address resolution workflows. Relying on a single geocoding API introduces a single point of failure that cascades into downstream logistics, routing, and analytics systems. A well-designed fallback chain transforms that fragility into a self-healing pipeline.
Prerequisites
Fallback chains are not a substitute for clean input. Running un-normalised free-text through a chain wastes quota on every tier and degrades aggregate resolution rates.
Architecture: How the Chain Flows
The diagram below shows the decision path for a single address through a three-tier chain. Each provider node either resolves the address (exits right) or hands off to the next tier. A circuit breaker sits in front of each node; if the provider has exceeded its consecutive-failure threshold it is skipped entirely.
Step-by-Step Implementation Workflow
1. Define Provider Priority and Cost Tiers
Rank providers by accuracy, regional coverage, latency, and operational cost. Commercial APIs with high match rates and global coverage occupy Tier 1. Open-source or regional providers serve as Tier 2 or Tier 3. Document this matrix explicitly — it becomes the configuration contract for your executor. When pairing commercial precision with open coverage, configuring Google Maps fallback to OpenStreetMap is the most common starting point.
API quota tracking and cost management must integrate with your routing context so you can attribute spend per fallback tier and dynamically adjust priorities when a budget threshold is breached.
2. Map Failure Conditions and State Transitions
Identify which HTTP responses and payload states trigger a fallback vs. halt the chain:
| Signal | Action |
|---|---|
429 Too Many Requests |
Advance to next tier; note provider as rate-limited |
5xx server error |
Advance to next tier; increment circuit-breaker counter |
| Connection timeout | Advance to next tier; apply backoff before retry |
200 OK + empty results / ZERO_RESULTS |
Advance to next tier |
| Schema validation failure | Advance to next tier; log payload for debugging |
400 Bad Request |
Halt chain — bad input will fail every provider |
401 / 403 |
Halt chain — auth failure requires operator action |
Client errors (4xx except 429) indicate bad input or configuration; forwarding the same query to subsequent providers wastes quota. Server and transient errors (5xx, 429, timeouts) are the only signals that justify progression.
3. Implement Stateful Request Context
Maintain a request context object that records which providers have been attempted, elapsed time, and accumulated cost. This prevents circular routing and enables accurate billing attribution. The context must capture the original query, the normalised input form, and the final resolution state for downstream analytics and SLA reporting.
from __future__ import annotations
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class FallbackState(str, Enum):
SUCCESS = "success"
EXHAUSTED = "exhausted"
INVALID_INPUT = "invalid_input"
@dataclass
class RequestContext:
"""Immutable audit trail for a single geocoding attempt."""
query: str
normalised_query: str = ""
attempts: list[str] = field(default_factory=list)
total_latency_ms: float = 0.0
state: FallbackState = FallbackState.EXHAUSTED
coordinates: Optional[tuple[float, float]] = None
def record_attempt(self, provider: str, latency_ms: float) -> None:
self.attempts.append(provider)
self.total_latency_ms += latency_ms
4. Configure Exponential Backoff with Jitter
Immediate retries after a transient error amplify load and often trigger stricter rate limits. Implement exponential backoff with randomised jitter to spread retry attempts across the provider’s recovery window:
import asyncio
import random
async def backoff_sleep(attempt: int, base: float = 0.5, cap: float = 10.0) -> None:
"""Exponential backoff with full jitter (capped at `cap` seconds)."""
delay = min(cap, base * (2 ** attempt))
jitter = random.uniform(0, delay)
await asyncio.sleep(jitter)
Tier 1 providers may warrant shorter delays (base 0.25 s) since their outages tend to be brief. Tier 3 providers, often community-run, can tolerate longer waits.
5. Build the Async Fallback Executor
The full executor below wires together the context, backoff, and per-provider parsing. It uses pydantic for configuration validation and httpx for non-blocking HTTP. Pair it with the async geocoding request patterns to optimise throughput when running the chain over large address batches.
import asyncio
import logging
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from urllib.parse import urljoin
import httpx
from pydantic import BaseModel, field_validator
logger = logging.getLogger(__name__)
class FallbackState(str, Enum):
SUCCESS = "success"
EXHAUSTED = "exhausted"
INVALID_INPUT = "invalid_input"
@dataclass
class RequestContext:
"""Audit trail for a single geocoding resolution attempt."""
query: str
normalised_query: str = ""
attempts: list[str] = field(default_factory=list)
total_latency_ms: float = 0.0
state: FallbackState = FallbackState.EXHAUSTED
coordinates: Optional[tuple[float, float]] = None
class ProviderConfig(BaseModel):
"""Configuration for one geocoding provider tier."""
name: str
base_url: str
api_key: str
timeout: float = 5.0
max_retries: int = 2
@field_validator("base_url")
@classmethod
def require_https(cls, v: str) -> str:
if not v.startswith("https://"):
raise ValueError("Provider base_url must use HTTPS")
return v
class CircuitBreaker:
"""Opens after `threshold` consecutive failures; resets after `cooldown` seconds."""
def __init__(self, name: str, threshold: int = 5, cooldown: float = 60.0) -> None:
self.name = name
self.threshold = threshold
self.cooldown = cooldown
self._failures = 0
self._opened_at: Optional[float] = None
def is_open(self) -> bool:
if self._opened_at is None:
return False
if time.monotonic() - self._opened_at >= self.cooldown:
self._reset()
return False
return True
def record_failure(self) -> None:
self._failures += 1
if self._failures >= self.threshold:
self._opened_at = time.monotonic()
logger.warning("Circuit opened for %s after %d failures", self.name, self._failures)
def record_success(self) -> None:
self._reset()
def _reset(self) -> None:
self._failures = 0
self._opened_at = None
class GeocodingFallbackChain:
"""
Resolves an address by walking a prioritised list of geocoding providers.
Each provider is guarded by a circuit breaker. On transient failure the
chain advances to the next tier after an exponential-backoff wait. Hard
client errors (400, 401, 403) halt the chain immediately.
"""
def __init__(self, providers: list[ProviderConfig]) -> None:
self.providers = providers
self._breakers: dict[str, CircuitBreaker] = {
p.name: CircuitBreaker(p.name) for p in providers
}
self._client = httpx.AsyncClient(timeout=30.0)
async def resolve(self, address: str) -> RequestContext:
"""Return a RequestContext with coordinates on success or EXHAUSTED/INVALID_INPUT."""
ctx = RequestContext(query=address)
for provider in self.providers:
breaker = self._breakers[provider.name]
if breaker.is_open():
logger.info("Skipping %s — circuit open", provider.name)
continue
start = time.monotonic()
try:
coordinates = await self._call_with_retry(provider, address)
ctx.total_latency_ms += (time.monotonic() - start) * 1000
ctx.attempts.append(provider.name)
if coordinates is not None:
ctx.coordinates = coordinates
ctx.state = FallbackState.SUCCESS
breaker.record_success()
logger.info(
"Resolved %r via %s in %.1f ms",
address,
provider.name,
ctx.total_latency_ms,
)
return ctx
# ZERO_RESULTS — try next provider
logger.debug("No results from %s for %r", provider.name, address)
except _HaltChainError as exc:
ctx.total_latency_ms += (time.monotonic() - start) * 1000
ctx.attempts.append(provider.name)
ctx.state = FallbackState.INVALID_INPUT
logger.error("Chain halted: %s", exc)
return ctx
except Exception as exc:
ctx.total_latency_ms += (time.monotonic() - start) * 1000
ctx.attempts.append(provider.name)
breaker.record_failure()
logger.warning("Provider %s failed: %s", provider.name, exc)
logger.info(
"Fallback chain exhausted for %r after %d provider(s)", address, len(ctx.attempts)
)
return ctx
async def _call_with_retry(
self, config: ProviderConfig, address: str
) -> Optional[tuple[float, float]]:
"""Retry up to `config.max_retries` times with exponential backoff."""
last_exc: Optional[Exception] = None
for attempt in range(config.max_retries + 1):
if attempt:
await _backoff_sleep(attempt - 1)
try:
return await self._call_provider(config, address)
except _HaltChainError:
raise
except Exception as exc:
last_exc = exc
raise last_exc # type: ignore[misc]
async def _call_provider(
self, config: ProviderConfig, address: str
) -> Optional[tuple[float, float]]:
"""
Call one provider and return (lat, lng) on match, None on ZERO_RESULTS.
Raises _HaltChainError for 400/401/403; re-raises httpx errors for
transient conditions (5xx, timeout, network).
"""
# Illustrates Google Maps Geocoding API structure.
# Adapt params/response parsing for each provider.
url = urljoin(config.base_url, "json")
params = {"address": address, "key": config.api_key}
response = await self._client.get(
url, params=params, timeout=config.timeout
)
if response.status_code in (400, 401, 403):
raise _HaltChainError(
f"HTTP {response.status_code} from {config.name} — halting chain"
)
response.raise_for_status()
data: dict = response.json()
status = data.get("status", "")
if status == "OK" and data.get("results"):
loc = data["results"][0]["geometry"]["location"]
lat, lng = float(loc["lat"]), float(loc["lng"])
_validate_coordinates(lat, lng, config.name)
return lat, lng
if status in ("ZERO_RESULTS", "NOT_FOUND"):
return None
raise ValueError(f"Unexpected status '{status}' from {config.name}")
async def close(self) -> None:
await self._client.aclose()
async def __aenter__(self) -> "GeocodingFallbackChain":
return self
async def __aexit__(self, *_: object) -> None:
await self.close()
class _HaltChainError(Exception):
"""Signals a hard failure that must stop chain progression."""
async def _backoff_sleep(attempt: int, base: float = 0.5, cap: float = 10.0) -> None:
import random
delay = min(cap, base * (2 ** attempt))
await asyncio.sleep(random.uniform(0, delay))
def _validate_coordinates(lat: float, lng: float, provider: str) -> None:
"""Reject coordinates outside valid ranges or known geocoding artefacts."""
if not (-90.0 <= lat <= 90.0 and -180.0 <= lng <= 180.0):
raise ValueError(f"{provider} returned out-of-range coordinates ({lat}, {lng})")
# Null Island guard — (0, 0) is a common geocoding artefact
if lat == 0.0 and lng == 0.0:
raise ValueError(f"{provider} returned Null Island coordinates")
6. Vectorised Pandas Variant
For bulk address files, wrap the async chain in a pandas apply with asyncio.run or run it inside an event loop to parallelise resolution. The rate-limiting strategies for batch processing page covers semaphore-based concurrency controls that prevent quota exhaustion when running many chains simultaneously.
import asyncio
from typing import Any
import pandas as pd
async def resolve_batch(
addresses: list[str], providers: list[ProviderConfig]
) -> list[RequestContext]:
"""Resolve a list of addresses concurrently with a shared semaphore."""
sem = asyncio.Semaphore(10) # max 10 in-flight requests
async with GeocodingFallbackChain(providers) as chain:
async def bounded_resolve(addr: str) -> RequestContext:
async with sem:
return await chain.resolve(addr)
return await asyncio.gather(*[bounded_resolve(a) for a in addresses])
def geocode_dataframe(df: pd.DataFrame, providers: list[ProviderConfig]) -> pd.DataFrame:
"""
Add 'lat', 'lng', 'provider_chain', and 'resolution_state' columns to df.
Expects a 'normalised_address' column produced upstream by the parsing pipeline.
"""
contexts = asyncio.run(resolve_batch(df["normalised_address"].tolist(), providers))
df = df.copy()
df["lat"] = [c.coordinates[0] if c.coordinates else None for c in contexts]
df["lng"] = [c.coordinates[1] if c.coordinates else None for c in contexts]
df["provider_chain"] = [" → ".join(c.attempts) for c in contexts]
df["resolution_state"] = [c.state.value for c in contexts]
return df
Provider Parameter Reference
| Provider | Base URL | Key parameter | ZERO_RESULTS signal | Notes |
|---|---|---|---|---|
| Google Maps Geocoding | https://maps.googleapis.com/maps/api/geocode/ |
key |
status == "ZERO_RESULTS" |
Richest component detail; charges per request |
| HERE Geocode | https://geocode.search.hereapi.com/v1/geocode |
apiKey |
Empty items array |
Strong European coverage |
| Mapbox Geocoding | https://api.mapbox.com/geocoding/v5/mapbox.places/ |
access_token |
Empty features array |
GeoJSON response; good for US addresses |
| TomTom Search | https://api.tomtom.com/search/2/geocode/ |
key |
Empty results array |
Good fallback for logistics routes |
| Nominatim (OSM) | https://nominatim.openstreetmap.org/search |
None (user-agent required) | Empty JSON array | Free; strict rate limit (1 req/s); last resort |
Edge Cases
Partial Address Resolution Produces a Wrong Centroid
Some providers return a result for a truncated address — e.g. resolving “123 Main St Springfield” to the city centroid rather than the street. The response types or result_type field (provider-specific) distinguishes rooftop precision from city/zip centroids. Reject low-precision results and advance the chain rather than accepting a coarse match.
ACCEPTABLE_TYPES = {"rooftop", "range_interpolated", "geometric_center"}
def is_precise_enough(result: dict) -> bool:
"""Return True only for rooftop or interpolated matches."""
location_type: str = (
result.get("geometry", {}).get("location_type", "")
or result.get("result_type", "")
)
return location_type in ACCEPTABLE_TYPES
International Addresses and Character Encoding
Providers differ in their handling of diacritics and non-Latin scripts. Always apply NFKC normalisation before sending an address to any provider tier. A provider that rejects a raw Unicode query might accept the normalised form. Send both forms if the first fails, rather than immediately advancing the chain.
Unstructured Free-Text Input
If an address arrives as a single unstructured string, run it through the core address parsing pipeline to separate street number, street name, city, postcode, and country before geocoding. Structured components substantially improve first-pass match rates on all tiers.
Provider Returns a Result in the Wrong Country
When a query omits the country component, some providers silently resolve it to a city in a different country with a similar name. Validate the returned country code against your expected country before accepting the result.
def country_matches(result: dict, expected_iso2: str) -> bool:
"""Check the address component for the country short name."""
for component in result.get("address_components", []):
if "country" in component.get("types", []):
return component.get("short_name", "").upper() == expected_iso2.upper()
return True # No country component — cannot validate, pass through
Rate Limit Spikes During Batch Runs
If you are tracking API spend with Redis, integrate the quota counters with the fallback selector so the chain automatically routes around providers that have consumed their daily budget before hitting a live 429.
Performance and Scaling
Per-provider timeout discipline is the single biggest lever. A 5-second timeout per tier means a three-tier chain can block for 15 seconds on a pathological input. Set aggressive timeouts (2–3 s for commercial APIs, 5 s for community APIs) and rely on the chain to advance rather than waiting for the full window.
Concurrency with semaphores (shown in the pandas variant above) lets you saturate your quota without exceeding it. Start with a concurrency limit of 10 and benchmark against your Tier 1 provider’s documented rate limit. For sustained throughput above 1 000 records per minute, offload the chain to a worker pool and decouple input ingestion from resolution using a message queue.
Caching resolved coordinates by normalised address string eliminates redundant chain traversals for repeated inputs. A Redis TTL of 7–30 days is appropriate for most address data. See the dynamic provider selection based on region page for how to partition the cache by geography when your providers have asymmetric regional accuracy.
Dead-letter queue throughput. Unresolvable addresses should flow into a DLQ for human review or periodic reprocessing. Instrument the DLQ depth as a key metric — a sustained DLQ backlog signals either input quality problems upstream or systemic provider degradation.
Troubleshooting
httpx.ReadTimeout on Every Provider
The timeout is set too aggressively for your network environment, or the provider endpoint is unreachable. Verify connectivity with a direct curl call. Increase timeout in ProviderConfig incrementally and monitor p99 latency before settling on a value.
Chain Returns INVALID_INPUT for Valid Addresses
A _HaltChainError is being raised for a non-4xx reason. Check that your provider-specific parsing logic is not throwing a ValueError on an unexpected-but-valid response format. Add structured logging for the raw response body in _call_provider to diagnose.
Circuit Breaker Stays Open After Provider Recovers
The cooldown window has not elapsed, or the successful health probe is not being routed through the same CircuitBreaker instance. Ensure the GeocodingFallbackChain instance is long-lived (shared across requests in an async application) rather than re-instantiated per call. Use an async context manager (async with GeocodingFallbackChain(providers) as chain) to manage the client lifecycle.
ZERO_RESULTS Rate Increases After Switching to a New Tier Configuration
The new Tier 1 provider has lower regional coverage for your address corpus. Compare match rates by country or postcode prefix across provider tiers. Use the geocoding accuracy comparison patterns to benchmark before promoting a provider.
Coordinate Validation Raises on Legitimate Remote Locations
Some valid coordinates near the equator and prime meridian are falsely caught by the Null Island guard. Tighten the guard to a small bounding box (e.g. abs(lat) < 0.5 and abs(lng) < 0.5) rather than exact equality if your data includes addresses in that region.
FAQ
Should I retry on HTTP 400 or halt the chain?
Halt. A 400 Bad Request signals malformed input the provider rejected — passing the same query to the next provider will produce the same result. Fix the input upstream before re-entering the chain.
How many providers should a production fallback chain have?
Three tiers is the practical ceiling for most pipelines: a high-accuracy commercial API (Tier 1), an open or regional alternative (Tier 2), and a last-resort provider with broad but lower-precision coverage (Tier 3). Beyond three, latency accumulates and marginal resolution gains drop steeply.
What is the difference between a fallback chain and a retry loop?
A retry loop re-issues the same request to the same provider; a fallback chain advances to a different provider on failure. Good implementations combine both: retry transiently-failing providers with backoff before falling back to the next tier.
How do I handle ZERO_RESULTS vs. a network timeout differently?
ZERO_RESULTS means the provider received and understood the query but found no match — advance to the next provider immediately. A timeout means the provider may be degraded — apply backoff before retrying or advancing. Never conflate these two signals in your state machine.
Can I run fallback providers in parallel instead of sequentially?
Yes, but only as an optimisation for latency-critical paths. Parallel fanout burns quota on every provider simultaneously. The safer pattern is sequential with aggressive per-provider timeouts (2–5 s), falling back immediately on timeout rather than waiting for the full window.
Related
- Multi-API Routing & Fallback Chains — the parent section covering the full range of provider routing strategies for geocoding pipelines.
- Configuring Google Maps Fallback to OpenStreetMap — step-by-step guide to the most common two-provider fallback setup, including DLQ routing for unresolvable inputs.
- API Quota Tracking and Cost Management — how to monitor spend per fallback tier and integrate budget thresholds into provider selection.
- Building Async Geocoding Requests in Python —
asyncioandhttpxpatterns for non-blocking bulk resolution, directly complementing the executor built on this page. - Rate-Limiting Strategies for Batch Processing — semaphore and token-bucket patterns that prevent quota exhaustion when the fallback chain runs over large address datasets.