ADR-008: Network Resilience Patterns
Status: Accepted (promoted from Proposed 2026-04-20; map_reqwest_error / extract_retry_after / circuit breaker / backoff all shipped and in use across maekon-network/src/{http_client,resilience,sync/remote_transport,integration/http_transport/mod,ai_llm_client/request}.rs + gRPC error mapping)
Date: 2026-03-09
Scope: maekon-network crate, all network-facing adapters
Note on CoreError syntax in example snippets: Examples below use the pre-ADR-019 syntax
CoreError::RateLimit { retry_after_secs }. After ADR-019 these must be written as struct variants with a typedcodefield:CoreError::RateLimit {code: maekon_core::error_codes::NetworkCode::RateLimit,retry_after_secs,}The resilience patterns themselves (retry/backoff/circuit-breaker/retry-after parsing) are unchanged by ADR-019.
Context
The desktop agent communicates with the connected server via HTTP REST, SSE, WebSocket, and gRPC. Desktop environments produce network failures a server process never sees: WiFi drops, VPN reconnects, sleep/wake cycles, and rolling server deployments. The agent must handle these without losing buffered data or overwhelming a recovering server.
Three incremental fixes surfaced the gaps this ADR closes:
| Pivot commit | Date | Path | Finding |
|---|---|---|---|
b13a46b | 2026-02-28 | http_client.rs | RequestTimeout + is_retryable: backoff present, no jitter |
ffa2478 | 2026-03-01 | batch_uploader.rs | Queue OOM fixed; flush retry added, no circuit breaker |
50ac66b | 2026-03-08 | sse_client.rs | SSE reconnect loop added, no jitter |
Decisions
1. Exponential Backoff with Jitter
Rule: All retry loops MUST use exponential backoff with jitter. Cap at a configurable maximum.
// 지수 백오프 + 지터 계산
fn backoff_delay(attempt: u32, base_ms: u64, max_ms: u64) -> Duration {
let exp = base_ms.saturating_mul(2u64.saturating_pow(attempt.min(10)));
let jitter = rand::thread_rng().gen_range(0..=(exp / 4));
Duration::from_millis((exp + jitter).min(max_ms))
}
Current status:
| Location | State | Action |
|---|---|---|
HttpApiClient::execute_with_retry() | Backoff, no jitter | Use backoff_delay() |
SseStreamClient::connect() | Backoff (retry_delay * 2), no jitter | Use backoff_delay() |
BatchUploader::flush() | Backoff, no jitter | Use backoff_delay() |
Default caps: 30 s for SSE/HTTP, 60 s for batch flush. Without jitter, all clients that dropped simultaneously reconnect at identical timestamps, spiking server load during recovery.
2. Token Refresh De-duplication
Rule: Only one refresh request may be in-flight at any time. Concurrent
callers that see needs_refresh = true MUST wait for the in-progress refresh.
Current problem in auth.rs: every caller releases the RwLock guard and
individually calls refresh(), firing N parallel POST requests.
Required pattern — AtomicBool + Notify:
pub struct TokenManager {
state: Arc<RwLock<Option<TokenState>>>,
refreshing: AtomicBool, // 리프레시 진행 중 여부
refresh_notify: Arc<Notify>, // 완료 시 대기 태스크 일괄 깨움
client: reqwest::Client,
base_url: String,
}
pub async fn get_token(&self) -> Result<String, CoreError> {
if self.refreshing.load(Ordering::Acquire) {
self.refresh_notify.notified().await;
}
let needs_refresh = { /* expiry check via RwLock */ };
if needs_refresh {
if self.refreshing
.compare_exchange(false, true, Ordering::AcqRel, Ordering::Acquire)
.is_ok()
{
let result = self.do_refresh().await;
self.refreshing.store(false, Ordering::Release);
self.refresh_notify.notify_waiters();
result?;
} else {
self.refresh_notify.notified().await; // 다른 태스크가 리프레시 중
}
}
// state RwLock에서 토큰 반환
}
refresh_notify is Arc<Notify> — shared across all TokenManager clones.
3. Circuit Breaker
Rule: Network clients that experience repeated failures MUST implement a circuit breaker to prevent overwhelming a recovering server.
States: Closed (normal) → Open (block requests) → Half-Open (probe).
/// 서킷 브레이커 — 연속 장애 시 요청 차단
pub struct CircuitBreaker {
state: AtomicU8, // 0=Closed, 1=Open, 2=HalfOpen
failure_count: AtomicU32,
failure_threshold: u32, // 기본값: 5
recovery_timeout: Duration, // 기본값: 30 s
last_failure_ms: AtomicU64, // Unix ms 타임스탬프
}
Scope (2026-03-09 original): Apply to BatchUploader. The flush path currently
retries max_retries times per call with no memory across scheduler ticks,
making it possible to hammer a permanently-down server on every 5-second cycle.
HttpApiClient::execute_with_retry() is already bounded per-call and is exempt.
Scope update 2026-04-20 (D7 broadening): The breaker now also guards
RemoteEmbeddingProvider, AnalysisClient, RemoteOcrProvider,
RemoteLlmProvider, and HttpApiSession. All 5 adapters resolve their
per-endpoint breaker through a shared CircuitBreakerRegistry keyed by
scheme://host:port so multiple adapters targeting the same endpoint
(e.g., two OpenAI clients on different models) converge on one breaker.
The original circuit-breaker broadening design is archived as an internal planning artifact and is not part of the public-minimal export.
Classification is centralized in resilience::classify_for_breaker:
- 5xx / transport / 401 / 429 →
Failure(endpoint health) - 2xx →
Success - Other 4xx (400, 404, 422) →
Neutral— caller bug; must not trip the shared breaker for every other caller against the same endpoint
Streaming sessions (HttpApiSession) use three-tier semantics: initial
HTTP status drives the breaker; mid-stream disconnects do NOT record.
This matches the BatchUploader pattern where "server acknowledged" = success.
The Ollama model-capability probe in ai_ocr_client::ensure_runtime_ocr_model_ready
is intentionally NOT wrapped — sidecar calls that fire once per request
are out of scope; the main OCR send drives breaker state.
Integration transports (sync/remote_transport, integration/http_transport)
remain deferred — the breaker-placement decision (adapter layer vs port-trait
layer) is its own follow-up pending the port-trait round.
4. Rate Limit Header Parsing
Rule: HTTP 429 responses MUST parse the Retry-After header. A hardcoded
fallback is only acceptable when the header is absent.
Current problem in http_client.rs:
// 현재: Retry-After 헤더 무시, 60초 하드코딩
429 => Err(CoreError::RateLimit { retry_after_secs: 60 }),
Required replacement:
/// 429 응답의 Retry-After 헤더를 파싱한다. 부재/파싱 실패 시 60초 기본값 반환.
fn extract_retry_after(response: &reqwest::Response) -> u64 {
response.headers()
.get("retry-after")
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok())
.unwrap_or(60)
}
429 => Err(CoreError::RateLimit { retry_after_secs: extract_retry_after(&resp) }),
execute_with_retry() already overrides the delay with retry_after_secs; no
further change is needed there.
Consequences
Must do (gates new network code merges):
backoff_delay()lands inmaekon-network/src/resilience.rsand replaces all inline delay calculations.extract_retry_after()replaces the hardcoded60incheck_response.TokenManagergainsAtomicBool+Arc<Notify>to de-duplicate refreshes.
Should do (next sprint):
CircuitBreakerimplemented inresilience.rsand wired intoBatchUploader.- Unit tests for each pattern: jitter range, single-refresh assertion, circuit state transitions, and header fallback.
Constraints: No new workspace dependencies are required. rand is already
present via maekon-vision. All changes are contained within maekon-network
— consistent with the crate dependency rules in ADR-001 §6.