Skip to content

[Detail Bug] Excessive backoff and wrong reset selection cause prolonged provider lockouts after 429s #35

@detail-app

Description

@detail-app

Summary

  • Context: The RateLimitService coordinates provider availability using persistent state in RateLimitState and in-memory state in ProviderCircuitState to handle API rate limits (HTTP 429).
  • Bug: RateLimitState applies an extremely aggressive exponential backoff using Duration.ofHours starting from the second consecutive failure, while RateLimitHeaderParser incorrectly picks the minimum reset time when multiple headers (requests vs. tokens) are provided.
  • Actual vs. expected: A second consecutive rate limit hit causes a 2-hour lockout (plus reset time), whereas it should apply a much smaller backoff (e.g., seconds or minutes); additionally, the parser picks the shortest wait time among multiple headers instead of the longest.
  • Impact: The system suffers from significant availability issues, where a provider can be disabled for hours after just two normal rate-limit events, and premature retries caused by the parser bug increase the likelihood of triggering this severe lockout.

Code with bug

1. Excessive Backoff in RateLimitState.java

// src/main/java/com/williamcallahan/javachat/service/RateLimitState.java

if (failures > 1) {
    Duration additionalBackoff = Duration.ofHours((long) Math.pow(2, failures - 1)); // <-- BUG 🔴 [Uses hours instead of seconds/minutes; 2nd failure = 2hr lockout]
    Duration maxBackoff = Duration.ofDays(7); // Never back off more than a week

    if (additionalBackoff.compareTo(maxBackoff) > 0) {
        additionalBackoff = maxBackoff;
    }

    state.rateLimitedUntil = state.rateLimitedUntil.plus(additionalBackoff);

2. Incorrect Minimum Reset Selection in RateLimitHeaderParser.java

// src/main/java/com/williamcallahan/javachat/service/RateLimitHeaderParser.java

long candidateSeconds = minPositive( // <-- BUG 🔴 [Should use max() to ensure all limits have reset before retrying]
        firstHeaderValue(headers, "x-ratelimit-reset-requests")
                .map(this::parseDurationSeconds)
                .orElse(0L),
        firstHeaderValue(headers, "x-ratelimit-reset-tokens")
                .map(this::parseDurationSeconds)
                .orElse(0L),
        // ...

Evidence

Reproduction: Excessive Backoff

I created a test ExcessiveBackoffTest.java that records two consecutive failures and checks the remaining wait time. The test confirms that after the second failure, the lockout duration is over 2 hours.

@Test
void recordRateLimit_appliesExcessiveBackoffOnSecondFailure() {
    Instant now = Instant.now();
    // First failure sets state.consecutiveFailures to 1
    rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
    
    // Second failure sets state.consecutiveFailures to 2, triggering backoff
    rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
    
    Duration waitTime2 = rateLimitState.getRemainingWaitTime(PROVIDER_NAME);
    // Actual: ~7201s (2 hours + 1s)
    assertTrue(waitTime2.toHours() >= 2); 
}

Reproduction: Header Parser Bug

I created RateLimitHeaderParserBugTest.java to verify the reset time selection. When provided with a 1s request reset and a 60s token reset, the parser incorrectly returned a 1s reset.

@Test
void parseResetInstant_shouldPickMaxResetTime() {
    Headers headers = Headers.builder()
            .put("x-ratelimit-reset-requests", "1s")
            .put("x-ratelimit-reset-tokens", "60s")
            .build();
    
    Optional<Instant> resetInstant = parser.parseResetInstant(headers);
    long secondsUntilReset = Duration.between(Instant.now(), resetInstant.get()).getSeconds();
    
    // Actual: 1s (due to minPositive)
    // Expected: 60s
    assertTrue(secondsUntilReset < 10); 
}

Why has this bug gone undetected?

These bugs have likely gone undetected because 429 errors are infrequent under low load or with higher-tier API keys. Furthermore, many users might assume a long lockout is a strict policy from the API provider (OpenAI/GitHub) rather than a bug in the local throttling logic. The interplay between the two bugs (one causing premature retries and the other penalizing them heavily) creates a "death spiral" that only becomes apparent when a provider's limits are actually reached.

Recommended fix

  1. In RateLimitState.java, change Duration.ofHours to a more appropriate unit such as Duration.ofSeconds or Duration.ofMinutes, matching the in-memory backoff logic in ProviderCircuitState.
  2. In RateLimitHeaderParser.java, replace the minPositive helper with a maxPositive logic to ensure the system waits until ALL relevant rate limits have reset before attempting a new request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions