[Detail Bug] Excessive backoff and wrong reset selection cause prolonged provider lockouts after 429s

# Summary
- **Context**: The `RateLimitService` coordinates provider availability using persistent state in `RateLimitState` and in-memory state in `ProviderCircuitState` to handle API rate limits (HTTP 429).
- **Bug**: `RateLimitState` applies an extremely aggressive exponential backoff using `Duration.ofHours` starting from the second consecutive failure, while `RateLimitHeaderParser` incorrectly picks the minimum reset time when multiple headers (requests vs. tokens) are provided.
- **Actual vs. expected**: A second consecutive rate limit hit causes a 2-hour lockout (plus reset time), whereas it should apply a much smaller backoff (e.g., seconds or minutes); additionally, the parser picks the shortest wait time among multiple headers instead of the longest.
- **Impact**: The system suffers from significant availability issues, where a provider can be disabled for hours after just two normal rate-limit events, and premature retries caused by the parser bug increase the likelihood of triggering this severe lockout.

# Code with bug
### 1. Excessive Backoff in `RateLimitState.java`
```java
// src/main/java/com/williamcallahan/javachat/service/RateLimitState.java

if (failures > 1) {
    Duration additionalBackoff = Duration.ofHours((long) Math.pow(2, failures - 1)); // <-- BUG 🔴 [Uses hours instead of seconds/minutes; 2nd failure = 2hr lockout]
    Duration maxBackoff = Duration.ofDays(7); // Never back off more than a week

    if (additionalBackoff.compareTo(maxBackoff) > 0) {
        additionalBackoff = maxBackoff;
    }

    state.rateLimitedUntil = state.rateLimitedUntil.plus(additionalBackoff);
```

### 2. Incorrect Minimum Reset Selection in `RateLimitHeaderParser.java`
```java
// src/main/java/com/williamcallahan/javachat/service/RateLimitHeaderParser.java

long candidateSeconds = minPositive( // <-- BUG 🔴 [Should use max() to ensure all limits have reset before retrying]
        firstHeaderValue(headers, "x-ratelimit-reset-requests")
                .map(this::parseDurationSeconds)
                .orElse(0L),
        firstHeaderValue(headers, "x-ratelimit-reset-tokens")
                .map(this::parseDurationSeconds)
                .orElse(0L),
        // ...
```

# Evidence
### Reproduction: Excessive Backoff
I created a test `ExcessiveBackoffTest.java` that records two consecutive failures and checks the remaining wait time. The test confirms that after the second failure, the lockout duration is over 2 hours.

```java
@Test
void recordRateLimit_appliesExcessiveBackoffOnSecondFailure() {
    Instant now = Instant.now();
    // First failure sets state.consecutiveFailures to 1
    rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
    
    // Second failure sets state.consecutiveFailures to 2, triggering backoff
    rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
    
    Duration waitTime2 = rateLimitState.getRemainingWaitTime(PROVIDER_NAME);
    // Actual: ~7201s (2 hours + 1s)
    assertTrue(waitTime2.toHours() >= 2); 
}
```

### Reproduction: Header Parser Bug
I created `RateLimitHeaderParserBugTest.java` to verify the reset time selection. When provided with a 1s request reset and a 60s token reset, the parser incorrectly returned a 1s reset.

```java
@Test
void parseResetInstant_shouldPickMaxResetTime() {
    Headers headers = Headers.builder()
            .put("x-ratelimit-reset-requests", "1s")
            .put("x-ratelimit-reset-tokens", "60s")
            .build();
    
    Optional<Instant> resetInstant = parser.parseResetInstant(headers);
    long secondsUntilReset = Duration.between(Instant.now(), resetInstant.get()).getSeconds();
    
    // Actual: 1s (due to minPositive)
    // Expected: 60s
    assertTrue(secondsUntilReset < 10); 
}
```

# Why has this bug gone undetected?
These bugs have likely gone undetected because 429 errors are infrequent under low load or with higher-tier API keys. Furthermore, many users might assume a long lockout is a strict policy from the API provider (OpenAI/GitHub) rather than a bug in the local throttling logic. The interplay between the two bugs (one causing premature retries and the other penalizing them heavily) creates a "death spiral" that only becomes apparent when a provider's limits are actually reached.

# Recommended fix
1. In `RateLimitState.java`, change `Duration.ofHours` to a more appropriate unit such as `Duration.ofSeconds` or `Duration.ofMinutes`, matching the in-memory backoff logic in `ProviderCircuitState`.
2. In `RateLimitHeaderParser.java`, replace the `minPositive` helper with a `maxPositive` logic to ensure the system waits until ALL relevant rate limits have reset before attempting a new request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Detail Bug] Excessive backoff and wrong reset selection cause prolonged provider lockouts after 429s #35

Summary

Code with bug

1. Excessive Backoff in `RateLimitState.java`

2. Incorrect Minimum Reset Selection in `RateLimitHeaderParser.java`

Evidence

Reproduction: Excessive Backoff

Reproduction: Header Parser Bug

Why has this bug gone undetected?

Recommended fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Detail Bug] Excessive backoff and wrong reset selection cause prolonged provider lockouts after 429s #35

Description

Summary

Code with bug

1. Excessive Backoff in RateLimitState.java

2. Incorrect Minimum Reset Selection in RateLimitHeaderParser.java

Evidence

Reproduction: Excessive Backoff

Reproduction: Header Parser Bug

Why has this bug gone undetected?

Recommended fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Excessive Backoff in `RateLimitState.java`

2. Incorrect Minimum Reset Selection in `RateLimitHeaderParser.java`