fix: poll(timeout_ms=0) returns buffered records when coordinator times out#2718
fix: poll(timeout_ms=0) returns buffered records when coordinator times out#2718HKanaparthi wants to merge 2 commits intodpkp:masterfrom
Conversation
|
I'm still confused about this. The behavior of dropping buffered fetch requests when the coordinator check fails is expected. It may mean, for example, that the group is in rebalance and the consumer is either no longer in the group or needs to refresh partition assignment. In either case, I believe it is wrong to continue processing fetched records. But so assume that coordinator.poll() returns False when timeout_ms=0; shouldnt the correct behavior be to retry until the coordinator is ready and the group is stable? |
poll(timeout_ms=0) Per maintainer feedback on dpkp#2718: returning buffered records after coordinator.poll() fails is unsafe because False can mean a rebalance is in progress and the consumer may no longer own those partitions. Fix by checking fetched_records() *before* coordinator.poll(). Records already in the buffer were fetched during a valid partition assignment (prior to this poll cycle), so returning them is always safe. If the buffer is empty we proceed with coordinator.poll() as before. This also means coordinator.poll() is never called when buffered records are present, which is the correct non-blocking semantic for timeout_ms=0. Fixes dpkp#2692
|
Hi @dpkp , thanks for the feedback! You're right — returning records after coordinator.poll() fails is unsafe when the failure indicates a rebalance or lost partition assignment. I've updated the fix with a different approach: This also means coordinator.poll() is never called when buffered records exist, which is the correct non-blocking semantic for timeout_ms=0. |
|
I'm still confused about the root cause and so I am not confident that these changes make sense. You said that When timeout_ms=0, coordinator.poll() returns False immediately. But that's only true if the coordinator is still unknown (we are waiting for FindCoordinatorResponse) or the group needs rejoin. In both cases isn't simple retry the solution? Eventually the coordinator is found and the group is joined and then at that point we would consume messages. Did you find something else? |
|
Good point — I think I over-assumed the root cause here. |
Fixes #2692
Root cause:
When timeout_ms=0, coordinator.poll() returns False immediately, causing
an early return {} at line 722 that skipped the fetched_records() buffer
check — so buffered messages from previous poll calls were silently dropped.
Fix:
Added a buffer check inside the early-exit branch so already-fetched
records are still returned even when the coordinator times out.
Non-blocking behavior is fully preserved.
Tests:
Added 2 regression tests covering:
Note: Tests use mocks. Happy to add integration tests if needed.