fix: handle fast-completion race in batch streaming seal check#3427
fix: handle fast-completion race in batch streaming seal check#3427matt-aitken wants to merge 1 commit intomainfrom
Conversation
When all batch runs complete before getBatchEnqueuedCount() is called, cleanup() has already deleted the enqueuedItemsKey in Redis, causing it to return 0. The existing Postgres fallback only checked sealed, but the BatchQueue completion path sets status=COMPLETED without setting sealed=true. Add the status check so the endpoint returns sealed:true instead of triggering SDK retries into a dead BatchQueue. Also switch findUnique to findFirst per webapp convention.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
🧰 Additional context used📓 Path-based instructions (8)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.ts{,x}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
apps/webapp/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.server.ts📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Files:
🧠 Learnings (14)📓 Common learnings📚 Learning: 2026-04-13T21:44:00.032ZApplied to files:
📚 Learning: 2026-04-17T13:20:11.004ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.309ZApplied to files:
📚 Learning: 2025-07-12T18:06:04.133ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.309ZApplied to files:
📚 Learning: 2026-04-20T14:50:16.440ZApplied to files:
📚 Learning: 2026-04-07T14:12:59.018ZApplied to files:
📚 Learning: 2026-04-20T15:08:49.959ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.309ZApplied to files:
📚 Learning: 2026-03-22T19:34:22.737ZApplied to files:
📚 Learning: 2026-03-03T13:08:03.862ZApplied to files:
📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
🔇 Additional comments (1)
WalkthroughThe service now handles a mismatch between the enqueued item count and Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| }); | ||
|
|
||
| if (currentBatch?.sealed) { | ||
| if (currentBatch?.sealed || currentBatch?.status === "COMPLETED") { |
There was a problem hiding this comment.
🔴 Incomplete race condition fix: second seal-race check at line 296 doesn't handle COMPLETED status
The PR adds || currentBatch?.status === "COMPLETED" to the first race check (line 226) to handle batches that complete before the count is read. However, the same fix is not applied to the analogous second race check at streamBatchItems.server.ts:296, which only accepts currentBatch?.sealed && currentBatch.status === "PROCESSING".
This creates a race window: between getEnqueuedCount (line 210) returning the correct count (Redis keys not yet cleaned) and the seal updateMany (line 269), tryCompleteBatch (batchSystem.ts:103-110) can set the batch status to COMPLETED without setting sealed = true. When the seal attempt fails (status is no longer PENDING), the second check at line 296 rejects the COMPLETED state and throws a ServiceValidationError — even though the batch completed successfully.
Race timeline
- Streaming endpoint finishes enqueuing all items
getEnqueuedCountreturns correct count (Redis keys exist) → passes first check- Meanwhile, BatchQueue's
completionCallbackcallstryCompleteBatch, which setsstatus: "COMPLETED"(but notsealed: true) — seebatchSystem.ts:103-110 cleanup()hasn't run yet (happens after completionCallback — seebatch-queue/index.ts:1088-1091)- Seal
updateManywithstatus: "PENDING"fails →sealResult.count === 0 - Re-query finds
sealed: false, status: "COMPLETED" - Check
sealed && status === "PROCESSING"→ false → throws ServiceValidationError
Prompt for agents
The second race condition check at line 296 in streamBatchItems.server.ts needs the same treatment as the first check at line 226. Currently line 296 only handles the case where a concurrent request sealed the batch (sealed=true, status=PROCESSING). It should also handle the case where the batch has already completed via the BatchQueue completion path (status=COMPLETED, regardless of sealed value). The fix should mirror what was done at line 226: add an OR condition for status === COMPLETED and return a success response with sealed: true. Also, per the repo's CLAUDE.md Prisma rules, the findUnique at line 286 should be changed to findFirst.
Was this helpful? React with 👍 or 👎 to provide feedback.
| // cleaned up the Redis keys before we got here. This happens when all | ||
| // runs complete fast enough that cleanup() deletes the enqueuedItemsKey | ||
| // before we read it — typically when the last item executes in the | ||
| // milliseconds between the loop ending and getBatchEnqueuedCount() being called. | ||
| // Check both sealed (sealed by this endpoint on a concurrent request) and | ||
| // COMPLETED (sealed by the BatchQueue completion path before we got here). |
There was a problem hiding this comment.
🚩 Race window between completionCallback and cleanup creates a subtle timing dependency
The PR's comments (lines 215-220) describe the race where cleanup() deletes Redis keys before getBatchEnqueuedCount() is called. However, there's actually a second relevant timing state: between completionCallback (which calls tryCompleteBatch, setting status=COMPLETED) and cleanup (which deletes the keys), as seen in batch-queue/index.ts:1082-1091. In this window, getEnqueuedCount returns the CORRECT count but the status is already COMPLETED. This window is what makes the incomplete fix at line 296 (the second race check) exploitable. The PR's comments could be more precise about these two distinct windows.
Was this helpful? React with 👍 or 👎 to provide feedback.
Problem
When
batchTrigger()is called with large payloads, each item's payload is uploaded to R2 server-side during the streaming loop before being enqueued. This makes the loop slow — around 3 seconds per item. Workers pick up and execute each item as it's enqueued, running concurrently with the ongoing stream.For the last item in the batch, a race exists between the streaming loop finishing and the batch completion cleanup:
enqueueBatchItem()recordSuccess()fires,processedCounthits the expected total,finalizeBatch()runscleanup()deletes all Redis keys for the batch, includingenqueuedItemsKeygetBatchEnqueuedCount()— reads the now-deleted key — returns 0The count check finds
enqueuedCount (0) !== batch.runCount, falls through to a Postgres fallback, but the fallback only checkedsealed. The BatchQueue completion path setsstatus = COMPLETEDin Postgres without settingsealed = true(that's the streaming endpoint's job), so the fallback misses it too.This causes the endpoint to return
sealed: false. The SDK treats this as retryable and retries up to 5 times with exponential backoff. Each retry callsenqueueBatchItem(), which reads the batch meta key from Redis — also deleted bycleanup()— and throws "Batch not found or not initialized" (500). The final retry gets a 422 because the batch is already COMPLETED, which the SDK does not retry, causing anApiErrorto be thrown fromawait batchTrigger()in the parent run — even though all child runs completed successfully.Fix
In the Postgres fallback inside
StreamBatchItemsService, also checkstatus === "COMPLETED"alongsidesealed. This covers the fast-completion path where the BatchQueue finishes all runs before the streaming endpoint gets to seal the batch normally.Also switches
findUniquetofindFirstper webapp convention.