fix: handle fast-completion race in batch streaming seal check#3427
fix: handle fast-completion race in batch streaming seal check#3427matt-aitken wants to merge 2 commits intomainfrom
Conversation
When all batch runs complete before getBatchEnqueuedCount() is called, cleanup() has already deleted the enqueuedItemsKey in Redis, causing it to return 0. The existing Postgres fallback only checked sealed, but the BatchQueue completion path sets status=COMPLETED without setting sealed=true. Add the status check so the endpoint returns sealed:true instead of triggering SDK retries into a dead BatchQueue. Also switch findUnique to findFirst per webapp convention.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
🧰 Additional context used📓 Path-based instructions (12)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{test,spec}.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.test.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.test.{ts,tsx}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.ts{,x}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
apps/webapp/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.test.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.server.ts📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Files:
🧠 Learnings (23)📓 Common learnings📚 Learning: 2026-03-03T13:07:33.177ZApplied to files:
📚 Learning: 2026-04-13T21:44:00.032ZApplied to files:
📚 Learning: 2026-04-07T14:12:18.946ZApplied to files:
📚 Learning: 2026-04-16T13:45:22.317ZApplied to files:
📚 Learning: 2026-03-02T12:43:25.254ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.330ZApplied to files:
📚 Learning: 2026-04-17T13:20:14.259ZApplied to files:
📚 Learning: 2026-04-07T14:12:59.018ZApplied to files:
📚 Learning: 2025-07-12T18:06:04.133ZApplied to files:
📚 Learning: 2026-03-03T13:08:03.862ZApplied to files:
📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.330ZApplied to files:
📚 Learning: 2026-04-20T15:08:55.358ZApplied to files:
📚 Learning: 2026-04-20T15:06:11.054ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.330ZApplied to files:
📚 Learning: 2026-04-16T13:24:09.546ZApplied to files:
📚 Learning: 2026-04-20T15:08:59.789ZApplied to files:
📚 Learning: 2026-04-16T14:07:46.808ZApplied to files:
📚 Learning: 2026-03-02T12:43:43.173ZApplied to files:
📚 Learning: 2026-04-20T14:50:21.818ZApplied to files:
📚 Learning: 2026-04-16T14:19:16.330ZApplied to files:
🔇 Additional comments (3)
WalkthroughWhen the observed Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The second race check at the seal updateMany only accepted sealed=true && status=PROCESSING. But the BatchQueue completion path can set status=COMPLETED (without sealed=true) between getEnqueuedCount and the seal updateMany, causing the check to reject a legitimate success state and throw ServiceValidationError. Also switch the post-seal re-query from findUnique to findFirst per webapp convention. Co-Authored-By: Matt Aitken <matt@mattaitken.com>
Problem
When
batchTrigger()is called with large payloads, each item's payload is uploaded to R2 server-side during the streaming loop before being enqueued. This makes the loop slow — around 3 seconds per item. Workers pick up and execute each item as it's enqueued, running concurrently with the ongoing stream.For the last item in the batch, a race exists between the streaming loop finishing and the batch completion cleanup:
enqueueBatchItem()recordSuccess()fires,processedCounthits the expected total,finalizeBatch()runscleanup()deletes all Redis keys for the batch, includingenqueuedItemsKeygetBatchEnqueuedCount()— reads the now-deleted key — returns 0The count check finds
enqueuedCount (0) !== batch.runCount, falls through to a Postgres fallback, but the fallback only checkedsealed. The BatchQueue completion path setsstatus = COMPLETEDin Postgres without settingsealed = true(that's the streaming endpoint's job), so the fallback misses it too.This causes the endpoint to return
sealed: false. The SDK treats this as retryable and retries up to 5 times with exponential backoff. Each retry callsenqueueBatchItem(), which reads the batch meta key from Redis — also deleted bycleanup()— and throws "Batch not found or not initialized" (500). The final retry gets a 422 because the batch is already COMPLETED, which the SDK does not retry, causing anApiErrorto be thrown fromawait batchTrigger()in the parent run — even though all child runs completed successfully.Fix
In the Postgres fallback inside
StreamBatchItemsService, also checkstatus === "COMPLETED"alongsidesealed. This covers the fast-completion path where the BatchQueue finishes all runs before the streaming endpoint gets to seal the batch normally.Also switches
findUniquetofindFirstper webapp convention.