Skip to content

backup: SQS encoder for queue meta and messages.jsonl (Phase 0a)#714

Open
bootjp wants to merge 20 commits intofeat/backup-phase0a-redis-simplefrom
feat/backup-phase0a-sqs
Open

backup: SQS encoder for queue meta and messages.jsonl (Phase 0a)#714
bootjp wants to merge 20 commits intofeat/backup-phase0a-redis-simplefrom
feat/backup-phase0a-sqs

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 30, 2026

Summary

Stacked on top of #713. Adds the SQS encoder for the Phase 0 logical-backup decoder.

Snapshot prefixes handled:

  • !sqs|queue|meta|<base64url(queue)>sqs/<queue>/_queue.json (dump-format projection of sqsQueueMeta with AWS-style field names)
  • !sqs|msg|data|<base64url(queue)><gen 8B BE><base64url(msgID)>sqs/<queue>/messages.jsonl (one record per line, sorted by (SendTimestampMillis, SequenceNumber, MessageID))
  • !sqs|msg|{vis,byage,dedup,group}, !sqs|queue|tombstone: excluded by default; --include-sqs-side-records routes them to sqs/<queue>/_internals/side_records.jsonl
  • !sqs|queue|{gen,seq}: ignored (operational counters, not user state)

Why buffer + sort at Finalize

Lex-order in the snapshot is m < q, so msg|data arrives BEFORE queue|meta. The encoder buffers per encoded-queue-prefix and resolves the human-readable queue name at Finalize via the meta records that arrive later. Per-queue memory is O(messages-per-queue); documented as a known limit for >100M-message queues.

Boundary detection

The msg|data key shape is <base64url(queue)><gen 8B BE><base64url(msgID)>. base64url alphabet is [A-Za-z0-9-_]; the first byte of an 8-byte BE gen is 0x00 for any production gen (< 2^56), so the first non-alphabet byte cleanly separates queue from gen. The msgID segment is additionally validated by attempting a base64url decode — a failed decode surfaces as ErrSQSMalformedKey rather than routing to the wrong queue.

Defaults

  • Visibility-state fields (visible_at_millis, current_receipt_token, receive_count, first_receive_millis) are zeroed on output. WithPreserveVisibility(true) passes them through.
  • Orphan msg|data records (no matching queue meta) emit a structured warning at Finalize and are dropped from the dump. Restoring orphans without a queue config would silently create a default-attribute queue.

Test plan

  • go test -race ./internal/backup/... — pass.
  • golangci-lint run ./internal/backup/... — clean.
  • 11 tests covering queue-meta round-trip, message ordering with ties, visibility-state zeroing/preservation, orphan warning, magic-prefix rejection, malformed-JSON rejection, wrong-prefix rejection, key-component peek, side-records include/exclude.

Stacking

Base: feat/backup-phase0a-redis-simple (PR #713).

Builds on PR #713. Adds the SQS encoder for the Phase 0 logical-
backup decoder.

Snapshot prefixes handled:
- !sqs|queue|meta|<base64url(queue)> -> sqs/<queue>/_queue.json
  (dump-format projection of the live sqsQueueMeta with AWS-style
  field names; FormatVersion stamped, throttle / partition / dedup-
  scope fields elided -- cluster-internal state, not user-visible
  config)
- !sqs|msg|data|<base64url(queue)><gen 8B BE><base64url(msgID)> ->
  sqs/<queue>/messages.jsonl (one record per line, sorted at
  Finalize-time by (SendTimestampMillis, SequenceNumber, MessageID)).
- !sqs|msg|vis | byage | dedup | group, !sqs|queue|tombstone:
  excluded by default; --include-sqs-side-records routes them to
  sqs/<queue>/_internals/side_records.jsonl as a structured bag.
- !sqs|queue|gen, !sqs|queue|seq: not handled by Phase 0 (operational
  counters, not user-visible state).

Implementation choices:
- Lex-order for the snapshot is m < q < ... so msg|data records
  arrive BEFORE queue|meta. Encoder buffers per encoded-queue-prefix
  and resolves the human-readable queue name at Finalize via the
  queue|meta records that arrive later.
- Boundary detection: the encoded queue segment is base64url-no-
  padding (alphabet [A-Za-z0-9-_]); the first byte of the 8-byte
  BE gen is 0x00 for any production gen value (< 2^56), so the
  first non-alphabet byte is the queue/gen boundary. The msgID
  segment is also validated by attempting a base64url decode -- a
  failed decode is surfaced as ErrSQSMalformedKey rather than
  routed to the wrong queue.
- Visibility-state on emitted messages is zeroed by default;
  --preserve-visibility passes the live values through.
- Orphan messages (data records with no matching queue meta) emit
  a structured warning at Finalize and are dropped from the dump.
  Restoring orphans without a queue config would silently create a
  default-attribute queue, which is rarely what the operator wants.
- Memory: O(messages-per-queue) buffer at Finalize for the sort
  pass. Queues with hundreds of millions of messages will need a
  future stream-and-merge variant; documented as a known limit.

Tests: queue meta round-trip, message ordering by (ts, seq, id) with
ties resolved by message_id, default visibility-state zeroing,
--preserve-visibility round-trip, orphan-message warning,
magic-prefix rejection, JSON-decode rejection, wrong-prefix
rejection, peekMsgDataKey component round-trip, side-records
include/exclude.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b668bfd-1e4b-4db8-b3ba-cb5de4065e4c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/backup-phase0a-sqs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the SQSEncoder, which processes SQS snapshot records into a structured per-queue layout consisting of metadata and ordered message logs. The implementation includes logic for parsing SQS keys, validating magic prefixes in stored values, and sorting messages by timestamp and sequence number. Review feedback identified critical boundary check issues in the key parsing functions parseSQSMessageDataKey and parseSQSGenericKey, where missing or empty segments could lead to malformed data or incorrect routing.

Comment thread internal/backup/sqs.go Outdated
Comment on lines +369 to +372
if idx == 0 || idx+genBytes > len(rest) {
return "", errors.Wrapf(ErrSQSMalformedKey,
"queue segment boundary not found in %q", key)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The boundary check should ensure that there is at least some data remaining for the message ID segment. If idx + genBytes == len(rest), the encMsgID will be empty, which is invalid for an SQS message data key. Using >= instead of > ensures that a non-empty message ID segment is present.

Suggested change
if idx == 0 || idx+genBytes > len(rest) {
return "", errors.Wrapf(ErrSQSMalformedKey,
"queue segment boundary not found in %q", key)
}
if idx == 0 || idx+genBytes >= len(rest) {
return "", errors.Wrapf(ErrSQSMalformedKey,
"queue segment boundary or message ID not found in %q", key)
}

Comment thread internal/backup/sqs.go
Comment on lines +397 to +401
idx := scanBase64URLBoundary(rest)
if idx == 0 {
return "", errors.Wrapf(ErrSQSMalformedKey,
"queue segment not found after prefix %q", prefix)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parseSQSGenericKey function lacks a check to ensure that a boundary was actually found. Since all SQS side-record keys (visibility, by-age, deduplication, group, and tombstone) are expected to have a binary suffix (at least the 8-byte generation), scanBase64URLBoundary should return an index strictly less than the length of the string. If idx == len(rest), it means no non-base64url byte was found, and the function will incorrectly return the entire remaining key as the queue name, leading to incorrect routing and directory structure in the backup.

Suggested change
idx := scanBase64URLBoundary(rest)
if idx == 0 {
return "", errors.Wrapf(ErrSQSMalformedKey,
"queue segment not found after prefix %q", prefix)
}
idx := scanBase64URLBoundary(rest)
if idx == 0 || idx == len(rest) {
return "", errors.Wrapf(ErrSQSMalformedKey,
"queue segment boundary not found after prefix %q", prefix)
}

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 796a42f5d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/sqs.go Outdated
Comment on lines +284 to +285
if len(st.messages) == 0 {
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit side records even when a queue has no messages

When WithIncludeSideRecords(true) is enabled, flushQueue returns immediately for queues with zero messages, so _internals/side_records.jsonl is never written even if internalBuf contains tombstone/vis/dedup/group records. This silently drops opted-in SQS side-record data for queues that currently have no message rows (a realistic case after purge/reap or for metadata-only snapshots), which makes --include-sqs-side-records incomplete.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 30, 2026 19:30
Three issues, all valid.

#372 Gemini -- empty msg-id rejected.
parseSQSMessageDataKey changed `idx+genBytes > len(rest)` to
`idx+genBytes >= len(rest)`. AWS SQS message IDs are non-empty
by construction, so an empty trailer cannot be a legitimate
snapshot record. Test
TestSQS_ParseMessageDataKey_RejectsEmptyMsgIDSegment locks it in.

#401 Gemini -- trailerless side-record key rejected.
parseSQSGenericKey now also rejects `idx == len(rest)`. All
side-record key shapes (vis / byage / dedup / group / tombstone)
terminate the encoded queue segment with at least one binary
trailer (the gen u64); a key whose entire suffix is base64url-clean
is malformed. Test
TestSQS_ParseGenericKey_RejectsTrailerlessKey covers it.

#285 Codex P2 -- side records flushed even with zero messages.
flushQueue used to `return` after writing _queue.json when the
queue had no messages, which silently dropped any internalBuf
content (purged or metadata-only queues with --include-sqs-side-
records). Now the message-flush block is gated on len > 0 but
the side-record flush runs unconditionally. Test
TestSQS_SideRecordsFlushedEvenWhenZeroMessages.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

3 issues addressed in 3455ba7f:

Gemini #372parseSQSMessageDataKey now rejects keys with empty msg-id segment (>= instead of >). AWS message IDs are non-empty by construction.

Gemini #401parseSQSGenericKey now also rejects idx == len(rest). Side-record keys always terminate the queue segment with a binary trailer (gen u64); a key whose entire suffix is base64url-clean is malformed.

Codex P2 #285flushQueue no longer returns early when the queue has zero messages; the side-record flush runs unconditionally. Purged or metadata-only queues now correctly emit _internals/side_records.jsonl under --include-sqs-side-records.

New tests: TestSQS_ParseMessageDataKey_RejectsEmptyMsgIDSegment, TestSQS_ParseGenericKey_RejectsTrailerlessKey, TestSQS_SideRecordsFlushedEvenWhenZeroMessages.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 48s —— View job


Review of commit 3455ba7

  • Read current code in internal/backup/sqs.go
  • Read current tests in internal/backup/sqs_test.go
  • Review the 3 fixes
  • Review broader code for remaining issues
  • Post review feedback

All three fixes are correct and the new tests properly lock them down. Two minor issues remain.

Fix 1 — parseSQSMessageDataKey boundary (>=) ✅

// sqs.go:378
if idx == 0 || idx+genBytes >= len(rest) {

>= correctly rejects the case where idx+genBytes == len(rest) (empty msg-id segment). The inline comment at lines 375–377 explains the AWS invariant. The test TestSQS_ParseMessageDataKey_RejectsEmptyMsgIDSegment constructs exactly that key shape (prefix + cQ + 8-byte gen, no msg-id) and asserts ErrSQSMalformedKey. Solid.

Fix 2 — parseSQSGenericKey trailerless rejection ✅

// sqs.go:412
if idx == 0 || idx == len(rest) {

idx == len(rest) correctly rejects a key whose entire suffix after the prefix is base64url-clean. Side-record keys always have a binary gen trailer starting with 0x00, so a legitimate key will always have idx < len(rest). TestSQS_ParseGenericKey_RejectsTrailerlessKey uses a pure base64url suffix (cQQQ) to trigger this path. Correct.

Fix 3 — flushQueue side records flush independently ✅

// sqs.go:284–310
if len(st.messages) > 0 {
    // ... write messages.jsonl ...
}
if len(st.internalBuf) > 0 {
    if err := s.flushInternals(dir, st.internalBuf); err != nil { ... }
}

The side-records flush is now unconditionally outside the messages block. TestSQS_SideRecordsFlushedEvenWhenZeroMessages registers a meta record + one vis side-record (no message-data records) and asserts the _internals/side_records.jsonl file exists after Finalize. Covers the exact purged-queue scenario. Correct.


Two minor nits (non-blocking)

1. peekMsgDataKey uses > not >= (sqs.go:582)

// debug helper — sqs.go:582
if idx == 0 || idx+genBytes > len(rest) {

The production parseSQSMessageDataKey uses >= (rejects empty msg-id), but this debug helper still uses >, which accepts an empty msg-id. Since peekMsgDataKey is only called from tests and is explicitly documented as a non-production path, it's not a safety issue — but a caller probing the edge case through both functions would get different results. Fix this →

2. Stale comment in HandleSideRecord (sqs.go:241–243)

if err != nil {
    // Tombstones include a fixed-width gen but no msg ID; the
    // generic parser tolerates the empty trailer.
    return err
}

After Fix 2, parseSQSGenericKey no longer tolerates an empty trailer — it now returns ErrSQSMalformedKey for idx == len(rest). The comment was presumably written to explain why tombstone keys don't fail the generic parser (they don't because the gen bytes start with 0x00, a non-base64url byte that terminates the scan before len(rest)). The comment now says the opposite of the truth. Fix this →


Overall: The three fixes are correct and well-tested. The nits above are cosmetic and don't affect correctness. Ready to merge once the stacking dependency (#713) lands.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f14fefc32

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/sqs.go Outdated
Comment on lines +262 to +266
if st.meta == nil {
s.emitWarn("sqs_orphan_messages",
"encoded_queue", st.encoded,
"buffered_messages", len(st.messages),
"hint", "no !sqs|queue|meta record matched this encoded prefix; messages dropped from the dump")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve side records when queue metadata is missing

When Finalize encounters a queue state without meta, it warns and skips flushing the queue entirely, which also drops any buffered side records (vis/byage/dedup/group/tombstone) for that queue. This breaks --include-sqs-side-records for a common case where DeleteQueue leaves tombstones but no !sqs|queue|meta|... row, so operators requesting side records lose exactly the deletion-era records they opted in to export.

Useful? React with 👍 / 👎.

Codex P2 round 8 (commit 86fbf3a):

When `Finalize` encountered a queue state without a !sqs|queue|meta
record (most commonly post-DeleteQueue, where the meta row was
removed but tombstones remain), it warned and skipped the queue
entirely — including any buffered side records. With
--include-sqs-side-records on, that silently dropped exactly the
deletion-era state operators opted in to capture, contradicting
the opt-in contract.

Now the orphan branch flushes side records to a `<encoded>.orphan`
sub-directory while still warning that messages were dropped. The
`.orphan` suffix prevents a restore tool from mistaking it for a
real queue dir produced by a successful meta flush; the encoded
prefix is the only stable identifier available without meta.
Orphan messages still drop because restoring them without queue
config would silently create a default-settings queue.

The branch is gated on `s.includeSideRecords` so the
default-off contract is preserved: without the flag, no `.orphan`
directory is ever created.

Tests:
  - TestSQS_OrphanQueueSideRecordsPreserved — drives a vis-record
    without a meta row, asserts the .orphan/_internals path is
    populated and the orphan-messages warning fires.
  - TestSQS_OrphanQueueSideRecordsSuppressedWhenOptOut — same
    scenario without --include-sqs-side-records; asserts no
    .orphan dir is created.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 2 fix for Codex P2 round 8 on commit 86fbf3a (commit 1621ede4):

When Finalize saw a queue state without a !sqs|queue|meta row (most commonly post-DeleteQueue, where the meta row was removed but tombstones remain), it warned and skipped the queue entirely — including any buffered side records. With --include-sqs-side-records on that silently dropped exactly the deletion-era state operators opted in to capture, contradicting the opt-in contract.

Now the orphan branch flushes side records to a <encoded>.orphan sub-directory while still warning that messages were dropped. The .orphan suffix prevents restore tools from mistaking it for a real queue dir; the encoded prefix is the only stable identifier without meta. Orphan messages still drop (restoring them without queue config would silently create a default-settings queue). The branch is gated on includeSideRecords so the default-off contract is preserved.

Tests: TestSQS_OrphanQueueSideRecordsPreserved and TestSQS_OrphanQueueSideRecordsSuppressedWhenOptOut. go test -race and golangci-lint clean. Fix merged forward into #716 / #718.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 11s —— View job


Review of commit 1621ede4

  • Read current internal/backup/sqs.go
  • Read current internal/backup/sqs_test.go
  • Review the orphan-queue side-records fix
  • Check the two previous nits from last round
  • Broader review pass
  • Post feedback

The orphan-queue fix is correct and the two new tests are well-constructed. Two nits from last round remain unaddressed.

Fix: flushOrphanQueueSideRecords

// sqs.go:288-303
func (s *SQSEncoder) flushOrphanQueueSideRecords(st *sqsQueueState) error {
    s.emitWarn("sqs_orphan_messages", ...)
    if !s.includeSideRecords || len(st.internalBuf) == 0 {
        return nil
    }
    dir := filepath.Join(s.outRoot, "sqs", st.encoded+".orphan")
    return s.flushInternals(dir, st.internalBuf)
}

The warning unconditionally fires for all orphan queues (regardless of opt-in status) so operators always know about the missing meta. The flush is gated on includeSideRecords && len(internalBuf) > 0, which is correct. The .orphan directory suffix, the flushInternals delegation, and os.MkdirAll handling are all consistent with the normal flush path.

One observation: the !s.includeSideRecords guard at line 294 is technically redundant — HandleSideRecord already drops records at intake when the flag is off (line 236–237), so internalBuf will always be empty when includeSideRecords == false. The double-guard is harmless and could be read as defence-in-depth, but it slightly obscures the real invariant. Non-blocking.

Finalize routing ✅

// sqs.go:265-279
for _, st := range s.queues {
    if st.meta == nil {
        if err := s.flushOrphanQueueSideRecords(st); err != nil && firstErr == nil {
            firstErr = err
        }
        continue
    }
    ...
}

Clean separation. All queues are iterated even on error (first-error-wins accumulation is the pre-existing pattern).

TestSQS_OrphanQueueSideRecordsPreserved

Creates a vis side-record with no matching meta, asserts sqs_orphan_messages warning fires and sqs/<encoded>.orphan/_internals/side_records.jsonl exists. Covers the exact scenario in the bug report.

TestSQS_OrphanQueueSideRecordsSuppressedWhenOptOut

With includeSideRecords off, HandleSideRecord drops the record at intake so no sqsQueueState is ever created. Finalize has nothing to iterate over, no .orphan dir is written. The test comment correctly explains this ("HandleSideRecord drops the record at intake") and the os.IsNotExist assertion is correct.


Two nits still open from last round (non-blocking)

1. peekMsgDataKey boundary inconsistency (sqs.go:611)

// still uses > instead of >= — inconsistent with parseSQSMessageDataKey
if idx == 0 || idx+genBytes > len(rest) {

The production parseSQSMessageDataKey correctly uses >= to reject empty msg-id segments (Fix 1 from last round), but this debug helper still uses >. It's test-only and documented as non-production, so it's not a safety issue — but a caller testing both functions against the same edge-case key gets different results. Fix this →

2. Stale comment in HandleSideRecord (sqs.go:241–243)

if err != nil {
    // Tombstones include a fixed-width gen but no msg ID; the
    // generic parser tolerates the empty trailer.
    return err
}

After Fix 2 (last round), parseSQSGenericKey now rejects idx == len(rest) — the comment says the opposite of the truth. Tombstone keys work because their gen starts with 0x00 (a non-base64url byte), which terminates the scan before len(rest) — not because the parser tolerates an empty trailer. Remove or replace the comment. Fix this →


Overall: The orphan-queue side-records fix is correct and complete. The .orphan directory convention is well-documented in the Finalize docstring, the two tests cleanly lock down the opt-in and opt-out paths, and the flushInternals delegation keeps the flush logic DRY. The two lingering nits are cosmetic. Ready to merge once the stacking dependency (#713) lands.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18a8ae3f74

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/sqs.go
if err != nil {
return "", err
}
idx := scanBase64URLBoundary(rest)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept partitioned SQS msg-data keys

The parser only handles legacy !sqs|msg|data|<queue><gen><msgID> layout and infers queue/gen boundary from the first non-base64 byte, so keys in the partitioned format (!sqs|msg|data|p|..., defined in adapter/sqs_keys.go as SqsPartitionedMsgDataPrefix) are treated as malformed (the queue segment becomes "p" and base64 decode fails). In environments where HT-FIFO partitioned keys are present, HandleMessageData will return an error and abort backup decoding instead of exporting messages.

Useful? React with 👍 / 👎.

Comment thread internal/backup/sqs.go
if err != nil {
return "", err
}
idx := scanBase64URLBoundary(rest)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse partitioned side-record keys before routing

parseSQSGenericKey also assumes the legacy key shape and returns bytes up to the first non-base64 character as the queue segment. For partitioned side-record keys like !sqs|msg|vis|p|<queue>|..., this extracts "p" as the queue key, so side records from different queues collapse into one state bucket and flush under the wrong/orphan path when --include-sqs-side-records is enabled.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 30, 2026 21:35
Codex round 9 (commit 8d08827):

`parseSQSMessageDataKey` and `parseSQSGenericKey` only handled the
legacy SQS key shape:

	!sqs|msg|<family>|<encQueue><gen 8B><rest>

HT-FIFO partitioned queues use the discriminator-prefixed shape
(adapter/sqs_keys.go SqsPartitionedMsg*Prefix):

	!sqs|msg|<family>|p|<encQueue>|<part 4B><gen 8B><rest>

The legacy heuristic read "p" as the queue segment and aborted with
ErrSQSMalformedKey. In environments running partitioned FIFO queues
HandleMessageData would refuse every record and the dump pass
would abort entirely (P1); side records would similarly collapse
into one mis-routed bucket under --include-sqs-side-records (P2).

Recognise the partitioned shape via the literal "p|" prefix on the
post-family suffix and route into a new
parseSQSPartitionedQueueAndTrailer helper that:

  - peels the "p|" discriminator
  - reads the queue segment up to the literal '|' terminator
  - validates that the trailer carries at least partition u32 +
    gen u64 bytes
  - if the caller is parsing a msg-data key, validates the
    msg-id segment decodes as base64url

Tests:
  - TestSQS_ParsePartitionedMessageDataKey
  - TestSQS_ParsePartitionedSideRecordKey (vis/byage/dedup/group)
  - TestSQS_ParsePartitionedMessageDataKey_RejectsTruncatedTrailer

The legacy parser path is unchanged for non-partitioned keys.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 3 fixes for Codex round 9 on commit 8d08827 (commit f923d99f):

The legacy SQS key parsers handled only !sqs|msg|<family>|<encQueue><gen 8B><rest>. HT-FIFO partitioned queues use a different shape (adapter/sqs_keys.go SqsPartitionedMsg*Prefix):

!sqs|msg|<family>|p|<encQueue>|<partition u32, 4B><gen u64, 8B><rest>

The legacy heuristic read "p" as the queue segment and surfaced ErrSQSMalformedKey. In any cluster running partitioned FIFO queues, HandleMessageData would refuse every record and the dump pass would abort (P1); side records would similarly collapse into one mis-routed bucket under --include-sqs-side-records (P2).

Both parsers now branch on the literal p| discriminator and route into a new parseSQSPartitionedQueueAndTrailer helper that peels the discriminator, reads the queue segment up to the | terminator, validates the partition+gen trailer, and (for msg-data keys) validates the trailing msg-id segment.

Tests: TestSQS_ParsePartitionedMessageDataKey, TestSQS_ParsePartitionedSideRecordKey (vis/byage/dedup/group), TestSQS_ParsePartitionedMessageDataKey_RejectsTruncatedTrailer. Legacy parser path is unchanged for non-partitioned keys.

go test -race and golangci-lint clean. Fix merged forward into #716 / #718.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 3m 28s —— View job


Review of commit f923d99f (Round 3)

  • Read internal/backup/sqs.go — partitioned key parsing changes
  • Read adapter/sqs_keys.go — verify key shape assumptions
  • Read internal/backup/sqs_test.go — new tests
  • Review the partitioned parsing logic
  • Check the two previous nits from last round
  • Post review feedback

Cross-checked the parsed format against the actual constructors in adapter/sqs_keys.go (sqsPartitionedMsgDataKey, sqsPartitionedMsgVisKey, sqsPartitionedMsgDedupKey, etc.). The backup encoder's understanding of the wire format matches the adapter exactly:

!sqs|msg|<family>|p|<base64url(queue)>|<partition u32 4B><gen u64 8B>[<base64url(msgID)>]

All three new pieces of code are correct.

isPartitionedRest + dispatch ✅

// sqs.go:423-425 (parseSQSMessageDataKey)
if isPartitionedRest(rest) {
    return parseSQSPartitionedQueueAndTrailer(rest, true /*hasMsgID*/, key)
}

// sqs.go:460-462 (parseSQSGenericKey)
if isPartitionedRest(rest) {
    return parseSQSPartitionedQueueAndTrailer(rest, false /*hasMsgID*/, key)
}

isPartitionedRest uses strings.HasPrefix(rest, "p|"). A legacy queue whose base64url encoding begins with p would need the next byte to also be |, which is not in the base64url alphabet — so the discriminator is unambiguous. The comment at lines 33–46 documents this invariant clearly.

parseSQSPartitionedQueueAndTrailer

// sqs.go:495-526
func parseSQSPartitionedQueueAndTrailer(rest string, hasMsgID bool, originalKey []byte) (string, error) {
    body := rest[len(sqsPartitionedDiscriminator):]         // strip "p|"
    terminator := strings.IndexByte(body, '|')
    if terminator <= 0 { ... }                              // no queue segment or empty
    encQueue := body[:terminator]
    if _, err := base64.RawURLEncoding.DecodeString(encQueue); err != nil { ... }
    trailer := body[terminator+1:]
    const fixedTrailerBytes = sqsPartitionBytes + genBytes  // 4 + 8 = 12
    if hasMsgID {
        if len(trailer) <= fixedTrailerBytes { ... }        // must have ≥1 byte msg-id
        encMsgID := trailer[fixedTrailerBytes:]
        if _, err := base64.RawURLEncoding.DecodeString(encMsgID); err != nil { ... }
        return encQueue, nil
    }
    if len(trailer) < fixedTrailerBytes { ... }             // side records: exactly 12 min
    return encQueue, nil
}

Edge cases verified:

  • rest = "p|" (no queue body): body = "", terminator = -1 ≤ 0 → rejected ✓
  • rest = "p||..." (empty queue segment): terminator = 0 ≤ 0 → rejected ✓
  • invalid base64url in queue segment → rejected ✓
  • truncated trailer (4 bytes, no gen): len(trailer) = 4 < 12ErrSQSMalformedKey

fixedTrailerBytes = 12 matches exactly what the adapter's vis key puts before the msgID (appendU32 + appendU64). Correct.

New tests ✅

TestSQS_ParsePartitionedMessageDataKey — constructs a key by hand matching the adapter's sqsPartitionedMsgDataKey output layout and asserts parseSQSMessageDataKey returns the right encoded queue. Solid.

TestSQS_ParsePartitionedSideRecordKey — exercises all four side-record families (vis, byage, dedup, group) via parseSQSGenericKey. Trailer is exactly 12 bytes (partition u32 + gen u64), which is the minimum valid for len(trailer) >= fixedTrailerBytes. Correct.

TestSQS_ParsePartitionedMessageDataKey_RejectsTruncatedTrailer — 4-byte trailer only; len(trailer) = 4 <= 12 triggers the hasMsgID rejection. Correct.


Three nits (non-blocking)

1. peekMsgDataKey boundary inconsistency still open — sqs.go:691

if idx == 0 || idx+genBytes > len(rest) {  // should be >=

Flagged in rounds 1 and 2. parseSQSMessageDataKey uses >= to reject empty msg-id segments; this debug helper still uses >. Test-only and explicitly documented as non-production, but a test that probes both functions against an edge-case key gets different results. Fix this →

2. Stale comment in HandleSideRecord still open — sqs.go:257–260

if err != nil {
    // Tombstones include a fixed-width gen but no msg ID; the
    // generic parser tolerates the empty trailer.
    return err
}

Flagged in rounds 1 and 2. After Fix 2 (round 1), parseSQSGenericKey rejects idx == len(rest) — the comment says the opposite. Tombstone keys succeed because their gen starts with 0x00 (non-base64url), which terminates the scan before len(rest), not because the parser tolerates an empty trailer. Fix this →

3. peekMsgDataKey doesn't handle partitioned keys (new nit) — sqs.go:685–700

The helper falls through to the legacy heuristic for partitioned keys (it has no isPartitionedRest branch). A test that calls peekMsgDataKey on a sqsPartitionedMsgDataKey-shaped key would get Encoded = "p" and a wrong GenRaw. Not a production concern, but if new tests for the partitioned path ever call peekMsgDataKey they'll silently produce wrong components. Low priority since peekMsgDataKey is test-only. Fix this →


Overall: The P1 fix (partitioned msg-data keys aborting the dump) and P2 fix (side records collapsing into one mis-routed bucket) are both correct. The parseSQSPartitionedQueueAndTrailer helper is structurally sound and validated against the actual adapter key constructors. The three new tests lock down the main partitioned-key paths. The three nits above are cosmetic. Ready to merge once the stacking dependency (#713) lands.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f923d99f1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/sqs.go Outdated
// round-trip; the encoder zeroes the visibility-state fields by default.
type sqsMessageRecord struct {
MessageID string `json:"message_id"`
Body []byte `json:"body"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit message body as plain text in dump records

sqsMessageRecord.Body is declared as []byte, so json.Encoder writes it as base64 in messages.jsonl rather than the original message text. In normal restores that replay each JSONL row into SendMessage, this changes payloads (e.g., hello becomes aGVsbG8=), corrupting application-visible message contents even though the backup/restore appears successful. The projection used for dump output should serialize body as a text string (or otherwise decode before emit) to preserve original SQS semantics.

Useful? React with 👍 / 👎.

Codex P1 round 9 (commit f923d99):

`sqsMessageRecord.Body` was declared `[]byte`, so json.Encoder
serialised every body as base64 in messages.jsonl. A restore that
replays each row through SendMessage would push the base64 string
itself as the body (e.g., `hello` → `aGVsbG8=`), corrupting
application-visible payloads. AWS SQS body is XML-text by contract,
so the dump format must emit it as a string.

Introduce a typed `sqsMessageBody` ([]byte) with a custom
MarshalJSON:

  - utf8.Valid(b) true  -> emit as a plain JSON string
    (the SQS contract-conformant case; no information lost).
  - utf8.Valid(b) false -> emit a typed envelope
    `{"base64":"<base64url>"}` so binary bytes still round-trip
    unambiguously without a lossy replacement-character rewrite.

Tests:
  - TestSQS_MessageBodyEmittedAsTextForUTF8 — asserts the
    JSONL contains `"body":"hello"` and NOT a base64 form.
  - TestSQS_MessageBodyFallsBackToBase64ForBinary — drives a
    body of bytes that are not valid UTF-8 and asserts the
    `{"base64":...}` envelope path.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42908204bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/sqs.go
Comment on lines +276 to +277
st := s.queueState(encQueue)
st.messages = append(st.messages, rec)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Filter out stale generations before emitting messages

HandleMessageData appends every !sqs|msg|data|... row to the queue bucket without checking whether the record’s generation matches the queue’s current generation, so backups can include messages that are intentionally unreachable after PurgeQueue/DeleteQueue generation bumps (these old keys persist until reaper cleanup). In clusters where a snapshot is taken before reaping finishes, this will write stale rows into messages.jsonl and a restore will resurrect purged/deleted messages as live queue contents.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 4 fix for Codex P1 round 9 on commit f923d99 (commit 42908204):

sqsMessageRecord.Body was a []byte, so json.Encoder rendered every body as base64 in messages.jsonl. A restore replaying each row through SendMessage would push the base64 string itself as the body (helloaGVsbG8=), corrupting application payloads. AWS SQS body is XML-text by contract.

Introduced a typed sqsMessageBody ([]byte) with a custom MarshalJSON: valid UTF-8 → plain JSON string; non-UTF-8 → typed {"base64":"…"} envelope so binary bytes still round-trip without lossy replacement-character rewrites.

Tests: TestSQS_MessageBodyEmittedAsTextForUTF8 and TestSQS_MessageBodyFallsBackToBase64ForBinary. go test -race and golangci-lint clean. Fix merged forward into #716 / #718.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 1m 18s —— View job


Review of commit 42908204 (Round 4)

  • Read internal/backup/sqs.go
  • Read internal/backup/sqs_test.go
  • Review sqsMessageBody type and MarshalJSON
  • Check previous nits status
  • Post review

The P1 fix is correct. One new nit; three previous nits still open.

Fix: sqsMessageBody custom MarshalJSON

// sqs.go:146–169
type sqsMessageBody []byte

func (b sqsMessageBody) MarshalJSON() ([]byte, error) {
    if utf8.Valid(b) {
        out, err := json.Marshal(string(b))    // plain JSON string
        ...
    }
    envelope := struct {
        Base64 string `json:"base64"`
    }{Base64: base64.RawURLEncoding.EncodeToString(b)}
    ...
}

utf8.Valid correctly gates the two paths. json.Marshal(string(b)) handles all JSON string escaping (quotes, control chars, surrogates) without risk of double-encoding. The decodeSQSMessageValue local struct still uses Body []byte (line 639), which Go silently converts to sqsMessageBody at the assignment Body: live.Body (line 660) — that's a legal same-underlying-type conversion. ✅

TestSQS_MessageBodyEmittedAsTextForUTF8 asserts "body":"hello" appears in the raw JSONL bytes and aGVsbG8 does NOT. TestSQS_MessageBodyFallsBackToBase64ForBinary asserts "body":{"base64": appears for a non-UTF-8 payload. Both tests lock down the exact bytes that hit the wire — correct. ✅

New nit: RawURLEncoding in binary fallback (sqs.go:163)

Base64: base64.RawURLEncoding.EncodeToString(b)

The envelope key is "base64" — which conventionally signals RFC 4648 standard base64. RawURLEncoding uses the -_ alphabet (no +/, no padding) rather than the standard +/= form. A restore-tool author reading the schema and seeing "base64" will instinctively reach for base64.StdDecoding and get garbage. Either switch to base64.StdEncoding or add an explicit note in the type comment naming the actual encoding used. Fix this →


Three nits still open from earlier rounds (non-blocking)

1. peekMsgDataKey boundary inconsistency (sqs.go:723)

if idx == 0 || idx+genBytes > len(rest) {  // should be >=

Flagged in rounds 1, 2, 3. Fix this →

2. Stale comment in HandleSideRecord (sqs.go:290–292)

// Tombstones include a fixed-width gen but no msg ID; the
// generic parser tolerates the empty trailer.

parseSQSGenericKey now rejects idx == len(rest). The comment says the opposite. Flagged in rounds 1, 2, 3. Fix this →

3. peekMsgDataKey silently mishandles partitioned keys (sqs.go:717)

No isPartitionedRest branch — returns Encoded = "p" for any partitioned key. Flagged in round 3. Fix this →


Overall: The Round 4 fix is correct and complete — sqsMessageBody.MarshalJSON solves the base64-in-JSONL P1 without any edge-case risk, and both tests lock down the exact bytes on the wire. The new nit about RawURLEncoding vs standard base64 is the only substantive item. Ready to merge once the stacking dependency (#713) lands.

bootjp added 2 commits April 30, 2026 22:04
Codex P1 round 10 (commit 4290820):

HandleMessageData was appending every !sqs|msg|data|... row to the
queue's buffer without checking whether the message's
QueueGeneration matched the queue's current generation. PurgeQueue
and DeleteQueue bump the generation but the affected rows are
removed asynchronously by the reaper; a snapshot taken mid-cleanup
still carries those stale rows and the encoder would emit them to
messages.jsonl, resurrecting purged messages on restore.

  - Add HandleQueueGen to capture activeGen from
    !sqs|queue|gen|<encoded> records (decimal-string value, mirrors
    adapter/sqs_catalog.go's CreateQueue Put).
  - Track activeGen on sqsQueueState.
  - flushQueue calls filterStaleGenMessages: drops messages whose
    QueueGeneration != activeGen and emits an
    sqs_stale_generation_messages_dropped warning so operators can
    correlate dump shape with mid-reap snapshot timing.
  - Safe fallback: activeGen == 0 (no gen record observed) keeps
    the legacy behaviour, so a backup that lacks the gen record
    does not silently lose every message.

Tests:
  - TestSQS_StaleGenerationMessagesDropped: live + stale message,
    asserts only live survives and the warning fires.
  - TestSQS_StaleGenerationFilterDisabledWithoutGenRecord: no gen
    record => no filter; preserves the legacy semantics.

flushQueue's body is also extracted into writeMessagesJSONL to
keep cyclomatic complexity under the project linter's ceiling.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 5 fix for Codex P1 round 10 on commit 4290820 (commit a59320d7):

HandleMessageData was appending every !sqs|msg|data|... row without checking generation. PurgeQueue and DeleteQueue bump the queue's generation but the affected rows are removed asynchronously by the reaper; a snapshot taken mid-cleanup still carries those stale rows and the encoder would emit them, resurrecting purged messages on restore.

Added HandleQueueGen to capture activeGen from !sqs|queue|gen|<encoded> records (decimal-string value, mirrors adapter/sqs_catalog.go's CreateQueue Put). flushQueue now calls filterStaleGenMessages which drops messages whose QueueGeneration != activeGen and emits an sqs_stale_generation_messages_dropped warning. Safe fallback: activeGen == 0 (no gen record observed) keeps the legacy behaviour so a backup lacking the gen record does not silently lose every message.

Tests: TestSQS_StaleGenerationMessagesDropped, TestSQS_StaleGenerationFilterDisabledWithoutGenRecord. go test -race and golangci-lint clean. Fix merged forward into #716 / #718.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 3m 14s —— View job


Review of commit a59320d7 (Round 5)

  • Read internal/backup/sqs.go — stale-gen filtering changes
  • Read internal/backup/sqs_test.go — new tests
  • Read adapter/sqs_catalog.go + adapter/sqs_keys.go — verify gen record format
  • Check status of previous nits
  • Post review feedback

Cross-checked the gen-record format against adapter/sqs_catalog.go. loadQueueGenerationAt (line 864) returns 0 for ErrKeyNotFound; tryCreateQueueOnce (line 968) writes lastGen + 1, so the minimum live gen is 1. The sentinel activeGen == 0 is safe — it cannot collide with a real gen value. The value encoding matches exactly: live writes strconv.FormatUint(gen, 10) (line 1000); HandleQueueGen reads with strconv.ParseUint(..., 10, 64) (line 284). ✅

HandleQueueGen

// sqs.go:279–290
func (s *SQSEncoder) HandleQueueGen(key, value []byte) error {
    encoded, err := stripPrefixSegment(key, []byte(SQSQueueGenPrefix))
    if err != nil { return err }
    gen, err := strconv.ParseUint(strings.TrimSpace(string(value)), 10, 64)
    if err != nil { return errors.Wrap(ErrSQSMalformedKey, err.Error()) }
    s.queueState(encoded).activeGen = gen
    return nil
}

strings.TrimSpace is a nice defensive touch — trailing newlines from naive KV stores won't cause spurious errors. The decimal-string contract matches the live adapter exactly.

filterStaleGenMessages

// sqs.go:748–762
func filterStaleGenMessages(in []sqsMessageRecord, activeGen uint64) ([]sqsMessageRecord, int) {
    if activeGen == 0 { return in, 0 }
    out := make([]sqsMessageRecord, 0, len(in))
    dropped := 0
    for _, m := range in {
        if m.QueueGeneration == activeGen { out = append(out, m); continue }
        dropped++
    }
    return out, dropped
}

Clean and correct. The make([]sqsMessageRecord, 0, len(in)) avoids reallocations in the common case where all messages survive. Warning fields (queue, active_gen, dropped, hint) give operators everything they need to diagnose mid-reap snapshots.

TestSQS_StaleGenerationMessagesDropped

Registers gen record "7", inserts live message (queue_generation: 7) and stale message (queue_generation: 6), asserts only the live message appears in messages.jsonl and the sqs_stale_generation_messages_dropped warning fires. Exact reproduction of the P1 scenario.

TestSQS_StaleGenerationFilterDisabledWithoutGenRecord

No gen record registered; message with queue_generation: 99 survives Finalize. Verifies the activeGen == 0 fallback keeps legacy behavior intact. Correct.


One new nit (non-blocking)

HandleQueueGen skips base64url validation of the encoded segment (sqs.go:279–290)

HandleQueueMeta validates the encoded segment by calling base64.RawURLEncoding.DecodeString(encoded) and returning ErrSQSMalformedKey on failure (line 251–253). HandleQueueGen doesn't — it calls queueState(encoded) with whatever the suffix is. A malformed gen key (e.g., containing |) would create an sqsQueueState entry keyed by a string that will never match any message or meta record, silently orphaning the gen observation. In practice, the live cluster always writes valid base64url segments, so this won't fire on real data; but a corrupt snapshot key would produce a silent orphan rather than an error. Fix this →


Four nits still open from earlier rounds (non-blocking)

1. peekMsgDataKey boundary inconsistency — sqs.go:807

if idx == 0 || idx+genBytes > len(rest) {  // should be >=

Production parseSQSMessageDataKey uses >=; this debug helper still uses >. Flagged in rounds 1, 2, 3, 4. Fix this →

2. Stale comment in HandleSideRecord — sqs.go:325–327

// Tombstones include a fixed-width gen but no msg ID; the
// generic parser tolerates the empty trailer.

parseSQSGenericKey now rejects idx == len(rest); the comment says the opposite. Flagged in rounds 1, 2, 3, 4. Fix this →

3. peekMsgDataKey silently mishandles partitioned keys — sqs.go:801
No isPartitionedRest branch; returns Encoded = "p" for any partitioned key. Flagged in rounds 3, 4. Fix this →

4. RawURLEncoding vs conventional base64 in binary body fallback — sqs.go:176
Field is named "base64" but encoded with base64.RawURLEncoding (-_ alphabet, no padding). A restore-tool author will reach for base64.StdDecoding and get garbage. Flagged in round 4. Fix this →


Overall: The Round 5 fix is correct and complete. The gen-record format is verified against the live adapter, the activeGen == 0 sentinel is provably safe (live cluster always writes gen ≥ 1), the filter logic is clean, and both tests precisely lock down the main behavior and fallback path. The one new nit and the four carry-over nits are all cosmetic or minor. Ready to merge once the stacking dependency (#713) lands.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c6a68b363

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/sqs.go
Comment on lines +650 to +654
return &sqsQueueMetaPublic{
FormatVersion: 1,
Name: live.Name,
FIFO: live.IsFIFO,
ContentBasedDeduplication: live.ContentBasedDedup,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve HT-FIFO attributes in exported queue metadata

decodeSQSQueueMetaValue only projects legacy fields into _queue.json, so persisted HT-FIFO attributes (PartitionCount, FifoThroughputLimit, DeduplicationScope) are dropped during export. For queues created with partitioned FIFO settings, a restore from this dump will silently recreate different queue behavior (single-partition/default routing/dedup scope), and these attributes are immutable in the adapter (SetQueueAttributes cannot fix them afterward). This makes the backup non-fidelity-preserving for partitioned FIFO queues.

Useful? React with 👍 / 👎.

Codex P1 round 12 (commit 5c6a68b):

`decodeSQSQueueMetaValue` only projected legacy SQS fields into
_queue.json. The HT-FIFO immutable attributes (PartitionCount,
FifoThroughputLimit, DeduplicationScope) were captured by the live
catalog at CreateQueue but silently dropped by the backup decoder.
The adapter rejects mutating these via SetQueueAttributes (they
are immutable per AWS contract), so a backup→restore roundtrip
would silently recreate single-partition / default-routing /
queue-scoped-dedup queues — non-fidelity preserving for any
partitioned FIFO workload.

Add the three fields to sqsQueueMetaPublic and the live decode
struct so they round-trip through the dump.

Test: TestSQS_QueueMetaPreservesHTFIFOAttributes drives a
4-partition perMessageGroupId / messageGroup-dedup queue and
asserts each field is read back from _queue.json.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 6 fix for Codex P1 round 12 on commit 5c6a68b (commit 65ea67c7):

decodeSQSQueueMetaValue only projected legacy SQS fields into _queue.json. The HT-FIFO immutable attributes (PartitionCount, FifoThroughputLimit, DeduplicationScope) were captured by the live catalog at CreateQueue but silently dropped by the backup decoder. The adapter rejects mutating these via SetQueueAttributes (immutable per AWS contract), so a backup→restore roundtrip would silently recreate single-partition / default-routing / queue-scoped-dedup queues — non-fidelity-preserving for any partitioned FIFO workload.

Added all three fields to sqsQueueMetaPublic and the live decode struct so they round-trip through the dump. Test: TestSQS_QueueMetaPreservesHTFIFOAttributes drives a 4-partition perMessageGroupId / messageGroup-dedup queue and asserts each field is read back from _queue.json.

go test -race and golangci-lint clean. Fix merged forward into #716 / #718.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 25s —— View job


Review of commit 65ea67c7 (Round 6)

  • Read internal/backup/sqs.go — HT-FIFO field additions
  • Read internal/backup/sqs_test.go — new test
  • Cross-check against adapter/sqs_catalog.go struct definitions
  • Check status of previous nits
  • Post review feedback

Cross-checked the three new fields against adapter/sqs_catalog.go:130–149. JSON tag names match exactly; the adapter's sqsQueueMeta persists partition_count, fifo_throughput_limit, and deduplication_scope — the live decode struct in decodeSQSQueueMetaValue reads exactly those tags.

Fix: sqsQueueMetaPublic + decodeSQSQueueMetaValue

// sqs.go:159–161
PartitionCount            uint32 `json:"partition_count,omitempty"`
FifoThroughputLimit       string `json:"fifo_throughput_limit,omitempty"`
DeduplicationScope        string `json:"deduplication_scope,omitempty"`

The omitempty tags are correct: non-FIFO queues and single-partition queues will have zero/empty values for these fields and they should be omitted from _queue.json. A queue created with explicit PartitionCount=1 serializes as "partition_count":1 (uint32(1) is non-zero), while a default single-partition queue stored as PartitionCount=0 emits no field — both are semantically correct for restore tools.

The live decode struct (lines 648–660) omits omitempty on the three fields, which is correct — omitempty is only relevant at marshal time, not during json.Unmarshal.

The projection at lines 674–676 maps the three fields straight through with no transformation. Correct.

TestSQS_QueueMetaPreservesHTFIFOAttributes

Drives a 4-partition perMessageGroupId / messageGroup queue through HandleQueueMetaFinalize and reads back _queue.json. The floatField helper correctly handles JSON's default float64 numeric type; uint32(4) is well within float64's exact integer range. All three assertions fire on the correct keys. The test name and comment both call out the immutability constraint (SetQueueAttributes rejects changes) so the regression reason is self-documenting.

One observation: the test uses a dot in the queue name (ht-fifo.fifo) — this is a legal SQS FIFO queue name and exercises EncodeSegment on a name with a period. No issue, just noting it's a realistic test vector.


Five nits still open from earlier rounds (non-blocking)

1. peekMsgDataKey boundary inconsistency — sqs.go:825

if idx == 0 || idx+genBytes > len(rest) {  // should be >=

Production parseSQSMessageDataKey uses >= to reject empty msg-id segments; this debug helper still uses >. Flagged rounds 1–5. Fix this →

2. Stale comment in HandleSideRecord — sqs.go:337

// Tombstones include a fixed-width gen but no msg ID; the
// generic parser tolerates the empty trailer.

parseSQSGenericKey now rejects idx == len(rest); the comment says the opposite. Tombstone keys succeed because their gen starts with 0x00 (non-base64url), which terminates the scan before len(rest). Flagged rounds 1–5. Fix this →

3. peekMsgDataKey silently mishandles partitioned keys — sqs.go:824
No isPartitionedRest branch; returns Encoded = "p" for any partitioned key. Flagged rounds 3–5. Fix this →

4. RawURLEncoding vs conventional base64 in binary body fallback — sqs.go:187
Field is named "base64" but encoded with base64.RawURLEncoding (-_ alphabet, no padding). A restore-tool author will reach for base64.StdDecoding and get garbage. Flagged rounds 4–5. Fix this →

5. HandleQueueGen missing base64url validation — sqs.go:299
HandleQueueMeta validates the encoded segment with base64.RawURLEncoding.DecodeString before calling queueState; HandleQueueGen does not, so a corrupt snapshot key creates a dangling sqsQueueState entry. Flagged round 5. Fix this →


Overall: The Round 6 fix is correct and complete. The three HT-FIFO fields are verified against the live adapter struct, the omitempty placement is right, and the test locks down all three values through a realistic FIFO queue name. No new issues introduced. The five carry-over nits are all cosmetic or minor. Ready to merge once the stacking dependency (#713) lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant