parquet: optimize CachedArrayReader byte-array coalescing#9743
parquet: optimize CachedArrayReader byte-array coalescing#9743ClSlaid wants to merge 1 commit intoapache:mainfrom
Conversation
When CachedArrayReader builds output from multiple cached batches, the old path materialized filtered byte arrays and then concatenated them. Replace that path for Utf8/Binary arrays with a direct coalescer that builds offsets, values, and validity in one output array, while keeping the existing generic MutableArrayData path for other types. Add a dedicated CachedArrayReader benchmark and a sparse string regression test so this path is measured directly and covered independently of broader parquet reader benchmarks. Benchmark vs main: - cached_array_reader/utf8_sparse_cross_batch_4m_rows/consume_batch: 11.949 ms -> 4.153 ms (-65.2%) - arrow_reader_clickbench/sync/Q24 (same filter/projection as ClickBench Q26): 28.377 ms -> 28.443 ms (+0.2%, no measurable change) Signed-off-by: cl <cailue@apache.org>
|
@alamb I've tried to optimize with GPT 5.4, the improvement is not that obvious in the original test case you gave. So I let it wrote a new benchmark and optimized on it. However, I'm still not really confident about the current implementation, so please have a look. |
|
@XiangpengHao can you help review this PR? |
|
run benchmarks arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue-9060-cached-array-reader-byte-coalescer (5e78671) to d7d9ad3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| selected_row_count: usize, | ||
| ) -> ArrayRef { | ||
| match selected_batches[0].array.data_type() { | ||
| ArrowType::Utf8 => { |
There was a problem hiding this comment.
Wouldn't it need utf8view to show up in the clickbench benchmarks?
When CachedArrayReader builds output from multiple cached batches, the old path materialized filtered byte arrays and then concatenated them. Replace that path for Utf8/Binary arrays with a direct coalescer that builds offsets, values, and validity in one output array, while keeping the existing generic MutableArrayData path for other types.
Add a dedicated CachedArrayReader benchmark and a sparse string regression test so this path is measured directly and covered independently of broader parquet reader benchmarks.
Benchmark vs main:
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?