GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512
GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512iemejia wants to merge 3 commits intoapache:masterfrom
Conversation
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
668caf7 to
2404a29
Compare
Pre-generate deterministic rows for the file and concurrent benchmarks so row construction does not skew the timed section, and make the encoding benchmarks include real dictionary-page and dictionary-decode work instead of only value buffers. Split synthetic RLE dictionary-index decoding into its own benchmark and encode generated binary payloads as UTF-8 explicitly so benchmark inputs stay consistent across runs and platforms.
Make the dictionary encode/decode benchmarks symmetric by routing both sides through a shared EncodedDictionary helper, guard against the dictionary writer falling back to plain encoding (which previously NPE'd in BinaryEncodingBenchmark setup for high-cardinality long strings), and drop redundant close() calls after toDictPageAndClose(). Share the pre-generated row array across threads in ConcurrentReadWriteBenchmark via Scope.Benchmark, eliminating 4x heap duplication and a now-unnecessary ThreadData inner class. Centralize the RNG seed as TestDataFactory.DEFAULT_SEED and add seed-overload variants for the int and binary generators so generators in the same setup no longer share a Random and silently depend on call order. Wrap the RLE encoder in try-with-resources and validate that LOW_CARDINALITY_DISTINCT fits within the configured bit width.
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
steveloughran
left a comment
There was a problem hiding this comment.
commented. I'm only just learning effective jmh myself; these all LGTM. The one request is that temp files are put under target/ so that a ./run.sh all command puts everything in the existing temporary directory tree
| } | ||
|
|
||
| @Benchmark | ||
| @OperationsPerInvocation(VALUE_COUNT) |
There was a problem hiding this comment.
didn't know of this trick until i saw your PR; adopted in #3452 earlier today
| * A no-op {@link OutputFile} that discards all written data. | ||
| * Useful for isolating CPU/encoding cost from filesystem I/O in write benchmarks. | ||
| */ | ||
| public final class BlackHoleOutputFile implements OutputFile { |
There was a problem hiding this comment.
does this act as a black hole for the benchmarks? or would passing in the blackhole to the constructor and having it used on L62 and L67 be best?
|
|
||
| @Setup(Level.Trial) | ||
| public void setup() throws IOException { | ||
| tempFile = File.createTempFile("parquet-read-bench-", ".parquet"); |
There was a problem hiding this comment.
there's a constant BenchmarkFiles.TARGET_DIR which defines the dir for benchmarks; it puts them under target/ so maven will clean them up. I used that in my PR so killing a test run in my IDE didn't leave cruft around...I'd recommend the same.
Summary
Resolves #3511.
The
parquet-benchmarksshaded jar built from current master is non-functional — it fails at runtime withRuntimeException: Unable to find the resource: /META-INF/BenchmarkList. This PR fixes that and adds 11 JMH benchmarks covering the encode/decode paths exercised by the open performance PRs, so reviewers can reproduce the reported numbers.What's broken on master
parquet-benchmarks/pom.xmlis missing two pieces of configuration:maven-compiler-pluginlacks theannotationProcessorPaths/annotationProcessorsconfig forjmh-generator-annprocess, so the JMH annotation processor never runs andMETA-INF/BenchmarkList/META-INF/CompilerHintsare never generated.maven-shade-pluginlacksAppendingTransformerentries for those two resources, so even if generated they would be dropped during shading.Both problems are fixed in this PR.
Benchmarks added
11 new files in
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/:IntEncodingBenchmarkBinaryEncodingBenchmarkByteStreamSplitEncodingBenchmark/ByteStreamSplitDecodingBenchmarkFixedLenByteArrayEncodingBenchmarkFileReadBenchmark/FileWriteBenchmarkRowGroupFlushBenchmarkConcurrentReadWriteBenchmarkBlackHoleOutputFileOutputFilethat discards bytes — isolates CPU from I/OTestDataFactoryValidation
After this PR, the shaded jar is runnable and registers 87 benchmarks:
Sanity check —
IntEncodingBenchmark.decodePlainreproduces the master baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5 measurement iterations):Out of scope (deferred)
Modernization of the existing
ReadBenchmarks/WriteBenchmarks/NestedNullWritingBenchmarks(Hadoop-freeLocalInputFile, parameterization, JMH-idiomatic state setup) is a separate concern and will be proposed in a follow-up PR.Follow-up
Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506, #3510) will be updated with a one-line "How to reproduce" snippet referencing the relevant
*Benchmarkclass.