2026-02-28

CVE-2026-25087 // signed overflow in arrow's ipc reader

tldr: the “prebuffered” read path had zero fuzz coverage, and the CoalesceReadRanges() function didn’t check whether offset + length overflows int64. fixed in PR #48925.

glossary

Apache Arrow // columnar in-memory data format used by Spark, Pandas, and most of the data analytics ecosystem
IPC // Arrow’s binary file format for serializing/deserializing RecordBatches between processes
FlatBuffers // zero-copy serialization lib (by Google) used by Arrow for its file metadata
signed integer overflow // when arithmetic on signed ints wraps past the type’s max value, undefined behavior in C/C++
UBSAN // undefined behavior sanitizer, compiler instrumentation that catches UB at runtime
fuzz harness // the wrapper code that feeds mutated input to a target function during fuzzing

apache products keep finding me! one of my first gigs was doing live upgrades on apache servers for US colleges as a sys admin, first client finding involved apache tomcat, and now here i am inside Arrow’s C++ IPC internals. i picked it because the IPC reader pulls int64 offsets and lengths straight out of flatbuffer metadata and does arithmetic on them. big adoption, direct trust of file-controlled integers.

Arrow’s IPC reader has two read paths: a normal one where buffers get read individually, and a cached/prebuffered path that kicks in when you call PreBufferMetadata() before reading batches. the prebuffered path collects all buffer ranges upfront and runs them through CoalesceReadRanges() to merge nearby reads for performance. i checked the fuzz harness and it only exercised the normal path. the entire prebuffered codepath had zero coverage. here we go…

the bug

CoalesceReadRanges() at interfaces.cc:475:

1
const int64_t current_range_end = current_range_start + itr->length;

both values come from the IPC file’s flatbuffer metadata. the reader validates them individually, checks they’re non-negative, checks alignment, etc but never considers whether their sum overflows.

the data flow is direct: GetBuffer() in reader.cc pulls raw offset and length from the flatbuffer -> ReadBuffer() feeds them into RequestRange() -> accumulates in a vector -> cached read path hands the vector to CoalesceReadRanges() through the range cache.

poc

first, a small C++ program (gen_control.cc) using Arrow’s own writer API to produce a completely valid IPC file. then a Python script (corrupt_ipc.py) parses the IPC framing, locates the RecordBatch message in the flatbuffer, and patches only the buffer offset and length fields. the corruption is as follows: offset near INT64_MAX, length positive, sum overflows. everything else untouched.

i added a binary diff sanity check to the run script. this ended up helping quite a bit during disclosure because the maintainers could verify the scope of the corruption quickly and painlessly.

1
2
3
auto reader = RecordBatchFileReader::Open(file, options);
reader->PreBufferMetadata({0});
reader->ReadRecordBatch(0);

standard public API. PreBufferMetadata forces the cached path. i built with UBSAN and ran the patched file:

1
2
interfaces.cc:475:68: runtime error: signed integer overflow:
9223372036837998864 + 536870912 cannot be represented in type 'long int'

stack trace confirmed: ReadRangeCache::Cache -> CoalesceReadRanges -> overflow!!

Arrow team fixed it in PR #48925 with overflow-safe arithmetic and bounds-checking against actual file length. they also updated the fuzz harness to cover the prebuffered path - that part matters as much as the code fix.

overall, i’d call this a textbook signed integer overflow from untrusted input. i was following a couple of posts on the subject at the time and this was quite the 1:1 translation. sole reason it survived in a well-fuzzed project is that the harness only tested one of two read paths. and i just happened to catch it :)

emi.