-
Notifications
You must be signed in to change notification settings - Fork 9
Basic serialization and deserialization of diff filters #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The changes extract a shared test helper for verifying serialization of geo diff counts with different bucket types in MSB sparse bit field representation.
The changes split out serialization-related code from diff_count.rs and bitvec.rs into a new serde.rs module. This improves code organization by separating concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements basic serialization and deserialization functionality for diff filters, allowing geofilters to be converted to/from compact byte representations for storage and future comparison.
- Adds
from_bytes
andwrite
methods to convert GeoDiffCount filters to/from byte arrays - Implements serialization for both MSB sparse arrays and LSB bitmaps with endianness considerations
- Includes comprehensive tests for round-trip serialization with random data
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
crates/geo_filters/src/diff_count/serde.rs | New module implementing serialization/deserialization methods for GeoDiffCount with little-endian optimization |
crates/geo_filters/src/diff_count/bitvec.rs | Adds BitVec serialization methods and helper functions for bitmap persistence |
crates/geo_filters/src/diff_count.rs | Integrates serde module, adds explanatory comments for MSB/LSB operations, and includes comprehensive serialization tests |
buf.as_ptr(), | ||
buf.len() / BYTES_PER_BLOCK, | ||
)) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::mem::transmute
with raw pointers can be unsafe and may cause undefined behavior if the memory layout assumptions are incorrect. Consider using safer alternatives like slice::align_to()
or explicit casting with proper alignment checks.
}; | |
let (prefix, blocks, suffix) = unsafe { buf.align_to::<u64>() }; | |
assert!( | |
prefix.is_empty() && suffix.is_empty(), | |
"Buffer is not properly aligned for u64" | |
); |
Copilot uses AI. Check for mistakes.
Clippy wants explicit types on transmute calls (reasonable). Fixed issue where from_raw_parts was upset about misaligned slices. I think for all platforms we're targetting the alignment of the buckets doesn't matter. This was fixed by not casting the byte buffer pointer to the correct type and instead transmuting after the fact. Fix alignment and unsafe transmutes The changes focus on making alignment handling more safe and explicit in unsafe code blocks, while also adding test coverage for alignment issues. Fix unsafe transmutes and alignment in bitvec module The changes improve type safety in unsafe transmutes and add alignment tests for raw pointer conversions. This fixes potential undefined behavior when converting between slice types.
crates/geo_filters/src/diff_count.rs
Outdated
// We have removed a value from our MSB, move a value in the | ||
// LSB into the MSB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reference here the invariance that the MSB vector must always have the first n bits represented.
By removing one we have to move another bit into the MSB vector if there is still one available in the LSB bitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a second invariance that the LSB bitmap is sized such that all bits smaller than the ones in the MSB vector fit into the LSB bitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Here is the bug:
We resize the LSB vector here if there are no LSB bits set. (at least I think that is what's happening here).
But in the case above where we toggle bits in the LSB vector directly, we are not checking for this case.
So, the correct solution is to remove this case here so that we always get the same representation independent of the order of operations being performed.
Can you write a test for this (it seems that we are unlucky with the randomized tests :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a PR to address this: #72
I didn't add a new test but I added a way of deterministically replaying fuzz tests. I'm not sure how we'd reliably get this to fail other than keep fuzzing until we find a failure and then hardcode the seed? Maybe by using a no-op hash function (identity function)?
- Remove serde module. Put code into diff_count.rs. - Remove bonus comments - Remove panic-y big endian serialization stuff. - Squish together lines
What does this change?
It is useful to be able to serialize the geofilters into a compact byte representation so that they can be stored and compared
against other filters in the future.
This PR adds a few functions and tests to support writing the geofilters to, and reading geofilters from byte buffers.
Implemented only for
Diff
filters for now.How does this work?
The implementation is pretty straight forward if you're familiar with the way the most and least significant bits are split (I wasn't at the beginning of today). Notes on that below...
We serialize the MSB array as numbers and the LSB bitmap as a bitmap (with one byte to represent the unoccupied bits in the most significant
u64
).I used some unsafe Rust to convert between the various representations. This means this library is not safe if you're sharing the serialized representations across machines with different endiannesses. I've left this in a comment for now. I partially implemented supporting big-endian platforms but given that I won't be able to easily test this I stashed it for now. Maybe someone will one day want a geofilter on an IBM mainframe (a tier 2 rust target!).
I've added a test that throws random values into the filter, writes it into a vec, and then reads it back out again to check that nothing has changed.
Bitmap representation
The bitfield that represents the spread over the geometric distribution is encoded in two parts. The most significant bits and the last significant bits are held separately in two different representations.
But why?
As items are inserted into the filter they are much more likely to land in the buckets represented by the least significant bits due to the geometric distribution. As you walk up the bit array the sparsity increases, because it's less likely that an item will be hashed to a bucket there. As a result, it doesn't make sense to store a dense representation for the entire filter, as you'd end up with a lot of zeros in the more significant end of the bitmap.
As a result this crate represents the sparser part of the bitmap efficiently using a vector of numbers representing the occupied indices and represents the denser part of the bitmap with a regular bit array.
Review notes
I also added some comments for some other bits as I was figuring this data structure out. I left them in this PR because they seem generally useful for someone who is not super familiar with the implementation of the MSB/LSB split. It would be nice to get some confirmation that these comments are correct. And let me know if you think they are not needed. I have been taking notes as I go along understanding this data structure and I plan on writing a doc for non-academic CS people to be able to get a grasp on it's workings, which might make these kinds of comments a bit irrelevant.
Please also let me know if there's any mistakes in my summary of the data structure in this PR description.