Basic serialization and deserialization of diff filters #71

itsibitzi · 2025-07-23T23:51:10Z

What does this change?

It is useful to be able to serialize the geofilters into a compact byte representation so that they can be stored and compared
against other filters in the future.

This PR adds a few functions and tests to support writing the geofilters to, and reading geofilters from byte buffers.

Implemented only for Diff filters for now.

How does this work?

The implementation is pretty straight forward if you're familiar with the way the most and least significant bits are split (I wasn't at the beginning of today). Notes on that below...

We serialize the MSB array as numbers and the LSB bitmap as a bitmap (with one byte to represent the unoccupied bits in the most significant u64).

I used some unsafe Rust to convert between the various representations. This means this library is not safe if you're sharing the serialized representations across machines with different endiannesses. I've left this in a comment for now. I partially implemented supporting big-endian platforms but given that I won't be able to easily test this I stashed it for now. Maybe someone will one day want a geofilter on an IBM mainframe (a tier 2 rust target!).

I've added a test that throws random values into the filter, writes it into a vec, and then reads it back out again to check that nothing has changed.

Bitmap representation

The bitfield that represents the spread over the geometric distribution is encoded in two parts. The most significant bits and the last significant bits are held separately in two different representations.

But why?

As items are inserted into the filter they are much more likely to land in the buckets represented by the least significant bits due to the geometric distribution. As you walk up the bit array the sparsity increases, because it's less likely that an item will be hashed to a bucket there. As a result, it doesn't make sense to store a dense representation for the entire filter, as you'd end up with a lot of zeros in the more significant end of the bitmap.

As a result this crate represents the sparser part of the bitmap efficiently using a vector of numbers representing the occupied indices and represents the denser part of the bitmap with a regular bit array.

Review notes

I also added some comments for some other bits as I was figuring this data structure out. I left them in this PR because they seem generally useful for someone who is not super familiar with the implementation of the MSB/LSB split. It would be nice to get some confirmation that these comments are correct. And let me know if you think they are not needed. I have been taking notes as I go along understanding this data structure and I plan on writing a doc for non-academic CS people to be able to get a grasp on it's workings, which might make these kinds of comments a bit irrelevant.

Please also let me know if there's any mistakes in my summary of the data structure in this PR description.

…ytes

The changes extract a shared test helper for verifying serialization of geo diff counts with different bucket types in MSB sparse bit field representation.

The changes split out serialization-related code from diff_count.rs and bitvec.rs into a new serde.rs module. This improves code organization by separating concerns.

Copilot

Pull Request Overview

This PR implements basic serialization and deserialization functionality for diff filters, allowing geofilters to be converted to/from compact byte representations for storage and future comparison.

Adds from_bytes and write methods to convert GeoDiffCount filters to/from byte arrays
Implements serialization for both MSB sparse arrays and LSB bitmaps with endianness considerations
Includes comprehensive tests for round-trip serialization with random data

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
crates/geo_filters/src/diff_count/serde.rs	New module implementing serialization/deserialization methods for GeoDiffCount with little-endian optimization
crates/geo_filters/src/diff_count/bitvec.rs	Adds BitVec serialization methods and helper functions for bitmap persistence
crates/geo_filters/src/diff_count.rs	Integrates serde module, adds explanatory comments for MSB/LSB operations, and includes comprehensive serialization tests

crates/geo_filters/src/diff_count.rs

Copilot · 2025-07-24T08:30:49Z

crates/geo_filters/src/diff_count/bitvec.rs

+                buf.as_ptr(),
+                buf.len() / BYTES_PER_BLOCK,
+            ))
+        };


Using std::mem::transmute with raw pointers can be unsafe and may cause undefined behavior if the memory layout assumptions are incorrect. Consider using safer alternatives like slice::align_to() or explicit casting with proper alignment checks.

Suggested change

};

let (prefix, blocks, suffix) = unsafe { buf.align_to::<u64>() };

assert!(

prefix.is_empty() && suffix.is_empty(),

"Buffer is not properly aligned for u64"

);

crates/geo_filters/src/diff_count/bitvec.rs

Clippy wants explicit types on transmute calls (reasonable). Fixed issue where from_raw_parts was upset about misaligned slices. I think for all platforms we're targetting the alignment of the buckets doesn't matter. This was fixed by not casting the byte buffer pointer to the correct type and instead transmuting after the fact. Fix alignment and unsafe transmutes The changes focus on making alignment handling more safe and explicit in unsafe code blocks, while also adding test coverage for alignment issues. Fix unsafe transmutes and alignment in bitvec module The changes improve type safety in unsafe transmutes and add alignment tests for raw pointer conversions. This fixes potential undefined behavior when converting between slice types.

approach.

crates/geo_filters/src/diff_count/bitvec.rs

crates/geo_filters/src/diff_count/serde.rs

crates/geo_filters/src/diff_count.rs

aneubeck · 2025-07-24T09:59:12Z

crates/geo_filters/src/diff_count.rs

+                    // We have removed a value from our MSB, move a value in the
+                    // LSB into the MSB


Maybe reference here the invariance that the MSB vector must always have the first n bits represented.
By removing one we have to move another bit into the MSB vector if there is still one available in the LSB bitmap.

There is a second invariance that the LSB bitmap is sized such that all bits smaller than the ones in the MSB vector fit into the LSB bitmap.

Oh. Here is the bug:
We resize the LSB vector here if there are no LSB bits set. (at least I think that is what's happening here).
But in the case above where we toggle bits in the LSB vector directly, we are not checking for this case.
So, the correct solution is to remove this case here so that we always get the same representation independent of the order of operations being performed.

Can you write a test for this (it seems that we are unlucky with the randomized tests :) )

Opened a PR to address this: #72

I didn't add a new test but I added a way of deterministically replaying fuzz tests. I'm not sure how we'd reliably get this to fail other than keep fuzzing until we find a failure and then hardcode the seed? Maybe by using a no-op hash function (identity function)?

crates/geo_filters/src/diff_count.rs

- Remove serde module. Put code into diff_count.rs. - Remove bonus comments - Remove panic-y big endian serialization stuff. - Squish together lines

crates/geo_filters/src/diff_count/bitvec.rs

itsibitzi added 3 commits July 23, 2025 22:43

implement basic serialization and deserialization of diff counts as b…

d90d6f0

…ytes

Refactor serialization tests for geo diff count types

3e33988

The changes extract a shared test helper for verifying serialization of geo diff counts with different bucket types in MSB sparse bit field representation.

Move serialization code to dedicated module

c143f4f

The changes split out serialization-related code from diff_count.rs and bitvec.rs into a new serde.rs module. This improves code organization by separating concerns.

Copilot AI review requested due to automatic review settings July 23, 2025 23:51

itsibitzi requested a review from a team as a code owner July 23, 2025 23:51

Modify module docs to explain that big endian is not supported.

fc9c45d

This comment was marked as outdated.

Sign in to view

itsibitzi requested a review from Copilot July 24, 2025 08:30

Copilot AI reviewed Jul 24, 2025

View reviewed changes

itsibitzi added 3 commits July 24, 2025 10:01

Fix test name/config mismatch in diff_count test. Thanks Copilot!

42f76b8

Update diff_count serialization test to use less clowny padding

a80c3e6

approach.