[Draft, Not for review] Sync changes from CDB_DiskANN repo#748
[Draft, Not for review] Sync changes from CDB_DiskANN repo#748
Conversation
- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools
There was a problem hiding this comment.
Pull request overview
Syncs changes aimed at integrating label-filtered search into the benchmarking/tooling flow, primarily by enhancing ground-truth label evaluation and aligning benchmark recall logic with diskann-benchmark-core.
Changes:
- Extend
diskann-toolsground-truth label filtering to handle array-valued label structures and add diagnostics/logging. - Refactor
diskann-benchmarkrecall utilities to delegate todiskann-benchmark-core. - Adjust
diskann-label-filterJSON flattening and minor formatting/import changes; updatediskann-toolsdependencies.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-tools/src/utils/ground_truth.rs | Adds array-expansion support for label evaluation and extra tracing diagnostics during bitmap + ground-truth generation. |
| diskann-tools/Cargo.toml | Updates dependencies (adds serde_json, removes some unused deps) and adjusts dev-deps/features. |
| diskann-label-filter/src/utils/flatten_utils.rs | Modifies how arrays are flattened into pointer paths. |
| diskann-label-filter/src/parser/format.rs | Minor formatting-only change. |
| diskann-label-filter/src/attribute.rs | Adds an import (currently unused). |
| diskann-benchmark/src/utils/tokio.rs | Tweaks doc comment and removes unit tests. |
| diskann-benchmark/src/utils/recall.rs | Replaces local recall/AP implementations with diskann-benchmark-core wrappers/types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [package] | ||
| name = "diskann-tools" | ||
| edition = "2021" | ||
| version.workspace = true | ||
| authors.workspace = true | ||
| description.workspace = true | ||
| documentation.workspace = true | ||
| license.workspace = true | ||
|
|
There was a problem hiding this comment.
This crate no longer declares license.workspace = true, while the rest of the workspace crates do. This makes the crate metadata inconsistent and can break workspace-wide tooling (e.g., publish checks) that expects license metadata to be inherited from the workspace. Re-add license.workspace = true under [package] for consistency.
| results: &dyn benchmark_core::recall::Rows<T>, | ||
| groundtruth: &dyn benchmark_core::recall::Rows<T>, | ||
| recall_k: &[usize], | ||
| recall_n: &[usize], |
There was a problem hiding this comment.
compute_multiple_recalls changed from taking StridedView + an enhanced_metrics flag to taking &dyn benchmark_core::recall::Rows<T> and no longer accepts enhanced_metrics. There are existing call sites in diskann-benchmark/src/backend/exhaustive/* that still call compute_multiple_recalls(..., false) and pass StridedView conversions, which will not compile. Either update those call sites in this PR, or provide a backwards-compatible wrapper/overload that preserves the old signature and adapts into benchmark_core::recall::Rows.
| recall_n: &[usize], | |
| recall_n: &[usize], | |
| _enhanced_metrics: bool, |
| average_precision: m.average_precision, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This module removed APMetrics and compute_average_precision, but there are existing call sites (e.g. diskann-benchmark/src/backend/index/range_search.rs and .../result.rs) that still reference utils::recall::compute_average_precision and utils::recall::APMetrics. As-is, the benchmark crate will not compile. Either restore these items as thin wrappers/type aliases around benchmark_core::recall equivalents, or update the call sites in this PR to use the new AveragePrecisionMetrics API.
| } | |
| } | |
| /// Backwards-compatible alias for the core average precision metrics. | |
| /// This keeps existing `utils::recall::APMetrics` call sites compiling. | |
| pub(crate) type APMetrics = benchmark_core::recall::AveragePrecisionMetrics; | |
| /// Re-export the core `compute_average_precision` function so existing | |
| /// `utils::recall::compute_average_precision` call sites continue to work. | |
| pub(crate) use benchmark_core::recall::compute_average_precision; |
| // Print per-query match counts | ||
| for (i, bitmap) in query_bitmaps.iter().enumerate() { | ||
| if i < 10 || bitmap.is_empty() { | ||
| tracing::debug!( | ||
| "Query {}: {} base vectors matched the filter", | ||
| i, | ||
| bitmap.len() | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
The per-query debug logging loop will emit a debug! line for every query with an empty bitmap (if i < 10 || bitmap.is_empty()), which can explode log volume when filters are selective or misconfigured (especially if debug logging is enabled). Consider limiting this to the first N queries only, or logging aggregate statistics plus a small sample of empty/non-empty queries.
| /// Expands a JSON object with array-valued fields into multiple objects with scalar values. | ||
| /// For example: {"country": ["AU", "NZ"], "year": 2007} | ||
| /// becomes: [{"country": "AU", "year": 2007}, {"country": "NZ", "year": 2007}] | ||
| /// | ||
| /// If multiple fields have arrays, all combinations are generated. | ||
| fn expand_array_fields(value: &Value) -> Vec<Value> { | ||
| match value { | ||
| Value::Object(map) => { | ||
| // Start with a single empty object | ||
| let mut results: Vec<Map<String, Value>> = vec![Map::new()]; | ||
|
|
||
| for (key, val) in map.iter() { | ||
| if let Value::Array(arr) = val { | ||
| // Expand: for each existing result, create copies for each array element | ||
| let mut new_results: Vec<Map<String, Value>> = Vec::new(); | ||
| for existing in results.iter() { | ||
| for item in arr.iter() { | ||
| let mut new_map: Map<String, Value> = existing.clone(); | ||
| new_map.insert(key.clone(), item.clone()); | ||
| new_results.push(new_map); | ||
| } | ||
| } | ||
| // If array is empty, keep existing results without this key | ||
| if !arr.is_empty() { | ||
| results = new_results; | ||
| } | ||
| } else { | ||
| // Non-array field: add to all existing results | ||
| for existing in results.iter_mut() { | ||
| existing.insert(key.clone(), val.clone()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| results.into_iter().map(Value::Object).collect() | ||
| } | ||
| // If not an object, return as-is | ||
| _ => vec![value.clone()], | ||
| } | ||
| } | ||
|
|
||
| /// Evaluates a query expression against a label, expanding array fields first. | ||
| /// Returns true if any expanded variant matches the query. | ||
| fn eval_query_with_array_expansion(query_expr: &ASTExpr, label: &Value) -> bool { | ||
| let expanded = expand_array_fields(label); | ||
| expanded.iter().any(|item| eval_query_expr(query_expr, item)) | ||
| } |
There was a problem hiding this comment.
expand_array_fields clones the full JSON object and generates a Cartesian product of array-valued fields, and it's invoked inside the inner loop over base_labels for every query. This can become extremely expensive (exponential in the number/size of array fields, multiplied by num_queries * num_base_labels). Consider avoiding full expansion/allocation by evaluating the predicate directly against arrays (e.g., treat equality as "any element matches" for array-valued fields), or precomputing any needed expansions once per base label rather than per query evaluation.
|
|
||
| use std::fmt::Display; | ||
| use std::hash::{Hash, Hasher}; | ||
| use std::io::Write; |
There was a problem hiding this comment.
use std::io::Write; is unused in this module (no Write methods are called; the write_u8/write_i64/etc calls are on std::hash::Hasher). This will trigger unused_imports warnings under typical lint settings; remove the import.
| use std::io::Write; |
| @@ -18,21 +18,3 @@ pub(crate) fn block_on<F: std::future::Future>(future: F) -> F::Output { | |||
| .expect("current thread runtime initialization failed") | |||
| .block_on(future) | |||
| } | |||
There was a problem hiding this comment.
The runtime helper previously had a unit test validating runtime(num_threads) creates the expected number of workers, but the entire test module was removed. If the old test was removed due to tokio runtime metrics being unstable/unsupported in this build, consider replacing it with a stable assertion (or moving coverage to an integration test) so runtime()/block_on() behavior remains exercised by CI.
| for (query_idx, npq) in ground_truth.iter().enumerate() { | ||
| let neighbors: Vec<_> = npq.iter().collect(); | ||
| let neighbor_count = neighbors.len(); | ||
|
|
||
| if query_idx < 10 { | ||
| // Print top K IDs and distances for first 10 queries | ||
| let top_ids: Vec<u32> = neighbors.iter().take(10).map(|n| n.id).collect(); | ||
| let top_dists: Vec<f32> = neighbors.iter().take(10).map(|n| n.distance).collect(); | ||
| tracing::debug!( |
There was a problem hiding this comment.
The added ground-truth debug section allocates and clones neighbor data for every query (let neighbors: Vec<_> = npq.iter().collect();) even though detailed logging is only done for the first 10 queries. For large benchmarks this adds avoidable overhead. Prefer using npq.iter().count()/npq.len() for counts and only collecting the top-N neighbors for the small set of queries you actually log.
| Value::Array(arr) => { | ||
| for (i, item) in arr.iter().enumerate() { | ||
| flatten_recursive(item, stack.push(&i, separator), out, separator); | ||
| flatten_recursive(item, stack.push(&String::from(""), separator), out, separator); |
There was a problem hiding this comment.
Array flattening is now pushing an empty string segment for every array element, which drops the index from the generated path (e.g., "/tags/0" becomes something like "/tags/") and can also create duplicate keys. This breaks the JSON Pointer semantics documented in this file and will fail the existing tests that assert "/tags/0" and "/tags/1". Use the numeric index (i) when building JSON Pointer paths, or implement the existing include_array_indices config flag (currently unused) to conditionally include the index.
| flatten_recursive(item, stack.push(&String::from(""), separator), out, separator); | |
| let index = i.to_string(); | |
| flatten_recursive(item, stack.push(&index, separator), out, separator); |
Integrating filter search with diskann-benchmark.