Skip to content

Comments

perf(reader): Optionally pass data file size to reader through FileScanTasks to avoid stat() call#2175

Open
mbutrovich wants to merge 4 commits intoapache:mainfrom
mbutrovich:file_size_passthrough
Open

perf(reader): Optionally pass data file size to reader through FileScanTasks to avoid stat() call#2175
mbutrovich wants to merge 4 commits intoapache:mainfrom
mbutrovich:file_size_passthrough

Conversation

@mbutrovich
Copy link
Collaborator

@mbutrovich mbutrovich commented Feb 24, 2026

Which issue does this PR close?

Screenshot 2026-02-24 at 2 03 07 PM

What changes are included in this PR?

  • Pass through data file size to FileScanTask. Iceberg Java does this by wrapping a reference to a DataFile in its FileScanTask. In this case we're just cherry-picking some fields until we decide we need more.
  • Remove redundant data file creation code in tests.

Are these changes tested?

Existing tests.

@mbutrovich mbutrovich self-assigned this Feb 24, 2026
@mbutrovich mbutrovich changed the title perf(reader): Optionally pass data file size to reader through FileScanTasks to avoid stat() call perf(reader): Optionally pass data file size to reader through FileScanTasks to avoid stat() call Feb 24, 2026
Copy link
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr!

let delete_filter_rx =
delete_file_loader.load_deletes(&task.deletes, Arc::clone(&task.schema));

let file_size = if task.file_size_in_bytes > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case the file_size_in_bytes is not larger than 0?

) -> Vec<Option<String>> {
let tasks = Box::pin(futures::stream::iter(
vec![Ok(FileScanTask {
file_size_in_bytes: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though they are tests, I still think it's better to have a positive number .

self.file_io.clone(),
false,
None,
None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still fetch file size here.


let file_scan_tasks = vec![
FileScanTask {
file_size_in_bytes: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to see 0 size file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants