feat: (perf) allow spawning multiple tasks per read by tafia · Pull Request #2156 · apache/iceberg-rust

tafia · 2026-02-21T07:00:40Z

Scanning of all files is both cpu and io intensive. While we can control the io parallelism via concurrency_limit* arguments, all the work is effectively done on the same tokio task, thus the same cpu.

This situation is one of the main reason why iceberg-rust is much slower than pyiceberg while reading large files (my test involved a 10G file).

This PR proposes to split scans into chunks which can be spawned independently to allow cpu parallelism.

In my tests (I have yet to find how to benchmark it in this project directly), reading a 10G file:

before: 38s
after: 16s
pyiceberg: 15s

Which issue does this PR close?

I haven't found any particular issue but several comments are referring to cpu bounded processing.

What changes are included in this PR?

This PR proposes to split scans into chunks which can be spawned independently to allow cpu parallelism.

Are these changes tested?

I have added a test to show that the change doesn't affect the output. I have yet to find a good benchmark to prove my claim about the performance. Any tip on how I could do would be welcomed!

Scanning of all files is both cpu and io intensive. While we can control the io parallelism via concurrency_limit* arguments, all the work is effectively done on the same tokio task, thus the same cpu. This situation is one of the main reason why iceberg-rust is much slower than pyiceberg while reading large files (my test involved a 10G file). This PR proposes to split scans into chunks which can be spawned independently to allow cpu parallelism. In my tests (I have yet to find how to benchmark it in this project directly), reading a 10G file: - before: 38s - after: 16s - pyiceberg: 15s

tafia · 2026-02-21T10:50:17Z

The error seems unrelated to the PR (python) or wrong (need to get iterator ownership not just elements)

mbutrovich · 2026-02-23T15:00:41Z

I'll take a look. In theory there's nothing stopping from generating FileScanTasks that span pieces of the files now (this is what Comet does with the existing reader). I've still been running into mismatches in parallelism though and trying to get more CPU utilization out of the Iceberg scan stages of Comet jobs, even when we've properly dispatched a bunch of I/O requests. I suspect you could be onto something here. Thanks! I'll take a pass this week.

tafia · 2026-02-23T15:52:52Z

I've still been running into mismatches in parallelism

Yes, this is far from an optimal solution but at least it is a simple move in the right direction.

Fyi, I've also added some row-group and columns parallelism but the changes are more complex and not ready to be merged.

blackmwk

Thanks @tafia for this pr. Do you mind to try datafusion integration rather than using arrow reader directly? I'm declining to make the to_arrow method more complicated. If you want a high performance local query engine, using datafusion is the right direction.

tell clippy that this is intentional

11a3a1b

mbutrovich self-requested a review February 23, 2026 14:58

blackmwk requested changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: (perf) allow spawning multiple tasks per read#2156

feat: (perf) allow spawning multiple tasks per read#2156
tafia wants to merge 2 commits intoapache:mainfrom
tafia:spawn-multiple-tasks-per-read

tafia commented Feb 21, 2026

Uh oh!

tafia commented Feb 21, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Feb 23, 2026

Uh oh!

tafia commented Feb 23, 2026

Uh oh!

blackmwk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

tafia commented Feb 21, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

tafia commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Feb 23, 2026

Uh oh!

tafia commented Feb 23, 2026

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tafia commented Feb 21, 2026 •

edited

Loading