pydask implementation by sophiemiddleton · Pull Request #67 · Mu2e/pyutils

sophiemiddleton · 2026-02-27T21:26:09Z

This pull request introduces Dask-based parallel file processing to the pyutils analysis framework, enabling scalable multi-file analysis and improved performance. The main changes are the addition of a new DaskProcessor class for parallel data processing, a comprehensive example script demonstrating its usage, and a version bump to reflect these new capabilities.

Dask integration and parallel processing:

Added new pyutils/pydask.py module providing the DaskProcessor class, which mirrors the API of Processor but uses Dask for parallel file processing. This allows users to process multiple files concurrently, either locally or on a distributed Dask cluster, with progress tracking and error resilience. ([pyutils/pydask.pyR1-R172](https://github.com/Mu2e/pyutils/pull/67/files#diff-b250e6c6661378cbc729a2da04b46f2d294e70508e4a275b7e0fd9cbcae9a15fR1-R172))

Documentation and examples:

Added examples/scripts/pyutils_basics_dask.py script demonstrating how to use DaskProcessor for multi-file analysis, including selection cuts, data inspection, vector operations, and plotting. The script highlights the advantages of Dask-based processing and provides a step-by-step guide for users. ([examples/scripts/pyutils_basics_dask.pyR1-R300](https://github.com/Mu2e/pyutils/pull/67/files#diff-d8098ac33b1267668f9bff146f303a655c199c355841082bd7130c69ee4a3131R1-R300))

Version update:

Updated setup.py to bump the package version from 1.4.0 to 1.8.0, reflecting the addition of Dask support and new features. ([setup.pyL5-R5](https://github.com/Mu2e/pyutils/pull/67/files#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7L5-R5))

sophiemiddleton · 2026-02-27T21:30:42Z

solves #66 and #65

sophiemiddleton · 2026-02-27T21:33:14Z

Just realized I havent udpated the README.md - I will do this now

sam-grant

This is really great. I'm excited to see it working and I'm glad you were able to reuse pieces of the existing code.

sam-grant · 2026-02-27T21:34:15Z

examples/scripts/pyutils_basics_dask.py

+# Create a sample file list for demonstration
+logger.log("\nCreating sample file list for demonstration...", "info")
+
+# Use the MDS3a.txt file list provided in the repository


Can we reuse get_file_list from pyprocess?

https://github.com/Mu2e/pyutils/blob/main/pyutils/pyprocess.py#L65-L138

Oh right this is a demo

sure, I actually realized that I hadnt got that part complete. I'm working on it now. Should be done in a few mins

I want to make notebooks, but I couldnt get the notebook to see my test pyenv, so I had to settle for this. Once its merged I will make some nicer interactive notebooks

sam-grant · 2026-02-27T21:40:10Z

setup.py

 setup(
    name="pyutils",
-    version="1.4.0",
+    version="1.8.0",


Ah, I guess we should actually bump to 1.9.0 or even 2.0.0 (major change), since we're on 1.8.0 with the current release.

sam-grant · 2026-02-27T21:41:30Z

examples/scripts/pyutils_basics_dask.py

+    if sample_files and file_list_path:
+        logger.log("Using DaskProcessor with multi-file processing", "info")
+
+        data = processor.process_data(


Cool that we can use a similar interface

yes, I know you wanted to keep things as is. And I was able to benchmark the two relative to one another - I'll show some stats on Wednesday

sam-grant · 2026-02-27T21:43:06Z

pyutils/pydask.py

+        client: Optional[Client] = None
+        created_client = False
+        try:
+            if scheduler_address:


Nice, so we just need to know the address and can connect to any scheduler on the network?

yes, exactly. I think we need to wait for the EAF team on the centralized scheduler. For now we can work with local schedular/cluster. In that respect it doesnt really have too much advantage over the current pyprocess.py. But, I think that pydask will be more future-proof and once we have the resources at the EAF will help us a lot!

sophiemiddleton · 2026-02-28T22:40:19Z

I have implemented your suggestions

Sophie Middleton and others added 11 commits February 22, 2026 17:11

dask example

45df4b3

dask tests add

0e5e7ea

pyprocess tests

ada7b3c

dask works

58d94cc

dask success

f0fc7ae

tidy up dask tests

84deadc

tidy up dask tests

1722374

tidy up dask tests

1619cd7

tidy up dask tests

b836030

moved to pylogger

0a1018c

removed notebook, need to make more notebook examples in future PR

054c53c

sophiemiddleton requested a review from sam-grant February 27, 2026 21:27

sophieMu2e added 2 commits February 27, 2026 15:39

edited README is pydask information

cfd211b

edited README is pydask information

af84b16

sam-grant approved these changes Feb 27, 2026

View reviewed changes

edits

e80efd2

Conversation

sophiemiddleton commented Feb 27, 2026

Uh oh!

sophiemiddleton commented Feb 27, 2026

Uh oh!

sophiemiddleton commented Feb 27, 2026

Uh oh!

sam-grant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sophiemiddleton commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants