Skip to content

pydask implementation#67

Open
sophiemiddleton wants to merge 14 commits intoMu2e:mainfrom
sophiemiddleton:dask-dev-v2
Open

pydask implementation#67
sophiemiddleton wants to merge 14 commits intoMu2e:mainfrom
sophiemiddleton:dask-dev-v2

Conversation

@sophiemiddleton
Copy link
Collaborator

This pull request introduces Dask-based parallel file processing to the pyutils analysis framework, enabling scalable multi-file analysis and improved performance. The main changes are the addition of a new DaskProcessor class for parallel data processing, a comprehensive example script demonstrating its usage, and a version bump to reflect these new capabilities.

Dask integration and parallel processing:

  • Added new pyutils/pydask.py module providing the DaskProcessor class, which mirrors the API of Processor but uses Dask for parallel file processing. This allows users to process multiple files concurrently, either locally or on a distributed Dask cluster, with progress tracking and error resilience. ([pyutils/pydask.pyR1-R172](https://github.com/Mu2e/pyutils/pull/67/files#diff-b250e6c6661378cbc729a2da04b46f2d294e70508e4a275b7e0fd9cbcae9a15fR1-R172))

Documentation and examples:

  • Added examples/scripts/pyutils_basics_dask.py script demonstrating how to use DaskProcessor for multi-file analysis, including selection cuts, data inspection, vector operations, and plotting. The script highlights the advantages of Dask-based processing and provides a step-by-step guide for users. ([examples/scripts/pyutils_basics_dask.pyR1-R300](https://github.com/Mu2e/pyutils/pull/67/files#diff-d8098ac33b1267668f9bff146f303a655c199c355841082bd7130c69ee4a3131R1-R300))

Version update:

  • Updated setup.py to bump the package version from 1.4.0 to 1.8.0, reflecting the addition of Dask support and new features. ([setup.pyL5-R5](https://github.com/Mu2e/pyutils/pull/67/files#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7L5-R5))

@sophiemiddleton
Copy link
Collaborator Author

solves #66 and #65

@sophiemiddleton
Copy link
Collaborator Author

Just realized I havent udpated the README.md - I will do this now

Copy link
Collaborator

@sam-grant sam-grant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great. I'm excited to see it working and I'm glad you were able to reuse pieces of the existing code.

# Create a sample file list for demonstration
logger.log("\nCreating sample file list for demonstration...", "info")

# Use the MDS3a.txt file list provided in the repository
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right this is a demo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I actually realized that I hadnt got that part complete. I'm working on it now. Should be done in a few mins

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make notebooks, but I couldnt get the notebook to see my test pyenv, so I had to settle for this. Once its merged I will make some nicer interactive notebooks

setup.py Outdated
setup(
name="pyutils",
version="1.4.0",
version="1.8.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess we should actually bump to 1.9.0 or even 2.0.0 (major change), since we're on 1.8.0 with the current release.

if sample_files and file_list_path:
logger.log("Using DaskProcessor with multi-file processing", "info")

data = processor.process_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that we can use a similar interface

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I know you wanted to keep things as is. And I was able to benchmark the two relative to one another - I'll show some stats on Wednesday

client: Optional[Client] = None
created_client = False
try:
if scheduler_address:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so we just need to know the address and can connect to any scheduler on the network?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly. I think we need to wait for the EAF team on the centralized scheduler. For now we can work with local schedular/cluster. In that respect it doesnt really have too much advantage over the current pyprocess.py. But, I think that pydask will be more future-proof and once we have the resources at the EAF will help us a lot!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@sophiemiddleton
Copy link
Collaborator Author

I have implemented your suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants