Skip to content

Comments

Add scheduler instrumentation (timing & system resource usage)#101

Merged
AlexJones0 merged 8 commits intolowRISC:masterfrom
AlexJones0:instrument_scheduler
Feb 18, 2026
Merged

Add scheduler instrumentation (timing & system resource usage)#101
AlexJones0 merged 8 commits intolowRISC:masterfrom
AlexJones0:instrument_scheduler

Conversation

@AlexJones0
Copy link
Contributor

@AlexJones0 AlexJones0 commented Feb 16, 2026

This PR introduces instrumentation to DVSim's scheduler, allowing us to collect metrics about the the scheduler's operation in each run if desired. This can be useful for:

  1. Profiling and performance analysis of DVSim,
  2. Optimization, and
  3. Catching DVSim performance regressions via regular instrumented runs.

This PR introduces two types of instrumentation: timing (start, end and duration) and system resource usage (RSS/VMS memory, swap, CPU utilization and user/system time). This makes the PR quite large, but the goal is to motivate the designed abstractions. If it makes it easier to review I can split out the last 2 commits with the resource instrumentation into a separate PR.

You can now give the --instrument option on the command line:

# No instrumentation
dvsim ...
# Just timing metrics
dvsim --instrument timing ...
# Just resource usage metrics
dvsim --instrument resources ...
# Both types of metrics
dvsim --instrument timing resources ...

Currently this just defaults to generating the instrumentation report in scratch/<branch>/reports/metrics.json, right next to the generated HTML reports for sim flows. In the future it might be nice to make this kind of thing more customizable.

A few more thoughts to consider (maybe for future PRs?):

  • The intention is to probably refactor the scheduler in the future, so this PR keeps the instrumentation as modularised as possible. A better pattern when refactoring might be to encapsulate the job status and spec within some scheduler job execution context, and then use the observer pattern to have the instrumentation, status printer and even the scheduler itself watch the status of dispatched jobs.
  • Instrumentation is currently managed by the scheduler (you pass it the instrumentation which it then starts and stops). There might be utility in measuring the rest of DVSim (e.g. building and deploying jobs), so maybe this should be moved slightly and more hooks added?
  • Resource usage instrumentation defaults to polling every 0.5 seconds to capture short spikes but remain relatively lightweight. It would be nice to add a way to configure this, either via a command-line argument and/or some config file loaded by DVSim.
  • Per-process resource usage is currently based on averages of system-wide samples during the job time frame, because we don't have access to per-job subprocesses. It might be possible to do this if we adequately refactored the launchers in the future, but I would expect some difficulty there because e.g. launchers with remote dispatch wouldn't affect system resource utilization too much.
  • There may be value in trying to modify the current abstractions to better support capturing time-series resource sampling data as well?

Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitty comments, but I really like this!

@AlexJones0 AlexJones0 force-pushed the instrument_scheduler branch 3 times, most recently from 78938a2 to 3d6269c Compare February 16, 2026 18:48
Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@AlexJones0 AlexJones0 force-pushed the instrument_scheduler branch 2 times, most recently from 06513e3 to 63f15ca Compare February 17, 2026 12:39
Copy link
Collaborator

@machshev machshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlexJones0! This is looking really good as a concept.

I'm not so keen on the way the instrumentation object is passed through the flow objects. It's another thing that has to be passed through all the main DVSim objects. If there is a way of avoiding this then that would be better.

I'd suggest using a global singleton for the instrumentation object which can be configured in the cli, and then a reference retrieved via a getter.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Implement the base class functionality for scheduler instrumentation,
where new instrumentation options can be added by subclassing the
`SchedulerInstrumentation` and registering the class with the
`InstrumentationFactory` registry. This allows DVSim to potentially be
extended to add custom scheduler instrumentation logic.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Implement timing instrumentation for the scheduler and register it with
the instrumentation registry/factory. This enables instrumentation for
reporting when the scheduler itself started/ended, as well as when each
job that was dispatched started/ended.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Hook the instrumentation implementation into the scheduler. The
scheduler now takes some instrumentation object as input and will
start/stop to wrap the scheduler's lifetime, and will notify it of
certain events (scheduler start, stop, job status change).

On stopping instrumentation, the scheduler will also generate dump the
generated metrics as a JSON file at a report path, if specified. In the
future we may wish to modify this abstraction so that the scheduler
itself does not handle the report writing (and instead either have some
parent abstraction handle it, or the instrumentation itself), but the
current scheduler architecture (without any significant refactoring)
means that the status is not encapsulated with the job, and thus must be
injected separately into the report by the scheduler.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Adds the `--instrument` option to the main DVSim CLI which can be used
to specify anywhere from zero (default) to multiple instrumentations to
use with DVSim's scheduler. This lets users of DVSim customize the level
of instrumentation that they which to use and select which information
is important to measure, allowing those who need data to capture more
data and those who need performance to disable all instrumentation by
default.

This instrumentation is constructed on the command line and passed
through the flow objects to the scheduler, where the instrumentation
report (where one exists) is currently written as a single metrics.json
file in the `reports` directory, next to the existing generated HTML
reports (where these exist).

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
For instrumentation of the scheduler, this allows us to get more
detailed information about system compute resource usage, including e.g.
memory (RSS, VMS, swap) and per-core CPU utilization.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Adds an additional instrumentation mechanism to DVSim's scheduler to
and registers it with the instrumentation factory/registry to allow
users to optionally enable instrumentation of system resources.

This captures a variety of useful system resources including memory
utilization (RSS, VMS, swap) and CPU utilisation (percentage and time,
plus per-core percentage) for both the system as a whole and
specifically for the DVSim / scheduler process overhead.

Since there is no "per-dispatched-job" process that is transparently
available to the scheduler, the resource metrics currently reported for
each job are instead an aggregate of the system resource metric samples
taken over the time period for which that job was running.

Currently, based on existing tools and polling frequencies, this is set
to poll for system resources every 0.5 seconds (2x a second). In the
future it would be nice to consider how best to make this customizable
(perhaps through an additional CLI argument, or through a config file
option?)

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 added this pull request to the merge queue Feb 18, 2026
Merged via the queue into lowRISC:master with commit e12f7c9 Feb 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants