Skip to content

fix: Harden metrics and event subsystem under stress#720

Open
nikooo777 wants to merge 31 commits intoimpr/gunfrom
fix/neo-logging
Open

fix: Harden metrics and event subsystem under stress#720
nikooo777 wants to merge 31 commits intoimpr/gunfrom
fix/neo-logging

Conversation

@nikooo777
Copy link
Collaborator

@nikooo777 nikooo777 commented Mar 2, 2026

Actions

  • Update production config to use "only-ids": true option in remote stores.

Tests

rebar3 eunit --module dev_node_process still fails on 2 tests, but this is happening in edge as well.

Summary

  • Fix stale PID caching in event server causing frozen metrics in long-lived processes
  • Add timeout to cpu_sup:avg5 scrape to prevent indefinite blocking
  • Use selective ETS query for /events endpoint instead of full table scan
  • Remove unnecessary spawn-per-request in metric recording paths
  • Guard all unprotected Prometheus writes that could crash critical server processes
  • Cache Prometheus startup failure with 60s retry backoff to avoid hot-path blocking
  • Bound event server Prometheus wait to ~30s with lazy counter redeclaration

Commits

  • 3fc9e78 fix: Harden event server against crash and stale PID caching
  • e872f1f fix: Add timeout to cpu_sup:avg5 in metrics collector
  • 6e1a978 fix: Use selective ETS query for event counters
  • da2c207 fix: Remove unnecessary spawn in metric recording paths
  • 1c25e27 fix: Guard unprotected Prometheus writes in hot paths
  • c0b8005 fix: Cache Prometheus startup failure with retry backoff
  • 535c2c6 fix: Bound event server Prometheus wait and handle late startup

@speeddragon speeddragon changed the base branch from neo/edge to impr/gun March 5, 2026 14:12
@speeddragon
Copy link
Collaborator

hb_store_gateway: remote_hyperbeam_node_ans104_test test is fixed in #736

@speeddragon
Copy link
Collaborator

Only tests now failing are:

  • hb_client (5 failing)
  • dev_arweave (2 failing)

Due to how the upload is done to up.arweave.net

samcamwilliams and others added 15 commits March 9, 2026 18:26
…st-hook"

This reverts commit de6262b, reversing
changes made to 37c508f.
- Wrap prometheus_counter:inc/3 in try-catch in handle_events/0 to prevent server crash on Prometheus errors
- Add is_process_alive/1 check in find_event_server/0 so long-lived processes recover from a dead cached PID
- Wrap cpu_sup:avg5/0 in spawn_monitor with 2s timeout
- Kill worker on timeout to prevent process leaks
- Replace ets:tab2list with ets:match_object filtering for event rows
- Avoids materializing the entire prometheus_counter_table
- Inline Prometheus ETS writes in hb_http and hb_store_arweave
- Guard observe in hb_http_client record_duration spawn
- Eliminates per-request process overhead under load
- Add try-catch to dec_prometheus_gauge and inc_prometheus_counter
- Guard retry path in inc_prometheus_gauge
- Wrap hb_store_lmdb:name_hit_metrics to prevent LMDB read crashes
- Check ensure_all_started result instead of ignoring it
- Cache failure timestamp in persistent_term, retry after 60s
- Callers degrade gracefully when Prometheus is unavailable
- Cap await_prometheus_started to ~30s to prevent unbounded mailbox growth
- Messages stay in mailbox during wait instead of being consumed and lost
- Wrap declare in try-catch, redeclare lazily on inc failure via ensure_event_counter
- Keep metric writes off the HTTP reply path to isolate request tail latency
- Retain try-catch inside the spawn to guard against metric-path hiccups
Comment on lines +153 to +156
ets:match_object(
prometheus_counter_table,
{{default, <<"event">>, '_', '_'}, '_', '_'}
).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning all items from the list, we only return events.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this materially effect the results on ~hyperbuddy@1.0/events/format~hyperbuddy@1.0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This filters from the ETS side, before counters, which also filters by <<"event">> only counters. This just reduces the amount of data ETS sends to this process, which would be discarded further down.

}
end,
WithPrivIP = hb_private:set(NormalBody, <<"ip">>, RealIP, Opts),
WithPrivIP = hb_private:set(WithPeer, <<"ip">>, RealIP, Opts),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithPeer wasn't being used.

@speeddragon speeddragon marked this pull request as ready for review March 10, 2026 18:44
NodeOpts2 = maps:merge(NodeOpts, #{ bundler_max_items => 3 }),
Node = hb_http_server:start_node(NodeOpts2#{
priv_wallet => hb:wallet(),
priv_wallet => ar_wallet:new(),
Copy link
Collaborator

@speeddragon speeddragon Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now start hb for every single test, we cannot use hb:wallet because the ServerID address will conflict when starting a new node inside the test. By using ar_wallet:new(), a new ServerID will be generated based in the private wallet, and it will not conflict with existing one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because hb:wallet() only creates a wallet if one hasn't been cached in the persistent_term store?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it uses the priv_key_location, which defaults to <<"hyperbeam-key.json">>.

% There is a race condition where we write to the store and a
% reset happens making not read the written value.
% Lower this number can still produce flaky test
timer:sleep(100),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it is better to use hb_util:until for this type of thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find something to wait for. Maybe I can use hb_store:list and make sure it is empty, but when I tried that it got stuck there (not sure why).

src/hb_http.erl Outdated
hb_http_client:response_status_to_atom(Status),
http_response_to_httpsig(Status, Headers, Body, Opts)
};
outbound_result_to_message(<<"json@1.0">>, Status, Headers, Body, Opts) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need a custom handler? This could just match outbound_result_to_message(Codec, Status, Headers, Body, Opts) and convert to the given codec.

[prometheus, prometheus_cowboy, prometheus_ranch]
),
ok;
case persistent_term:get(hb_prometheus_start_failed, undefined) of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't want this route. I don't think it works and it may also return non-ok, when it is called ensure_. Why not 38ccd088ff4cdd1c79bfeb09fd88fde0bb7e0d6e? (Also, we should make sure 14827ad3ee2a3f53803f795cc6ce250794e1240a or equivalent makes it in from this dead branch).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied both commits.


%% @doc Reading an unsupported signature type transaction should fail
failure_to_process_message_test() ->
%% TODO: Enable when we find a TX that we don't support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-key flex 😂

<<"host">> => <<"draft-17_whitepaper">>
},
Opts
Opts#{http_client => httpc}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary fix until we fix this in gun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants