doc: Article on high volume log loss. by alanconway · Pull Request #3166 · openshift/cluster-logging-operator

alanconway · 2025-12-04T20:22:46Z

Description

"This guide explains how to handle scenarios where high-volume logging can cause log loss in
OpenShift clusters, and how to configure your cluster to minimize this risk."

/cc @xperimental
/cc @cahartma
/assign @jcantrill

Links

JIRA: https://issues.redhat.com/browse/OBSDA-1211

doc: Article on high volume log loss.

"This guide explains how to handle scenarios where high-volume logging can cause log loss in
OpenShift clusters, and how to configure your cluster to minimize this risk.""

openshift-ci · 2025-12-04T20:24:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [alanconway]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jcantrill

Overall I believe this needs changes related to changing container log settings and not expanding capacity of /var/log

docs/administration/high-volume-log-loss.adoc

jcantrill · 2025-12-11T19:26:53Z

/hold

alanconway · 2025-12-12T19:34:13Z

@jcantrill good feedback. I'm doing a rewrite to clarify the focus on rotation parameters rather than /var/log size as you suggested. Will have new version shortly.

alanconway · 2025-12-19T16:21:08Z

/lgtm
/unhold

openshift-ci · 2025-12-19T16:21:12Z

@alanconway: you cannot LGTM your own PR.

Details

In response to this:

/lgtm
/unhold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alanconway · 2026-01-05T18:00:27Z

@jcantrill please re-review, I think I've addressed your comments.

docs/administration/high-volume-log-loss.adoc

alanconway · 2026-01-07T14:57:01Z

@jcantrill made changes to "Recommendations" to clarify your points - definite improvement.
Cosmetic changes to other sections.
Let me know how it reads to you

jcantrill · 2026-01-14T17:04:32Z

/hold

docs/administration/high-volume-log-loss.adoc

r2d2rnd · 2026-01-27T22:12:06Z

docs/administration/high-volume-log-loss.adoc

+During _normal operation_ sendRate keeps up with writeRate (on average).
+The number of unread logs is small, and does not grow over time.
+
+Logging is _overloaded_ when writeRate exceeds sendRate (on average) for some period of time.


This phrase is only true if logs are not dropped or filtered. We could perhaps to extend this phrase or being more explicit?

The Logging stack is also "overloaded" if the collector is using all the cpu or memory hitting the limits, then, it's (very) slow reading and sending the logs

Good point, needs to be addressed.

docs/administration/high-volume-log-loss.adoc

r2d2rnd · 2026-01-27T23:13:33Z

docs/administration/high-volume-log-loss.adoc

+
+[,yaml]
+----
+apiVersion: logging.openshift.io/v1alpha1


We could probably link to the Red Hat Documentation or sharing here an example with:

limits.{cpu,memory},

requests.{cpu,memory}

nodeSelector/tolerations

r2d2rnd · 2026-01-28T16:45:21Z

docs/administration/high-volume-log-loss.adoc

+The long-term average send rate *must* exceed the write rate (including spikes) to allow recovery after overloads.
+
+----
+TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes


I feel that this formula is a simplification of a different reality where:

it could not consider that the formula itself assumes that all the nodes are producing the same quantity/size of logs and this is not the reality.

it assumes that not filtering (drop/pruning) is set

The "TotalWriteRateBytes" comes from "log_logged_bytes_total" that it's a metric coming from measuring logs from pods. It's not easy to understand this formula when it's used "TotalSendRateLogs" that can include logs collected by the audit logs (they can be so much verbose) and journald logs

Assumption that all logs produce the same number of logs/size

The reality shows that usually only some of the collectors are assuming the majority of the load, then, you have some of the nodes where it can be produced the highest % of logs as they are containing the most verbose applications. Therefore, the nodes where running these applications and the collectors running on them are the most impacted.

Later, it's shared the formula:
DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor

This DiskTotalSize metric could be "good", but, as the formula TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes is a simplification, it will be wrong and it will fail.

Let's put an example:

Node A: 300GB/day

Node B: 50GB/day

Node C: 30GB/day

The Disk needed in the Node A won't be the same that in the node B. Then, I could fail as not having enough disk in the Node A and in the nodes B and C, I have the most of the storage free.

Other example:

Node A: with a single pod generating 200G/day (this comes from the reality)

Node B: 250 pods generating not too many MB per day

Then, I feel this metric could not be good for calculating how much pressure will exist in the node and for being used later in other formulas.

A better metric, but also not good, it could be to make the measures by application/collector/node and using as baseline the highest value of them.

+1 - need more realistic examples or at least explanation of the issues.

r2d2rnd · 2026-01-28T16:56:49Z

docs/administration/high-volume-log-loss.adoc

+
+.Recovery time to clear the backlog from a max outage:
+----
+RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes)


As this metric takes TotalWriteRateBytes that assumes that all the nodes are producing the same load, then, this time could not be correct.

Let's take again the example:
Node A: 300GB/day
Node B: 50GB/day
Node C: 30GB/day

Then, the Recovery time will be different from the collector in the Node A, Node B, and Node C.

Also, when recovering, in the Node A as the load is bigger and also probably at the node level, the time for recovering could be slower not only because of being needed to recover from the logs from the past, the node is producing more and more logs when the service is recovered that they need to be processed.

In a production environment, I'd expect that the collector in the node A was suffering more than the other collectors as the receiver could be rejecting logs with "Too many" messages that the collector needs to retry.

+1 need to clarify the calculation should be for the worst-case node. Can we use metrics to identify that?

+1 need to clarify the calculation should be for the worst-case node. Can we use metrics to identify that?

We can get metrics by collector/node. In fact, I usually needs to use by collector/node to know what's the one with worst behaviour and if this behaviour is in relation with the number/size of logs produced in the same node or it could be a different reason as it could be the node health, etc

r2d2rnd · 2026-01-28T17:47:04Z

docs/administration/high-volume-log-loss.adoc

+TotalWriteRateBytes = 2MB/s
+SafetyFactor = 1.5
+
+DiskTotalSize = 3600s × 2MB/s × 1.5 = 10GB


This DiskTotalSize assums that when the logs are rotated, they are not compressed. Example of compression:

/host/var/log/pods/openshift-ovn-kubernetes_ovnkube-node-mrdgt_218324ca-0add-4885-9c30-80f7576d4f85/ovnkube-controller: total 182032 -rw-------. 1 root root 37269952 Jan 7 14:21 0.log -rw-------. 1 root root 4627032 Dec 26 00:22 0.log.20251219-234609.gz -rw-------. 1 root root 4503073 Dec 30 07:30 0.log.20251226-002205.gz -rw-------. 1 root root 4760160 Jan 4 04:30 0.log.20251230-073005.gz -rw-------. 1 root root 52429256 Jan 4 04:30 0.log.20260104-043015 -rw-------. 1 root root 15676015 Jan 28 17:39 1.log -rw-------. 1 root root 4794543 Jan 15 05:49 1.log.20260111-045408.gz -rw-------. 1 root root 4766578 Jan 19 07:00 1.log.20260115-054909.gz -rw-------. 1 root root 4789273 Jan 26 12:13 1.log.20260119-070010.gz -rw-------. 1 root root 52431455 Jan 26 12:13 1.log.20260126-121325

In fact, if the logs are compressed, Vector will not read them, even when they remain in the nodes as per the "source" rule where the compressed files are excluded:

exclude_paths_glob_patterns = ["/var/log/pods/*/*/*.gz", "/var/log/pods/*/*/*.log.*", "/var/log/pods/*/*/ *.tmp", "/var/log/pods/openshift-logging_*/gateway/*.log", "/var/log/pods/openshift-logging_*/loki*/*.log ", "/var/log/pods/openshift-logging_*/opa/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.lo g", "/var/log/pods/openshift-logging_kibana-*/*/*.log", "/var/log/pods/openshift-logging_logfilesmetricex porter-*/*/*.log"]

Never even thought about compression. Hmmm....

r2d2rnd · 2026-01-28T17:50:44Z

docs/administration/high-volume-log-loss.adoc

+=== Configure Kubelet log limits
+
+Here is an example `KubeletConfig` resource (OpenShift 4.6+). +
+It provides `50MB × 10 files = 500MB` per container.


Caveats on this.

Making largers files will impact also to the node performance in different ways:

1.If the logs are compressed, then, the impact in the memory and cpu from the node will be bigger when compressing
2.When the collector is able to log forward again the logs, if it's pending to read 100G in the node, the impact on the disk will be big in terms of Disk I/O as the collector will try to read the quickest possible these logs. These logs are in "/var/log/pods", usually, the first disk from the node. If loads running in this node require low latency as it could be "etcd", they will suffer the impact

docs/administration/high-volume-log-loss.adoc

r2d2rnd

Adding more feedback. Thank you for writing it

alanconway · 2026-01-28T21:02:53Z

@r2d2rnd this is fantastic feedback, damn you. I have some work to do.
I've pushed a incomplete update covering some of your earlier comments (which I resolved) but there's more to do:

slow forwarder (collector CPU/mem) is equally important as slow sender, the system will be bottlenecked by whichever is worse.
You've got a lot of valuable extra detail and realism to work into examples
other log types: not sure if the current section is enough - there's nothing on how to measure other log types using metrics or otherwise
drop/filter - definitely need to be included. E.g. api-audit filter does drastic reduction of log size, need to take that into account.

r2d2rnd · 2026-01-29T11:49:12Z

docs/administration/high-volume-log-loss.adoc

+If the forwarder is too slow, in some cases adjusting its CPU/Memory may resolve the problem.
+
+There are always some _unread logs_, written but not yet read by the forwarder.
+


The log loss can be also produced by:

https://issues.redhat.com/browse/OBSDA-1206

quick pods as jobs that they are created/destroyed so quick that not able to read the logs

More generally: when the pod log files are deleted (not just rotated) we could lose tail logs that we haven't read yet. If the pod was short-lived, that could be all the logs. Do you know the schedule for deleting log files after a pod is deleted? I know they can persist for a while but no idea how long.
Probably there's nothing we can do about this except note that it can happen in the document...

r2d2rnd

Othe causes for Log Loss

This guide explains how to handle scenarios where high-volume logging can cause log loss in OpenShift clusters, and how to configure your cluster to minimize this risk.

openshift-ci · 2026-02-25T15:59:27Z

@alanconway: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot assigned jcantrill Dec 4, 2025

openshift-ci bot requested review from cahartma and xperimental December 4, 2025 20:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2025

alanconway force-pushed the high-volume-article branch from 6b08884 to 9a559bd Compare December 4, 2025 20:28

jcantrill added the release/6.4 label Dec 11, 2025

jcantrill requested changes Dec 11, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 11, 2025

alanconway force-pushed the high-volume-article branch 4 times, most recently from f4fc421 to 124cb75 Compare December 19, 2025 16:19

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2025

jcantrill requested changes Jan 6, 2026

View reviewed changes

docs/administration/high-volume-log-loss.adoc Outdated Show resolved Hide resolved

docs/administration/high-volume-log-loss.adoc Outdated Show resolved Hide resolved

docs/administration/high-volume-log-loss.adoc Outdated Show resolved Hide resolved

alanconway force-pushed the high-volume-article branch from 124cb75 to 9e5e077 Compare January 7, 2026 14:56

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2026

alanconway force-pushed the high-volume-article branch from 9e5e077 to 3a8241e Compare January 15, 2026 11:23

r2d2rnd reviewed Jan 27, 2026

View reviewed changes

docs/administration/high-volume-log-loss.adoc Outdated Show resolved Hide resolved

r2d2rnd reviewed Jan 27, 2026

View reviewed changes

docs/administration/high-volume-log-loss.adoc Outdated Show resolved Hide resolved

alanconway force-pushed the high-volume-article branch from 3a8241e to a1328c9 Compare January 27, 2026 13:56

r2d2rnd reviewed Jan 27, 2026

View reviewed changes

r2d2rnd reviewed Jan 28, 2026

View reviewed changes

docs/administration/high-volume-log-loss.adoc Outdated Show resolved Hide resolved

r2d2rnd reviewed Jan 28, 2026

View reviewed changes

alanconway force-pushed the high-volume-article branch 3 times, most recently from 9f2f3ce to b51a954 Compare January 28, 2026 21:02

alanconway force-pushed the high-volume-article branch from b51a954 to 7adc87e Compare January 28, 2026 21:10

r2d2rnd reviewed Jan 29, 2026

View reviewed changes

doc: Article on high volume log loss.

f9106bb

This guide explains how to handle scenarios where high-volume logging can cause log loss in OpenShift clusters, and how to configure your cluster to minimize this risk.

alanconway force-pushed the high-volume-article branch from 7adc87e to f9106bb Compare February 25, 2026 15:33

		If the forwarder is too slow, in some cases adjusting its CPU/Memory may resolve the problem.

		There are always some _unread logs_, written but not yet read by the forwarder.

Conversation

alanconway commented Dec 4, 2025

Description

Links

Uh oh!

openshift-ci bot commented Dec 4, 2025

Uh oh!

jcantrill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcantrill commented Dec 11, 2025

Uh oh!

alanconway commented Dec 12, 2025

Uh oh!

alanconway commented Dec 19, 2025

Uh oh!

openshift-ci bot commented Dec 19, 2025

Uh oh!

alanconway commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alanconway commented Jan 7, 2026

Uh oh!

jcantrill commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r2d2rnd Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Assumption that all logs produce the same number of logs/size

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

r2d2rnd left a comment

Choose a reason for hiding this comment

Uh oh!

alanconway commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r2d2rnd Jan 28, 2026 •

edited

Loading

alanconway commented Jan 28, 2026 •

edited

Loading