doc: Article on high volume log loss.#3166
doc: Article on high volume log loss.#3166alanconway wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alanconway The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6b08884 to
9a559bd
Compare
jcantrill
left a comment
There was a problem hiding this comment.
Overall I believe this needs changes related to changing container log settings and not expanding capacity of /var/log
|
/hold |
|
@jcantrill good feedback. I'm doing a rewrite to clarify the focus on rotation parameters rather than /var/log size as you suggested. Will have new version shortly. |
f4fc421 to
124cb75
Compare
|
/lgtm |
|
@alanconway: you cannot LGTM your own PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@jcantrill please re-review, I think I've addressed your comments. |
124cb75 to
9e5e077
Compare
|
@jcantrill made changes to "Recommendations" to clarify your points - definite improvement. |
|
/hold |
9e5e077 to
3a8241e
Compare
3a8241e to
a1328c9
Compare
| During _normal operation_ sendRate keeps up with writeRate (on average). | ||
| The number of unread logs is small, and does not grow over time. | ||
|
|
||
| Logging is _overloaded_ when writeRate exceeds sendRate (on average) for some period of time. |
There was a problem hiding this comment.
This phrase is only true if logs are not dropped or filtered. We could perhaps to extend this phrase or being more explicit?
The Logging stack is also "overloaded" if the collector is using all the cpu or memory hitting the limits, then, it's (very) slow reading and sending the logs
There was a problem hiding this comment.
Good point, needs to be addressed.
|
|
||
| [,yaml] | ||
| ---- | ||
| apiVersion: logging.openshift.io/v1alpha1 |
There was a problem hiding this comment.
We could probably link to the Red Hat Documentation or sharing here an example with:
- limits.{cpu,memory},
- requests.{cpu,memory}
- nodeSelector/tolerations
| The long-term average send rate *must* exceed the write rate (including spikes) to allow recovery after overloads. | ||
|
|
||
| ---- | ||
| TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes |
There was a problem hiding this comment.
I feel that this formula is a simplification of a different reality where:
- it could not consider that the formula itself assumes that all the nodes are producing the same quantity/size of logs and this is not the reality.
- it assumes that not filtering (drop/pruning) is set
- The "TotalWriteRateBytes" comes from "log_logged_bytes_total" that it's a metric coming from measuring logs from pods. It's not easy to understand this formula when it's used "TotalSendRateLogs" that can include logs collected by the audit logs (they can be so much verbose) and journald logs
Assumption that all logs produce the same number of logs/size
The reality shows that usually only some of the collectors are assuming the majority of the load, then, you have some of the nodes where it can be produced the highest % of logs as they are containing the most verbose applications. Therefore, the nodes where running these applications and the collectors running on them are the most impacted.
Later, it's shared the formula:
DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor
This DiskTotalSize metric could be "good", but, as the formula TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes is a simplification, it will be wrong and it will fail.
Let's put an example:
- Node A: 300GB/day
- Node B: 50GB/day
- Node C: 30GB/day
The Disk needed in the Node A won't be the same that in the node B. Then, I could fail as not having enough disk in the Node A and in the nodes B and C, I have the most of the storage free.
Other example:
- Node A: with a single pod generating 200G/day (this comes from the reality)
- Node B: 250 pods generating not too many MB per day
Then, I feel this metric could not be good for calculating how much pressure will exist in the node and for being used later in other formulas.
A better metric, but also not good, it could be to make the measures by application/collector/node and using as baseline the highest value of them.
There was a problem hiding this comment.
+1 - need more realistic examples or at least explanation of the issues.
|
|
||
| .Recovery time to clear the backlog from a max outage: | ||
| ---- | ||
| RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes) |
There was a problem hiding this comment.
As this metric takes TotalWriteRateBytes that assumes that all the nodes are producing the same load, then, this time could not be correct.
Let's take again the example:
Node A: 300GB/day
Node B: 50GB/day
Node C: 30GB/day
Then, the Recovery time will be different from the collector in the Node A, Node B, and Node C.
Also, when recovering, in the Node A as the load is bigger and also probably at the node level, the time for recovering could be slower not only because of being needed to recover from the logs from the past, the node is producing more and more logs when the service is recovered that they need to be processed.
In a production environment, I'd expect that the collector in the node A was suffering more than the other collectors as the receiver could be rejecting logs with "Too many" messages that the collector needs to retry.
There was a problem hiding this comment.
+1 need to clarify the calculation should be for the worst-case node. Can we use metrics to identify that?
There was a problem hiding this comment.
+1 need to clarify the calculation should be for the worst-case node. Can we use metrics to identify that?
We can get metrics by collector/node. In fact, I usually needs to use by collector/node to know what's the one with worst behaviour and if this behaviour is in relation with the number/size of logs produced in the same node or it could be a different reason as it could be the node health, etc
| TotalWriteRateBytes = 2MB/s | ||
| SafetyFactor = 1.5 | ||
|
|
||
| DiskTotalSize = 3600s × 2MB/s × 1.5 = 10GB |
There was a problem hiding this comment.
This DiskTotalSize assums that when the logs are rotated, they are not compressed. Example of compression:
/host/var/log/pods/openshift-ovn-kubernetes_ovnkube-node-mrdgt_218324ca-0add-4885-9c30-80f7576d4f85/ovnkube-controller:
total 182032
-rw-------. 1 root root 37269952 Jan 7 14:21 0.log
-rw-------. 1 root root 4627032 Dec 26 00:22 0.log.20251219-234609.gz
-rw-------. 1 root root 4503073 Dec 30 07:30 0.log.20251226-002205.gz
-rw-------. 1 root root 4760160 Jan 4 04:30 0.log.20251230-073005.gz
-rw-------. 1 root root 52429256 Jan 4 04:30 0.log.20260104-043015
-rw-------. 1 root root 15676015 Jan 28 17:39 1.log
-rw-------. 1 root root 4794543 Jan 15 05:49 1.log.20260111-045408.gz
-rw-------. 1 root root 4766578 Jan 19 07:00 1.log.20260115-054909.gz
-rw-------. 1 root root 4789273 Jan 26 12:13 1.log.20260119-070010.gz
-rw-------. 1 root root 52431455 Jan 26 12:13 1.log.20260126-121325
In fact, if the logs are compressed, Vector will not read them, even when they remain in the nodes as per the "source" rule where the compressed files are excluded:
exclude_paths_glob_patterns = ["/var/log/pods/*/*/*.gz", "/var/log/pods/*/*/*.log.*", "/var/log/pods/*/*/
*.tmp", "/var/log/pods/openshift-logging_*/gateway/*.log", "/var/log/pods/openshift-logging_*/loki*/*.log
", "/var/log/pods/openshift-logging_*/opa/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.lo
g", "/var/log/pods/openshift-logging_kibana-*/*/*.log", "/var/log/pods/openshift-logging_logfilesmetricex
porter-*/*/*.log"]
There was a problem hiding this comment.
Never even thought about compression. Hmmm....
| === Configure Kubelet log limits | ||
|
|
||
| Here is an example `KubeletConfig` resource (OpenShift 4.6+). + | ||
| It provides `50MB × 10 files = 500MB` per container. |
There was a problem hiding this comment.
Caveats on this.
Making largers files will impact also to the node performance in different ways:
1.If the logs are compressed, then, the impact in the memory and cpu from the node will be bigger when compressing
2.When the collector is able to log forward again the logs, if it's pending to read 100G in the node, the impact on the disk will be big in terms of Disk I/O as the collector will try to read the quickest possible these logs. These logs are in "/var/log/pods", usually, the first disk from the node. If loads running in this node require low latency as it could be "etcd", they will suffer the impact
r2d2rnd
left a comment
There was a problem hiding this comment.
Adding more feedback. Thank you for writing it
9f2f3ce to
b51a954
Compare
|
@r2d2rnd this is fantastic feedback, damn you. I have some work to do.
|
b51a954 to
7adc87e
Compare
| If the forwarder is too slow, in some cases adjusting its CPU/Memory may resolve the problem. | ||
|
|
||
| There are always some _unread logs_, written but not yet read by the forwarder. | ||
|
|
There was a problem hiding this comment.
The log loss can be also produced by:
- https://issues.redhat.com/browse/OBSDA-1206
- quick pods as jobs that they are created/destroyed so quick that not able to read the logs
There was a problem hiding this comment.
More generally: when the pod log files are deleted (not just rotated) we could lose tail logs that we haven't read yet. If the pod was short-lived, that could be all the logs. Do you know the schedule for deleting log files after a pod is deleted? I know they can persist for a while but no idea how long.
Probably there's nothing we can do about this except note that it can happen in the document...
r2d2rnd
left a comment
There was a problem hiding this comment.
Othe causes for Log Loss
This guide explains how to handle scenarios where high-volume logging can cause log loss in OpenShift clusters, and how to configure your cluster to minimize this risk.
7adc87e to
f9106bb
Compare
|
@alanconway: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
"This guide explains how to handle scenarios where high-volume logging can cause log loss in
OpenShift clusters, and how to configure your cluster to minimize this risk."
/cc @xperimental
/cc @cahartma
/assign @jcantrill
Links
doc: Article on high volume log loss.
"This guide explains how to handle scenarios where high-volume logging can cause log loss in
OpenShift clusters, and how to configure your cluster to minimize this risk.""