ReadonlyFilesystem Condition with Auto-Recovery Detection#1234
ReadonlyFilesystem Condition with Auto-Recovery Detection#1234CharudathGopal wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
Welcome @CharudathGopal! |
|
Hi @CharudathGopal. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: CharudathGopal The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@Random-Liu @sjenning Can you please take a look at this PR |
|
/ok-to-test |
|
/cc @Random-Liu @wangzhen127 @hakman @andyxning Tests have passed, appreciate any help in reviewing this change.. |
|
Hello @hakman, This feature is important for Portworx Enterprise Product which provides a comprehensive, software-defined, persistent container storage and data management platform, specifically designed for Kubernetes-based applications across on-premises, hybrid, and multi-cloud environments. The changes we are making as part of this PR can help multiple platform like Anthos, Gardener to operate seamlessly with Portworx. Appreciate any help with reviewing this change. |
|
Hi @CharudathGopal — thanks for putting this together, and for the detailed work on it. I went through the PR, and I had to spend some time understanding the problem statement and the operational impact of the change. Here’s my feedback:
That said, NPD is modular. If this approach is working well for you and your users/customers as-is, I don't see any reason to not use it. I do agree it would be valuable to have this capability upstream, but I think the implementation details need a bit more discussion before we merge something like this. @wangzhen127 @SergeyKanzhelev — curious if you have thoughts on the right approach here (especially around the signal/source of truth and condition ownership). |
|
Hello @hakman, Thanks for the detailed review! Before answering your questions/concerns I would like to mention I have tried to keep the design as similar to existing SystemLogMonitor (readonly-monitor.json). 1. Condition Ownership 2. /dev/kmsg Ring Buffer Limitation Even in case of SystemLogMonitor (readonly-monitor.json), when the POD is restarted and ring-buffer do not have the error message then it resets the node condition to false. 3. Unreadable /dev/kmsg = OK 4. Complexity Initial approach: Scan all /proc/mounts for read-only filesystems. Current approach: Only check devices that kernel explicitly flagged in error messages. More targeted, fewer false positives, but requires robust path resolution. I 100% agree that this needs a upstream solution, but getting this change in helps a lot of customers and multiple K8s Platform Providers to use this solution. |
|
@hakman FYI, forgot to mention earlier. The doc I have added already talks about disabling readonly-monitor.json before using the customPluginMonitor we have here in this PR. @SergeyKanzhelev @wangzhen127 Appreciate if you review the change. |
ISSUE: #474
Problem Statement
The existing NPD SystemLogMonitor for ReadonlyFilesystem is a one-way trigger: it sets the condition to True when detecting
Remounting filesystem read-onlyin kernel logs, but never clears it. The condition remains stuck until NPD pod restart or node reboot, causing some Kubernetes platform to delete the node.Solution: CustomPluginMonitor with Active Recovery Detection
This PR introduces a CustomPluginMonitor (check_ro_filesystem.sh) that actively monitors both detection and recovery of read-only filesystems.
Key Features
NPD Design Adherence & Detection (5-minute lookback)
/dev/kmsgfor "Remounting filesystem read-only" messages/proc/uptime/host/proc/1/mountsRecovery Check (all-time lookback)
/dev/kmsgwith no time limitTargeted Approach (Avoids False Positives)
/proc/mounts(which would flag legitimate read-only mounts like /boot, CD-ROMs, ConfigMaps)Deployment Changes
ReadonlyFilesystem CustomPluginMonitor is not enabled by default, users can enable this feature by following steps documented in
docs/readonly-recovery-plugin-monitor.md--config.system-log-monitor=/config/readonly-monitor.jsonwith--config.custom-plugin-monitor=/config/readonly-recovery-plugin-monitor.json