Implement graceful shutdown procedure#8851
Open
whitslack wants to merge 2 commits intoElementsProject:masterfrom
Open
Implement graceful shutdown procedure#8851whitslack wants to merge 2 commits intoElementsProject:masterfrom
whitslack wants to merge 2 commits intoElementsProject:masterfrom
Conversation
When "snub-idle-channels" is set to true, lightningd will no longer spawn channeld subdaemons for channels that have no outstanding HTLCs, and it will cease trying to auto-reconnect to peers with whom we have no outstanding HTLCs. Incoming channel_reestablish messages for these idle channels will cause lightningd to reply to the peer with a warning explaining that we are temporarily declining to reestablish the channel. Since we do not send our own channel_reestablish, the peer is unable to add any HTLCs to the channel (or make any other updates to the channel). The reason we might want to do this is so we can halt a node gracefully by progressively snubbing more and more channels as they become idle until eventually we have no outstanding HTLCs whatsoever and also no possibility of any new HTLCs being added. At that point, we can safely take our node offline for an extended duration with no possibility that any of our channels will be unilaterally closed due to HTLC deadlines while we are offline. Changelog-Added: New `snub-idle-channels` dynamic config variable makes CLN temporarily stop spawning channeld subdaemons for channels with no HTLCs, as a means to achieve a safe node shutdown. Issue: ElementsProject#4842
This script utilizes the new "snub-idle-channels" knob to attempt to stop a CLN node gracefully. The script sets the snub flag and then starts forcibly disconnecting peers that have one or more reestablished channels but no outstanding HTLCs. When both the number of reestablished channels and the number of outstanding HTLCs reach zero, the script stops the node. If this does not occur before a user-specified timeout, then the script exits with an error and reports the block height and approximate time until the next outstanding HTLC expires. Changelog-Added: `contrib/lightning-graceful-stop.sh` attempts to stop a node without leaving any outstanding HTLCs. Closes: ElementsProject#4842
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BOLT 2 says:
We can abuse this requirement to implement a graceful shutdown procedure:
channel_reestablishmessages for any channels that have exactly zero outstanding HTLCs.This PR has two objectives:
snub-idle-channelsdynamic config variable that, when set totrue, makes lightningd:channeldsubdaemons for channels that have no outstanding HTLCs;channel_reestablishmessages for channels that have no outstanding HTLCs,contrib/lightning-graceful-stop.shscript that utilizessnub-idle-channelsto implement the graceful shutdown procedure outlined above.I have tested this graceful shutdown procedure on my own production node with great success. In under a minute my node dropped from over 30 outstanding HTLCs to 14, all of which were "stuck." The shutdown script reported that the next expiration was 140 blocks away, giving me plenty of time to power off my node and perform a hardware upgrade. If I had been willing to wait for all of my outstanding HTLCs to be resolved, then I could have stopped my node indefinitely with no danger of any forced unilateral closures. (Of course, my peers could still voluntarily choose to unilaterally close my channels with them if they grew tired of waiting for my node to reappear in the network, but that's not the concern that graceful shutdown is attempting to address.)
Note that there is still one edge case that this graceful shutdown strategy doesn't solve. If a peer has transmitted a new commitment containing a new HTLC, but we never transmitted our own new commitment containing that same new HTLC (either because we never received the peer's new commitment or because we restarted before we could send our own new commitment), then we will not know about (or will have forgotten) the new HTLC, and we will believe that the channel is safe to snub even though the peer would retransmit their new commitment containing the new HTLC if we allowed them to reestablish the channel. I am not certain, but it may be possible to use the fields in the
channel_reestablishmessage received from the peer to ascertain whether the peer has new HTLCs that they need to retransmit to us, and if they do, then we shouldn't snub the channel even if we are currently aware of no outstanding HTLCs in it.Checklist
Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:
tools/lightning-downgrade