Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in option for btcpay_backup to backup full lnd state for migration #928

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

johnheenan
Copy link

The deleted line removes the failure to backup lnd database files channel.db and sphinxreplay.db.

Please see lightningnetwork/lnd#9070 for issue and discussion.

I suspect the next line should also be deleted for CLN:

    --exclude="volumes/generated_clightning_bitcoin_datadir/_data/lightning-rpc" \

The deleted line removes the failure to backup lnd database files
channel.db and sphinxreplay.db.
@ndeet
Copy link
Contributor

ndeet commented Sep 11, 2024

Thanks @johnheenan for all the debugging and figuring out. 💚

I think the CLN RPC data is not needed, there are other CLN data volumes backed up and it seems to work. But it seems for LND it either never worked or the graph data is something that is needed since a few versions.

@dennisreimann
Copy link
Member

The problem with including the graph data is, that this ends up being rather large. For instance I have a node where — even though LND compacts the data — just this folder alone is > 20GB:

12G	generated_lnd_bitcoin_datadir/_data/data/graph/mainnet/channel.db
4.0K	generated_lnd_bitcoin_datadir/_data/data/graph/mainnet/channel.db.last-compacted
5.3G	generated_lnd_bitcoin_datadir/_data/data/graph/mainnet/sphinxreplay.db
4.0K	generated_lnd_bitcoin_datadir/_data/data/graph/mainnet/sphinxreplay.db.last-compacted
4.1G	generated_lnd_bitcoin_datadir/_data/data/graph/mainnet/wtclient.db
4.0K	generated_lnd_bitcoin_datadir/_data/data/graph/mainnet/wtclient.db.last-compacted

This makes it very unpractical for backups, especially if you are doing daily snapshots.

I haven't read all of the debugging conversations, but we'd need to ensure what exactly was the problem here and whether or not this is data that needs to get backed up and cannot be replicated via sync.

@johnheenan Can you provide us with a TL;DR version of the finding when debuggin this together with @damanic?

@dennisreimann
Copy link
Member

I reached out to Oliver Gugger from Lightning Labs and he confirmed, that the graph data is needed, when restoring the backup on a new device. The whole process is described here: https://github.com/lightningnetwork/lnd/blob/master/docs/safety.md#migrating-a-node-to-a-new-device

@johnheenan
Copy link
Author

johnheenan commented Sep 12, 2024

The TL;DR version. An out of date channel.db file makes a lnd node toxic. It is excluded from backup (as is any other files in graph directory). An up to date channels.backup file is useless for node restoration. It is backed up correctly. Its only use is to recover channel funds through co-operative close.

I would suggest this as a solution to the large size of the folder issue:

  1. As a default, skip backup of the graph directory with a warning
  2. Offer an option to post process all db files in graph to a compacted state before adding to the backup. It will add substantial time to the backup. Warn that the backup is useless as soon as Lightning is started up again.
  3. Offer to backup 'as is' to speed up a migration. Again, warn that the backup is useless as soon as lnd is started up again.
  4. Apply the same warnings for CLN and eclair, if used

Note just because lightning is disabled does not mean it is not in use. Lightning must be kept disabled to keep the backup of the graph directory (for lnd) non toxic.

@johnheenan
Copy link
Author

I reached out to Oliver Gugger from Lightning Labs and he confirmed, that the graph data is needed, when restoring the backup on a new device. The whole process is described here: https://github.com/lightningnetwork/lnd/blob/master/docs/safety.md#migrating-a-node-to-a-new-device

This is specific to migration of a node, not backup of a node where the backup is immediately useless when the node starts up again.

For this PR I am not specifcally addressing migration.

Currently if somone wants to migrate easily they need to:

  1. Delete line 113 from btcpay_backup.sh
  2. Disable lightning before backup

I agree just backing up the graph directory as a backup and continuing to use lightning is pointless. At least we know backups cannot currently be used for migrtation and that the essential graph directory is mssing from backups.

I don't know if the backup size issue is also shared with CLN and eclair.

Maybe the easiest option is to offer a single option to also fully backup whatever lightning node data is in use and to leave the lightning node disabled after this backup. Perhaps also include an option to compact data.

I understand compaction takes a long time but the effect is dramatic for lnd, such as 60GB to below 1GB. I tested compressing uncompacted channel.db on au8b with gz. Size wemt down from 145MB to 46MB, still too little when scaled up.

@damanic
Copy link

damanic commented Sep 12, 2024

An up to date channels.backup file is useless for node restoration. It is backed up correctly. Its only use is to recover channel funds through co-operative close.

Perhaps the backup/restore policy for LND on btcpay should be for recovery of funds only and the restore documentation should present steps to recover funds (close channels). Perhaps a tool offered to assist with the process. In most cases this would be the safest option to instruct and advise on rather than than risking toxic channel state on an out of date backup.

Then create up a separate btcpay-migrate.sh script with migration documentation that warns of the issues and increased size of backup if using LND. The migrate script would be similar to backup except that it would:

  1. Include the additional volumes required to migrate LND ( volumes/generated_lnd_bitcoin_datadir/_data/data/graph )
  2. Not restart the server after backup complete

@ndeet
Copy link
Contributor

ndeet commented Sep 12, 2024

instead of copying all of btcpay-backup.sh to btcpay-migrate.sh and just change a few lines we could also consider adding flags so you can run "full LND backup" and another option "do not start containers after backup" it will be non breaking from current backup but add the option to do a full backup.

As for CLN, as I said there is afaik no such thing as graph data in CLN and I migrated 2 servers successfully with the current backup script. This seems to be an LND only issue.

@johnheenan
Copy link
Author

johnheenan commented Sep 13, 2024

I have just added a second commit to to exclude backing up lnd database files by default.

To backup lnd database files the option --lnd has to be included. There is also a help option with -h or --help.

As far as I can tell, these are the minimum amount of code changes required to keep everyone happy.

This commit is untested as I am not sure if I have done enough. I would appreciate comments as to this regard, particularly if the following lines are suffciient to disable lnd from coming backup just before btcpay_down is run. I will test anyway.

if [ -z "$EXCLUDE_LND_GRAPH" ] && [[ "$BTCPAYGEN_LIGHTNING" == "lnd" ]]; then
     export BTCPAYGEN_LIGHTNING="none"
     btcpay_update_docker_env
  fi
fi

I realise that if if it is important to propagate a changed environment value back up for BTCPAYGEN_LIGHTNING for scripting purposes with btcpay_backup.sh then this script should be forced to be sourced, if BTCPAYGEN_LIGHTNING is to be changed by the script.

@johnheenan johnheenan changed the title One line fix for failing to backup two lnd database files Add in option for btcpay_backup to backup full lnd state for migration Sep 13, 2024
@johnheenan
Copy link
Author

johnheenan commented Sep 13, 2024

Tested version of commit above. Backup tested only, not migration

The --lnd option successfully includes lnd database files and disables lnd but not synchronously.

I am having with these lines below to disable lnd fully before a backup is taken in the btcpay-backup.sh script. The docker stats command is showing the lnd container has still not shut down before file backups are taken.

Maybe the easiest solution is to include a sleep?

Also the btcpay-setup.sh script does a check with github for updates, which is not appropriate.

Is there a more suitable solution to above issues?

if [[ "$EXCLUDE_LND_GRAPH" == *false ]] && [[ "$BTCPAYGEN_LIGHTNING" == lnd ]]; then
    echo "Disabling lnd from starting up."
    export BTCPAYGEN_LIGHTNING="none"
    # btcpay_update_docker_env # does not work to dsiable knd container
    # source ./btcpay-setup.sh --install-only # using docker stats, lnd container not stopped before backup taken
    source ./btcpay-setup.sh -i  # using docker stats, lnd container not stopped before backup taken
fi

@johnheenan
Copy link
Author

The latest commit above has fixed the synchronisation problem.

The only issue remaining, before further testing, is what is a better high level script command than belwow that does not pull updates?

    export BTCPAYGEN_LIGHTNING="none"
    source ./btcpay-setup.sh --install-only # is there a better way to disable lnd using high level script commands?

@damanic
Copy link

damanic commented Sep 14, 2024

instead of copying all of btcpay-backup.sh to btcpay-migrate.sh and just change a few lines we could also consider adding flags so you can run "full LND backup" and another option "do not start containers after backup" it will be non breaking from current backup but add the option to do a full backup.

Would it not be simpler to just have a 'migrate' option on the backup script that applies all the considerations when running backup in migration context. Otherwise we may end up with a bunch of flags and confusing documentation as more considerations arise.

When the migrate flag is enabled on backup:

  • Include additional volumes required for migrate
  • Keep the btcpay stack DOWN after backup complete.
  • Prevent btcpay stack from being restarted automatically on system reboot
  • Add a file to the btcpay dir that indicates the server was shut down for migration

When attempting to bring btcpay UP on a system that was brought down for migration

  • Check for file in the btcpay dir that indicates the server was shut down for migration
    • If present block btcpay UP and display reason.
    • Provide instruction on how to bypass the block/restriction.

@johnheenan
Copy link
Author

Would it not be simpler to just have a 'migrate' option on the backup script that applies all the considerations when running backup in migration context. Otherwise we may end up with a bunch of flags and confusing documentation as more considerations arise.

Completely agree. From the admin perspective it is a great long term goal that requires a lot of testing and commitment. Why not request they be added in as future enhancements?

At the moment there is no commitment to be able to migrate, just to do backups. The requirement is fully on admins to test they work for the server they were taken on and for any other appropriate purposes for which a backup is taken, which you have done.

From the perspective of where I am are now at a development level, the only goal is to provide a workable solution for lnd migration with warnings that does not alter default behaviour.

@damanic
Copy link

damanic commented Sep 14, 2024

@johnheenan

Looking at your backup script modification the flag --migration would be a better fit because --lnd does not indicate intention.

You have to look at the note on the flag to see it is intended for migration.

--lnd : For migration

And the flag should only be used in migration context because you are modifying the install to remove LND and prevent it from starting which is a migration specific step:

# If doing full lnd backup and lnd is enabled then disable lnd from restarting
if [[ "$EXCLUDE_LND_GRAPH" == *false ]] && [[ "$BTCPAYGEN_LIGHTNING" == lnd ]]; then
    echo "Disabling lnd from starting up."
    echo
    export BTCPAYGEN_LIGHTNING="none"
    # btcpay_update_docker_env # does not work to disable lnd container
    source ./btcpay-setup.sh --install-only # is there a better way to disable lnd using high level script commands?
fi

IMO it would be better to not alter the install with export BTCPAYGEN_LIGHTNING="none" with a btcpay-setup.sh run, and instead simply keep the server down at end of script execution. This fits with all reasons to --migrate because no one would want their BTCPAY server to continue to run after a migrate backup regardless of whether they are running LND or not.

@damanic
Copy link

damanic commented Sep 14, 2024

Alternatively a separate flag for lnd, btcpay restart.

Eg.
--include-lnd-graph (default no)
--restart (default yes)

@johnheenan
Copy link
Author

Looking at your backup script modification the flag --migration would be a better fit because --lnd does not indicate intention.

I will alter the flags to make them clearer.

However, the intention is for 'migration purposes', nor for 'migration'. It is backup script, not a migration script.

@johnheenan
Copy link
Author

Alternatively a separate flag for lnd, btcpay restart.

Eg. --include-lnd-graph (default no) --restart (default yes)

Everything restarts except LND if full LND backup is taken. Once LND restarts the backup is useless.

What about the following flags?

For backup:
--include-lnd-graph Takes a full backup of LND state and disables LND from restarting to preserve LND backup integrity

For restore:
--include-lnd-graph Restores full LND state backup, if available. Must only be used once and never with LND currently enabled

The restore should also refuse to restore LND if LND is currently enabled.

@johnheenan
Copy link
Author

johnheenan commented Sep 14, 2024

Also what about this addiitonal flag for restore?

--reenable-lnd Reenable LND if it was disabled

With this flag for backup?
--disable-lnd Disable lnd if enabled and exit without further action

@johnheenan
Copy link
Author

Also what about this addiitonal flag for restore?

--reenable-lnd Reenable LND if it was disabled

With this flag for backup? --disable-lnd Disable lnd if enabled and exit without further action

I would really prefer to see both of these in a new separate command

@damanic
Copy link

damanic commented Sep 14, 2024

@johnheenan

Everything restarts except LND if full LND backup is taken. Once LND restarts the backup is useless.

I think we need the option to include the extra LND data for migration purposes , and the option to not restart BTCPAY services when the backup context is migration. I don't see how anyone would want any of the btcpayservices to restart immediately after a backup when the reason for backup is migration.

I think flags should be

--include-lnd-graph
[includes lnd graph data in volume backup]

--no-restart
[prevents btcpay-up from running after backup complete]

Then in a migration context I would run btcpay-backup.sh --include-lnd-graph=1 --no-restart=1 and know that my backup archive is ready to restore on another server without risk of missing data from services being restarted.

As you rightly said before, this is not a migration script, it is on the admin to know what they are doing.

If we want to protect the user from start up of a migrated install then a migrate script that employs a method to indicate the server was shut down for migration and warns about restart would be an appropriate way to prevent btcpay from being restarted without proper consideration.

@johnheenan
Copy link
Author

--no-restart
[prevents btcpay-up from running after backup complete]

If will put that in but I will still to disable LND. Meaning that if there is a subsequebt restart LND won't be coming up without addiitonal action.

@johnheenan
Copy link
Author

Changes made as per request.

Easiest way to view entire changes is from https://github.com/btcpayserver/btcpayserver-docker/pull/928/files

@ndeet
Copy link
Contributor

ndeet commented Sep 17, 2024

Looks already very good 👏 I agree with @damanic that export BTCPAYGEN_LIGHTNING="none" probably is a bad idea. If we keep it, then we need to also explain how they can re-enable it and be sure that setting the env to "none" won't delete any data when running btcpay-setup.sh again (it seems it is the case to not delete anything but not 100% confident yet).

So imho, not restarting all the services is fine. If people restart the server or manually start the docker containers again then we have to assume that they know what they are doing?

@dennisreimann would be interested in your opinion on that

@johnheenan
Copy link
Author

johnheenan commented Sep 17, 2024

Looks already very good 👏 I agree with @damanic that export BTCPAYGEN_LIGHTNING="none" probably is a bad idea.

I think if export BTCPAYGEN_LIGHTNING="none" is removed then their should only be one option, such as --include-lnd-graph and it must include not allowing btcpay-up to run at the end. If btcpay-up is run without disabling lnd then the backup is useless for migration.

@johnheenan
Copy link
Author

johnheenan commented Sep 17, 2024

If someone If someone is going to migrate lnd, shouldn't they know the consequences of messing it up by restarting without lnd disabled? However, who reads command output and warnings?

So which is better?

  1. Ensuring a backup for migration purposes will succeed by operators who are not aware of issues at the time and may become confused (when lnd is disabled)?
  2. Letting them risk losing funds, aggravating themselves and blaming others (when lnd is not disabled)?

I don't have strong views either way. Ultimately it is the operators responsibility to know what they are doing.

Also why can't we have a simple command to just enable or diable features without affecting files? Re-enabling lnd with export BTCPAYGEN_LIGHTNING="lnd" followed by . ./btcpat-setup.sh -idoes alter files because it makes an unwanted request for updates. I have not noticed problems though.

@johnheenan
Copy link
Author

If someone leaves btcpay down on old server then there are no problems.

It is common practice with web migration to restart old servers with a maintenace page showing. The equvalent practice with lnd means potential loss of funds. There isn't even a maintenace page option with BTCPay server. Maybe there should be.

If someone is not going to bother reading warnings then I would regard leaving lnd disabled as good practice as it forces them to read something to fix the problem. If someone has read warnings then they know what to expect and so know what to do or what look up.

So, in summary, rather than a bad idea, I see disabling lnd as a good idea. But that is just an opinion and I really don't care. Whatever choice is taken it can be changed later on.

@damanic
Copy link

damanic commented Sep 18, 2024

IMO warning and no restart should suffice for the backup script.

To add additional protections btcpay could choose not to document the --include-lnd-graph flag and make use of it in a documented btcpay-migrate.sh script that runs btcpay-backup.sh --include-lnd-graph=1 --no-restart=1 with additional steps to bring up a container that displays a holding/maintenance page with all other service containers
disabled and prevented from restart unless manual intervention at filesystem level.

@johnheenan
Copy link
Author

IMO warning and no restart should suffice for the backup script.

OK, changes made. Dire warnings show up if

  1. LND is currently enabled
  2. include-lnd-graph is included
  3. --no-restart has NOT been included with above

Warnings given twice. Warning at end.

🚨🚨🚨 LND is currently enabled and has been restarted 🚨🚨🚨
🚨🚨🚨 You cannot restore from this backup anywhere as is!!!  🚨🚨🚨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants