nodes stability issues

@Support, I have noticed that if a node is healthy and online, and on latest, it generally stays that way barring an event that would cause an issue. However if the docker container is stopped, the node doesn’t always stay healthy when coming back online. For instance on a reboot, or code upgrade, or anythign of the sort. 5 - 10% of my nodes when being started again will not properly come back up. Sometimes they just stall, and with a stop and start the issue is fixed, not as big a problem. However sometimes they will not come back online. When checking docker ps -a I can see that the container is infinitely restarting.

The only fix for this in the end is too deleted the shards (sometimes beacon too but usually not) and then bootstrapping the node.

This has become quite frustrating. I confirmed with @Jared recently that I am not alone in this issue.

Is anyone else having this problem?

Is there something that can be done to get more stability?

It seems strange to me that nodes have to be bootstrapped just because they have been stopped and started.

Would love a solution to this problem. I posted this on the forum instead of sending support a private message in case others are having this problem or know of a solution.

To clarify, the only time I see an issue personally is when I use the hard-link script.

I believe this happens because some data is still being written to the drive, possibly from RAM (depending on system configuration).

The most important thing is to make sure your node shuts down/restarts safely and that docker is given enough time to stop the nodes and finish writing data to disk.

I have been looking into a way to delete out X amount of the newest leveldb.db files to quickly fix a stall but I’m not sure if this is something possible. If it is, then that would be drastically faster than a full bootstrap.

Well I definitely have seen this before I started using the hard-link script. I just see it more often now, because my nodes get stopped and started on a daily basis now. I was thinking it might just be happening with the hard-link script because of the issue of stopping and starting. I am not sure it is related to the hard-link process itself. Or maybe it is… Not sure.

But I definitely have seen this problem somewhat before I started using the hard-link script. Like if I had to reboot the server, or when code upgrades are done.

1 Like

It sounds like we need a command to tell the node to …

  1. stop writing to the disk and…
  2. THEN call the docker container stop to stop the container

Basically… a safer shutdown procedure then using the container stop command.

I’ve seen this issue. The best way to prevent is to docker container stop all nodes. Wait a bit. Then restart. I haven’t had an issue with the hard-link script, but perhaps the script writer could implement a delay if it’s happening to you. My issues were much more frequent when I’d have a Internet outage, even for a few seconds. The nodes cannot recover if they are writing an ldb file and ANYTHING interrupts it.

A checksum that can delete corrupted files and redownload would fix this issue, but the devs have so far turned a blind eye to this feature request.

It sounds like there should be a function call to shutdown a node other than the docker container stop command. A command like “shutdown node XXXX” that would first wait until the node is no longer writing to an ldb file (or somehow safely stops it from doing so) and then in turn runs the docker shutdown command for said node. This would be a SAFE way to shut node containers down.

@Jared what do you think of that idea?

This functionality already exists.

@Jared, what’s an example of the command?

docker stop <container-name> will grateful shutdown a node-container, you have to wait for the command complete.
If your server running multiple containers, you should stop them one by one.

In my experience, to prevent disk I/O issue, each server should only run up to 4 nodes.

Depends on the disk you are using. I’ve got 7 running on a SATA SSD, and I’m about to spin up 3 more for a total of 10. Hoping it works out. If you have a fast NVME SSD, I imagine you’d run out of disk space before I/O.

This is correct. You can run many many more vNodes on 1 disk as long as it’s a good NVMe SSD.

If you run out of disk space then use this community member made script: https://github.com/J053Fabi0/Duplicated-files-cleaner-Incognito

Just an update, it turns out the SATA SSD can only reliably run around 5 nodes. The disk is also running the host OS and several other VMs, so you might be able to squeeze a few more if you are running bare metal. I bought the fastest PCIe 4.0 NVME consumer SSD you can get and it’s running the 10 nodes with close to 100% vote count. In other words, I’m looking forward to spinning up many more nodes.

2 Likes