vNode becoming unstable and continuously rebooting

doc · 10 January 2021 05:46

I have been running several vNodes for several months, but lately I have had 1 that stoped syncing chain and upon further inspection it started to just reboot the docker image continuously. For simplicity, I just deleted the data folder and restarted the vNode and it seemed to sync back up fine.

I have another vNode on the same machine that is starting to do the same thing. I open the log file and saw this set being repeated over and over:

NumCPU 4  
2021-01-09 01:31:02.296 incognito.go:95 [INF] Server log: Version 1.19.0-beta                                                                                                                                          
2021-01-09 01:31:03.919 engine.go:173 [INF] Consensus log: CONSENSUS: NewConsensusEngine                                                                                                                                             
encoded libp2p key: XXXX (I removed this key for forums)                                         
2021-01-09 01:31:03.935 host.go:88 [INF] Peerv2 log: selfPeer: XXXX(removed this key for orum) 0.0.0.0 9436                                                                                                            
2021-01-09 01:31:06.616 blockchain.go:122 [INF] BlockChain log: Init Beacon View height 936798                                                                                                                                          
2021-01-09 01:31:07.967 blockchain.go:143 [INF] BlockChain log: Init Shard View shardID 0, height 925168                                                                                                                              
2021-01-09 01:31:09.178 blockchain.go:143 [INF] BlockChain log: Init Shard View shardID 1, height 841303                                                                                                                              
2021-01-09 01:31:10.281 blockchain.go:143 [INF] BlockChain log: Init Shard View shardID 2, height 290958                                                                                                                              
2021-01-09 01:31:11.454 blockchain.go:143 [INF] BlockChain log: Init Shard View shardID 3, height 826041                                                                                                                              
2021-01-09 01:31:12.756 blockchain.go:143 [INF] BlockChain log: Init Shard View shardID 4, height 907671                                                                                                                               
block  <nil>                                                                                                                               
2021-01-09 01:31:12.757 incognito.go:154 [WRN] Server log: Gracefully shutting down the btc database...                                                                                                                               
2021-01-09 01:31:12.769 panic.go:679 [WRN] Server log: Shutdown complete

any idea what would be causing the “block <nil>” and/or how to solve this?

scooter · 9 January 2021 08:02

I was experiencing constant restarts when trying to run 2 vNodes on an underpowered (4 vcpu + 8gb ram) system. The docker image restarted more or less regularly every 8 hours. After I stopped the 2nd node and moved it your their own vps everything was back to normal…

Not sure if it’s the same issue as I wasn’t able to capture the logs for some reasons. But did you change something on your node, like running additional services? Can you monitor your node through grafana or similar to look for spikes?

doc · 10 January 2021 03:35

Thanks for the reply @scooter …

I have not changed my setup, and these nodes have been running for about 20weeks without issue, until last week. They have been earning, syncing, etc with no problems, I have even had both nodes earning at same time on same server with no issue.

I think they both failed this last week when they got selected and had to sync to a different height/shard than they were currently on. Both nodes froze on the “Incognito Tool” website in the “Pending” role, this is when I noticed they stoped syncing chain, and when I looked at server, I could see the docker instance rebooting and the above in the logs…

The first node fixed when a deletion of the data folder and server reboot, and I am about to do the same with this other instance. And another piece of info is this particular instance (vNode) had been selected very rarely, was actually earning on average about the same as my pNode (which has the Incognito stake)… but over the past 2 weeks it got selected 4 times and it was this very last time that it got borked and started restarting.

I don’t disagree it could be a hardware issue, but probably would lean towards some sort of hardware failure, rather not not having enough resources, since they have both been working fine for so long, and I have another server with the exact same setup that has had no issues (and has had both vNodes selected and syncing at same time, multiple times.

If the data folder deletion and reboot doesn’t fix it, then I will post more, but it also might be something the Incognito team might be interested in keeping tabs on incase they see/hear of others having the issue too.

duc · 10 January 2021 10:41

Hey @doc, in case that you were syncing data, please make sure it’s running with the latest docker image (tag: 20210106_1). Thanks.

doc · 11 January 2021 03:43

Hey @duc looks like my vNodes on this server are running version: 20201225_1 …

Of note, is my other vNode server are running version: 20210106_1

Not sure why this server did not pull the update, is there a command for me to force the update?

I can do the process laid out here:

Unless you have any test you would like to me to run to see why it didn’t pull the update.

Also, after deleting the data folder and restarting the server, both vNodes are syncing the chain, and the one that failed last time actually was selected for committee and went through all phases without issue this time.

doc · 13 January 2021 05:15

I went ahead and deleted both data folders and restarted using process outlined above from @Peter. It looks like both vNodes have pulled the new image and are syncing the chain, I am not sure why this specific server didn’t get the update, where the other server did. If anyone is aware of something I might have configured or something to look at to ensure this doesn’t happen again, I am all open to any suggestions.