vNodes High CPU, Disk and Network Traffic

Has anything changed lately with the validators? My vNodes seem to be running 20210122_5, and I re-created this instance about 5 weeks ago (according to docker ps). I had to recreate because for some reason they stopped syncing the chain after months of use.

Last few days, i began seeing some alerts on my vNode slack bot about IP address issues. I am still able to connect to servers, no IP address has changed but the vNodes are now taking way more CPU and Disk I/O and Net traffic than what I have seen previously. Normally while idle they would use <5% CPU. I am currently seeing:

https://gyazo.com/0e8140cc7efb3a1f6b16a915c9abbf41

I am also no longer able to connect to the RPC calls (connection refused) after rebooting. I can connect to server running the vNodes and can see they are using CPU and Disk I/O, but I can not query them via the RPC interface, and I am not sure if they are syncing the chain right now or if something else has gone wrong.

Again, these have been working normally for 5 weeks, and months before that. I wanted to touch base and see if anyone else is having any issues and/or could point me to some troubleshooting tips, as at this point I am stuck … .and will probably take them offline soon as they seem to be in some sort of run away state on CPU/Disk I/O and Network traffic.

This is old. You should be running 20210320_2. Try updating and see if you still have issues with high utilization.

Have you also been updating the server instance itself as well?

2 Likes

Thanks @Jared … Didn’t the vNode automatically update the validator code? I am not sure why it is no longer pulling these updates.

I am forcing updates now, and will see how this plays out.

I think it is supposed to periodically pull new updates on its own. Whether it is or not I can’t say because I routinely force check mine just to be sure.

2 Likes

I wanted to report back that after deleting the data folder and forcing a resync of the chain, vNodes are back to the same normal usage levels I have seen in the past. I am not sure why my vNodes are not pulling updates on the images when they are released, or why after a few weeks (months?) of operating lately, they have some sort of syncing error and stop working… this last error was the worst, as the CPU, Disk I/O usage was really high and abnormal.

I am not sure who on the core team might find this information useful, so I will just ping @Peter to make sure they are at least aware of the issue.

As a side note, my pNode seems to be having a similar issue. I had to force restart a few times, and it eventually started allowing RPC calls again, and it now looks like it is syncing the chain again.

Hopefully the new tools being released will be helpful in identifying these issues, and me being able to validate my setup is working properly and that the issue isnt in one of my personal configurations.

Same here. Well except for allowing RPC calls again and syncing the chain. Guess more power cycles are on the agenda. :hammer_and_wrench: :electric_plug: :hammer_and_wrench: :electric_plug:

I just want to make sure there is sufficient time between the tools being released and slashing going live. I also hope they release a script that will spin up multiple vNodes on a machine, that is configured with all the settings that need to be supported… so there is no more confusion about the need of ETH clients and setting up the virtual network on docker instances, along with which ports need to be changed and/or forwarded… this information is not exactly clear for the different use cases that exist.

multiple vNodes on same machine
multiple servers on 1 network
multiple vNodes on multiple servers on 1 network

1 Like

Hi @doc

Recently, some users encountered vNode syncing problem after we deployed BFT v2 Round robin block producer.
If your node is one of that (by checking log), please try to stop container, delete data and then start container to reset it all again.

Sorry for the inconvenience.

1 Like

@khanhj one server (with 2 vNodes running) resolved itself on the resync.

Another has not, on the one causing issues I have two vNodes running. One vNode is working as expected, the other is throwing errors (see below) repeatedly into the error.log

2021-03-30T17:52:15.158Z        ^[[31mERROR^[[0m        basichost       failed to resolve local interface addresses     {"error": "route ip+net: netlinkrib: too many open files"}  

I am seeing the high utilization again, but not as bad (since its only 1 vNode on the machine that is having issues)

So in summary:
I have 2 servers, each running 2 vNodes
1 server is running as expected
1 server has 1 vNode running as expected, and another that is throwing errors, having high utilization and no longer is responding to RPC request.

Log files in the data folder for the problem vNode are in the 1-2Gb range
Log files on the proper functioning vNode were in 1-2Gb range (assuming this is while syncing the chain) but now are in 5Mb range

currently running image:
incognitochain/incognito-mainnet:20210320_2

here is a gif of the sudo docker stats:

You can see the change in utilization after I kill the misbehaving vNode:

1 Like

Hey @doc…wow thank you for the visuals you included in your last post…definitely see what you are speaking about…mainet_0 it’s like on overdrive…that is really weird hopefully the dev team will figure this out…by the way have you had any similar issue with a pNode?.. because I believe that @Mike_Wagner said he had same or similar issue with a pNode… :sunglasses: :thinking:

1 Like

I believe it is a similar issue, but I dont have the same visibility into the pNode to see this level of stats and resources… at this point, only dev team could look at it to know, but since they use the same validator code, I would assume its the same issue.

1 Like

Hey, @doc…good to hear from you…hope all is well bro… :grinning:…as to the pNode issue I hear you and well perhaps @Mike_Wagner might have some more info or observations of his own pNodes…thank you, bro, for making us aware of this issue by the way… :sunglasses:

1 Like

Hi @doc it would be great if you can give us the latest log file of problem vnode. The file could be retrieved from /data/inc_mainnet_1/<datetime>.log

1 Like

@0xkumi and @khanhj here is a link where you can download the latest log file, which is smaller because I killed the vNode.

2 Likes