Shard 2 Stalling at 433077 on Multiple vNodes

Josh_Hamon · 28 May 2021 17:09

I appreciate all of your investigation!!!

@Support do you have any additional information or troubleshooting suggestions?

fredlee · 28 May 2021 21:31

Right, might add that running 20210302_1 up to block 1M and then switching over to the latest release works fine. Shard synced all the way up to current block.

I am now running a new test with a full node to check all shards. I have made a new fresh install and running the latest recommended script (How to setup your own node in a blink of an eye). It’s not done yet, but I can tell you it’s not looking good so far. I have blocks with errors on multiple shards. I’ll make a post when all shards are done or stalled.

Josh_Hamon · 22 June 2021 16:36

For those keeping score at home, I am still unable to get shards 0,2 or 6 to sync fully on a validator. I see there’s a new tag 20210622_1 which I’m trying now

Mike_Wagner · 22 June 2021 17:22

Well … at least whatever has been updated fixed the slow sync issue I’ve been (casually) observing for about a week. After an update last week (20210617_1?), one of my pNodes slowed to a crawl on beacon/shard syncing. The pNode was literally in the middle of a sync and saw sync speed instantly drop by ~75%. Was only syncing about ~250,000 blocks per day, if that.

Then whatever change was pushed yesterday broke all my other pNodes, similar to what Devenus observed.

The update today (20210622_1) has restored syncing at a reasonable rate again. The pNode that suddenly couldn’t sync more than ~250,000 in a day, is already up to blockheight ~450,000 in a few hours. Last week that took nearly two full days.

Hopefully the beacon chain syncs will be caught up by tomorrow and I’ll be syncing assigned shard chains thereafter.

Mike_Wagner · 23 June 2021 15:47

Or not.

So far today – one pNode has started resyncing from scratch … again. Another one has been stalled near the current blockheight for nearly an hour, and is now reporting offline in the Node Monitor. I expect it too will start resyncing from scratch – again – shortly. <SIGH>

update: Yep, the stalled one started over AGAIN.

So at least two nodes started a resync from 0 yesterday, synced up to the current blockheight, then inexplicably stalled near the current blockheight and have now started yet another resync from 0 in a ~24-hour period. RIP monthly ISP bandwidth cap.

Josh_Hamon · 23 June 2021 22:04

On the new image and shard 0, I stalled earlier than normal at block 63902

VERY glad that they didn’t implement slashing yet. Any thoughts @support?

zes333 · 24 June 2021 05:26

You are not one) my two vnodes also hung up at block 63902

abduraman · 24 June 2021 08:22

Hey all,

I want to share my experience here. I stopped all of my Incognito dockers, and followed 3rd (infura account) and 4th (run.sh script) steps here (How to host a Virtual Node). My vNodes run flawlessly (no stall, no offline) for at least 3 days.

Btw, run.sh may be wrong. Please fix it as it is written here: How to host a Virtual Node

Mike_Wagner · 25 June 2021 01:28

@Josh_Hamon @zes333 My pNodes finally resynced the beacon chain (third time’s the charm, I guess) and have started syncing Shard 0. The Shard 0 blockheight for each is currently above 900,000.

These two are each on 20210622_1:

@abduraman Didn’t need to make changes to scripts or config parameters (not that I could even if I wanted to – these are pNodes).

¯\_(ツ)_/¯

abduraman · 25 June 2021 08:12

@Mike_Wagner I agree with you. My experience sharing was not an answer to the concerns about pNodes above. I wrote here since the topic title writes “… Multiple vNodes”.

abduraman · 25 June 2021 12:42

Unfortunately, I don’t use such a script. I’ve created another script for each node.

Josh_Hamon · 25 June 2021 17:01

If only I knew why

Josh_Hamon · 25 June 2021 21:14

You have been able to fully sync shard 0, 2 & 6?

I am using an infura account, but only 3 calls have been made to it.

Josh_Hamon · 25 June 2021 21:15

Per @fredlee that’s not required, but to confirm I’ve asked @rocky in his setup post.

abduraman · 25 June 2021 22:00

I have no node syncing shard 0 but 2 and 6 are ok.

fredlee · 26 June 2021 09:14

Yeah, I run multiple nodes on the same infura. I recently changed to a node.js script managing my validators, but I still have an old shell script that runs two nodes on the same machine. It looks a lot like your script, but yours is cleaner, because I didn’t think of looping through array keys instead of values.

  i=0
  for validator_key in "${validators[@]}"; do
    rpc_port=$(($first_rpc_port + i))
    node_port=$(($first_node_port + i))

    i=$((i+1))
    data_dir=${DATA_DIR}/node_$i
    echo "Starting inc_validator_$i container on $node_port (RPC $rpc_port)"

    set -x
    docker run --restart=always --net inc_net -p $node_port:$node_port -p $rpc_port:$rpc_port \
        -e NODE_PORT=$node_port -e RPC_PORT=$rpc_port -e BOOTNODE_IP=$bootnode \
        -e GETH_NAME=$geth_name -e GETH_PROTOCOL= -e GETH_PORT= -e FULLNODE= \
        -e MININGKEY=${validator_key} -e TESTNET=false -e LIMIT_FEE=1 \
        -v ${data_dir}:/data -d --name inc_validator_$i incognitochain/incognito-mainnet:${latest_tag}
    set +x
  done
}

Don’t forget to set the empty -e GETH_PROTOCOL= -e GETH_PORT= because it will append the default values if it’s not set at all and end up with http://https://mainnet.infura.io/v3/a1b2c3...:8545. It’s quite an ugly piece of code with no checks.

https://we.incognito.org/t/solved-vnode-sync-errors-since-version-20210313-3/12390/11?u=fredlee

(What´s your -itd for?)

Josh_Hamon · 26 June 2021 15:16

I forked this from @mesquka, so I’m not 100% but I think it might have been intended as -it -d? Per a quick search:

"

docker run -it -d --name container_name image_name bash

The above command will create a new container with the specified name from the specified docker image. The container name is optional.

The -i option means that it will be interactive mode (you can enter commands to it)
The -t option gives you a terminal (so that you can use it as if you used ssh to enter the container).
The -d option (daemon mode) keeps the container running in the background.
bash is the command it runs.
"

Though think combining all three options into one flag isn’t an issue here.

I don’t have -e GETH_PROTOCOL= -e GETH_PORT= but will give that a try.

UPDATE: Still seeing if shard syncing will make it past the roadblock but with @fredlee’s change I’m already seeing calls to infura, so I’m hopeful.

UPDATE: On Shard0 I’m past the roadblock by adding in the code suggested above. This is using the forked script I mentioned above and the image from 06/26. Later today I’ll work on trying it with other nodes.

fredlee · 26 June 2021 18:04

Nah -itd is probably no issue at all. I just wondered what it was for.

zes333 · 26 June 2021 21:48

I found the problem of getting stuck in shard 63902 according to your script) If you are interested, knock on the PM)

Josh_Hamon · 2 July 2021 02:27

Now I’m stalling in different places:

shard0 - 63902
shard2 - 293816

But not every time