Shard stall

My pNode was stalling as well, and it’s very likely because it picked up the new docker tag which doesn’t like the old beacon/shard data (pre staking v2). Visiting:

http://[pNode IP]:5000/restart-node?delete-data=1&qrcode=[code from bottom of pNode]

cleared the old data and it’s now re-syncing with the new v2 flow version.

Same thing happened yesterday for vNodes [Solved] Shards sync stalling for multiple nodes, updating to new code broke shard syncing so had to clear everything and start the beacon/shard sync again, all good now.

Thanks for answering this @adrian, I guess all the problems may be solved by your instructions :+1:

For those who haven’t know yet, please have a look at the topic, especially @Devenus’s comment. The new firmware for pNode has supported some functions that may help node operators to manage their nodes easier, thanks!

@duc my vnodes are run by jservers. They updated their code this weekend, so im not sure where the shard stall is coming from

If you are on a vNode, it would be even easier, either running by Docker or building from source, please make sure your node is running with the latest code as in the topic (I guess you’ve looked into it already)
Secondly, also make sure you’ve cleared up the node’s data and re-sync from scratch.

The shard’s stall is because it ran the old code and accidentally produced data that was not compatible with Incognito chain’s data. That’s why I recommended node owners who have sync issues to clear up data and then re-sync with the latest code.

This was totally my fault for not informing the community timely about that such a big upgrade. That’s a lesson learned for me, sorry.

2 Likes

I’m optimistic. Even though my node with the shard 0 stall just finished a fresh sync this weekend, I went ahead and deleted the shard data (again) and it’s currently syncing from scratch (again). Not quite halfway through the beacon block data, after ~4.5 hours. So still have some time before I’ll know if it can sync past block 169,583 on shard 0.

In the meantime, this did give me the opportunity to do a fun write-up about the sync behavior of Nodes.

Just read your topic, this is really awesome. To be honest, the core devs team is lacking of such a neat explanation to the community. Probably, our post is quite technical, isn’t it?

As a pnode owner I have no ability to check code, I plug and play. What are my options on dealing with the stalled status, it has stalled the last two times it made committee. I am running the Latest version on both pnodes so I don’t know why I am stalled

@duc
So in viewing thriftinkid’s nodes we do not see any external issue from the docker commands. If we don’t have the public validator key (BLS) then we are not able to see this stall issue. How can the stall issue be identified without going into the monitoring tool?

Next question is if the node is in pending but stalled will the node fix the stall issue and go into committee or just stay stalled?

Which then brings up the question, if it’s stalled in pending state, if the vnode is stopped the data cleared and restarted will that cause it to fall out of pending?

Thanks

You could try deleting the shard data on the pNode. You’ll need to know the ID of your pNode (it’s the text of QR code sticker on the bottom of the pNode) and the IP address of the pNode on your home network.

You would then edit the following URL where noted:

http://<YOUR NODE IP>:5000/restart-node?delete-data=1&qrcode=<QR CODE OF NODE>

This will remove all synced data and your pNode will sync from scratch. This is highly dependent on the speed of your internet link, but it will take a while for the pNode to fully sync the beacon chain. My pNode that is syncing from scratch right now, looks to complete the beacon sync in probably another 6 hours or so. Then it will need another 8? 9? 10? hours to (hopefully) fully sync the chain for shard 0 (that’s the shard it’s currently Pending for).

As I noted above, your pNode will be active in committee during Epochs 3476 and 3477. The current epoch is 3469. That’s a difference of 6 epochs or ~24 hours. That should be just enough time for your pNode to fully sync both the beacon chain and assigned shard chain if you delete the chain info right now. Even if it doesn’t fully sync by the time it enters committee, you’ll still be awarded block rewards. And if it completes the sync during the first epoch, you should see successful votes tallied in the second epoch.

Right now, stalled nodes still enter committee. They just do not tally votes in the Node Monitor tool. When slashing is enabled, this will be important because Nodes that don’t maintain a vote percentage of 50% during committee will be slashed. But for now, stalled/not-stalled do not affect the ability of Node to enter committee. I’ve personally had several stalled Nodes in committee without issue.

The node wouldn’t fall out of pending. The pNode I used for this post has been in Pending since early this morning. I deleted the shard chain info this afternoon and started a new sync from scratch. It’s still syncing and still in Pending.

3 Likes

Hey @duc as @fitz_fiat said, I now have 4 nodes pending that are all stalling. I’m still a bit confused as to why resetting them would fix the stalling issue if he reset them all after you guys pushed the update. Did you push another one after that? They should all be running on the newest code, and there should be no issue then.

Shard 6 909062 16 hours ago (stalling)

Shard 6 909062 11 hours ago (stalling)

Shard 6 1003289 stalling

Shard 2 433077 a day ago (stalling)

image

Well … some 20 hours later and my pNode has (again) fully synced the beacon chain and has (again) stalled on block 169583 in Shard 0.

This shard stall is not due to a mix of old shard data and the new flow code, at least in my case. Something else is preventing the sync or block insert at that height for Shard 0’s chain.

I have multiple nodes stalled on block 169583 in Shard 0. All data was recently refreshed.

Also, on the Infura dashboard, it still says zero requests but I have the key attached to multiple nodes. Any ideas?

image

Check your Node. My stall … unstalled for Shard 2 sometime in the last few hours. I did NOT delete chain info on this Node either.

I want to get see if I can get clarification, just to make sure I am understanding and I didn’t miss this comment on one of the many post about the updates for slashing and pNodes.

My current understanding is that slashing goes live in June, and if your node (p or v) doesn’t meaningfully contribute to committee it will get slashed and unstaked.

Does this mean, that pNodes will get unstaked and not have the ability to get the funded stake back?

I definitely understand the need for slashing, but I am a bit concerned that with the random selection process, there really isn’t enough time to test (and retest) if a stall/sync issue is fixed/resolved/still exist on pNodes… especially if you have one of those 30+ day dry spells waiting for selection…

This isnt a huge deal for vNodes, as you can just restake… and probably not a huge deal for pNodes if you can get the funded stake back, but I just wanted to get clarity on this… as I havent found the specific answer on this… which I could have definitely overlooked, through the multiple post I haver read about getting pNodes set up with new firmware, node monitor, etc.

3 Likes

Great question. I have been wondering this myself.

2 Likes

Thank you @doc for asking the question…I have 3 pnodes myself all with funded stake at this time and well I wonder if they were to get slashed what will happen to the funded stake on them…will it be reinstated since I took no action to unstake them, to begin with?.. :sunglasses:

1 Like