Sync Breaking Updates

Mike_Wagner · 14 October 2021 13:49

It is becoming increasingly common for updates pushed to the network to break synced data. Some of these updates come in the form of Docker updates; others from other changes – such as the one in the past 12 hours. This forces a validator to resync the beacon data first then the assigned shard chain second.

Neither of these forced resyncs can be accomplished in a single epoch. An update in the last 12 hours has again forced all validators to resync their Nodes. While the resync happens automatically, the very real negative effect is the slashing of community Nodes next in line for committee at the time of the update – plus the following 6-12 epochs. With the Nodes resetting their beacon and shard blockheight back to 0, they are unable to achieve voting consensus and per the network slashing protocol are slashed at the end of committee.

With slashing active and the release of fixed nodes (decentralization) finally nearing, the network runs a very real risk of consensus failure. A breaking update (forced resync) will no longer affect only 10 Nodes per shard, it would affect all 32 Nodes per shard. With <22 Nodes per shard suddenly having beacon & shard blockheights of <Latest (1,530,000+) a vote consensus will not be reached. Furthermore, the next 6-12 epochs worth of incoming Nodes will all be in various states of resync and therefore unable to achieve consensus. And for those of us around in early 2020, we can recall what happens when not enough Nodes are available to reach vote consensus: the network stops.

While the network outage then was caused by a hosting failure taking some of the fixed nodes offline, the effect would be the same if every Node is forced to resync from 0.

Unless there are other mitigations of which I’m unaware, it would seem that we are imminently headed for a situation where a pushed update will unintentionally shut down the network until the ever increasing average resync time for Validator Nodes has passed once again allowing a committee with fully synced Nodes.

The Nodes awaiting the next 6-12 committee epochs – at the time of a breaking update – perhaps should be insulated from these forced resyncs until they have left committee. 6-12 epochs (continually adjusted for the increasing blockchain sizes) allows the Nodes not immediately next time enough to resync from 0 and thus enter committee with fully synced blockchains.

If it’s not a priority, I suggest this issue should receive immediate attention and resources. If this isn’t addressed, besides the potential for network stoppage, the number of “newly” slashed Nodes will spike support requests following every one of these updates.

brico84 · 14 October 2021 18:43

This only affected my pNodes but it sounds like you are sayi this affects vNodes too?

Mike_Wagner · 14 October 2021 18:52

Other updates have affected both pNodes and vNodes. I’m grouping any kind of sync breaking update here; not just the one that affected pNodes last night.

incognito · 16 October 2021 16:14

Hello @Mike_Wagner and community validators,

Regarding new docker tag release for chain code update, we totally understood your concern as well as the bad causes that might happen if we make it not right. We usually made it really carefully especially the recent updates, let me describe our release process: a change wants to be released, has to pass all of QC’s tests on at least on 2 environments, they are devnet and testnet, after that when it’s ready for release, we would build a “candidate” docker tag and deploy on our vNodes and a beta-fullnode, that’s why you usually see there were 2 additional docker tags for each release. If these nodes can keep syncing and validating data from the network without errors, we would publish a new release on GitHub along with an official docker tag on docker hub then community nodes could pull and get themselves updated. We would wait a couple of days for community nodes to update prior to deploying to all fixed nodes.

Sometimes, community nodes got issues and stalled, that’s usually because of code bug, not the release process. We believe the release process for a new docker tag is good enough for now when the core team still holds majority slots in a committee. Unless it’s a hot fix, we always follow the process above. Of course, the update would have to be backward compatible with the current code by versioning and checkpoint.

When we increase committee size and the core team doesn’t hold majority slots of a committee that helps also increase the decentralization of the network, the process has to be changed a bit. At that point, we have to have a mechanism for nodes in the network to know when a new update should be applied, for example, it needs to check to see how many percents of nodes in a committee or the network running with the new docker tag, if a majority of them do it (says 90%), it’s safe to apply the new update. A detail of this will be published on the forum once we implement it.

About the pNode firmware update (v2.0.0), we believe it’s to help fix the unexpected data delete issue that caused a pNode syncing from scratch in 1.9.x version. To be clear, the update for v2.0.0 has not caused the issue but helps fix the issue.

Lastly, agreed that it’s painful for a node to have to wipe its data and re-sync from scratch for some reason. @jared proposed a great idea to have a bootstrapping to speed things up by downloading chain’s data from a trusted source set up by the core team instead of waiting for 6-12 epochs as you said for a resync. He has been working on this and will get it released along with a guide very soon. Thanks again @jared!

Hope that my explanation can give you guys some insight and clarification of things around the release process as well as our understanding of community validators’ pains and how we will solve them.

Thank you!