[Shipped] Network Monitor

To increase the stability of our decentralized network and making further decision, we need to understand deeply how our consensus protocol work in mainnet environment. With the help of highway, we plan to investigate on the current BFT flow, including:

  • BFT timing (propose, vote, commit) analytic
  • Block Producing & Block Fork analytic

In addition, we will provide stakers with the info of their node participating into consensus flow, (who aware of the next slashing feature)

  • Node Joining Committee Analytic
  • Node Voting Message Analytic

Length: 2 months

  • First month: collect data, pre-processing message into analytic DB (support bft timing)
  • Second month: API + basic Web App + support other analytics

Resources: @0xkumi ,@Corncob

10 Likes

Update progress:

  • Developed the tool to collect and process data and able to view BFT Timing for analysis.
  • We also found that Highway is not stable in high traffic period, and tried to solve this problem.
  • Currently, we are visualizing the data into Grafana
  • Next step, we will make APIs for Node Consensus Analytic, including Joining History and Voting History
5 Likes

Update:

  • Last month, we continue to develop apis that show Node Consensus Activity.
  • Next step, we will integrate to explorer page in other to help users monitor their physical Node.
6 Likes

Update:

  • Currently, user could check their node status at https://monitor.mainnet.incognito.org/ .To get validator public key (miningpubkey), you could call rpc getmininginfo to your running node. You will see “MiningPubkey” field.
  • Currently, the mobile version is testing and will be release in this month
  • We will continue to add more components into the web page, such as vote historgram, get mining key tool,…
6 Likes

Works well and I noticed that my nodes seem offline probably due to the firewall :slight_smile:

Mine show offline too…

The “offline/online” only work if you run our docker script which setup the monitor endpoint. The most important is the Vote Stat (in percentage). If you see Vote Stat = 0, you should check your Node. This information will be used in Slashing mechanism.

How do pNode operators rectify this? 2 of my nodes are powered off because support hasn’t responded to requests to fix them, going back seven months now. A third is online, is fully synced per the beaconblockheight, and has a stable network connection.

Yet all three of those Nodes display Vote Stats = 0.

image

The third pNode above was definitely powered and online during the last committee period (epochs 3145-3147 on 26/3/2021).

3 Likes

For the third node case, I guess at that time the Node didn’t sync enough shard block to validate. To find the cause, can you click on the third node row, and check which Chain ID at the epoch 3146, 3147. And what is the current sync state of that chain (block height)?

image

image

Looking at the history of the other community nodes in the current epoch (3269), I found only two that have a VoteCount value above 0. 2 out of 80. [EDIT: Nope, not out of the full 80. See post below.] That many community pNodes and vNodes can’t be misconfigured, sitting on poor internet connections, etc. That’s an abysmal rate.

image

Comparing one of those Nodes against mine posted above, I can see that more of the shards have been synced. Yet this Node looks to have finished up Epoch 3227, then stopped syncing a day later around Epoch 3232. As a result, it too is now reporting a VoteCount of 0.

I’m just trying to understand the sync status of Nodes, as the blockchain actually sees them since this will form the base factor for (the upcoming) slashing. As @doc and others have noted the past few months, as a community we’re concerned that our Nodes may not be performing correctly. These Nodes have been online and earning for 18 months without the proper technical feedback to diagnose the actual consensus performance.

For vNode operators, they will be able to resolve issues once definitively identified and instructions posted, since they have the necessary access to make configuration changes.

For pNode operators, they are at the mercy of the devs and/or support team. Other than supplying power and an internet connection, there isn’t much more a pNode operator can do to resolve technical and/or configuration problems. pNode operators don’t have ready access to the terminal without partial physical disassembly of the pNode. And even if they did – many operate pNodes so they don’t have to (re)configure docker or edit run.sh.

6 Likes

The results you are showing here are concerning, this was/is one of my biggest concerns… we are not actually sure how well the vNodes/pNodes are operating… there may be much more work to get done than expected if the reliability of vNode/pNodes isn’t as high as we were hoping.

Looking forward to see what comes from this, if anything this will provide the opportunity needed to make a much more robust blockchain once rectified… we just need to make sure the process of getting everything rectified is done with transparency, community input and that there are not any surprise launches/releases that unduly cause issues for validators/community/team.

5 Likes

I assumed the list of Nodes on the new monitor page displayed all community Nodes. I now believe it is showing just those Nodes which could potentially be slashed when those rules go live. So that initial assessment I posted is faulty.

The next (and current) epoch --3270 – only lists 46 nodes as part of the current committee. That is far less than the 80 community slots. Therefore some community Nodes are contributing to consensus via votes. It is still less than half, though. So that’s troubling.

Regardless – the crux of your concern and my post are still very much valid. ~1490 out of ~2800 Nodes are displayed on the Red List page. That would mean just over half of the current community Nodes would potentially be slashed.

The community needs time, and more importantly, clear and concise information to remedy. This will be all the more important for pNode operators who do not have the same kind of access vNode operators have.

7 Likes

OK, WTF are the wheels about to pop off this mother#$#@$%…what the hell is going on…first we have a an abandonment by some key players from the project, then we have a stoppage to the production of pNodes, and now we have an issue with the existing pNodes out in the public not properly running correctly and possibly facing slashing due to upcoming policy changes at Incognito…this is at the DEV team…guys this shit is getting out of control and you are going to be facing an implosion of this project if you do not get your shit together…apology to the rest of the community for my language but this crap is out of control… :exploding_head: :exploding_head: :rage: :rage: :face_with_symbols_over_mouth: :face_with_symbols_over_mouth:

4 Likes

I agree what is going… none of this sounds good…

3 Likes

Hey @0xkumi,

I’ve checked my nodes. Both ports are open. So there is no firewall issue. Even I sent RPC commands and got the correct responses. However. my nodes are offline and their vote stats are 0. How will I find out the problem? Before the slashing is enabled, I think you should provide us with some tools/methods to find the problems.

4 Likes

Mine appear as ‘Not staked’, running 20210406_1 and firewalls open.

getblockchaininfo RPC returns latest epoch.
getincognitopublickeyrole RPC returns the following nodes as staked, some pending.

@0xkumi do we need to configure anything else on vNodes to send metrics to the monitor? The MONITOR env is set https://github.com/incognitochain/incognito-chain/blob/mainnet_20210406_1/bin/run_incognito.sh#L22

Also! Please can we use DNS and TLS for the monitoring endpoint? This is a privacy project so let’s not send unencrypted node metrics around the internet. What if the endpoint gets DDoSed and you need to switch IPs? It will break monitoring globally.

Screenshot 2021-04-17 at 11.05.16 am

Mine runs 20210122_5. Why is mine so outdated? Auto-update has problems? How can I force them to update themselves? @0xkumi

1 Like

I have had to manually update mine for the past two months… I don’t know what changed, but I would notice large spikes in my vNode operations, and logs showed error on some block syncs, so I would have to manually delete and restart block sync, which would eventually pull the new image. All of last year, the auto updater worked fine on my vNodes, but something has definitely changed that is causing issues with this on some vNodes (some of mine pulled update, some didn’t… it’s unclear on what the cause was)

here is my post on the topic:

1 Like

In relation to my previous post, I’ve manage to get the monitor working now. I was mistakenly using the keychain public key instead of the mining public key.
Screenshot 2021-04-20 at 9.31.42 am

curl -Ss --header "Content-Type: application/json" \
    --request POST \
    --data '{"jsonrpc":"1.0","method":"getmininginfo","params":[],"id":1}' \
    http://IP:RPC_PORT | jq .Result.MiningPublickey

The MiningPublickey only seems to be filled on vNodes running 20210406_1+.

My point on DNS and TLS for the monitoring endpoint still stands and would be much appreciated by the community :slight_smile:

1 Like