(Kind of solved) Problem starting multiple nodes at the same time

@Support

Not sure what happened but all my nodes that should have been in committee stopped working. Checking the logs I only see the following information.

2022-04-13 14:13:19.269 incognito.go:107 [INF] Server log: Version 1.19.0-beta
2022-04-13 14:13:22.107 engine.go:182 [INF] Consensus log: CONSENSUS: NewConsensusEngine
encoded libp2p key: CAMSeTB3...
2022-04-13 14:13:22.128 host.go:88 [INF] Peerv2 log: selfPeer: QmWD5nss... 0.0.0.0 9500
2022-04-13 14:13:22.129 server.go:297 [INF] Server log: [newpeerv2] backup discovery peer [mainnet-bootnode.incognito.org:9330]
2022-04-13 14:13:22.213 beaconbeststate.go:800 [INF] BlockChain log: Init Beacon Committee State 1926843
2022-04-13 14:13:36.025 consensus_assign_rule.go:22 [INF] Committee State log : Beacon Height 1926843, using Assign Rule V3
2022-04-13 14:13:36.025 consensus_swap_rule.go:25 [INF] Committee State log : Beacon Height 1926843, using Swap Rule V3

This is also true for my fullnode. CPU keeps spinning but nothing is happening.

hey @fredlee, can you show us the version of code that running on your server (i assume you running docker tag)
after that, let try to stop all containers, clear log, start only 1 container, send us the log and public validator key of it.

If you need quick support you can ping me on telegram https://t.me/khanhj

2 Likes

Hey @fredlee,

To help you with diagnostics please ping me on Telegram: https://t.me/lukemax

I will need you to check your server’s disk IO or do a remote session to check it for you.

1 Like

It looks like starting the nodes takes a very, very long time. Here’s what my log looks like from first line to it actually joining the mainnet.

2022-04-14 07:52:50.040 incognito.go:107 [INF] Server log: Version 1.19.0-beta

...

2022-04-14 08:00:38.675 server.go:704 [CRT] Server log: ************************* Node is running in mainnet network *************************

That’s almost 8 minutes. It might be isolated to this physical server. Not sure yet. Can anyone else check their log how long they have between the first line and it actually spinning?

My actual problem tho, is that when restarting the physical machine, it turns out they all go into some kind of timeout spiral, and keep restarting them self with the unhandled panic error.

panic: interface conversion: multiview.View is nil, not *blockchain.BeaconBestState

goroutine 197 [running]:
github.com/incognitochain/incognito-chain/blockchain.(*BlockChain).SendFeatureStat(0x45d964b800)
	/Users/autonomous/projects/incognito-chain/blockchain/featurestat.go:105 +0x6ac
github.com/incognitochain/incognito-chain/blockchain.(*BlockChain).InitFeatureStat.func2()
	/Users/autonomous/projects/incognito-chain/blockchain/featurestat.go:99 +0x36
created by github.com/incognitochain/incognito-chain/blockchain.(*BlockChain).InitFeatureStat
	/Users/autonomous/projects/incognito-chain/blockchain/featurestat.go:95 +0x15d

Still not sure why this happened in the middle of an epoch, but something else (like a network timeout?), might have triggered them all to restart at the same time.

@khanhj helped me figure out that I can gently start one node at a time, waiting 10-15 minutes between each start. I also think they’re looking into the panic error.

@Jared pinged you on Telegram

What method do you use to restart your server? This sounds like an issue of the docker container not stopping properly.

Thank you for your help @Jared and @khanhj. If I understand correctly, when I start multiple nodes at the same time, they all start running block verification and I get timeouts because of i/o issues.

I have probably had this problem for a while because my hardware has not changed (unless my SSDs has gotten bad). What might have changed is that incognito is throwing a panic and does not shut down correctly. Maybe that affects my nodes ability to stagger through the block verification?

I beleive the panic issue may be fixed by the following commit

For now, I have solved it by limiting to starting one node every 15 minutes. Once the fix is in place, I can test and see if the panic error was the reason they got stuck in a timeout spiral.

For anyone else reading this. Bear in mind, I’m really pushing my hardware with a lot of nodes, so I’m not the typical use-case.

Hey @fredlee,

I had faced this problem (same log messages, stuck at the same point) when my VPS provider moved my VPS to a bad host system. CPU, disk, network parameters were really bad. Then upon my insisted request, they moved my VPS to another host system and all my nodes have been fixed. FYI.

1 Like

I had thought the same thing but @fredlee has informed me he’s hosting on his own hardware.

1 Like

That’s cool @fredlee you’re hosting on your own hardware. I’m thinking of doing the same. Care to share your rig specs / limits you have found for the amount of nodes it can run?

Then he may have some enemies :smiling_imp: Unless he hasn’t any exclusive ISP, the network is the only parameter he cannot control. I think I have found the problem: His network has gone bad :rofl:

Well, that’s the thing. Up until recently, I was hosting 33 nodes on a single 2011 Mac mini, 16GB, 1TB SSD. But now with the new version and committee changes, I am facing performance issues and my nodes are getting bulk slashed.

I’ve been trying to find something stable, but have problems starting even 2 nodes at the same time. If I stagger the starts by 15 minutes, I manged to run 4 nodes, but when the 5th starts, everything stops and they all go offline.

I’m giving up on my Mac mini incognito project for now, it had a good run, for over a year it ran both a full node and was validating on all shards.

So now 33 of my nodes have no home, I can retire them, move them to another server, or build a new glorious multi-node project. Have not decided what to do with them yet. :grin:

2 Likes

Go for glory my friend. An epic multi node project. 33 nodes on a single 2011 Mac mini is pretty cool :smiley:.

You want to allocate 2GB RAM per node. Could you upgrade the RAM?

When you start them up check the node monitor (monitor.incognito.org) first to make sure it says Sync Status Latest before you start up the next node.

How do you manage so many on such a small storage space? Do you use hard links?

With the resource improvements the devs recently released there is still hope for your set up. Lets wait and give it a shot again. :slightly_smiling_face:

I know it may depend how many are in committee. But the part I’m always uncertain on is the number of cores given X nodes on a system. If Fredlee was previously running 33 nodes on a 2011 mini mac suggests to me cpu / cores isn’t really an issue and the SSD and RAM are the ones to focus on.

The new node database mode being worked on should drastically help lower resource usage.

I already spoke to the dev in charge of this and we will try to offer a bootstrap version so node operators can quickly change over.

3 Likes

A bootstrap would be greatly appreciated

2 Likes

It’s unfortunately a hardware limitation on the 2011 model. Like you were hinting when we talked, it looks like the memory is the biggest issue, but it also goes hand in hand with storage reaching it’s limit. So basically, with the beacon and shard growth over the past year, I finally hit an upper limit what the hardware could handle.

Kind of, I use ZFS and snapshot clones. I also experimented with dedupe between nodes, but that does not work because each nodes database files ends up with unique data.

Looking forward to the new improvements. What I have noticed on the current database design, if I bootstrap from one node to another, when I then boot the new node, it rewrites about 30GB of data during the integrity check in the start. I imagine it updates some data and therefore rewrites the database pages? (just guessing here). Do you know if this is different in the new storage design? I imagine that if it can avoid updating or rewriting the big immutable block data, cloning or hard linking should work even better.

Yes. That is correct, during normal operation it’s very light on the cpu load because of the relatively slow rate of blocks. You really only strain the CPU during shard resyncing or node restarts.

1 Like

Have you tried using this method for hard links? It was designed and coded by a community member:

I use it to manage a large range of nodes and it works well for me.

Nope, but I saw that one, checked the source and it looked really good.

Unfortunately I would not be able to use it because it would kill my nodes doing this. :wink:

    console.group("\nStarting containers.");
    for (const nodeIndex of allNodesIndex)
      console.log(await docker(["container", "start", `inc_mainnet_${nodeIndex}`], (v) => v.split("\n")[0]));

I wonder tho, are the ldb files immutable? Is it 100% sure that old ldb files are never written to? Or is it more of a, “seems to work fine for me”? :slightly_smiling_face:

I believe the old files are not written to. Just checked and confirmed they are there.