[Node Guide] Automatically Restart BEACON STALL Containers

Jared · 6 September 2023 22:52

From time to time, I’ve observed that certain nodes encounter the “BEACON STALL” status in their Sync State after running the duplicate files script by @J053. These nodes become stuck in this state until a restart. It’s likely that this issue stems from containers not shutting down properly, and I don’t believe it’s a problem with @J053’s script.

To automate the process, you can utilize the provided script. Replace “Key_1” with your node’s Validator Public Key in the script below:

#!/bin/bash

declare -A node_key
node_key=(
    ["1"]="KEY_1"
    ["2"]="KEY_2"
    ["3"]="KEY_3"
    # Add more nodes and keys as needed
)

base_url="https://monitor.incognito.org/pubkeystat/stat"
container_name="inc_mainnet_"

for node_name in "${!node_key[@]}"; do
    key="${node_key[$node_name]}"
    response=$(curl -s "$base_url" -H 'accept: application/json' -H 'content-type: application/json' --data "{\"mpk\":\"$key\"}" | jq -r '.[].SyncState')

    if [ "$response" == "BEACON STALL" ]; then
        echo -n "Restarting Container "
        sudo docker restart "${container_name}${node_name}"
    fi
done

sleep 2
exit 0

Save the above code into a file (e.g., stalling.sh). Replace KEY_X with your nodes’ keys and add as many as required. Make the script executable with: sudo chmod +x stalling.sh.

Set up a cron job to run the script at specified intervals. To edit the cron jobs, use the command: sudo crontab -e. Add a new line at the bottom as follows:

0,30 * * * * /path/to/stalling.sh

This will run the script every 30 minutes. Adjust the frequency as needed.

Congratulations! You’ve successfully established an automated script to monitor and handle “BEACON STALL” containers.

For added convenience, you can follow the instructions on @J053’s GitHub repository:

Run crontab -e.

Add 0 0 * * * deno task --cwd /root/Duplicated-files-cleaner-Incognito run at the end of the file, adjusting the path if necessary.

Set my script to run approximately 30-60 minutes after to address any “BEACON STALL” containers.

If you have any questions or get stuck on anything feel free to leave a comment below or send me a PM.

If copying and pasting gives you issues, feel free to curl or use git from here:
https://github.com/lukemax47/Node_Stall

brico84 · 10 August 2023 12:18

AMAZING!!!

This is just what I need!

brico84 · 10 August 2023 15:06

@Jared I would suggest adding a trigger also for when response == "
SHARD STALL" as well as “OFFLINE”. Basically, any of the states that require a restart of the docker.

brico84 · 10 August 2023 15:07

For the “OFFLINE” Suggestion, if you do implement, maybe make it in a way that it can be easily “Commented out”. Some users may actually not want “OFFLINE” nodes touched?

Jared · 10 August 2023 16:19

I’m considering making a script that automates everything. It will take nodes offline and bring them online when X epochs are pending.

Currently, the script will only touch BEACON STALL.

Jared · 27 August 2023 15:59

Hey @brico84,

I just released another script you might be interested in: [User Guide] Automatic Docker Container Controller! 👨🏼‍💻

This script will automatically bring your nodes online when they are close to COMMITTEE and take them offline after they exit committee.

Ah, I had forgotten about Shard Stall since I only bootstrap my beacon data. Do you bootstrap both beacon and shard data?

brico84 · 31 August 2023 16:20

I do. I modified the script on my end to includes SHARD STALL and NODE STALL which also sometimes comes up.

I am seeing a problem though. you didn’t include [0] in node_key

So inc_mainnet_0 is not getting monitored.

Jared · 1 September 2023 01:54

The script pulls what you provide. Check to make sure you have included your inc_mainnet_0 information.

brico84 · 6 December 2023 15:06

Some suggested changes I made that work well. @Jared up to you if you want to implement them, but in case anyone else wants to make these edits…

I changed the response line to the following to include additional cases that came up for me that the script was not handling.

change the If statement line to…

If [ “$response” == “BEACON STALL” ] || [ “$response” == “SHARD STALL” ] || [ “$response” == “NODE STALL” ]; then

This will cover shard stall and node stall cases as well

I also changed the echo output from "Restarting container " to "Starting Container "

This is just for my own purposes as “restarting” is usually an error of the containers constantly restarting because of a problem. So I prefer to see “Starting” instead of “Restarting”. this has no bearing on the function of the script itself.

Jared · 27 February 2024 03:40

@brico84,

Thank you for the change suggestion. I’ve pushed it to the repo on main.

Sorry for the long delay in replying. I’ve been busy working on a toolkit of sorts that will mainly benefit node operators but also help power users.