Since way too long, many users have encountered a Sentinel issue:
This was reported many times, with many sentinel versions. Some sentinels do fine, others need very frequent restart and reset. The issue showed some dependence to sentinel version and proportion of in-cycle verifiers, but as some tests showed, no version is immune to the “freeze” bug.
As for myself, I began to develop workarounds: bash scripts that would monitor the sentinels and resync/restart them when this happens. This is not a real solution however, and not practical if you don’t have several sentinel or in-cycle nodes you can safely sync from. Plus this involves shell, rsync, ssh knowledge that make it hard to handle for newcomers.
The sentinel had to be fixed.
So I “froze” days to debug and logs and tweak to understand what was going on and maybe fix it.
Good news: I think I made it.
I identified a sentinel bug, found a clean way to avoid it, and the evercrashing sentinel is running smooth since.
Now, the time for a proper release, explanation and hopefully a merge in the “official” codebase.
The changes will include
- some tweaks to existing logging, in order to make further debugs easier (no impact on performance nor verifier operations)
- addition of new logs, to report in the logs some data that is only visible via the web interface otherwise: number of manager verifiers, protection state, maybe key/ip mismatches.
- Slightly more details in the web interface
- Of course, the main fix that will end our stuck sentinel issues.
- A synthetic overview of what I understood of the bug, and how that was fixed
- review of all code changes for external scrutiny.
I believe this version will make it to the main codebase, since - unlike the CE version - it does not change the behaviour of the verifier, and is “just” logging adjustments and bugfix.
Although the resulting changes - once cleaned up - are quite small, this sentinel issue has been a huge pain for every Nyzo operator since a long time, and fixing it needed significant work and a deep understanding of the current code base and Nyzo operations.
I’m confident the fix is effective, I just need to clean it up and write the related notes.
I know from experience I won’t get significant spontaneous tips, so I’m opening this NCFP to let the community suggest a proper reward for the fix.
I’ll update with an address and an amount once there will be some suggests.