NCFP-28: Sentinel Fix

Since way too long, many users have encountered a Sentinel issue:

This was reported many times, with many sentinel versions. Some sentinels do fine, others need very frequent restart and reset. The issue showed some dependence to sentinel version and proportion of in-cycle verifiers, but as some tests showed, no version is immune to the “freeze” bug.

As for myself, I began to develop workarounds: bash scripts that would monitor the sentinels and resync/restart them when this happens. This is not a real solution however, and not practical if you don’t have several sentinel or in-cycle nodes you can safely sync from. Plus this involves shell, rsync, ssh knowledge that make it hard to handle for newcomers.

The sentinel had to be fixed.
So I “froze” days to debug and logs and tweak to understand what was going on and maybe fix it.
Good news: I think I made it.
I identified a sentinel bug, found a clean way to avoid it, and the evercrashing sentinel is running smooth since.

Now, the time for a proper release, explanation and hopefully a merge in the “official” codebase.

The changes will include

  • some tweaks to existing logging, in order to make further debugs easier (no impact on performance nor verifier operations)
  • addition of new logs, to report in the logs some data that is only visible via the web interface otherwise: number of manager verifiers, protection state, maybe key/ip mismatches.
  • Slightly more details in the web interface
  • Of course, the main fix that will end our stuck sentinel issues.
  • A synthetic overview of what I understood of the bug, and how that was fixed
  • review of all code changes for external scrutiny.
    (WIP)

I believe this version will make it to the main codebase, since - unlike the CE version - it does not change the behaviour of the verifier, and is “just” logging adjustments and bugfix.

Although the resulting changes - once cleaned up - are quite small, this sentinel issue has been a huge pain for every Nyzo operator since a long time, and fixing it needed significant work and a deep understanding of the current code base and Nyzo operations.

I’m confident the fix is effective, I just need to clean it up and write the related notes.

I know from experience I won’t get significant spontaneous tips, so I’m opening this NCFP to let the community suggest a proper reward for the fix.

I’ll update with an address and an amount once there will be some suggests.

3 Likes

I propose a reward of at least 20k nyzo for this fix as it has been a MAJOR issue since forever. Many people lost verifiers because of it. This shouldn’t be taken lightly and should be rewarded bigly. Doing the debugging work, troubleshooting and even fixing the issue is just epic work.

1 Like

I think 50k is a reasonable amount given the magnitude and scope of this problem.

This is at least a Medium severity problem on the OWASP framework and a fix was offered.

@NyzoSy, please provide a nyzo address so that we can send the transaction once we have consensus. A transaction will be sent as soon as a trusted community member confirms that the changes fixes their sentinel problem.

Great job.

3 Likes

I am proposing 65k because of the undoubtedly great importance of this solution.

1 Like

Even though v608 supposedly addresses part of this problem, I propose still giving a reward to Sy for detailed work looking into this problem. Will wait for the core dev team to formally address this issue.

Sy’s work has undoubtedly a lot of value

please 100k for the one who is able to fix the sentinel!