bin/background-worker: Simplify restart behavior #7459

Turbo87 · 2023-11-06T20:58:14Z

All instances of our code hitting the panic!() on production so far have been because of a read-only database during database maintenance events (see Sentry).

The r2d2 connection pool can easily recover from these kinds of issues, so a restart of the dyno or re-initialization of the connection pool seems unnecessary.

This is emphasized by the fact that the startup time of the background worker in production is currently multiple minutes due to the index repository cloning. It seems bad to restart the background worker after just five unsuccessful connection attempts when it then takes multiple minutes again until the next attempt.

The `Runner::new()` fn itself doesn't do much, so re-running it should not change the outcome compared to re-running `runner.run_all_pending_jobs()` directly. The only difference is the re-initialization of the r2d2 database connection pool. The database pool should however be relatively resilient against database connection issues and will try to reconnect if it fails to connect on startup.

The only cases of our code hitting the `panic!()` in production so far have been because of a read-only Postgres database during database server maintenance. The r2d2 connection pool automatically recovers from this though. In other words: there is no benefit in us panicking after five failures if in all observed instances it would have been sufficient to wait for the database to recover. Since the startup on production takes multiple minutes due to the index repository cloning it seems somewhat bad to restart the whole dyno after just five seconds of database connection issues.

LawnGnome

I think I'm OK with this, but let's think through the possible failure mode here:

If a panic occurs that isn't a read only database panic, and it's something that isn't transient, we might end up in a situation where the background worker just keeps restarting the same jobs indefinitely, right? Presumably we'd eventually detect that through the job queue size getting large enough that a page occurs?

Of course, the first step in response to a page would probably be to restart the background worker, which would not really be any different to the code that exists today, except for the part where someone might have been rolled out of bed.

So, at worst, we might be slowing down the restart of the background worker. (Of course, for a non-database error, that might not resolve the problem anyway, so a human would have been woken up regardless.)

So, yeah, I think this is fine — it won't really slow down response to other panics, and not restarting in the read-only case is definitely an improvement.

Turbo87 added 2 commits November 6, 2023 21:49

Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Nov 6, 2023

Turbo87 requested a review from a team November 6, 2023 21:13

LawnGnome approved these changes Nov 6, 2023

View reviewed changes

Turbo87 merged commit fb264d0 into rust-lang:main Nov 7, 2023

Turbo87 deleted the worker-restart branch November 7, 2023 07:30

bors mentioned this pull request Nov 7, 2023

worker: Clone index repository in the background #7461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bin/background-worker: Simplify restart behavior #7459

bin/background-worker: Simplify restart behavior #7459

Uh oh!

Turbo87 commented Nov 6, 2023

Uh oh!

LawnGnome left a comment

Uh oh!

Uh oh!

bin/background-worker: Simplify restart behavior #7459

bin/background-worker: Simplify restart behavior #7459

Uh oh!

Conversation

Turbo87 commented Nov 6, 2023

Uh oh!

LawnGnome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!