Skip to content

bin/background-worker: Simplify restart behavior #7459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 7, 2023

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Nov 6, 2023

All instances of our code hitting the panic!() on production so far have been because of a read-only database during database maintenance events (see Sentry).

The r2d2 connection pool can easily recover from these kinds of issues, so a restart of the dyno or re-initialization of the connection pool seems unnecessary.

This is emphasized by the fact that the startup time of the background worker in production is currently multiple minutes due to the index repository cloning. It seems bad to restart the background worker after just five unsuccessful connection attempts when it then takes multiple minutes again until the next attempt.

The `Runner::new()` fn itself doesn't do much, so re-running it should not change the outcome compared to re-running `runner.run_all_pending_jobs()` directly.

The only difference is the re-initialization of the r2d2 database connection pool. The database pool should however be relatively resilient against database connection issues and will try to reconnect if it fails to connect on startup.
The only cases of our code hitting the `panic!()` in production so far have been because of a read-only Postgres database during database server maintenance. The r2d2 connection pool automatically recovers from this though.

In other words: there is no benefit in us panicking after five failures if in all observed instances it would have been sufficient to wait for the database to recover.

Since the startup on production takes multiple minutes due to the index repository cloning it seems somewhat bad to restart the whole dyno after just five seconds of database connection issues.
@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Nov 6, 2023
@Turbo87 Turbo87 requested a review from a team November 6, 2023 21:13
Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm OK with this, but let's think through the possible failure mode here:

If a panic occurs that isn't a read only database panic, and it's something that isn't transient, we might end up in a situation where the background worker just keeps restarting the same jobs indefinitely, right? Presumably we'd eventually detect that through the job queue size getting large enough that a page occurs?

Of course, the first step in response to a page would probably be to restart the background worker, which would not really be any different to the code that exists today, except for the part where someone might have been rolled out of bed.

So, at worst, we might be slowing down the restart of the background worker. (Of course, for a non-database error, that might not resolve the problem anyway, so a human would have been woken up regardless.)

So, yeah, I think this is fine — it won't really slow down response to other panics, and not restarting in the read-only case is definitely an improvement.

@Turbo87 Turbo87 merged commit fb264d0 into rust-lang:main Nov 7, 2023
@Turbo87 Turbo87 deleted the worker-restart branch November 7, 2023 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants