Skip to content

Fix controller-manager failures causing CNS to remain incorrectly healthy #3688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented May 27, 2025

Problem

Currently, CNS will repeatedly try to start controller-manager even if it encounters terminal errors, keeping CNS in an incorrectly healthy state. This affects both controller-manager in CRD mode and multiTenantController in MultiTenantCRD mode.

Solution

This PR adds timeouts for controller startup and proper error handling to ensure CNS fails appropriately when controllers can't be started. Key changes:

  1. Added timeout contexts for both controller-manager and multiTenantController with a 5-minute default
  2. Added new metrics to track timeout occurrences:
    • cns_ctrlmanager_start_timeouts_total
    • cns_multitenantcontroller_start_timeouts_total
  3. Made CNS handle terminal errors more gracefully by propagating errors through channels instead of calling os.Exit directly from goroutines
  4. Simplified goroutine structure for starting controllers and improved error reporting
  5. Extracted common functionality to reusable functions for better maintainability

Before

  • CNS would indefinitely retry controller-manager.Start on failures
  • No timeout was set for controller startup attempts
  • Terminal errors wouldn't cause CNS to fail appropriately

After

  • Controller startup attempts have a reasonable timeout
  • Terminal errors are properly propagated and handled
  • CNS will report unhealthy status when controllers fail to start

These changes ensure CNS will not remain in an incorrectly healthy state when controllers fail to initialize properly.

Fixes #3687.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits May 27, 2025 19:57
Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
@Copilot Copilot AI changed the title [WIP] Failures in controller-manager leave CNS in an incorrectly healthy state Fix controller-manager failures causing CNS to remain incorrectly healthy May 27, 2025
@Copilot Copilot AI requested a review from timraymond May 27, 2025 20:07
Copilot finished work on behalf of timraymond May 27, 2025 20:07
@github-actions github-actions bot added the stale Stale due to inactivity. label Jun 11, 2025
@Azure Azure deleted a comment from github-actions bot Jun 12, 2025
@timraymond timraymond removed the stale Stale due to inactivity. label Jun 12, 2025
@timraymond
Copy link
Member

@copilot Please make only the minimum changes necessary to guard manager.Start on L1615 in cns/service/main.go. This is the method call in question that must be fixed. Change it so the context passed into this has a reasonable timeout. If manager.Start on this line produces an error, or the context expires, the entire process should terminate.

Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jun 12, 2025

@copilot Please make only the minimum changes necessary to guard manager.Start on L1615 in cns/service/main.go. This is the method call in question that must be fixed. Change it so the context p...

Made minimal changes as requested in commit c9b5c00. Added 5-minute timeout context to manager.Start() and process terminates with os.Exit(1) if it fails or times out. Removed the retry loop to keep changes surgical.

Copilot finished work on behalf of timraymond June 12, 2025 14:38
Copy link
Contributor Author

Copilot AI commented Jun 12, 2025

@timraymond Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 054e6f75-7569-4569-ad2a-71faf0af21ad

Sorry for the inconvenience!

…remove synchost.go

Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copilot finished work on behalf of timraymond June 12, 2025 15:07
…remove unnecessary metric

Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copilot finished work on behalf of timraymond June 12, 2025 15:33
Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copilot finished work on behalf of timraymond June 12, 2025 16:00
…ested

Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copilot finished work on behalf of timraymond June 13, 2025 15:46
Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copilot finished work on behalf of timraymond June 13, 2025 15:57
Co-authored-by: timraymond <288061+timraymond@users.noreply.github.com>
Copilot finished work on behalf of timraymond June 13, 2025 17:21
@timraymond timraymond marked this pull request as ready for review June 13, 2025 17:32
@timraymond timraymond requested a review from a team as a code owner June 13, 2025 17:32
@timraymond timraymond requested review from ashvindeodhar and rbtr June 13, 2025 17:32
@rbtr
Copy link
Contributor

rbtr commented Jun 13, 2025

@copilot fix lints

  cns/service/main.go:1619: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
  		
  cns/service/main.go:1629: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
  	
  cns/service/main.go:1642: File is not `gci`-ed with --skip-generated -s standard -s default (gci)

Co-authored-by: rbtr <2940321+rbtr@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jun 13, 2025

@copilot fix lints

  cns/service/main.go:1619: File is not `gci`-ed with --skip-generated -s standard -s default (gci)

...

Fixed the gci linting issues by removing trailing whitespace from lines 1619, 1629, and 1642 in commit 9ee0904.

Copilot finished work on behalf of rbtr June 13, 2025 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failures in controller-manager leave CNS in an incorrectly healthy state
3 participants