/work/platform-tooling
// automation that makes upgrades boring
The hardest part of running restricted platforms isn't writing YAML. It's keeping the thing alive: upgrades, validation, and recovery when the environment won't let you lean on easy external fixes.
This toolchain turns day-2 ops into a workflow: wave-based upgrades, specialist diagnostics, safe actions, and structured proof.
> One entry point. Explicit safety. Machine-readable results.
// research summary
The line count is secondary. What matters is that the platform has an operable interface: upgrade, validate, diagnose, remediate.
| Family | Lines | State |
|---|---|---|
| Upgrade orchestrator | 3,900 | Mature (Python waves + Argo CD) |
| Doctor diagnostics | 13,700 | Mature (10 specialists, auto-fix, metrics) |
| Vault management | 2,225 | Stable |
| Standalone ops scripts | 3,700 | ~1,900 lines overlap doctor |
| Bootstrap / deploy | 360 | Minimal |
| Unified CLI prototype | 1,500 | Partial unification (upgrade + doctor) |
// patterns that made it work
Upgrades aren't one big bang. They're a sequence with gates. Waves encode dependency order, health checks, and rollback points.
One command fans out into targeted checks by domain (storage, identity, operators, networking). Consistent output; fixes applied deliberately.
Automation is only useful if it's safe. Explicit safety levels (safe/confirm/block) and approval tokens keep high-impact actions controlled.
Human-readable output is nice. Machine-readable output is operability. Every action returns structured results for dashboards, automation, and AI-assisted review.
// the insight
A unified CLI already existed in prototype form. The right move was to expand the unification: route upgrades, diagnostics, secrets, and ops tools through one entry point.
The winning split was bash for specialists (fast, natural for CLI plumbing) and Python for orchestration (state, envelopes, safety, UX).
// the bridge
Reliability gets better when operators have one safe interface for inspect, act, and recover. See /work/openshift-paas.
> Reliability is a user interface.