Automation that makes upgrades boring.

Writing the YAML is a week of the job. Keeping the platform alive is the rest of it: upgrades, validation, and recovery in an environment that won't let you lean on easy external fixes.

This toolchain turns day-2 ops into a workflow: wave-based upgrades, specialist diagnostics, safe actions, and structured proof.

> One CLI entry point, explicit safety levels, JSON envelopes on every action.

~80%

rework reduction

weeks → hours

provisioning cycles

100+

clusters served

// the tool families

What matters is that the platform has an operable interface: upgrade, validate, diagnose, remediate.

Family	State
Upgrade orchestrator	Mature (Python waves + Argo CD)
Doctor diagnostics	Mature (specialist checks, auto-fix, metrics)
Vault management	Stable
Standalone ops scripts	Partially overlaps doctor
Bootstrap / deploy	Minimal
Unified CLI prototype	Partial unification (upgrade + doctor)

// patterns that made it work

Wave-Based Orchestration

Upgrades run as a gated sequence. Waves encode dependency order, health checks, and rollback points.

Specialist Diagnostics

One command fans out into targeted checks by domain (storage, identity, operators, networking). Consistent output; fixes applied deliberately.

Safety Levels

Explicit safety levels (safe/confirm/block) and approval tokens keep high-impact actions controlled.

JSON Envelopes

Every action returns a structured envelope so dashboards and agents can consume the same result humans read.

// the insight

A unified CLI already existed in prototype form. The right move was to expand the unification: route upgrades, diagnostics, secrets, and ops tools through one entry point.

The winning split was bash for specialists (fast, natural for CLI plumbing) and Python for orchestration (state, envelopes, safety, UX).

// the bridge

Operators need one safe interface for inspect, act, and recover. See /work/openshift-paas.