/builds/platform-tooling

// automation that makes upgrades boring

The hardest part of running air-gapped platforms isn't writing YAML. It's keeping the thing alive: upgrades, validation, and recovery when the environment is hostile and you can't Google your way out.

I built a toolchain that turns day-2 ops into a workflow: wave-based upgrades, specialist-driven diagnostics, and safe automation with structured output.

> One entry point. Explicit safety. Machine-readable results.

25.4k

LOC automation

tool families

diagnostic specialists

JSON

envelopes

// research summary (interview-safe)

Family	Lines	State
Upgrade orchestrator	3,900	Mature (Python waves + Argo CD)
Doctor diagnostics	13,700	Mature (10 specialists, auto-fix, metrics)
Vault management	2,225	Stable
Standalone ops scripts	3,700	~1,900 lines overlap doctor
Bootstrap / deploy	360	Minimal
jrenctl (unmerged)	1,500	Partial unification (upgrade + doctor only)

The key point isn't the line count. It's that the platform has an operable interface: upgrade, validate, diagnose, remediate.

// patterns that made it work

Wave-Based Orchestration

Upgrades aren't one big bang. They're a sequence with gates. Waves encode dependency order, health checks, and rollback points.

Specialist Diagnostics

One command fans out into targeted checks by domain (storage, identity, operators, networking). The output is consistent, and fixes can be applied deliberately.

Safety Levels

Automation is only useful if it's safe. I used explicit safety levels (safe/confirm/block) and approval tokens to keep high-impact actions controlled.

JSON Envelopes

Human-readable output is nice. Machine-readable output is operability. Every action returns structured results suitable for dashboards, automation, and LLM tooling.

// the insight

A unified CLI already existed in prototype form. The right move wasn't to start over. It was to expand the unification: route upgrades, diagnostics, secrets, and ops tools through one entry point.

The winning split was bash for specialists (fast, natural for CLI plumbing) and Python for orchestration (state, envelopes, safety, UX).

This tooling is part of the broader platform work described in /builds/openshift-paas.

> Reliability is a user interface.