Skip to content
← /work

/work/dify-platform

// operable AI applications under constraint

A multi-tenant AI application platform adapted for restricted OpenShift, complete with two-layer observability that lets you debug AI failures the same way you'd debug infrastructure failures. Traces. Metrics. Logs.

It demonstrates how AI agents can be operated with the same review, observability, and deployment discipline expected from infrastructure in high-consequence environments.

> Fast delivery. Restricted environment. Observable by default.

1 day
to production
5
core services
4
sync-waves
2
observability layers

// what it does

A multi-tenant marketplace where teams build chatbots, agents, and RAG pipelines without writing code: drag and drop components, connect to your data, and deploy.

A visual workflow builder plus chat interface, running in a self-managed environment with enterprise auth, observability, and self-hosted models.

// what users can build

Chat Apps

Conversational AI with custom system prompts and personas

Workflows

Visual multi-step pipelines with code execution nodes

RAG Apps

Knowledge base Q&A with pgvector semantic search

Agent Apps

Autonomous decision-making with tool orchestration

// fork enhancements

The reliability work around Dify: hardened deployment, identity, persistence, and observability for restricted OpenShift.

Standard DifyThis Fork
Basic API keysKeycloak SSO + OAuth2
Optional PostgreSQLEDB + pgvector (HA)
No LLM observabilityLangfuse (built-in)
Docker ComposeHelm + ArgoCD + sync-waves
Public SaaSRestricted OpenShift

// request path

FIG.D1: identity-proxied, observable by default.

Route → OAuth2-Proxy → Keycloak (SSO)
Dify API + Web ──trace──▶ Langfuse
↓ ──metric──▶ Prometheus / Grafana
EDB Postgres + pgvector · Redis · ClickHouse

// the key to AI reliability

AI needs two observability layers. Infrastructure metrics tell you the system is healthy. LLM traces tell you the AI is working correctly. Most platforms give you one; you need both.

Layer 1: Langfuse
LLM-specific observability
Token usage per request
Cost estimation
Full request/response traces
User feedback tracking
Layer 2: Prometheus
Infrastructure observability
CPU / Memory usage
Pod health metrics
HTTP metrics (via OTEL)
Database connections

> This pattern applies to any LLM app. Without both layers, you have no visibility into why AI fails.

// sync-wave deployment

ArgoCD sync-waves orchestrate startup order so dependencies are ready before applications arrive.

Wave -1Secrets, ConfigMaps, internal CA
Wave 0Keycloak OpenIDClient
Wave 1EDB PostgreSQL Cluster(2-5 min)
Wave 2Redis StatefulSet
Wave 3All Application Pods + Routes

// the dogfood

The MVP is a self-hosted assistant connected to a curated engineering corpus. Documentation, patterns, examples, and decisions become searchable through conversation.

The goal is practical: make platform and AI-delivery knowledge easier to inspect, reuse, and validate.

> The platform is useful only when the knowledge becomes reachable.

// why this is hard

No public registries. Every container image, chart, and dependency has to be mirrored, vendored, and reviewed.

FIPS cryptography. Standard TLS libraries don't work. OpenSSL FIPS module required. Most open-source projects assume normal crypto. Dify didn't.

OpenShift security context. No root. Arbitrary UIDs. Read-only filesystems. SecurityContextConstraints that block 90% of Docker Hub images out of the box.

Enterprise SSO. Identity, OAuth2 proxying, and certificate chains have to be treated as first-class platform work.

// stack

DifyOpenShiftLangfuseEDB PostgreSQLpgvectorKeycloakOAuth2-ProxyHelmArgoCDOpenTelemetryGrafanaClickHouseRedisHarbor

// the point

Observable AI infrastructure: the foundation reliability requires.

Two-layer observability gives you visibility into what AI is doing. Making it do the right thing consistently is the next problem.

> Step one: make it observable. Step two: make it reliable.