No description
Find a file
2026-04-06 18:40:11 +01:00
argocd Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
base Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
charts Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
DNS_ENTRIES.md Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
PREREQUISITES.md Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
README.md Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
RUNBOOK.md Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00
SOW-TASK-LIST.md Confluent on OpenShift - GitOps automation, monitoring, and deployment documentation 2026-04-06 18:40:11 +01:00

Confluent Platform on OpenShift — GitOps Deployment

Fully automated multi-cluster deployment of Confluent Platform on OpenShift using ArgoCD, Helm charts, and CFK (Confluent for Kubernetes) operator.

Architecture Overview

                    West Region                              East Region
              (3-5ms intra-region)                     (3-5ms intra-region)
        ┌──────────────┬──────────────┐          ┌──────────────┬──────────────┐
        │  dc-west-1   │  dc-west-2   │          │  dc-east-1   │  dc-east-2   │
        │   (rack1)    │   (rack2)    │          │   (rack1)    │   (rack2)    │
        │              │              │          │              │              │
        │ KRaft (1)    │ KRaft (1)    │          │ KRaft (1)    │ KRaft (1)    │
        │ Kafka 0,1,2  │ Kafka        │          │ Kafka 0,1,2  │ Kafka        │
        │              │  100,101,102 │          │              │  100,101,102 │
        │ Schema Reg   │ Schema Reg   │          │ Schema Reg   │ Schema Reg   │
        │ REST Proxy   │ REST Proxy   │          │ REST Proxy   │ REST Proxy   │
        │ Connect      │ Connect      │          │ Connect      │ Connect      │
        │ ControlCenter│ ControlCenter│          │ ControlCenter│ ControlCenter│
        │              │              │          │              │              │
        │  CFK Operator│  CFK Operator│          │  CFK Operator│  CFK Operator│
        └──────┬───────┴──────┬───────┘          └──────┬───────┴──────┬───────┘
               │  Cluster     │                         │  Cluster     │
               │  Linking     │                         │  Linking     │
               └──────────────┘                         └──────────────┘
                      │              Cross-Region              │
                      │           Cluster Linking              │
                      │          (30-40ms async)               │
                      └────────────────────────────────────────┘

Each OCP cluster runs a complete Confluent Platform stack. Clusters within a region are linked for synchronous-like replication (intra-region). Clusters across regions are linked for async DR replication (cross-region).

Repository Structure

.
├── README.md                          # This file
├── PREREQUISITES.md                   # What must be in place before GitOps
├── RUNBOOK.md                         # Operational findings and lessons learned
├── DNS_ENTRIES.md                     # Required DNS entries per cluster
│
├── argocd/                            # ArgoCD Application definitions
│   ├── applicationset-cfk-operator.yaml
│   ├── applicationset-infra.yaml
│   └── applicationset-kafka.yaml
│
├── base/                              # Per-cluster Helm values
│   ├── garland-infra.yaml             # dc-west-1 infrastructure values
│   ├── garland-kafka.yaml             # dc-west-1 Kafka stack values
│   ├── louisville-infra.yaml          # dc-west-2 infrastructure values
│   ├── louisville-kafka.yaml          # dc-west-2 Kafka stack values
│   ├── sterling-infra.yaml            # dc-east-1 infrastructure values
│   ├── sterling-kafka.yaml            # dc-east-1 Kafka stack values
│   ├── manassas-infra.yaml            # dc-east-2 infrastructure values
│   └── manassas-kafka.yaml            # dc-east-2 Kafka stack values
│
└── charts/
    ├── cluster-infra/                 # Helm chart: cluster prerequisites
    │   ├── Chart.yaml
    │   ├── values.yaml
    │   └── templates/
    │       ├── namespace.yaml         # confluent namespace
    │       ├── scc.yaml               # Custom SCC (UID 1001)
    │       ├── storageclass.yaml      # NFS CSI StorageClass
    │       ├── metallb.yaml           # MetalLB IPAddressPool + L2Advertisement
    │       └── pull-secret.yaml       # Docker Hub pull secret
    │
    └── confluent-kafka/               # Helm chart: Confluent Platform stack
        ├── Chart.yaml
        ├── values.yaml
        └── templates/
            ├── kraftcontroller.yaml   # KRaft controller with deterministic clusterID
            ├── kafka.yaml             # Kafka brokers with ID offset + rack awareness
            ├── schemaregistry.yaml    # Schema Registry
            ├── restproxy.yaml         # REST Proxy
            ├── connect.yaml           # Kafka Connect
            ├── controlcenter.yaml     # Control Center (300s probe delay)
            ├── kafkarestclass.yaml    # KafkaRestClass for ClusterLink REST API
            ├── clusterlink.yaml       # ClusterLink CRs (intra + cross-region)
            └── static-lb-ips.yaml     # PostSync Job for static MetalLB IPs

Deployment Layers

The deployment is structured in 3 layers, each managed by separate ArgoCD Applications:

Layer 1: Infrastructure (cluster-infra chart)

Creates cluster-level prerequisites that Confluent depends on:

Resource Description
Namespace confluent namespace
SecurityContextConstraints Custom confluent-scc — allows UID 1001, scoped to 8 Confluent SAs
StorageClass nfs-csi — NFS CSI provisioner pointing to shared NFS server
MetalLB IPAddressPool Dedicated IP range per cluster for Kafka LoadBalancer services
L2Advertisement MetalLB L2 mode for the IP pool
Docker Pull Secret confluent-registry — credentials for pulling CFK images from Docker Hub

Layer 2: CFK Operator (upstream Helm chart)

Installs the Confluent for Kubernetes operator from https://packages.confluent.io/helm:

  • Chart: confluent-for-kubernetes version 0.1514.19
  • CFK operator version: 3.2.1
  • podSecurity.enabled=false (OpenShift uses SCCs, not PodSecurity)
  • Must use ServerSideApply=true in ArgoCD syncOptions — CFK has 22 large CRDs that cause ArgoCD controller OOM with client-side apply

Layer 3: Confluent Platform (confluent-kafka chart)

Deploys all Confluent Platform components:

Component Replicas Notes
KRaftController 1 Pre-defined clusterID (exactly 22 chars) for deterministic cluster links
Kafka 3 Broker ID offset via annotation (0 for rack1, 100 for rack2)
SchemaRegistry 1
KafkaRestProxy 1
Connect 1
ControlCenter 1 300s liveness probe delay (slow startup with Kafka Streams)
KafkaRestClass 1 Required for ClusterLink CR REST API access
ClusterLink 1-2 Intra-region + cross-region links per cluster
Static LB IP Job 1 ArgoCD PostSync hook — patches MetalLB IPs onto Kafka LB services

Key Design Decisions

Deterministic Cluster IDs

KRaft controllers use pre-defined clusterID values (exactly 22 characters) set in Helm values. This enables:

  • Single-pass GitOps: ClusterLink CRs reference known IDs at deploy time — no manual discovery step
  • Stable across redeployments: IDs don't change when tearing down and redeploying
  • Pattern for scaling: {dc}-{region}-{az}-{deployment} (e.g., garland-west-az1-nb01)

Broker ID Offset

CFK annotation platform.confluent.io/broker-id-offset assigns non-overlapping broker IDs:

  • rack1 clusters (AZ1): offset 0 → brokers 0, 1, 2
  • rack2 clusters (AZ2): offset 100 → brokers 100, 101, 102
  • Gap of 100 allows scaling to 100 brokers per AZ without conflicts

Static MetalLB IPs

CFK creates LoadBalancer services but doesn't support per-broker IP annotations. An ArgoCD PostSync hook (Kubernetes Job) waits for CFK to create the services, then patches each with metallb.universe.tf/loadBalancerIPs to assign deterministic IPs matching pre-configured DNS entries.

Custom SCC (Not anyuid)

A dedicated confluent-scc SecurityContextConstraints allows UID 1001 (Confluent default) with MustRunAs. Scoped to 8 specific service accounts — not the broad anyuid SCC. Deployed as part of the infra chart.

Rack Awareness

Brokers are tagged with rack labels matching their AZ:

  • AZ1 clusters: broker.rack=rack1
  • AZ2 clusters: broker.rack=rack2

Kafka uses this for replica placement — ensures replicas are spread across racks (AZs) for fault tolerance.

Per-Cluster Configuration

Configuration is driven entirely by per-cluster values files in base/. Each cluster has two files:

{cluster}-infra.yaml

nfs:
  server: "172.16.2.201"             # NFS server IP
  share: "/mnt/samsung-1tbs/csi/vols" # NFS export path
metallb:
  addressPool: "172.16.2.90-172.16.2.99" # Dedicated MetalLB IP range
confluent:
  namespace: confluent
  dockerRegistry:
    server: docker.io
    username: <username>
    password: <password>

{cluster}-kafka.yaml

namespace: confluent
kraft:
  replicas: 1
  clusterID: "garland-west-az1--c001"  # Exactly 22 chars
  storageClass: nfs-csi
kafka:
  replicas: 3
  brokerIdOffset: "0"                   # 0 for rack1, 100 for rack2
  rack: rack1                           # rack1 or rack2
  externalDomain: garland.arsalan.io    # Domain for external listener
  dataVolumeCapacity: 50Gi
  storageClass: nfs-csi
  staticIPs:
    brokers:
      - 172.16.2.90                     # kafka-0-lb
      - 172.16.2.91                     # kafka-1-lb
      - 172.16.2.92                     # kafka-2-lb
    bootstrap: 172.16.2.93              # kafka-bootstrap-lb
clusterLinks:
  - name: garland-to-louisville
    sourceBootstrap: kafka-bootstrap.louisville.arsalan.io:9092
    sourceClusterId: louisville-west-az2-c001
    destinationClusterId: garland-west-az1--c001

Adding a New Cluster / deployment

  1. Create values files: Copy an existing pair ({cluster}-infra.yaml + {cluster}-kafka.yaml) and update:

    • clusterID: unique, exactly 22 characters
    • brokerIdOffset: 0 for rack1, 100 for rack2
    • rack: rack1 or rack2
    • externalDomain: cluster's DNS domain
    • metallb.addressPool: unique IP range
    • staticIPs: IPs from the MetalLB pool
    • clusterLinks: source/destination cluster IDs and bootstrap endpoints
  2. Add DNS entries: Per DNS_ENTRIES.md — broker and bootstrap records pointing to MetalLB IPs

  3. Create ArgoCD Applications: Add entries to the ApplicationSets or create individual Applications pointing to the new values files

  4. Push to git: ArgoCD auto-syncs and deploys the full stack

Validated Capabilities

Capability Status Details
Multi-cluster deployment Proven 4 clusters, 36 pods total via ArgoCD
Broker ID offset Proven 0,1,2 / 100,101,102 per cluster pair
Rack awareness Proven rack1/rack2 per AZ
Intra-region cluster linking Proven 10/10 messages replicated, Lag: 0
Cross-region cluster linking Proven 10/10 messages replicated both directions
Failover (broker kill) Proven 150/150 messages, zero data loss with acks=all
RF=3 with min.insync.replicas=2 Proven acks=all production during broker outage
GitOps tear-down + redeploy Proven Full stack from git in ~30 minutes
Deterministic cluster IDs Proven ClusterLink CRs work on first deploy
Static MetalLB IPs Proven PostSync hook assigns deterministic IPs
Custom SCC (UID 1001) Proven confluent-scc, not anyuid
OCI pull-through proxy Proven IDMS routes all pulls via oci.arsalan.io

Known Limitations

Limitation Impact Workaround
CFK enforces required pod anti-affinity Brokers per cluster ≤ nodes per cluster Add worker nodes for more brokers
KRaft replicas cannot be scaled after creation Must set correct quorum size at deploy Deploy with target replica count from start
ControlCenter requires 3+ brokers Won't start with RF < 3 Ensure 3+ brokers before deploying CC
Connect defaults RF=3 for internal topics Fails with < 3 brokers Override *.storage.replication.factor if needed
CFK storageClass only on Kafka/KRaft Schema validation error on SR/Connect Don't add storageClass to SR/Connect CRs
clusterID must be exactly 22 bytes KRaft won't deploy with wrong length Use fixed pattern: {name}-{region}-{az}-{id}
ArgoCD OOMs on CFK CRDs Controller CrashLoopBackOff Use ServerSideApply=true + increase memory to 6Gi

Prerequisites

See PREREQUISITES.md for the full list of what must be configured before deploying.

Operational Runbook

See RUNBOOK.md for all 23 findings, solutions, and operational procedures discovered during the PoC.