No description
Find a file
2026-03-05 14:44:51 +00:00
.mcp.json udpated 2026-03-05 14:26:17 +00:00
arch1-dynamic-scratch.md udpated 2026-03-05 14:26:17 +00:00
arch1-dynamic-scratch.png udpated 2026-03-05 14:44:51 +00:00
arch2-hot-standby.md udpated 2026-03-05 14:26:17 +00:00
arch2-hot-standby.png udpated 2026-03-05 14:44:51 +00:00
arch3-connected-cluster.md udpated 2026-03-05 14:26:17 +00:00
arch3-connected-cluster.png udpated 2026-03-05 14:44:51 +00:00
README.md first commit 2026-03-05 14:25:33 +00:00

Document Version: 2.0 Date: 2026-03-05 Authors: Platform Engineering Status: Architecture Decision Complete -- Implementation Ready

OpenShift Cloud Bursting on GCP - Architecture Decision

Executive Summary

This repository contains the architecture evaluation and implementation documentation for Wells Fargo's OpenShift cloud bursting capability on Google Cloud Platform (GCP). The primary use case is burst workloads from on-premises OpenShift clusters to GCP, including GPU workloads for Arise AI.

Recommendation: Architecture 2 - Hot Standby Cluster is the selected approach based on weighted evaluation across time-to-burst, supportability, operational risk, implementation complexity, and cost.


Architectures Evaluated

# Architecture Description Key Sections Verdict
1 Dynamic Scratch Environment Full cluster provisioned on-demand, torn down after use Timing, Cost, Ansible Viable fallback
2 Hot Standby Cluster Pre-provisioned cluster with scale-to-zero workers Timing, Cost, Ansible RECOMMENDED
3 Connected/Stretched Cluster Single control plane spanning on-prem and GCP Risks, Why Not Not selected

Weighted Decision Matrix

Scoring scale: 1 (poor) to 5 (best).

Criterion Weight Dynamic Scratch Hot Standby Connected/Stretched
Time to burst 25% 2 5 4
Supportability / vendor posture 20% 4 4 1
Operational risk / blast radius 15% 4 3 1
Implementation complexity 15% 2 4 1
Idle cost efficiency 10% 5 2 2
Data locality fit 10% 2 2 4
Automation fit (Ansible) 5% 3 4 2
Weighted total 100% 3.15 3.90 2.00

Weight Rationale

  • Time to burst (25%): Arise AI requires burst readiness within 10 minutes. This is the dominant business SLO and the primary differentiator between architectures.
  • Supportability / vendor posture (20%): Wells Fargo's regulated environment requires vendor-backed support for production infrastructure. Unsupported topologies create unacceptable risk exposure during audits and incidents.
  • Operational risk / blast radius (15%): A failure in the burst infrastructure must not cascade into on-prem production workloads. Isolation boundaries matter.
  • Implementation complexity (15%): The 10-week delivery timeline requires an architecture that can be implemented with available team capacity and skills.
  • Idle cost efficiency (10%): Important but secondary -- the cost difference between architectures is measured in hundreds of dollars per month, not the dominant decision factor.
  • Data locality fit (10%): Relevant for workloads with high data gravity, but addressed by the data lane strategy rather than architecture choice.
  • Automation fit (5%): All three architectures can be automated with Ansible; the difference is in surface area and fragility, not feasibility.

Architecture Comparison Summary

The three architectures represent fundamentally different trade-offs along the speed-cost-risk spectrum. Hot Standby keeps a minimal control plane running continuously (~$527/month idle cost) to achieve the fastest burst response (5-8 minutes for standard workers). It accepts ongoing infrastructure cost in exchange for operational simplicity -- a single oc scale command triggers the entire scale-up, and pre-installed operators eliminate cold-start configuration delays.

Dynamic Scratch eliminates all idle cost by provisioning clusters on demand, but the fastest practical variant (Hive ClusterPool with hibernation) still requires 10-15 minutes for cluster resume plus worker provisioning. Full IPI install takes 45+ minutes, which is incompatible with the 10-minute SLO. The automation surface area is significantly larger, with more failure points during the critical burst-initiation window.

Connected/Stretched Cluster was rejected primarily on supportability grounds. Red Hat does not support mixed-provider topologies where the control plane and workers span different infrastructure platforms. Beyond the support gap, the architecture creates the largest blast radius: a network partition between on-prem (control plane) and GCP (workers) renders all GCP workloads unmanageable, and latency sensitivity of the etcd consensus protocol means even minor network degradation can cascade into cluster-wide instability.


Key Constraints

  • Burst SLO: Application-ready within 10 minutes of trigger
  • GPU Support: NVIDIA T4/A100 for Arise AI workloads
  • Connectivity: GCP Cloud Interconnect (preferred) with optional HA VPN overlay for encryption
  • Automation: Ansible-based orchestration integrated with Wells operational controls
  • Data Governance: Restricted VPC, no public endpoints, data locality controls
  • CIDR: No overlap allowed between on-prem and GCP networks

Assumptions and Prerequisites

Infrastructure:

  • GCP project exists with billing enabled and organizational policy constraints reviewed
  • Cloud Interconnect or HA VPN is pre-established and tested (not on the critical path for this plan)
  • DNS zones delegated and accessible from GCP VPC
  • Container registries (quay.io, registry.redhat.io) reachable via proxy or mirror
  • GCP quotas pre-requested and approved (vCPU, GPU, SSD, IP addresses per architecture requirements)
  • Ansible Automation Platform (AAP) available with required collections versioned

Organizational:

  • Wells Fargo CAB (Change Advisory Board) has pre-approved the burst automation pattern
  • Break-glass runbook approved for emergency manual operations
  • On-call rotation established for burst events
  • Cost center and chargeback model agreed for GCP spend
  • Data classification for burst workloads completed (no PCI/PII without additional controls)

Technical:

  • OpenShift version: 4.15+ (minimum supported for all referenced features)
  • RHACM version: 2.8+ (if using ClusterPools)
  • Ansible collections: kubernetes.core >= 3.0, community.okd >= 3.0, google.cloud >= 1.3
  • oc CLI and jq available on Ansible controller
  • kubeconfig provisioned and rotated via enterprise credential management

Regulatory and Compliance Considerations

Wells Fargo is subject to multiple regulatory frameworks that constrain cloud infrastructure decisions:

Regulation Relevance Key Requirement
OCC/FDIC Guidance Cloud computing risk management for national banks Third-party risk assessment, data governance, business continuity
GLBA (Gramm-Leach-Bliley) Customer data protection If any customer data touches burst workloads, full data protection controls apply
SOX (Sarbanes-Oxley) Financial reporting integrity Audit trail for all infrastructure changes; segregation of duties
BCBS 239 Risk data aggregation Implications if burst workloads process risk reporting data
FFIEC IT examination standards Examination criteria for cloud computing and outsourcing

Cross-cutting controls:

  • Encryption: TLS 1.2+ minimum for all in-transit data; FIPS 140-2 validated modules if required by policy
  • Data residency: US-only GCP regions (us-central1, us-east1, us-east4, us-west1, etc.)
  • CMEK: Customer-Managed Encryption Keys required for persistent volumes and GCS buckets
  • Audit logging: GCP Cloud Audit Logs + OpenShift audit logs forwarded to Wells SIEM
  • Segregation of duties: Separate roles for burst trigger, approval, cluster admin, and audit review
  • Evidence collection: Automated compliance evidence generation for audit cycles

See arch2-hot-standby.md Section 15 for detailed implementation of compliance controls.


Data Strategy

Three-pattern model per workload:

Pattern Best For Trade-off
Remote Access Moderate throughput, data must stay on-prem Latency-sensitive apps may suffer
Pre-staged Copy Latency-sensitive, high read volume Data freshness / RPO management
Hybrid Cache Predictable hot subset needed in cloud Cache invalidation complexity

Rule: No workload is promoted to burst-ready until a data lane is explicitly chosen and benchmarked.


Delivery Timeline (10 Weeks)

Phase Weeks Focus Exit Gate
0 - Discovery 0-1 SLO confirmation, compliance stance, data lane selection Signed architecture decision record
1 - Platform Baseline 1-4 OCP on GCP, operators, worker pools, GPU config Platform readiness checklist
2 - Automation 4-7 Ansible burst lifecycle, change controls, idempotency Push-button dry run
3 - Validation 7-9 Load tests, failure scenarios, security audit, 10-min SLO All acceptance criteria passed
4 - Pilot 10 Controlled pilot, hypercare, handoff Operational ownership transferred

Acceptance Tests

  • Burst SLO: Trigger to application-ready within 10 minutes
  • Scale: Workers and GPU nodes under load
  • Data: Lane-specific performance and correctness
  • Resilience: Fail one hybrid path, validate recovery
  • Security: IAM, network policy, log/audit controls
  • Rollback: Failure injection and automated recovery

Security Controls (Minimum)

  • IAM least privilege for installer, automation, and runtime service accounts
  • Secret handling via enterprise-approved mechanism (no plaintext in playbooks)
  • Network segmentation and firewall policy per restricted VPC standards
  • Egress control for required endpoints only
  • Encryption-in-transit decision documented (Interconnect only vs. Interconnect + HA VPN)
  • Audit and log forwarding to Wells SIEM
  • Image provenance and vulnerability policy gate

Operational Model

Team Responsibility
SRE / Platform Cluster lifecycle, upgrades, policy, observability
Network Interconnect / VPN availability and routing
Security Control enforcement and audit evidence
App Team Workload burst readiness and data lane ownership

Monthly cadence: burst rehearsal, control evidence review, cost/performance optimization.


Change Control Integration

  • Routine bursts: Pre-approved standard change template in ServiceNow, auto-created by Ansible
  • Emergency bursts: Break-glass procedure with post-incident review within 24 hours
  • Scale-down: Automated with notification to NOC; no separate change ticket required
  • Approval chain: App Team request -> Platform Team validate -> Auto-execute
  • Audit trail: All burst actions logged to SIEM with correlation IDs linking ServiceNow ticket to cluster events

Cost Governance

  • GCP billing labels: Required on all burst resources (team, workload, burst-event-id, cost-center)
  • Budget alerts: Configured at 80% and 100% of monthly envelope via GCP Billing Budgets
  • CUD recommendation: 1-year Committed Use Discount for control plane nodes (~37% savings on idle cost)
  • Spot VMs: Allowed for non-GPU fault-tolerant burst workers (with preemption handling in Ansible)
  • Monthly review: Idle vs. burst cost breakdown reviewed by Platform and Finance teams
  • Chargeback model: Control plane cost to Platform Team budget; burst compute charged to requesting team

Information Needed from Wells

  1. Encryption requirement for private hybrid links (Interconnect-only or mandatory VPN/MACsec)
  2. Priority workload list with data volume and RPO/RTO expectations
  3. Preferred ACM hub placement (GCP or on-prem) based on governance boundary

Repository Contents

gcp-osd/
  README.md                      # This file - overview and decision summary
  arch1-dynamic-scratch.md       # Architecture 1: Dynamic Scratch Environment
  arch2-hot-standby.md           # Architecture 2: Hot Standby Cluster (RECOMMENDED)
  arch3-connected-cluster.md     # Architecture 3: Connected/Stretched Cluster (NOT SELECTED)

Glossary

Abbreviation Definition
AAP Ansible Automation Platform
ACM / RHACM Red Hat Advanced Cluster Management for Kubernetes
CAB Change Advisory Board
CIDR Classless Inter-Domain Routing
CMEK Customer-Managed Encryption Keys
CUD Committed Use Discount
FMEA Failure Mode and Effects Analysis
GLBA Gramm-Leach-Bliley Act
IPI Installer-Provisioned Infrastructure
MCS Machine Config Server
MTTR Mean Time to Repair
OCC Office of the Comptroller of the Currency
OCP OpenShift Container Platform
RPO Recovery Point Objective
RPN Risk Priority Number
RTO Recovery Time Objective
RWN Remote Worker Nodes
SIEM Security Information and Event Management
SLO Service Level Objective
SOX Sarbanes-Oxley Act