| .mcp.json | ||
| arch1-dynamic-scratch.md | ||
| arch1-dynamic-scratch.png | ||
| arch2-hot-standby.md | ||
| arch2-hot-standby.png | ||
| arch3-connected-cluster.md | ||
| arch3-connected-cluster.png | ||
| README.md | ||
Document Version: 2.0 Date: 2026-03-05 Authors: Platform Engineering Status: Architecture Decision Complete -- Implementation Ready
OpenShift Cloud Bursting on GCP - Architecture Decision
Executive Summary
This repository contains the architecture evaluation and implementation documentation for Wells Fargo's OpenShift cloud bursting capability on Google Cloud Platform (GCP). The primary use case is burst workloads from on-premises OpenShift clusters to GCP, including GPU workloads for Arise AI.
Recommendation: Architecture 2 - Hot Standby Cluster is the selected approach based on weighted evaluation across time-to-burst, supportability, operational risk, implementation complexity, and cost.
Architectures Evaluated
| # | Architecture | Description | Key Sections | Verdict |
|---|---|---|---|---|
| 1 | Dynamic Scratch Environment | Full cluster provisioned on-demand, torn down after use | Timing, Cost, Ansible | Viable fallback |
| 2 | Hot Standby Cluster | Pre-provisioned cluster with scale-to-zero workers | Timing, Cost, Ansible | RECOMMENDED |
| 3 | Connected/Stretched Cluster | Single control plane spanning on-prem and GCP | Risks, Why Not | Not selected |
Weighted Decision Matrix
Scoring scale: 1 (poor) to 5 (best).
| Criterion | Weight | Dynamic Scratch | Hot Standby | Connected/Stretched |
|---|---|---|---|---|
| Time to burst | 25% | 2 | 5 | 4 |
| Supportability / vendor posture | 20% | 4 | 4 | 1 |
| Operational risk / blast radius | 15% | 4 | 3 | 1 |
| Implementation complexity | 15% | 2 | 4 | 1 |
| Idle cost efficiency | 10% | 5 | 2 | 2 |
| Data locality fit | 10% | 2 | 2 | 4 |
| Automation fit (Ansible) | 5% | 3 | 4 | 2 |
| Weighted total | 100% | 3.15 | 3.90 | 2.00 |
Weight Rationale
- Time to burst (25%): Arise AI requires burst readiness within 10 minutes. This is the dominant business SLO and the primary differentiator between architectures.
- Supportability / vendor posture (20%): Wells Fargo's regulated environment requires vendor-backed support for production infrastructure. Unsupported topologies create unacceptable risk exposure during audits and incidents.
- Operational risk / blast radius (15%): A failure in the burst infrastructure must not cascade into on-prem production workloads. Isolation boundaries matter.
- Implementation complexity (15%): The 10-week delivery timeline requires an architecture that can be implemented with available team capacity and skills.
- Idle cost efficiency (10%): Important but secondary -- the cost difference between architectures is measured in hundreds of dollars per month, not the dominant decision factor.
- Data locality fit (10%): Relevant for workloads with high data gravity, but addressed by the data lane strategy rather than architecture choice.
- Automation fit (5%): All three architectures can be automated with Ansible; the difference is in surface area and fragility, not feasibility.
Architecture Comparison Summary
The three architectures represent fundamentally different trade-offs along the speed-cost-risk spectrum. Hot Standby keeps a minimal control plane running continuously (~$527/month idle cost) to achieve the fastest burst response (5-8 minutes for standard workers). It accepts ongoing infrastructure cost in exchange for operational simplicity -- a single oc scale command triggers the entire scale-up, and pre-installed operators eliminate cold-start configuration delays.
Dynamic Scratch eliminates all idle cost by provisioning clusters on demand, but the fastest practical variant (Hive ClusterPool with hibernation) still requires 10-15 minutes for cluster resume plus worker provisioning. Full IPI install takes 45+ minutes, which is incompatible with the 10-minute SLO. The automation surface area is significantly larger, with more failure points during the critical burst-initiation window.
Connected/Stretched Cluster was rejected primarily on supportability grounds. Red Hat does not support mixed-provider topologies where the control plane and workers span different infrastructure platforms. Beyond the support gap, the architecture creates the largest blast radius: a network partition between on-prem (control plane) and GCP (workers) renders all GCP workloads unmanageable, and latency sensitivity of the etcd consensus protocol means even minor network degradation can cascade into cluster-wide instability.
Key Constraints
- Burst SLO: Application-ready within 10 minutes of trigger
- GPU Support: NVIDIA T4/A100 for Arise AI workloads
- Connectivity: GCP Cloud Interconnect (preferred) with optional HA VPN overlay for encryption
- Automation: Ansible-based orchestration integrated with Wells operational controls
- Data Governance: Restricted VPC, no public endpoints, data locality controls
- CIDR: No overlap allowed between on-prem and GCP networks
Assumptions and Prerequisites
Infrastructure:
- GCP project exists with billing enabled and organizational policy constraints reviewed
- Cloud Interconnect or HA VPN is pre-established and tested (not on the critical path for this plan)
- DNS zones delegated and accessible from GCP VPC
- Container registries (quay.io, registry.redhat.io) reachable via proxy or mirror
- GCP quotas pre-requested and approved (vCPU, GPU, SSD, IP addresses per architecture requirements)
- Ansible Automation Platform (AAP) available with required collections versioned
Organizational:
- Wells Fargo CAB (Change Advisory Board) has pre-approved the burst automation pattern
- Break-glass runbook approved for emergency manual operations
- On-call rotation established for burst events
- Cost center and chargeback model agreed for GCP spend
- Data classification for burst workloads completed (no PCI/PII without additional controls)
Technical:
- OpenShift version: 4.15+ (minimum supported for all referenced features)
- RHACM version: 2.8+ (if using ClusterPools)
- Ansible collections:
kubernetes.core >= 3.0,community.okd >= 3.0,google.cloud >= 1.3 ocCLI andjqavailable on Ansible controller- kubeconfig provisioned and rotated via enterprise credential management
Regulatory and Compliance Considerations
Wells Fargo is subject to multiple regulatory frameworks that constrain cloud infrastructure decisions:
| Regulation | Relevance | Key Requirement |
|---|---|---|
| OCC/FDIC Guidance | Cloud computing risk management for national banks | Third-party risk assessment, data governance, business continuity |
| GLBA (Gramm-Leach-Bliley) | Customer data protection | If any customer data touches burst workloads, full data protection controls apply |
| SOX (Sarbanes-Oxley) | Financial reporting integrity | Audit trail for all infrastructure changes; segregation of duties |
| BCBS 239 | Risk data aggregation | Implications if burst workloads process risk reporting data |
| FFIEC | IT examination standards | Examination criteria for cloud computing and outsourcing |
Cross-cutting controls:
- Encryption: TLS 1.2+ minimum for all in-transit data; FIPS 140-2 validated modules if required by policy
- Data residency: US-only GCP regions (us-central1, us-east1, us-east4, us-west1, etc.)
- CMEK: Customer-Managed Encryption Keys required for persistent volumes and GCS buckets
- Audit logging: GCP Cloud Audit Logs + OpenShift audit logs forwarded to Wells SIEM
- Segregation of duties: Separate roles for burst trigger, approval, cluster admin, and audit review
- Evidence collection: Automated compliance evidence generation for audit cycles
See arch2-hot-standby.md Section 15 for detailed implementation of compliance controls.
Data Strategy
Three-pattern model per workload:
| Pattern | Best For | Trade-off |
|---|---|---|
| Remote Access | Moderate throughput, data must stay on-prem | Latency-sensitive apps may suffer |
| Pre-staged Copy | Latency-sensitive, high read volume | Data freshness / RPO management |
| Hybrid Cache | Predictable hot subset needed in cloud | Cache invalidation complexity |
Rule: No workload is promoted to burst-ready until a data lane is explicitly chosen and benchmarked.
Delivery Timeline (10 Weeks)
| Phase | Weeks | Focus | Exit Gate |
|---|---|---|---|
| 0 - Discovery | 0-1 | SLO confirmation, compliance stance, data lane selection | Signed architecture decision record |
| 1 - Platform Baseline | 1-4 | OCP on GCP, operators, worker pools, GPU config | Platform readiness checklist |
| 2 - Automation | 4-7 | Ansible burst lifecycle, change controls, idempotency | Push-button dry run |
| 3 - Validation | 7-9 | Load tests, failure scenarios, security audit, 10-min SLO | All acceptance criteria passed |
| 4 - Pilot | 10 | Controlled pilot, hypercare, handoff | Operational ownership transferred |
Acceptance Tests
- Burst SLO: Trigger to application-ready within 10 minutes
- Scale: Workers and GPU nodes under load
- Data: Lane-specific performance and correctness
- Resilience: Fail one hybrid path, validate recovery
- Security: IAM, network policy, log/audit controls
- Rollback: Failure injection and automated recovery
Security Controls (Minimum)
- IAM least privilege for installer, automation, and runtime service accounts
- Secret handling via enterprise-approved mechanism (no plaintext in playbooks)
- Network segmentation and firewall policy per restricted VPC standards
- Egress control for required endpoints only
- Encryption-in-transit decision documented (Interconnect only vs. Interconnect + HA VPN)
- Audit and log forwarding to Wells SIEM
- Image provenance and vulnerability policy gate
Operational Model
| Team | Responsibility |
|---|---|
| SRE / Platform | Cluster lifecycle, upgrades, policy, observability |
| Network | Interconnect / VPN availability and routing |
| Security | Control enforcement and audit evidence |
| App Team | Workload burst readiness and data lane ownership |
Monthly cadence: burst rehearsal, control evidence review, cost/performance optimization.
Change Control Integration
- Routine bursts: Pre-approved standard change template in ServiceNow, auto-created by Ansible
- Emergency bursts: Break-glass procedure with post-incident review within 24 hours
- Scale-down: Automated with notification to NOC; no separate change ticket required
- Approval chain: App Team request -> Platform Team validate -> Auto-execute
- Audit trail: All burst actions logged to SIEM with correlation IDs linking ServiceNow ticket to cluster events
Cost Governance
- GCP billing labels: Required on all burst resources (
team,workload,burst-event-id,cost-center) - Budget alerts: Configured at 80% and 100% of monthly envelope via GCP Billing Budgets
- CUD recommendation: 1-year Committed Use Discount for control plane nodes (~37% savings on idle cost)
- Spot VMs: Allowed for non-GPU fault-tolerant burst workers (with preemption handling in Ansible)
- Monthly review: Idle vs. burst cost breakdown reviewed by Platform and Finance teams
- Chargeback model: Control plane cost to Platform Team budget; burst compute charged to requesting team
Information Needed from Wells
- Encryption requirement for private hybrid links (Interconnect-only or mandatory VPN/MACsec)
- Priority workload list with data volume and RPO/RTO expectations
- Preferred ACM hub placement (GCP or on-prem) based on governance boundary
Repository Contents
gcp-osd/
README.md # This file - overview and decision summary
arch1-dynamic-scratch.md # Architecture 1: Dynamic Scratch Environment
arch2-hot-standby.md # Architecture 2: Hot Standby Cluster (RECOMMENDED)
arch3-connected-cluster.md # Architecture 3: Connected/Stretched Cluster (NOT SELECTED)
Glossary
| Abbreviation | Definition |
|---|---|
| AAP | Ansible Automation Platform |
| ACM / RHACM | Red Hat Advanced Cluster Management for Kubernetes |
| CAB | Change Advisory Board |
| CIDR | Classless Inter-Domain Routing |
| CMEK | Customer-Managed Encryption Keys |
| CUD | Committed Use Discount |
| FMEA | Failure Mode and Effects Analysis |
| GLBA | Gramm-Leach-Bliley Act |
| IPI | Installer-Provisioned Infrastructure |
| MCS | Machine Config Server |
| MTTR | Mean Time to Repair |
| OCC | Office of the Comptroller of the Currency |
| OCP | OpenShift Container Platform |
| RPO | Recovery Point Objective |
| RPN | Risk Priority Number |
| RTO | Recovery Time Objective |
| RWN | Remote Worker Nodes |
| SIEM | Security Information and Event Management |
| SLO | Service Level Objective |
| SOX | Sarbanes-Oxley Act |