anaeem/gcp

No description

Find a file

anaeem 0d7d8c45af udpated		2026-03-05 14:44:51 +00:00
.mcp.json	udpated	2026-03-05 14:26:17 +00:00
arch1-dynamic-scratch.md	udpated	2026-03-05 14:26:17 +00:00
arch1-dynamic-scratch.png	udpated	2026-03-05 14:44:51 +00:00
arch2-hot-standby.md	udpated	2026-03-05 14:26:17 +00:00
arch2-hot-standby.png	udpated	2026-03-05 14:44:51 +00:00
arch3-connected-cluster.md	udpated	2026-03-05 14:26:17 +00:00
arch3-connected-cluster.png	udpated	2026-03-05 14:44:51 +00:00
README.md	first commit	2026-03-05 14:25:33 +00:00

README.md

Document Version: 2.0 Date: 2026-03-05 Authors: Platform Engineering Status: Architecture Decision Complete -- Implementation Ready

OpenShift Cloud Bursting on GCP - Architecture Decision

Executive Summary

This repository contains the architecture evaluation and implementation documentation for Wells Fargo's OpenShift cloud bursting capability on Google Cloud Platform (GCP). The primary use case is burst workloads from on-premises OpenShift clusters to GCP, including GPU workloads for Arise AI.

Recommendation: Architecture 2 - Hot Standby Cluster is the selected approach based on weighted evaluation across time-to-burst, supportability, operational risk, implementation complexity, and cost.

Architectures Evaluated

#	Architecture	Description	Key Sections	Verdict
1	Dynamic Scratch Environment	Full cluster provisioned on-demand, torn down after use	Timing, Cost, Ansible	Viable fallback
2	Hot Standby Cluster	Pre-provisioned cluster with scale-to-zero workers	Timing, Cost, Ansible	RECOMMENDED
3	Connected/Stretched Cluster	Single control plane spanning on-prem and GCP	Risks, Why Not	Not selected

Weighted Decision Matrix

Scoring scale: 1 (poor) to 5 (best).

Criterion	Weight	Dynamic Scratch	Hot Standby	Connected/Stretched
Time to burst	25%	2	5	4
Supportability / vendor posture	20%	4	4	1
Operational risk / blast radius	15%	4	3	1
Implementation complexity	15%	2	4	1
Idle cost efficiency	10%	5	2	2
Data locality fit	10%	2	2	4
Automation fit (Ansible)	5%	3	4	2
Weighted total	100%	3.15	3.90	2.00

Weight Rationale

Time to burst (25%): Arise AI requires burst readiness within 10 minutes. This is the dominant business SLO and the primary differentiator between architectures.
Supportability / vendor posture (20%): Wells Fargo's regulated environment requires vendor-backed support for production infrastructure. Unsupported topologies create unacceptable risk exposure during audits and incidents.
Operational risk / blast radius (15%): A failure in the burst infrastructure must not cascade into on-prem production workloads. Isolation boundaries matter.
Implementation complexity (15%): The 10-week delivery timeline requires an architecture that can be implemented with available team capacity and skills.
Idle cost efficiency (10%): Important but secondary -- the cost difference between architectures is measured in hundreds of dollars per month, not the dominant decision factor.
Data locality fit (10%): Relevant for workloads with high data gravity, but addressed by the data lane strategy rather than architecture choice.
Automation fit (5%): All three architectures can be automated with Ansible; the difference is in surface area and fragility, not feasibility.

Architecture Comparison Summary

The three architectures represent fundamentally different trade-offs along the speed-cost-risk spectrum. Hot Standby keeps a minimal control plane running continuously (~$527/month idle cost) to achieve the fastest burst response (5-8 minutes for standard workers). It accepts ongoing infrastructure cost in exchange for operational simplicity -- a single oc scale command triggers the entire scale-up, and pre-installed operators eliminate cold-start configuration delays.

Dynamic Scratch eliminates all idle cost by provisioning clusters on demand, but the fastest practical variant (Hive ClusterPool with hibernation) still requires 10-15 minutes for cluster resume plus worker provisioning. Full IPI install takes 45+ minutes, which is incompatible with the 10-minute SLO. The automation surface area is significantly larger, with more failure points during the critical burst-initiation window.

Connected/Stretched Cluster was rejected primarily on supportability grounds. Red Hat does not support mixed-provider topologies where the control plane and workers span different infrastructure platforms. Beyond the support gap, the architecture creates the largest blast radius: a network partition between on-prem (control plane) and GCP (workers) renders all GCP workloads unmanageable, and latency sensitivity of the etcd consensus protocol means even minor network degradation can cascade into cluster-wide instability.

Key Constraints

Burst SLO: Application-ready within 10 minutes of trigger
GPU Support: NVIDIA T4/A100 for Arise AI workloads
Connectivity: GCP Cloud Interconnect (preferred) with optional HA VPN overlay for encryption
Automation: Ansible-based orchestration integrated with Wells operational controls
Data Governance: Restricted VPC, no public endpoints, data locality controls
CIDR: No overlap allowed between on-prem and GCP networks

Assumptions and Prerequisites

Infrastructure:

GCP project exists with billing enabled and organizational policy constraints reviewed
Cloud Interconnect or HA VPN is pre-established and tested (not on the critical path for this plan)
DNS zones delegated and accessible from GCP VPC
Container registries (quay.io, registry.redhat.io) reachable via proxy or mirror
GCP quotas pre-requested and approved (vCPU, GPU, SSD, IP addresses per architecture requirements)
Ansible Automation Platform (AAP) available with required collections versioned

Organizational:

Wells Fargo CAB (Change Advisory Board) has pre-approved the burst automation pattern
Break-glass runbook approved for emergency manual operations
On-call rotation established for burst events
Cost center and chargeback model agreed for GCP spend
Data classification for burst workloads completed (no PCI/PII without additional controls)

Technical:

OpenShift version: 4.15+ (minimum supported for all referenced features)
RHACM version: 2.8+ (if using ClusterPools)
Ansible collections: kubernetes.core >= 3.0, community.okd >= 3.0, google.cloud >= 1.3
oc CLI and jq available on Ansible controller
kubeconfig provisioned and rotated via enterprise credential management

Regulatory and Compliance Considerations

Wells Fargo is subject to multiple regulatory frameworks that constrain cloud infrastructure decisions:

Regulation	Relevance	Key Requirement
OCC/FDIC Guidance	Cloud computing risk management for national banks	Third-party risk assessment, data governance, business continuity
GLBA (Gramm-Leach-Bliley)	Customer data protection	If any customer data touches burst workloads, full data protection controls apply
SOX (Sarbanes-Oxley)	Financial reporting integrity	Audit trail for all infrastructure changes; segregation of duties
BCBS 239	Risk data aggregation	Implications if burst workloads process risk reporting data
FFIEC	IT examination standards	Examination criteria for cloud computing and outsourcing

Cross-cutting controls:

Encryption: TLS 1.2+ minimum for all in-transit data; FIPS 140-2 validated modules if required by policy
Data residency: US-only GCP regions (us-central1, us-east1, us-east4, us-west1, etc.)
CMEK: Customer-Managed Encryption Keys required for persistent volumes and GCS buckets
Audit logging: GCP Cloud Audit Logs + OpenShift audit logs forwarded to Wells SIEM
Segregation of duties: Separate roles for burst trigger, approval, cluster admin, and audit review
Evidence collection: Automated compliance evidence generation for audit cycles

See arch2-hot-standby.md Section 15 for detailed implementation of compliance controls.

Data Strategy

Three-pattern model per workload:

Pattern	Best For	Trade-off
Remote Access	Moderate throughput, data must stay on-prem	Latency-sensitive apps may suffer
Pre-staged Copy	Latency-sensitive, high read volume	Data freshness / RPO management
Hybrid Cache	Predictable hot subset needed in cloud	Cache invalidation complexity

Rule: No workload is promoted to burst-ready until a data lane is explicitly chosen and benchmarked.

Delivery Timeline (10 Weeks)

Phase	Weeks	Focus	Exit Gate
0 - Discovery	0-1	SLO confirmation, compliance stance, data lane selection	Signed architecture decision record
1 - Platform Baseline	1-4	OCP on GCP, operators, worker pools, GPU config	Platform readiness checklist
2 - Automation	4-7	Ansible burst lifecycle, change controls, idempotency	Push-button dry run
3 - Validation	7-9	Load tests, failure scenarios, security audit, 10-min SLO	All acceptance criteria passed
4 - Pilot	10	Controlled pilot, hypercare, handoff	Operational ownership transferred

Acceptance Tests

Burst SLO: Trigger to application-ready within 10 minutes
Scale: Workers and GPU nodes under load
Data: Lane-specific performance and correctness
Resilience: Fail one hybrid path, validate recovery
Security: IAM, network policy, log/audit controls
Rollback: Failure injection and automated recovery

Security Controls (Minimum)

IAM least privilege for installer, automation, and runtime service accounts
Secret handling via enterprise-approved mechanism (no plaintext in playbooks)
Network segmentation and firewall policy per restricted VPC standards
Egress control for required endpoints only
Encryption-in-transit decision documented (Interconnect only vs. Interconnect + HA VPN)
Audit and log forwarding to Wells SIEM
Image provenance and vulnerability policy gate

Operational Model

Team	Responsibility
SRE / Platform	Cluster lifecycle, upgrades, policy, observability
Network	Interconnect / VPN availability and routing
Security	Control enforcement and audit evidence
App Team	Workload burst readiness and data lane ownership

Monthly cadence: burst rehearsal, control evidence review, cost/performance optimization.

Change Control Integration

Routine bursts: Pre-approved standard change template in ServiceNow, auto-created by Ansible
Emergency bursts: Break-glass procedure with post-incident review within 24 hours
Scale-down: Automated with notification to NOC; no separate change ticket required
Approval chain: App Team request -> Platform Team validate -> Auto-execute
Audit trail: All burst actions logged to SIEM with correlation IDs linking ServiceNow ticket to cluster events

Cost Governance

GCP billing labels: Required on all burst resources (team, workload, burst-event-id, cost-center)
Budget alerts: Configured at 80% and 100% of monthly envelope via GCP Billing Budgets
CUD recommendation: 1-year Committed Use Discount for control plane nodes (~37% savings on idle cost)
Spot VMs: Allowed for non-GPU fault-tolerant burst workers (with preemption handling in Ansible)
Monthly review: Idle vs. burst cost breakdown reviewed by Platform and Finance teams
Chargeback model: Control plane cost to Platform Team budget; burst compute charged to requesting team

Information Needed from Wells

Encryption requirement for private hybrid links (Interconnect-only or mandatory VPN/MACsec)
Priority workload list with data volume and RPO/RTO expectations
Preferred ACM hub placement (GCP or on-prem) based on governance boundary

Repository Contents

gcp-osd/
  README.md                      # This file - overview and decision summary
  arch1-dynamic-scratch.md       # Architecture 1: Dynamic Scratch Environment
  arch2-hot-standby.md           # Architecture 2: Hot Standby Cluster (RECOMMENDED)
  arch3-connected-cluster.md     # Architecture 3: Connected/Stretched Cluster (NOT SELECTED)

Glossary

Abbreviation	Definition
AAP	Ansible Automation Platform
ACM / RHACM	Red Hat Advanced Cluster Management for Kubernetes
CAB	Change Advisory Board
CIDR	Classless Inter-Domain Routing
CMEK	Customer-Managed Encryption Keys
CUD	Committed Use Discount
FMEA	Failure Mode and Effects Analysis
GLBA	Gramm-Leach-Bliley Act
IPI	Installer-Provisioned Infrastructure
MCS	Machine Config Server
MTTR	Mean Time to Repair
OCC	Office of the Comptroller of the Currency
OCP	OpenShift Container Platform
RPO	Recovery Point Objective
RPN	Risk Priority Number
RTO	Recovery Time Objective
RWN	Remote Worker Nodes
SIEM	Security Information and Event Management
SLO	Service Level Objective
SOX	Sarbanes-Oxley Act