Platform Engineering as a Product (2/3): Architecture, Operating Model, and Governance
Part 2 of 3 — Platform Engineering as a Product
- Part 1: Why Platform Engineering Matters (and Why Most Get It Wrong)
- Part 2: Inside the Platform: Architecture, Operating Model, and Governance (you are here)
- Part 3: Building It — The Maturity Roadmap and Tech Stack
From Vision to Internals
Most platform engineering efforts die in the architecture phase. Not because the technology is wrong — because no one defines who owns what, who’s on-call, or how governance actually works without becoming a bottleneck.
In Part 1, we covered the why. Now let’s open the hood. This article covers three things:
- The 5-layer architecture — what each layer does and how they compose
- The operating model — who owns what, who’s on-call, and how the platform team functions as a product team
- Governance-as-code — how to embed compliance without creating bottlenecks
1. The 5-Layer Platform Architecture
The platform is modular, extensible, and interface-consistent. Every capability flows through the same API — whether your developers use the CLI, UI, or API directly.
flowchart TB
subgraph UI["1. Developer Interface Layer"]
A1["CLI"]
A2["UI Portal"]
A3["API Gateway"]
end
subgraph Core["2. Platform Core Services"]
B1["Service Blueprint Engine"]
B2["Infra Blueprint Engine"]
B3["Pipeline Orchestrator"]
end
subgraph Policy["3. Policy & Compliance Layer"]
C1["OPA Policy Engine"]
C2["Input Validators"]
C3["Security Scanners"]
end
subgraph Runtime["4. Provisioning & Runtime Layer"]
D1["Infra Provisioning"]
D2["Service Deployment"]
D3["Environment Orchestrator"]
end
subgraph Obs["5. Observability & Audit Layer"]
E1["Telemetry Hooks"]
E2["Audit Trail"]
E3["Cost Monitor"]
end
A1 --> A3
A2 --> A3
A3 --> B1
A3 --> B2
A3 --> B3
B1 --> C1
B2 --> C1
B3 --> C2
B3 --> C3
C1 --> D1
C2 --> D1
C3 --> D2
B1 --> D2
D1 --> E3
D2 --> E1
D2 --> E2
Layer 1: Developer Interface
This is the surface your developers interact with. Three interfaces, one underlying API.
- API Gateway — the primary surface. All capabilities are exposed via versioned APIs. This is the backbone that ensures consistency across all interfaces.
- CLI Tool — for power users and CI/CD integrations. Think
dip create,dip deploy,dip monitor. Uses the same API as everything else. - UI Portal — for visual workflows, service discovery, scaffolding, and dashboards. Ideal for onboarding and service browsing.
The guiding principle: every capability is API-accessible, CLI-controllable, and UI-visible.
Layer 2: Platform Core Services
This is where the intelligence lives.
- Service Blueprint Engine — provides golden templates for common service types. A developer says “I need a REST API service” and gets a fully scaffolded project with CI/CD, logging, tracing, and compliance hooks pre-wired.
- Infrastructure Blueprint Engine — manages reusable, versioned modules for infrastructure provisioning. Databases, queues, caches, storage — all defined as curated blueprints that abstract away cloud-specific details.
- Pipeline Orchestrator — standardizes CI/CD pipelines across teams. Shared templates with hooks for security scanning, policy checks, and automated testing.
Layer 3: Policy & Compliance
Governance is built into the platform, not bolted on top.
- Policy-as-Code Engine — uses OPA (Open Policy Agent) or similar tools for runtime policy evaluation. Rules like “all storage must be encrypted at rest” or “no public-facing resources without TLS” are evaluated automatically during provisioning and deployment.
- Input Validators — catch bad requests early. Before a blueprint even executes, parameters are validated against schemas, allowed values, and organizational constraints.
- Security Scanners — embedded into every pipeline. SAST, container scanning, IaC scanning — all running automatically as part of the golden path.
Layer 4: Provisioning & Runtime
This layer executes the actual infrastructure and service deployments.
- Infra Provisioning Controller — manages infrastructure lifecycle via Terraform, Crossplane, or Pulumi. Handles creation, updates, drift detection, and teardown.
- Service Deployment Controller — integrates with GitOps tools (ArgoCD, Flux) or CI/CD runners for application deployment across environments.
- Environment Orchestrator — handles multi-environment support, cloud-region mapping, and on-prem orchestration. Developers spin up environments through the platform; this layer handles the complexity underneath.
Layer 5: Observability & Audit
Every service deployed through the platform comes with observability pre-configured.
- Telemetry Hooks — logging, metrics, and tracing are baked into blueprints via OpenTelemetry, Prometheus, and Grafana. Developers don’t configure monitoring — it’s already there.
- Audit Trail Collector — every action through the platform is logged. Who provisioned what, when, from where. Critical for compliance and debugging.
- Cost & Usage Monitor — auto-tags resources and links them to cost dashboards. Teams see their spend in real time, not at the end of the quarter.
2. The Operating Model
Architecture is one thing. How people operate within it is another. This is where “you build it, you run it” either works or falls apart.
Role-Based Ownership
| Role | Responsibilities | Owns |
|---|---|---|
| Developer Teams | Build, deploy, and operate services end-to-end. Consume platform APIs, templates, and pipelines. Define SLOs, monitor performance, manage incidents | Service code, infra blueprints used by their service, alerts, dashboards |
| Platform Engineering Team | Build and maintain the internal developer platform. Curate and version blueprints. Own API, CLI, and UI layers. Implement policy-as-code. Maintain developer docs and onboarding | Platform core, shared templates, policy bundles, governance tooling |
| Security & Risk | Define compliance, security, and data protection policies. Review and audit policy bundles with platform team | Regulatory inputs, escalation triggers, risk scoring logic |
| SRE (Optional / Embedded) | Guide dev teams in adopting SLOs and resilience patterns. Help establish observability standards. Coach teams through incident reviews | Reliability practices, SLO/SLI models, observability standards |
The platform team is not an operations team. They don’t operate your services. They don’t provision infrastructure for your teams. They build and maintain the platform that enables everyone else to do those things autonomously. Get this distinction wrong, and you’ve just renamed your ops team.
On-Call & Incident Ownership
| Layer | On-Call Owner | Notes |
|---|---|---|
| App / Service Runtime | Developer team | Alerts configured via platform, tied to service-level SLOs |
| Blueprint / Platform Issues | Platform team | Bugs in templates, deploy flow, policy logic, API failures |
| Infra Failures | Platform team (initial), cloud escalation if needed | Provisioning failures, limits, misconfiguration |
| Security Incidents | Security team, with dev team coordination | Alerts from scanning, policies, or external events |
When something breaks at 2am, the team that built the service owns the first response. The platform team handles platform-level issues — blueprint bugs, provisioning failures, policy engine outages.
Proactive Support Model
“You build it, you run it” doesn’t mean “you’re on your own.” The platform team provides:
- Office hours — regular sessions for onboarding and problem-solving
- Support channels — #ask-platform, #policy-help, #blueprint-feedback
- Embedded champions — platform team members temporarily embedded with key product squads during early adoption phases
Platform as Product: Ways of Working
The platform team operates as a product team, not a service desk.
| Practice | How It’s Done |
|---|---|
| Customer Discovery | Biweekly feedback sessions with dev teams, friction log reviews |
| Backlog Management | Kanban or dual-track agile — prioritized by adoption, internal demand, and risk |
| Release Management | Versioned blueprint releases, changelogs, backwards compatibility guarantees |
| Adoption Metrics | Track services onboarded, time-to-deploy, pipeline usage, developer satisfaction |
| Documentation | First-class artifact — built into CLI/UI flows and auto-generated from blueprints |
flowchart TD
A["Developer Teams"] -->|"Uses"| B["Platform APIs & CLI/UI"]
A -->|"Owns"| C["Services & Pipelines"]
B -->|"Built by"| D["Platform Team"]
C -->|"Integrated with"| E["Observability Layer"]
D -->|"Collaborates with"| F["Security Team"]
F -->|"Defines"| G["Policy-as-Code Bundles"]
D -->|"Maintains"| G
A -->|"Consumes"| G
A -->|"Alerts to"| H["On-call Rotations"]
D -->|"Supports"| H
🔐 3. Governance-as-Code
If you’re in a regulated industry — fintech, banking, insurance — governance isn’t optional. But it also shouldn’t be a bottleneck. I’ve seen organizations where a compliance review adds two weeks to every deployment. That’s not governance. That’s a queue.
The platform adopts a governance-as-code model that enforces security, compliance, and best practices at multiple layers, automatically. Developers move fast within safe guardrails. Security teams get assurance. Auditors get logs. No manual gates.
Three Enforcement Layers
| Layer | Description | Tools / Methods |
|---|---|---|
| Input Validation | Parameters validated before infra or service creation | JSON schema, regex, CLI validators, UI constraints |
| Policy-as-Code | Dynamic rules evaluated during blueprint execution and deployment | OPA / Gatekeeper, Conftest |
| Pipeline Hooks | All CI/CD pipelines include required scanning and logging | Snyk, Trivy, Checkov, custom admission controllers |
Policy Bundle Design
Policies are organized as modular bundles and managed like application code — versioned, tested, and deployed with changelogs.
policy-bundle/
├── terraform/
│ ├── enforce-encryption.rego
│ ├── restrict-instance-types.rego
├── kubernetes/
│ ├── restrict-host-paths.rego
│ └── enforce-resource-limits.rego
├── cicd/
│ ├── block-default-branch-push.yaml
│ └── enforce-security-scan.yaml
├── metadata/
│ ├── version.json
│ └── policy-owners.yaml
Governance Lifecycle
| Step | Description |
|---|---|
| Authoring | Platform team and security team collaborate to define policy logic |
| Testing | Policies tested against sandbox blueprints and sample inputs |
| Releasing | Versioned and rolled out via changelog and flagging system |
| Monitoring | Violations logged with dashboards showing policy hits/failures |
| Reviewing | Periodically audited and reviewed with stakeholders |
Policy Enforcement Flow
flowchart TD
A["Dev initiates service / infra request"] --> B["Input Validator"]
B -->|"Valid"| C["Policy Engine (OPA)"]
B -->|"Invalid"| X["Error: Missing or Invalid Input"]
C -->|"Compliant"| D["Provision Infra / Deploy App"]
C -->|"Violation"| Y["Reject + Policy Violation Log"]
D --> E["Trigger CI/CD Pipeline"]
E --> F["Run Security Scanners"]
F -->|"Passed"| G["Deploy to Environment"]
F -->|"Failed"| Z["Block + Report to Dev + Audit Log"]
D --> H["Auto-tagging + Audit Trail"]
style X fill:#7f1d1d,color:#e7e5e4
style Y fill:#7f1d1d,color:#e7e5e4
style Z fill:#7f1d1d,color:#e7e5e4
style G fill:#14532d,color:#e7e5e4
This flow enforces policies without blocking workflows. Developers get fast feedback. Security teams get visibility. Auditors get logs. No one waits in a queue.
Common Policies
| Category | Policy Example |
|---|---|
| Security | All storage must be encrypted at rest; no public-facing resources without TLS |
| Cost Management | All resources must be tagged with cost-center and env |
| Resilience | All services must define liveness and readiness probes |
| Deployment | No pushes directly to default branch; PR checks must pass before deploy |
| Infra Provisioning | Only approved regions, instance types, and services may be used |
What’s Next
We’ve covered the architecture (how it’s built), the operating model (how it’s run), and governance (how compliance is embedded). The platform is designed. Now it needs to be built.
In Part 3, we cover the capability evolution roadmap — how to go from a bootstrap MVP to an autonomous, AI-assisted platform — and the tech stack recommendations for each layer.
/ Unni