Technical documentation
AI Platform Engineering on AWS.
How we design, secure and run GenAI services in production on AWS, from the cloud account to the agent platform.
For technical teams (architecture, IT, cloud, security). Choices are tailored to your context; the following describes our reference architecture.
AI as a product, on a reusable foundation.
The goal isn't to run a model, but to deliver a reliable GenAI service, integrated with the IS, secured and measured, then replicate it. That requires a cloud foundation and engineering practices, not a series of POCs.
Security & governance from the start
Isolated accounts, least privilege, encryption, traceability: compliance is a starting condition, not a patch.
Data where it lives
We connect existing sources (S3, databases, ERP/CRM/DMS) without needless copying, with EU data residency.
Everything as Infrastructure as Code
Landing zone, networks, models, agents: provisioned and versioned (Terraform / CDK), reproducible across environments.
Observability & FinOps
Every model call is traced, measured and budgeted; cost per use and per token is steered.
Human in the loop
Sensitive actions require explicit validation; guardrails frame responses.
Reference architecture.
A useful GenAI use case relies on layers that work together, from cloud foundation to adoption. Security, observability and FinOps are cross-cutting.
Setting up the cloud environment (Landing Zone).
Before any GenAI use case, we lay a sound cloud foundation: isolated environments, centralised security and automatic guardrails. That's what makes the rest reproducible and auditable.
-
Multi-account by default
Separating production, non-production, security and tooling limits blast radius and clarifies responsibilities.
-
Automatic guardrails (SCPs)
Organization rules prevent dangerous configurations: unauthorised regions, unencrypted resources.
-
Centralised identity (SSO)
Federated access via IAM Identity Center, temporary credentials and least privilege.
-
Centralised security & logs
CloudTrail, Config, GuardDuty and Security Hub aggregated in a dedicated account.
-
Controlled network
Private VPCs, PrivateLink to Bedrock and services, no needless Internet exposure.
AWS Organizations
Management
-
Organizations
- Billing
-
Control Tower
Security
- Log Archive
- Audit
-
GuardDuty · Security Hub
Infrastructure
- Network (Transit Gateway)
-
Shared VPCs
- Shared services
Workloads
- GenAI, Dev
- GenAI, Prod
Foundation deployed via AWS Control Tower + Infrastructure as Code, adaptable to an existing landing zone.
The AWS services we use.
Driven by need and your context, no gratuitous complexity.
GenAI
-
Amazon Bedrock
Managed models (incl. Claude), no servers to run.
-
Bedrock Knowledge Bases
Managed RAG: ingestion, embeddings, retrieval.
-
Bedrock Guardrails
Content/topic filtering, PII masking.
-
Bedrock AgentCore
Agents: orchestration, memory, identity, tools.
Data & vectors
-
Amazon S3
Storage for documents and source data.
-
OpenSearch Serverless
Vector search for RAG.
-
Aurora PostgreSQL / pgvector
Vectors and relational data.
-
DynamoDB · Glue
Agent state/memory, data ingestion.
Integration & runtime
-
API Gateway · Lambda
Private APIs and serverless functions.
-
ECS / Fargate
Containers for durable workloads.
-
Step Functions · EventBridge
Orchestration and events.
-
MCP connectors
Agent tools to ERP, CRM, DMS, APIs.
Security & identity
-
IAM · IAM Identity Center
Least privilege, federated access (SSO).
-
KMS · Secrets Manager
Encryption and secrets management.
-
PrivateLink
Private access to Bedrock and services.
-
GuardDuty · Security Hub · WAF
Detection, posture, app protection.
Governance & landing zone
-
AWS Organizations · Control Tower
Multi-account and guardrails.
-
AWS Config
Continuous resource compliance.
-
CloudTrail
Audit trail of every action.
-
Service Catalog
Approved self-service templates.
Observability & FinOps
-
CloudWatch · X-Ray
Metrics, logs, request traces.
-
Cost Explorer · Budgets
Cost tracking and alerts.
-
Bedrock logging
Model invocation tracing.
-
FinOps tags
Cost per use, team and environment.
Agentic GenAI patterns.
The building blocks we assemble per use case, always governed.
RAG (retrieval-augmented)
Bedrock Knowledge Bases + vector store: answers grounded in your documents, with source citations.
Agents & tools (MCP)
An agent (AgentCore) calls tools, search, business APIs, actions, via MCP, with scoped permissions.
Guardrails
Content and topic filtering, sensitive data (PII) masking on input and output.
Human in the loop
Risky actions (write, send, decide) require explicit, traced validation.
Evaluation & quality
Test sets, relevance measurement and regression checks before and after go-live.
Aligned with the AWS Well-Architected Framework.
Our choices follow the six pillars of the Well-Architected Framework, and its Generative AI Lens. Applied to a GenAI use case:
Operational excellence
Automated deployments (IaC, CI/CD), runbooks, end-to-end observability.
Security
Least privilege, encryption, network isolation, traceability, guardrails.
Reliability
Managed quotas and limits, error recovery, graceful degradation of model calls.
Performance efficiency
Model chosen per need, caching, targeted RAG, serverless sizing.
Cost optimization
Cost per token tracked, right-sized models, budgets and alerts (FinOps).
Sustainability
Suitable regions and models, serverless resources, no over-provisioning.
We also apply the Generative AI Lens: response evaluation, guardrails, human oversight and inference cost control.
Industrialisation: Infrastructure as Code & CI/CD.
Landing zone, networks, guardrails, MCP servers and agents: everything is versioned, tested and deployed automatically. Terraform (or AWS CDK) for infrastructure, a dedicated CI/CD pipeline for MCP tools and agents.
# Bedrock guardrail, filter content + mask PII
resource "aws_bedrock_guardrail" "assistant" {
name = "assistant-guardrail"
blocked_input_messaging = "Request blocked."
blocked_outputs_messaging = "Response filtered."
sensitive_information_policy_config {
pii_entities_config { type = "EMAIL", action = "ANONYMIZE" }
}
}
# Least-privilege execution role for the agent / MCP server
resource "aws_iam_role" "agent" {
name = "genai-agent-exec"
assume_role_policy = data.aws_iam_policy_document.assume.json
}
resource "aws_iam_role_policy" "agent" {
role = aws_iam_role.agent.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["bedrock:InvokeModel", "bedrock:Retrieve"]
Resource = [var.model_arn, aws_opensearchserverless_collection.kb.arn]
}]
})
}
# MCP server container on ECS Fargate, behind a private API
module "mcp_server" {
source = "./modules/fargate-service"
name = "mcp-crm"
image = "${aws_ecr_repository.mcp.repository_url}:${var.git_sha}"
task_role_arn = aws_iam_role.agent.arn
private = true
} CI/CD pipeline, MCP servers & agents
- 01
Commit
Agent / MCP server code, prompts, Terraform config.
- 02
Lint & tests
Unit tests, MCP tool schema validation, IaC scan.
- 03
Build image
MCP / agent container pushed to ECR, vulnerability scan.
- 04
Evaluation
Agent test sets: relevance, guardrails, regression.
- 05
Deploy
Terraform apply, ECS/Fargate or AgentCore, dev → prod.
- 06
Run & observability
Traces, costs, alerts; rollback on regression.
dev → prod promotion approved; secrets injected via Secrets Manager, never in plaintext. Tools: GitLab CI / GitHub Actions / CodePipeline.
Security, compliance & control.
Data residency (EU)
Inference and storage in Europe; your data isn't used to train the models.
End-to-end encryption
KMS (managed keys), TLS everywhere, secrets in Secrets Manager.
Least privilege
Dedicated IAM roles per use case, temporary access, separated environments.
Full traceability
CloudTrail, Bedrock invocation logging, source and action tracing.
Private network
Access to Bedrock and data via PrivateLink, no Internet transit.
Costs under control (FinOps)
Budgets, cost per use and per token, drift alerts.
From the AWS account to production.
Foundation
Landing zone, accounts, security and network (Control Tower + IaC).
Data connection
IS sources, ingestion, vector store, access rights.
Service build
RAG and/or agent, MCP tools, guardrails, API.
Industrialisation
CI/CD, IaC, dev → prod environments, tests and evaluation.
Run
Observability, FinOps, continuous improvement, scale-out.
Audit your foundation or scope a GenAI architecture on AWS?
Let's discuss your existing environment, your security constraints and the simplest path to a production use case.
Talk to an architect