Technical documentation

AI Platform Engineering on AWS.

How we design, secure and run GenAI services in production on AWS, from the cloud account to the agent platform.

For technical teams (architecture, IT, cloud, security). Choices are tailored to your context; the following describes our reference architecture.

AI as a product, on a reusable foundation.

The goal isn't to run a model, but to deliver a reliable GenAI service, integrated with the IS, secured and measured, then replicate it. That requires a cloud foundation and engineering practices, not a series of POCs.

Security & governance from the start

Isolated accounts, least privilege, encryption, traceability: compliance is a starting condition, not a patch.

Data where it lives

We connect existing sources (S3, databases, ERP/CRM/DMS) without needless copying, with EU data residency.

Everything as Infrastructure as Code

Landing zone, networks, models, agents: provisioned and versioned (Terraform / CDK), reproducible across environments.

Observability & FinOps

Every model call is traced, measured and budgeted; cost per use and per token is steered.

Human in the loop

Sensitive actions require explicit validation; guardrails frame responses.

Reference architecture.

A useful GenAI use case relies on layers that work together, from cloud foundation to adoption. Security, observability and FinOps are cross-cutting.

06 Adoption & value

Business assistants Usage metrics KPIs Scale-out

05 Agents & applications

API Gateway

Lambda

ECS / Fargate

Step Functions

EventBridge

04 GenAI core, Amazon Bedrock

Bedrock (Claude)

Knowledge Bases (RAG)

Guardrails

AgentCore EU inference

03 Data & IS integration

OpenSearch Serverless

Aurora pgvector

DynamoDB

Glue MCP connectors

02 Network & security

VPC

PrivateLink

KMS

Secrets Manager

GuardDuty

WAF

01 Landing Zone & accounts

AWS Organizations

Control Tower SCPs

IAM Identity Center

Cross-cutting across all layers

Security Compliance

Observability (CloudWatch · X-Ray)

Traceability (CloudTrail)

FinOps (Budgets · Cost Explorer)

Setting up the cloud environment (Landing Zone).

Before any GenAI use case, we lay a sound cloud foundation: isolated environments, centralised security and automatic guardrails. That's what makes the rest reproducible and auditable.

Multi-account by default

Separating production, non-production, security and tooling limits blast radius and clarifies responsibilities.
Automatic guardrails (SCPs)

Organization rules prevent dangerous configurations: unauthorised regions, unencrypted resources.
Centralised identity (SSO)

Federated access via IAM Identity Center, temporary credentials and least privilege.
Centralised security & logs

CloudTrail, Config, GuardDuty and Security Hub aggregated in a dedicated account.
Controlled network

Private VPCs, PrivateLink to Bedrock and services, no needless Internet exposure.

AWS Organizations

Management

Organizations
Billing
Control Tower

Security

Log Archive
Audit
GuardDuty · Security Hub

Infrastructure

Network (Transit Gateway)
Shared VPCs
Shared services

Workloads

GenAI, Dev
GenAI, Prod

Foundation deployed via AWS Control Tower + Infrastructure as Code, adaptable to an existing landing zone.

The AWS services we use.

Driven by need and your context, no gratuitous complexity.

GenAI

Amazon Bedrock

Managed models (incl. Claude), no servers to run.
Bedrock Knowledge Bases

Managed RAG: ingestion, embeddings, retrieval.
Bedrock Guardrails

Content/topic filtering, PII masking.
Bedrock AgentCore

Agents: orchestration, memory, identity, tools.

Data & vectors

Amazon S3

Storage for documents and source data.
OpenSearch Serverless

Vector search for RAG.
Aurora PostgreSQL / pgvector

Vectors and relational data.
DynamoDB · Glue

Agent state/memory, data ingestion.

Integration & runtime

API Gateway · Lambda

Private APIs and serverless functions.
ECS / Fargate

Containers for durable workloads.
Step Functions · EventBridge

Orchestration and events.
MCP connectors

Agent tools to ERP, CRM, DMS, APIs.

Security & identity

IAM · IAM Identity Center

Least privilege, federated access (SSO).
KMS · Secrets Manager

Encryption and secrets management.
PrivateLink

Private access to Bedrock and services.
GuardDuty · Security Hub · WAF

Detection, posture, app protection.

Governance & landing zone

AWS Organizations · Control Tower

Multi-account and guardrails.
AWS Config

Continuous resource compliance.
CloudTrail

Audit trail of every action.
Service Catalog

Approved self-service templates.

Observability & FinOps

CloudWatch · X-Ray

Metrics, logs, request traces.
Cost Explorer · Budgets

Cost tracking and alerts.
Bedrock logging

Model invocation tracing.
FinOps tags

Cost per use, team and environment.

Agentic GenAI patterns.

The building blocks we assemble per use case, always governed.

RAG (retrieval-augmented)

Bedrock Knowledge Bases + vector store: answers grounded in your documents, with source citations.

Agents & tools (MCP)

An agent (AgentCore) calls tools, search, business APIs, actions, via MCP, with scoped permissions.

Guardrails

Content and topic filtering, sensitive data (PII) masking on input and output.

Human in the loop

Risky actions (write, send, decide) require explicit, traced validation.

Evaluation & quality

Test sets, relevance measurement and regression checks before and after go-live.

Aligned with the AWS Well-Architected Framework.

Our choices follow the six pillars of the Well-Architected Framework, and its Generative AI Lens. Applied to a GenAI use case:

Operational excellence

Automated deployments (IaC, CI/CD), runbooks, end-to-end observability.

Security

Least privilege, encryption, network isolation, traceability, guardrails.

Reliability

Managed quotas and limits, error recovery, graceful degradation of model calls.

Performance efficiency

Model chosen per need, caching, targeted RAG, serverless sizing.

Cost optimization

Cost per token tracked, right-sized models, budgets and alerts (FinOps).

Sustainability

Suitable regions and models, serverless resources, no over-provisioning.

We also apply the Generative AI Lens: response evaluation, guardrails, human oversight and inference cost control.

Industrialisation: Infrastructure as Code & CI/CD.

Landing zone, networks, guardrails, MCP servers and agents: everything is versioned, tested and deployed automatically. Terraform (or AWS CDK) for infrastructure, a dedicated CI/CD pipeline for MCP tools and agents.

Terraform, illustrative excerpt

# Bedrock guardrail, filter content + mask PII
resource "aws_bedrock_guardrail" "assistant" {
  name                      = "assistant-guardrail"
  blocked_input_messaging   = "Request blocked."
  blocked_outputs_messaging = "Response filtered."

  sensitive_information_policy_config {
    pii_entities_config { type = "EMAIL", action = "ANONYMIZE" }
  }
}

# Least-privilege execution role for the agent / MCP server
resource "aws_iam_role" "agent" {
  name               = "genai-agent-exec"
  assume_role_policy = data.aws_iam_policy_document.assume.json
}

resource "aws_iam_role_policy" "agent" {
  role = aws_iam_role.agent.id
  policy = jsonencode({
    Version   = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["bedrock:InvokeModel", "bedrock:Retrieve"]
      Resource = [var.model_arn, aws_opensearchserverless_collection.kb.arn]
    }]
  })
}

# MCP server container on ECS Fargate, behind a private API
module "mcp_server" {
  source        = "./modules/fargate-service"
  name          = "mcp-crm"
  image         = "${aws_ecr_repository.mcp.repository_url}:${var.git_sha}"
  task_role_arn = aws_iam_role.agent.arn
  private       = true
}

CI/CD pipeline, MCP servers & agents

01

Commit

Agent / MCP server code, prompts, Terraform config.
02

Lint & tests

Unit tests, MCP tool schema validation, IaC scan.
03

Build image

MCP / agent container pushed to ECR, vulnerability scan.
04

Evaluation

Agent test sets: relevance, guardrails, regression.
05

Deploy

Terraform apply, ECS/Fargate or AgentCore, dev → prod.
06

Run & observability

Traces, costs, alerts; rollback on regression.

dev → prod promotion approved; secrets injected via Secrets Manager, never in plaintext. Tools: GitLab CI / GitHub Actions / CodePipeline.

Security, compliance & control.

Data residency (EU)

Inference and storage in Europe; your data isn't used to train the models.

End-to-end encryption

KMS (managed keys), TLS everywhere, secrets in Secrets Manager.

Least privilege

Dedicated IAM roles per use case, temporary access, separated environments.

Full traceability

CloudTrail, Bedrock invocation logging, source and action tracing.

Private network

Access to Bedrock and data via PrivateLink, no Internet transit.

Costs under control (FinOps)

Budgets, cost per use and per token, drift alerts.

From the AWS account to production.

Foundation

Landing zone, accounts, security and network (Control Tower + IaC).

Data connection

IS sources, ingestion, vector store, access rights.

Service build

RAG and/or agent, MCP tools, guardrails, API.

Industrialisation

CI/CD, IaC, dev → prod environments, tests and evaluation.

Run

Observability, FinOps, continuous improvement, scale-out.

Audit your foundation or scope a GenAI architecture on AWS?

Let's discuss your existing environment, your security constraints and the simplest path to a production use case.

Talk to an architect