Flex Gateway Part 3 — Observability in Production
MuleSoft Series · Part 3 · Final

Flex Gateway in Production:
Observability That Actually Works

Real incidents, real dashboards, real monitoring stacks — built differently for Connected Mode vs Local Mode. No generic theory. Just what you need the day things break at 2 AM.

Flex Gateway in Production: Observability That Actually Works
This post is grounded in three real-world architectures: a retail e-commerce platform, a regulated banking system, and a factory edge network. Each has different monitoring needs, different tools, and different failure modes. We'll build the observability stack for all three.
Real-World Context

Three Systems, Three Realities

Abstract monitoring advice fails in production because every system has different traffic patterns, compliance requirements, and failure modes. Let's establish our three reference architectures before touching a single config file.

Use Case A · Connected Mode
RetailCo — High-Volume E-Commerce Platform
350k API requests/hour at peak (Black Friday spikes to 1.4M/hr). Flex Gateway in Connected Mode feeds Anypoint Analytics. They need real-time throughput dashboards, P99 latency tracking, and instant rate-limit alerting when flash sales hammer the product catalog API.
Connected Mode Anypoint Analytics Prometheus + Grafana PagerDuty Alerts
Use Case B · Connected Mode + Strict Compliance
SecureBank — PCI-DSS Regulated Financial API
15k daily transactions. Every failed authentication must be logged and retained for 7 years. mTLS on all connections. They use Connected Mode for policy management but route all telemetry through an on-premise Elastic Stack — no sensitive data ever leaves the datacenter.
Connected Mode On-Prem Elastic Stack Jaeger Tracing PCI-DSS Audit Logs
Use Case C · Local Mode — Air-Gapped
FactoryEdge — Industrial IoT at the Plant Floor
8 factory sites across 3 countries. No internet. Flex Gateway in Local Mode manages APIs between PLCs, MES systems, and SCADA platforms. Each site runs a self-contained Prometheus + Grafana stack. A central VictoriaMetrics aggregates across sites over a dedicated VPN.
Local Mode Air-Gapped Network VictoriaMetrics On-Site Grafana
Section 01

Which Monitoring Stack For Each Mode?

Connected Mode and Local Mode expose telemetry differently. The choice isn't just about tools — it's about where data flows, who owns the pipeline, and what you're allowed to send off-site.

☁ Connected Mode
Anypoint-Integrated Stack
☁️
Anypoint Analytics
Built-in. API-level dashboards, SLA tracking, consumer analytics — zero config. Best for business-level visibility.
Prometheus + Grafana
Scrape the /metrics endpoint. Full control over dashboards, alerting rules, and data retention. Works alongside Anypoint.
Jaeger / Tempo
OpenTelemetry traces exported via OTLP. Visualise request flows across microservices in Grafana Tempo or Jaeger UI.
Elastic Stack (optional)
For compliance use cases where structured JSON logs must be retained on-prem. Beats agent ships logs to on-site Elasticsearch.
⚙ Local Mode
Self-Contained Stack
Prometheus (per-site)
The primary tool. Scrapes gateway metrics every 15s. No internet required. Runs as a Docker container or systemd service.
Grafana (per-site)
Local dashboard server. Pre-load the Flex Gateway community dashboard. Alerts via local Alertmanager or email SMTP relay.
️
VictoriaMetrics (central)
Lightweight Prometheus-compatible long-term storage. Each site remote-writes metrics over VPN. Single pane of glass across all sites.
Loki (log aggregation)
Grafana's log aggregation tool. Pairs with Promtail to ship structured JSON gateway logs. No Elasticsearch license needed.
Capability
☁ Connected Mode
⚙ Local Mode
Business-level analytics
Anypoint built-in
Not available
Infrastructure metrics (Prometheus)
✓ scrape /metrics
✓ scrape /metrics
Distributed tracing
OTLP → Jaeger/Tempo
OTLP → local Jaeger
Structured log retention
Elastic / CloudWatch
Loki / Elasticsearch
Internet required
✓ for Anypoint
✗ fully offline
Multi-site aggregation
Anypoint hub
VictoriaMetrics + VPN
Compliance log custody
On-prem Elastic possible
✓ fully on-prem
Section 02

Installing the Monitoring Stack

 Connected Mode — RetailCo & SecureBank Setup

In Connected Mode, Anypoint Platform already collects API-level metrics. Our job is to add the infrastructure observability layer on top — Prometheus for metrics, Jaeger for traces, and optionally Elastic for compliance logs.

Connected Mode · Full Observability Architecture
⚡ Flex Gateway
☁ Anypoint Platform
(API analytics, policy mgmt)
⚡ Flex Gateway :9090
→ scrape
 Prometheus
→ query
 Grafana
⚡ Flex Gateway OTLP
→ export
 Jaeger
→ visualise
 Grafana
⚡ Flex Gateway JSON logs
→ ship
 Elasticsearch
(compliance retention only)
01
Enable the metrics endpoint in Flex Gateway
First, expose Prometheus metrics. Add the telemetry block to your gateway config — this is what Prometheus will scrape.
gateway-config.yaml — telemetry block
apiVersion: gateway.mulesoft.com/v1alpha1 kind: Gateway metadata: name: retailco-gateway spec: telemetry: # ── Prometheus metrics ── metrics: enabled: true port: 9090 path: "/metrics" # ── OpenTelemetry tracing ── tracing: enabled: true exporter: otlp endpoint: "http://jaeger:4317" sampleRate: 0.05 # 5% — RetailCo has 350k req/hr # ── Structured JSON logging ── logging: format: json level: info fields: - request.id # unique per request — essential for tracing - request.path - request.method - response.status - response.duration_ms - client.id - jwt.sub # SecureBank: required for PCI audit trail - jwt.org_id
02
Deploy the full monitoring stack with Docker Compose
Spin up Prometheus, Grafana, and Jaeger together. This is the standard Connected Mode companion stack — runs alongside your gateway in Kubernetes or as a standalone compose stack.
docker-compose.monitoring.yml — Connected Mode stack
services: # ── Prometheus — metrics collection ── prometheus: image: prom/prometheus:v2.51.0 ports: - "9091:9090" # 9091 externally (9090 used by gateway) volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml - prometheus_data:/prometheus command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.retention.time=30d" # 30-day retention - "--web.enable-lifecycle" # hot-reload config # ── Grafana — dashboards & alerting ── grafana: image: grafana/grafana:10.4.0 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} volumes: - ./grafana/provisioning:/etc/grafana/provisioning - grafana_data:/var/lib/grafana depends_on: [prometheus, jaeger] # ── Jaeger — distributed tracing ── jaeger: image: jaegertracing/all-in-one:1.56 ports: - "16686:16686" # Jaeger UI - "4317:4317" # OTLP gRPC receiver volumes: prometheus_data: grafana_data:
03
Configure Prometheus scrape targets
Tell Prometheus where to find your gateway, and add labels so you can filter dashboards by environment and route.
prometheus.yml
global: scrape_interval: 15s external_labels: cluster: "retailco-prod" environment: "production" rule_files: - "/etc/prometheus/alerts.yml" alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] scrape_configs: - job_name: "flex-gateway" static_configs: - targets: ["flex-gateway:9090"] labels: gateway_id: "prod-eu-west-1" mode: "connected"

⚙️ Local Mode — FactoryEdge Air-Gapped Setup

No internet. No Anypoint. Eight sites across three countries. Here's how FactoryEdge built a fully autonomous monitoring stack that gives both per-site and cross-site visibility without a single cloud call.

Local Mode · Air-Gapped Multi-Site Architecture
⚡ Flex GW (Site A)
→ remote_write
️ VictoriaMetrics (Central)
→ query
 Central Grafana
⚡ Flex GW (Site B–H)
→ local scrape
 Local Prometheus
→ local dash
 Site Grafana
⚡ Flex GW JSON logs
→ tail
 Promtail → Loki
(all sites — no Elastic license needed)
01
Per-site stack: Prometheus + Grafana + Loki
Every factory site runs this identical compose file. Images are pre-loaded onto a USB drive at HQ and air-gap installed.
docker-compose.local-monitoring.yml — per-site (air-gapped)
services: prometheus: image: prom/prometheus:v2.51.0 # pre-loaded image ports: ["9091:9090"] volumes: - ./prometheus-local.yml:/etc/prometheus/prometheus.yml - prom_data:/prometheus command: - "--storage.tsdb.retention.time=90d" # 90d local — no cloud backup grafana: image: grafana/grafana:10.4.0 ports: ["3000:3000"] # ── Loki — lightweight log aggregation (no Elastic licence) ── loki: image: grafana/loki:2.9.0 ports: ["3100:3100"] promtail: image: grafana/promtail:2.9.0 volumes: - /var/log/flex-gateway:/var/log/flex-gateway:ro - ./promtail.yml:/etc/promtail/config.yml
02
Prometheus with remote_write to central VictoriaMetrics
Each site scrapes locally AND forwards metrics to a central VictoriaMetrics instance over a site-to-site VPN. Both local autonomy AND cross-site comparison dashboards.
prometheus-local.yml — air-gapped site with remote_write
global: scrape_interval: 15s external_labels: site: "factory-site-a" # identifies site in central dashboards region: "de-munich" mode: "local" # ── Forward metrics to central VictoriaMetrics (over VPN) ── remote_write: - url: "http://10.100.0.10:8428/api/v1/write" write_relabel_configs: - source_labels: [__name__] regex: "flex_.*" # only forward flex metrics action: keep scrape_configs: - job_name: "flex-gateway-local" static_configs: - targets: ["localhost:9090"]
Section 03

The Dashboard You Actually Need at 2 AM

This is what RetailCo's Grafana dashboard looked like during the Black Friday 2026 flash sale at 09:14 AM — when their product catalog API started returning 503s and their rate limit counter spiked to 3× normal. Real numbers. Real incident.

Grafana · Flex Gateway Overview · retailco-prod · eu-west-1
● LIVE
Req / sec
3,847
↑ 2.7× baseline
Flash sale triggered
Error Rate
8.3%
↑ from 0.1%
⚠ Alert firing
P99 Latency
642ms
↑ from 84ms
SLA threshold: 500ms
Rate Limit Hits
14.2k
↑ 12k/min
Top offender: mobile-app
Request Rate — Last 30 min (req/sec)
flash sale
Error Rate % — Last 30 min
alert @ 5%
Latency Percentiles — P50 / P95 / P99 (ms)
P50 P95 P99
Rate Limit Hits by Client — Last 5 min
Top: mobile-app v3.2 · 9,847 hits · client_id: mob-4f7a
Error Rate Heatmap — Last 24 Hours (each column = 1 hour)
00:0003:0006:00 09:0012:0015:00 18:0021:00Now
Route Performance Breakdown — Last 5 min
Endpoint
RPS
P99
Err%
Status
GET /api/products
2,840
642ms
11.2%
DEGRADED
GET /api/products/:id
780
310ms
0.8%
SLOW
POST /api/cart
145
88ms
0.2%
OK
POST /api/checkout
82
124ms
0.0%
OK
GET /health
12
4ms
0.0%
OK
Live Error Log Stream
09:14:22 ERROR rate_limit_exceeded · client_id=mob-4f7a · path=/api/products · limit=100/min · current=847/min
09:14:23 ERROR upstream_timeout · path=/api/products · upstream=product-service:3000 · duration_ms=5003 · request_id=req-9f2b
09:14:24 WARN circuit_breaker_open · route=product-catalog · consecutive_failures=5 · tripping in 500ms
09:14:31 INFO rate_limit_config_updated · new_limit=50/min · applied to route=product-catalog · hotfix deployed via GitOps
09:14:45 INFO error_rate_recovering · current=3.2% · target=<1% · upstream latency recovering · p99=280ms
What Happened Next
The team identified client mob-4f7a (mobile app v3.2) as the culprit — a polling loop bug was firing every 70ms instead of 5s. They pushed a rate limit reduction from 100 to 50 req/min via GitOps in under 90 seconds. Error rate recovered to 0.4% within 2 minutes, without touching the product catalog service or mobile app. That's the gateway layer working as intended.
Section 04

Alert Rules That Don't Lie

Most alert configs fire too late or too often — either they wake you for noise, or they miss the real incident. These rules are calibrated from the RetailCo and SecureBank production incidents, not invented in a vacuum.

CRITICAL
HighErrorRate
rate(flex_requests_errors_total[5m]) / rate(flex_requests_total[5m]) > 0.05
for: 2m
CRITICAL
LatencySLABreach
histogram_quantile(0.99, rate(flex_request_duration_ms_bucket[5m])) > 500
for: 3m
WARNING
RateLimitStorm
rate(flex_rate_limit_exceeded[5m]) / rate(flex_requests_total[5m]) > 0.20
for: 5m
WARNING
AuthFailureSpike
rate(flex_auth_failures_total[5m]) > 50
for: 1m
INFO
GatewayPodDown
up{job="flex-gateway"} == 0
for: 30s
alerts.yml — production-calibrated Prometheus alert rules
groups: - name: flex-gateway-production rules: - alert: HighErrorRate expr: | (rate(flex_requests_errors_total[5m]) / rate(flex_requests_total[5m])) > 0.05 for: 2m labels: severity: critical team: platform annotations: summary: "Error rate {{ $value | humanizePercentage }}" runbook_url: "https://wiki.internal/runbooks/flex-gateway-errors" - alert: LatencySLABreach expr: | histogram_quantile(0.99, rate(flex_request_duration_ms_bucket[5m]) ) > 500 for: 3m labels: severity: critical - alert: RateLimitStorm expr: | rate(flex_rate_limit_exceeded[5m]) / rate(flex_requests_total[5m]) > 0.20 for: 5m labels: severity: warning
Section 05 · SecureBank Use Case

PCI-DSS Compliance Logging

SecureBank has a hard requirement: every API call to their payments endpoint must be logged with who made it, when, from where, and what the outcome was — and that log must be retained for 7 years and be tamper-evident. Here's how they achieve that entirely through Flex Gateway's telemetry stack, without touching application code.

gateway-config.yaml — SecureBank PCI audit logging
apiVersion: gateway.mulesoft.com/v1alpha1 kind: Gateway metadata: name: securebank-gateway spec: telemetry: logging: format: json level: info fields: - request.id - request.timestamp # ISO-8601 UTC — non-negotiable for PCI - request.path - request.method - request.remoteAddress # source IP — required by PCI DSS 10.3 - response.status - response.duration_ms - jwt.sub # cardholder / user identity - jwt.org_id # merchant / tenant ID - tls.clientCertSubject # mTLS client cert DN - policy.outcomes # which policies ran and what they decided routes: - id: payments-api path: ["/v1/payments/*"] upstream: url: "http://payments-service:8080" policies: - audit-log: always: true includeRequestBody: false # never log card numbers destination: "syslog://audit-relay:514"
Tamper-Evident Log Storage
SecureBank ships these logs via syslog to an Elasticsearch cluster with ILM rollover policies and index-level write-once settings. After 24 hours, indices are frozen and marked read-only. Deletion requires dual-authorisation through Vault. This satisfies PCI DSS Requirement 10 without any custom application code.
Section 06 · Cloud-Native Option

Azure Monitoring Stack — App Insights & Azure Monitor

If your team already runs on Azure, standing up Prometheus, Grafana, and Jaeger yourself is unnecessary overhead. Azure offers a fully managed observability stack that maps directly to Flex Gateway's telemetry outputs — Application Insights for distributed tracing and APM, Azure Monitor for metrics and alerting, Log Analytics Workspace for structured log retention, and Azure Managed Grafana as the dashboard layer.

RetailCo migrated from their self-hosted Prometheus + Grafana stack to the Azure-native equivalent after moving their Kubernetes cluster to AKS. The migration took one afternoon — not because the tools are simpler, but because the integration path with OpenTelemetry Collector is identical. The config changes were minimal.
Metrics & Alerts
Azure Monitor
Receives Prometheus remote_write from the gateway. Hosts alert rules, action groups (PagerDuty, Teams, email), and metric dashboards.
Replaces Prometheus
APM & Tracing
Application Insights
Receives OTLP traces via OpenTelemetry Collector. Provides end-to-end transaction maps, failure analysis, dependency tracking, and live metrics stream.
Replaces Jaeger
Log Retention
Log Analytics Workspace
Ingests structured JSON logs from Flex Gateway. KQL queries replace Kibana. Retention up to 7 years (Archive tier) — perfect for PCI DSS.
Replaces Elasticsearch
Dashboards
Azure Managed Grafana
Fully managed Grafana backed by Azure Monitor and Log Analytics as native datasources. No servers to manage.
Replaces Self-Hosted Grafana
Telemetry Router
OTel Collector
The bridge between Flex Gateway and Azure services. Receives OTLP, fans out to App Insights (traces), Azure Monitor (metrics), and Log Analytics (logs) simultaneously.
Central Hub
Log Ingestion
Data Collection Rule
Azure DCR defines the transformation pipeline from gateway JSON logs to Log Analytics tables. KQL-based filtering keeps ingestion costs down.
Cost Control
Azure Native · Full Observability Architecture
⚡ Flex Gateway
→ OTLP gRPC
 OTel Collector
→ exporters
 App Insights
(traces + APM)
.
.
→ remote_write
 Azure Monitor
(metrics + alerts)
.
.
→ DCR pipeline
 Log Analytics
(logs + KQL queries)
 Azure Monitor
+
 Log Analytics
→ datasource
 Azure Managed Grafana
(unified dashboards)

Self-Hosted vs Azure — When to Choose What

 Self-Hosted
Prometheus + Grafana + Jaeger
Setup time2–4 hours
Ongoing opsYou own it
Cloud dependencyNone
Air-gap support✓ Full
Cost (1M series)~$0 (infra only)
Log retentionManage yourself
Best forAir-gap / Local Mode
☁ Azure
Monitor + App Insights + Log Analytics
Setup time30–60 min
Ongoing opsManaged by Azure
Cloud dependencyAzure subscription
Air-gap support✗ Needs internet
Cost (1M series)~$200–400/mo
Log retentionUp to 7 years
Best forAKS / Connected Mode
Trace → Log Correlation in App Insights
Application Insights automatically correlates traces with logs using the operation_Id field — the same value as Flex Gateway's request.id. From any failed request in the Failures blade, click "View in Transaction Search" to see every log line, dependency call, and span for that specific request end-to-end.
Series Complete

From Zero to Production-Grade

Across three posts, we've taken Flex Gateway from concept to fully operational — deployed, secured, and observable. Every decision is grounded in real architectures: a retail platform surviving Black Friday, a bank meeting PCI-DSS, a factory floor with no internet, and a cloud team running on Azure AKS.

What You've Built
Part
Outcome
Architecture, modes, first Docker deployment
Part 1
Gateway running in minutes
Kubernetes, Helm, policy chains, GitOps CI/CD
Part 2
Production-grade deployment
Self-hosted Prometheus + Grafana + Jaeger
Part 3
Full observability — any infra
Azure Monitor + App Insights + Log Analytics
Part 3
Full observability — Azure native
PCI-DSS audit logs, KQL queries, tamper-evident retention
Part 3
Compliance without code changes
 MuleSoft Flex Gateway — Series Complete
One gateway. Every cloud.
Zero blind spots.
Self-hosted or Azure-native, Connected Mode or air-gapped Local Mode — the OTel Collector, Prometheus metrics, and structured JSON logs are the same. The destination changes; the visibility doesn't. Your team will know about problems before your users do, whether you're debugging in Grafana, Application Insights, or a KQL query in Log Analytics at 2 AM.