Flex Gateway in Production: Observability That Actually Works
Real incidents, real dashboards, real monitoring stacks — built differently for Connected Mode vs Local Mode. No generic theory. Just what you need the day things break at 2 AM.
This post is grounded in three real-world architectures: a retail e-commerce platform, a regulated banking system, and a factory edge network. Each has different monitoring needs, different tools, and different failure modes. We'll build the observability stack for all three.
Real-World Context
Three Systems, Three Realities
Abstract monitoring advice fails in production because every system has different traffic patterns, compliance requirements, and failure modes. Let's establish our three reference architectures before touching a single config file.
Use Case A · Connected Mode
RetailCo — High-Volume E-Commerce Platform
350k API requests/hour at peak (Black Friday spikes to 1.4M/hr). Flex Gateway in Connected Mode feeds Anypoint Analytics. They need real-time throughput dashboards, P99 latency tracking, and instant rate-limit alerting when flash sales hammer the product catalog API.
15k daily transactions. Every failed authentication must be logged and retained for 7 years. mTLS on all connections. They use Connected Mode for policy management but route all telemetry through an on-premise Elastic Stack — no sensitive data ever leaves the datacenter.
8 factory sites across 3 countries. No internet. Flex Gateway in Local Mode manages APIs between PLCs, MES systems, and SCADA platforms. Each site runs a self-contained Prometheus + Grafana stack. A central VictoriaMetrics aggregates across sites over a dedicated VPN.
Local ModeAir-Gapped NetworkVictoriaMetricsOn-Site Grafana
Section 01
Which Monitoring Stack For Each Mode?
Connected Mode and Local Mode expose telemetry differently. The choice isn't just about tools — it's about where data flows, who owns the pipeline, and what you're allowed to send off-site.
☁ Connected Mode
Anypoint-Integrated Stack
☁️
Anypoint Analytics
Built-in. API-level dashboards, SLA tracking, consumer analytics — zero config. Best for business-level visibility.
Prometheus + Grafana
Scrape the /metrics endpoint. Full control over dashboards, alerting rules, and data retention. Works alongside Anypoint.
Jaeger / Tempo
OpenTelemetry traces exported via OTLP. Visualise request flows across microservices in Grafana Tempo or Jaeger UI.
Elastic Stack (optional)
For compliance use cases where structured JSON logs must be retained on-prem. Beats agent ships logs to on-site Elasticsearch.
⚙ Local Mode
Self-Contained Stack
Prometheus (per-site)
The primary tool. Scrapes gateway metrics every 15s. No internet required. Runs as a Docker container or systemd service.
Grafana (per-site)
Local dashboard server. Pre-load the Flex Gateway community dashboard. Alerts via local Alertmanager or email SMTP relay.
️
VictoriaMetrics (central)
Lightweight Prometheus-compatible long-term storage. Each site remote-writes metrics over VPN. Single pane of glass across all sites.
Loki (log aggregation)
Grafana's log aggregation tool. Pairs with Promtail to ship structured JSON gateway logs. No Elasticsearch license needed.
Capability
☁ Connected Mode
⚙ Local Mode
Business-level analytics
Anypoint built-in
Not available
Infrastructure metrics (Prometheus)
✓ scrape /metrics
✓ scrape /metrics
Distributed tracing
OTLP → Jaeger/Tempo
OTLP → local Jaeger
Structured log retention
Elastic / CloudWatch
Loki / Elasticsearch
Internet required
✓ for Anypoint
✗ fully offline
Multi-site aggregation
Anypoint hub
VictoriaMetrics + VPN
Compliance log custody
On-prem Elastic possible
✓ fully on-prem
Section 02
Installing the Monitoring Stack
Connected Mode — RetailCo & SecureBank Setup
In Connected Mode, Anypoint Platform already collects API-level metrics. Our job is to add the infrastructure observability layer on top — Prometheus for metrics, Jaeger for traces, and optionally Elastic for compliance logs.
Connected Mode · Full Observability Architecture
⚡ Flex Gateway
→
☁ Anypoint Platform
(API analytics, policy mgmt)
⚡ Flex Gateway :9090
→ scrape
Prometheus
→ query
Grafana
⚡ Flex Gateway OTLP
→ export
Jaeger
→ visualise
Grafana
⚡ Flex Gateway JSON logs
→ ship
Elasticsearch
(compliance retention only)
01
Enable the metrics endpoint in Flex Gateway
First, expose Prometheus metrics. Add the telemetry block to your gateway config — this is what Prometheus will scrape.
gateway-config.yaml — telemetry block
apiVersion:gateway.mulesoft.com/v1alpha1kind:Gatewaymetadata:name:retailco-gatewayspec:telemetry:# ── Prometheus metrics ──metrics:enabled:trueport:9090path:"/metrics"# ── OpenTelemetry tracing ──tracing:enabled:trueexporter:otlpendpoint:"http://jaeger:4317"sampleRate:0.05# 5% — RetailCo has 350k req/hr# ── Structured JSON logging ──logging:format:jsonlevel:infofields:-request.id# unique per request — essential for tracing-request.path-request.method-response.status-response.duration_ms-client.id-jwt.sub# SecureBank: required for PCI audit trail-jwt.org_id
02
Deploy the full monitoring stack with Docker Compose
Spin up Prometheus, Grafana, and Jaeger together. This is the standard Connected Mode companion stack — runs alongside your gateway in Kubernetes or as a standalone compose stack.
No internet. No Anypoint. Eight sites across three countries. Here's how FactoryEdge built a fully autonomous monitoring stack that gives both per-site and cross-site visibility without a single cloud call.
Local Mode · Air-Gapped Multi-Site Architecture
⚡ Flex GW (Site A)
→ remote_write
️ VictoriaMetrics (Central)
→ query
Central Grafana
⚡ Flex GW (Site B–H)
→ local scrape
Local Prometheus
→ local dash
Site Grafana
⚡ Flex GW JSON logs
→ tail
Promtail → Loki
(all sites — no Elastic license needed)
01
Per-site stack: Prometheus + Grafana + Loki
Every factory site runs this identical compose file. Images are pre-loaded onto a USB drive at HQ and air-gap installed.
services:prometheus:image:prom/prometheus:v2.51.0# pre-loaded imageports: ["9091:9090"]
volumes:-./prometheus-local.yml:/etc/prometheus/prometheus.yml-prom_data:/prometheuscommand:-"--storage.tsdb.retention.time=90d"# 90d local — no cloud backupgrafana:image:grafana/grafana:10.4.0ports: ["3000:3000"]
# ── Loki — lightweight log aggregation (no Elastic licence) ──loki:image:grafana/loki:2.9.0ports: ["3100:3100"]
promtail:image:grafana/promtail:2.9.0volumes:-/var/log/flex-gateway:/var/log/flex-gateway:ro-./promtail.yml:/etc/promtail/config.yml
02
Prometheus with remote_write to central VictoriaMetrics
Each site scrapes locally AND forwards metrics to a central VictoriaMetrics instance over a site-to-site VPN. Both local autonomy AND cross-site comparison dashboards.
prometheus-local.yml — air-gapped site with remote_write
global:scrape_interval:15sexternal_labels:site:"factory-site-a"# identifies site in central dashboardsregion:"de-munich"mode:"local"# ── Forward metrics to central VictoriaMetrics (over VPN) ──remote_write:-url:"http://10.100.0.10:8428/api/v1/write"write_relabel_configs:-source_labels: [__name__]
regex:"flex_.*"# only forward flex metricsaction:keepscrape_configs:-job_name:"flex-gateway-local"static_configs:-targets: ["localhost:9090"]
Section 03
The Dashboard You Actually Need at 2 AM
This is what RetailCo's Grafana dashboard looked like during the Black Friday 2026 flash sale at 09:14 AM — when their product catalog API started returning 503s and their rate limit counter spiked to 3× normal. Real numbers. Real incident.
The team identified client mob-4f7a (mobile app v3.2) as the culprit — a polling loop bug was firing every 70ms instead of 5s. They pushed a rate limit reduction from 100 to 50 req/min via GitOps in under 90 seconds. Error rate recovered to 0.4% within 2 minutes, without touching the product catalog service or mobile app. That's the gateway layer working as intended.
Section 04
Alert Rules That Don't Lie
Most alert configs fire too late or too often — either they wake you for noise, or they miss the real incident. These rules are calibrated from the RetailCo and SecureBank production incidents, not invented in a vacuum.
SecureBank has a hard requirement: every API call to their payments endpoint must be logged with who made it, when, from where, and what the outcome was — and that log must be retained for 7 years and be tamper-evident. Here's how they achieve that entirely through Flex Gateway's telemetry stack, without touching application code.
apiVersion:gateway.mulesoft.com/v1alpha1kind:Gatewaymetadata:name:securebank-gatewayspec:telemetry:logging:format:jsonlevel:infofields:-request.id-request.timestamp# ISO-8601 UTC — non-negotiable for PCI-request.path-request.method-request.remoteAddress# source IP — required by PCI DSS 10.3-response.status-response.duration_ms-jwt.sub# cardholder / user identity-jwt.org_id# merchant / tenant ID-tls.clientCertSubject# mTLS client cert DN-policy.outcomes# which policies ran and what they decidedroutes:-id:payments-apipath: ["/v1/payments/*"]
upstream:url:"http://payments-service:8080"policies:-audit-log:always:trueincludeRequestBody:false# never log card numbersdestination:"syslog://audit-relay:514"
Tamper-Evident Log Storage
SecureBank ships these logs via syslog to an Elasticsearch cluster with ILM rollover policies and index-level write-once settings. After 24 hours, indices are frozen and marked read-only. Deletion requires dual-authorisation through Vault. This satisfies PCI DSS Requirement 10 without any custom application code.
If your team already runs on Azure, standing up Prometheus, Grafana, and Jaeger yourself is unnecessary overhead. Azure offers a fully managed observability stack that maps directly to Flex Gateway's telemetry outputs — Application Insights for distributed tracing and APM, Azure Monitor for metrics and alerting, Log Analytics Workspace for structured log retention, and Azure Managed Grafana as the dashboard layer.
RetailCo migrated from their self-hosted Prometheus + Grafana stack to the Azure-native equivalent after moving their Kubernetes cluster to AKS. The migration took one afternoon — not because the tools are simpler, but because the integration path with OpenTelemetry Collector is identical. The config changes were minimal.
Metrics & Alerts
Azure Monitor
Receives Prometheus remote_write from the gateway. Hosts alert rules, action groups (PagerDuty, Teams, email), and metric dashboards.
Replaces Prometheus
APM & Tracing
Application Insights
Receives OTLP traces via OpenTelemetry Collector. Provides end-to-end transaction maps, failure analysis, dependency tracking, and live metrics stream.
Replaces Jaeger
Log Retention
Log Analytics Workspace
Ingests structured JSON logs from Flex Gateway. KQL queries replace Kibana. Retention up to 7 years (Archive tier) — perfect for PCI DSS.
Replaces Elasticsearch
Dashboards
Azure Managed Grafana
Fully managed Grafana backed by Azure Monitor and Log Analytics as native datasources. No servers to manage.
Replaces Self-Hosted Grafana
Telemetry Router
OTel Collector
The bridge between Flex Gateway and Azure services. Receives OTLP, fans out to App Insights (traces), Azure Monitor (metrics), and Log Analytics (logs) simultaneously.
Central Hub
Log Ingestion
Data Collection Rule
Azure DCR defines the transformation pipeline from gateway JSON logs to Log Analytics tables. KQL-based filtering keeps ingestion costs down.
Cost Control
Azure Native · Full Observability Architecture
⚡ Flex Gateway
→ OTLP gRPC
OTel Collector
→ exporters
App Insights
(traces + APM)
.
→
.
→ remote_write
Azure Monitor
(metrics + alerts)
.
→
.
→ DCR pipeline
Log Analytics
(logs + KQL queries)
Azure Monitor
+
Log Analytics
→ datasource
Azure Managed Grafana
(unified dashboards)
Self-Hosted vs Azure — When to Choose What
Self-Hosted
Prometheus + Grafana + Jaeger
Setup time2–4 hours
Ongoing opsYou own it
Cloud dependencyNone
Air-gap support✓ Full
Cost (1M series)~$0 (infra only)
Log retentionManage yourself
Best forAir-gap / Local Mode
☁ Azure
Monitor + App Insights + Log Analytics
Setup time30–60 min
Ongoing opsManaged by Azure
Cloud dependencyAzure subscription
Air-gap support✗ Needs internet
Cost (1M series)~$200–400/mo
Log retentionUp to 7 years
Best forAKS / Connected Mode
Trace → Log Correlation in App Insights
Application Insights automatically correlates traces with logs using the operation_Id field — the same value as Flex Gateway's request.id. From any failed request in the Failures blade, click "View in Transaction Search" to see every log line, dependency call, and span for that specific request end-to-end.
Series Complete
From Zero to Production-Grade
Across three posts, we've taken Flex Gateway from concept to fully operational — deployed, secured, and observable. Every decision is grounded in real architectures: a retail platform surviving Black Friday, a bank meeting PCI-DSS, a factory floor with no internet, and a cloud team running on Azure AKS.
Self-hosted or Azure-native, Connected Mode or air-gapped Local Mode — the OTel Collector, Prometheus metrics, and structured JSON logs are the same. The destination changes; the visibility doesn't. Your team will know about problems before your users do, whether you're debugging in Grafana, Application Insights, or a KQL query in Log Analytics at 2 AM.
← Newer
Older →
Comments
🍪 We use cookies to enhance your experience and analyze traffic.
See our Cookie Policy.
Comments
Post a Comment