Flex Gateway Part 3 — Observability in Production

MuleSoft Series · Part 3 · Final

Flex Gateway in Production:
Observability That Actually Works

Real incidents, real dashboards, real monitoring stacks — built differently for Connected Mode vs Local Mode. No generic theory. Just what you need the day things break at 2 AM.

This post is grounded in three real-world architectures: a retail e-commerce platform, a regulated banking system, and a factory edge network. Each has different monitoring needs, different tools, and different failure modes. We'll build the observability stack for all three.

Real-World Context

Three Systems, Three Realities

Abstract monitoring advice fails in production because every system has different traffic patterns, compliance requirements, and failure modes. Let's establish our three reference architectures before touching a single config file.



Use Case B · Connected Mode + Strict Compliance

SecureBank — PCI-DSS Regulated Financial API

15k daily transactions. Every failed authentication must be logged and retained for 7 years. mTLS on all connections. They use Connected Mode for policy management but route all telemetry through an on-premise Elastic Stack — no sensitive data ever leaves the datacenter.

Connected Mode On-Prem Elastic Stack Jaeger Tracing PCI-DSS Audit Logs

Section 01

Which Monitoring Stack For Each Mode?

Connected Mode and Local Mode expose telemetry differently. The choice isn't just about tools — it's about where data flows, who owns the pipeline, and what you're allowed to send off-site.

☁ Connected Mode

Anypoint-Integrated Stack

☁️

Anypoint Analytics

Built-in. API-level dashboards, SLA tracking, consumer analytics — zero config. Best for business-level visibility.



Prometheus + Grafana

Scrape the /metrics endpoint. Full control over dashboards, alerting rules, and data retention. Works alongside Anypoint.



Jaeger / Tempo

OpenTelemetry traces exported via OTLP. Visualise request flows across microservices in Grafana Tempo or Jaeger UI.



Elastic Stack (optional)

For compliance use cases where structured JSON logs must be retained on-prem. Beats agent ships logs to on-site Elasticsearch.

⚙ Local Mode

Self-Contained Stack



Prometheus (per-site)

The primary tool. Scrapes gateway metrics every 15s. No internet required. Runs as a Docker container or systemd service.



Grafana (per-site)

Local dashboard server. Pre-load the Flex Gateway community dashboard. Alerts via local Alertmanager or email SMTP relay.

️

VictoriaMetrics (central)

Lightweight Prometheus-compatible long-term storage. Each site remote-writes metrics over VPN. Single pane of glass across all sites.



Loki (log aggregation)

Grafana's log aggregation tool. Pairs with Promtail to ship structured JSON gateway logs. No Elasticsearch license needed.

Capability

☁ Connected Mode

⚙ Local Mode

Business-level analytics

Anypoint built-in

Not available

Infrastructure metrics (Prometheus)

✓ scrape /metrics

Distributed tracing

OTLP → Jaeger/Tempo

OTLP → local Jaeger

Structured log retention

Elastic / CloudWatch

Loki / Elasticsearch

Internet required

✓ for Anypoint

✗ fully offline

Multi-site aggregation

Anypoint hub

VictoriaMetrics + VPN

Compliance log custody

On-prem Elastic possible

✓ fully on-prem

Section 02

Installing the Monitoring Stack

 Connected Mode — RetailCo & SecureBank Setup

In Connected Mode, Anypoint Platform already collects API-level metrics. Our job is to add the infrastructure observability layer on top — Prometheus for metrics, Jaeger for traces, and optionally Elastic for compliance logs.

Connected Mode · Full Observability Architecture

⚡ Flex Gateway

→

☁ Anypoint Platform

(API analytics, policy mgmt)

⚡ Flex Gateway :9090

→ scrape

 Prometheus

→ query

 Grafana

⚡ Flex Gateway OTLP

→ export

 Jaeger

→ visualise

 Grafana

⚡ Flex Gateway JSON logs

→ ship

 Elasticsearch

(compliance retention only)

Enable the metrics endpoint in Flex Gateway

First, expose Prometheus metrics. Add the telemetry block to your gateway config — this is what Prometheus will scrape.

gateway-config.yaml — telemetry block

apiVersion: gateway.mulesoft.com/v1alpha1
kind: Gateway
metadata:
  name: retailco-gateway
spec:
  telemetry:

    # ── Prometheus metrics ──
    metrics:
      enabled:  true
      port:     9090
      path:     "/metrics"

    # ── OpenTelemetry tracing ──
    tracing:
      enabled:     true
      exporter:    otlp
      endpoint:    "http://jaeger:4317"
      sampleRate:  0.05   # 5% — RetailCo has 350k req/hr

    # ── Structured JSON logging ──
    logging:
      format: json
      level:  info
      fields:
        - request.id           # unique per request — essential for tracing
        - request.path
        - request.method
        - response.status
        - response.duration_ms
        - client.id
        - jwt.sub               # SecureBank: required for PCI audit trail
        - jwt.org_id

Deploy the full monitoring stack with Docker Compose

Spin up Prometheus, Grafana, and Jaeger together. This is the standard Connected Mode companion stack — runs alongside your gateway in Kubernetes or as a standalone compose stack.

docker-compose.monitoring.yml — Connected Mode stack

services:

  # ── Prometheus — metrics collection ──
  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9091:9090"           # 9091 externally (9090 used by gateway)
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"   # 30-day retention
      - "--web.enable-lifecycle"               # hot-reload config

  # ── Grafana — dashboards & alerting ──
  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana_data:/var/lib/grafana
    depends_on: [prometheus, jaeger]

  # ── Jaeger — distributed tracing ──
  jaeger:
    image: jaegertracing/all-in-one:1.56
    ports:
      - "16686:16686"         # Jaeger UI
      - "4317:4317"           # OTLP gRPC receiver

volumes:
  prometheus_data:
  grafana_data:

Configure Prometheus scrape targets

Tell Prometheus where to find your gateway, and add labels so you can filter dashboards by environment and route.

prometheus.yml

global:
  scrape_interval:     15s
  external_labels:
    cluster:     "retailco-prod"
    environment: "production"

rule_files:
  - "/etc/prometheus/alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: "flex-gateway"
    static_configs:
      - targets: ["flex-gateway:9090"]
        labels:
          gateway_id:   "prod-eu-west-1"
          mode:         "connected"

⚙️ Local Mode — FactoryEdge Air-Gapped Setup

No internet. No Anypoint. Eight sites across three countries. Here's how FactoryEdge built a fully autonomous monitoring stack that gives both per-site and cross-site visibility without a single cloud call.

Local Mode · Air-Gapped Multi-Site Architecture

⚡ Flex GW (Site A)

→ remote_write

️ VictoriaMetrics (Central)

→ query

 Central Grafana

⚡ Flex GW (Site B–H)

→ local scrape

 Local Prometheus

→ local dash

 Site Grafana

⚡ Flex GW JSON logs

→ tail

 Promtail → Loki

(all sites — no Elastic license needed)

Per-site stack: Prometheus + Grafana + Loki

Every factory site runs this identical compose file. Images are pre-loaded onto a USB drive at HQ and air-gap installed.

docker-compose.local-monitoring.yml — per-site (air-gapped)

services:

  prometheus:
    image: prom/prometheus:v2.51.0   # pre-loaded image
    ports:   ["9091:9090"]
    volumes:
      - ./prometheus-local.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus
    command:
      - "--storage.tsdb.retention.time=90d"   # 90d local — no cloud backup

  grafana:
    image: grafana/grafana:10.4.0
    ports:   ["3000:3000"]

  # ── Loki — lightweight log aggregation (no Elastic licence) ──
  loki:
    image: grafana/loki:2.9.0
    ports:   ["3100:3100"]

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log/flex-gateway:/var/log/flex-gateway:ro
      - ./promtail.yml:/etc/promtail/config.yml

Prometheus with remote_write to central VictoriaMetrics

Each site scrapes locally AND forwards metrics to a central VictoriaMetrics instance over a site-to-site VPN. Both local autonomy AND cross-site comparison dashboards.

prometheus-local.yml — air-gapped site with remote_write

global:
  scrape_interval: 15s
  external_labels:
    site:    "factory-site-a"     # identifies site in central dashboards
    region: "de-munich"
    mode:   "local"

# ── Forward metrics to central VictoriaMetrics (over VPN) ──
remote_write:
  - url: "http://10.100.0.10:8428/api/v1/write"
    write_relabel_configs:
      - source_labels: [__name__]
        regex:  "flex_.*"            # only forward flex metrics
        action: keep

scrape_configs:
  - job_name: "flex-gateway-local"
    static_configs:
      - targets: ["localhost:9090"]

Section 03

The Dashboard You Actually Need at 2 AM

This is what RetailCo's Grafana dashboard looked like during the Black Friday 2026 flash sale at 09:14 AM — when their product catalog API started returning 503s and their rate limit counter spiked to 3× normal. Real numbers. Real incident.

Grafana · Flex Gateway Overview · retailco-prod · eu-west-1

● LIVE

Req / sec

3,847

↑ 2.7× baseline

Flash sale triggered

Error Rate

8.3%

↑ from 0.1%

⚠ Alert firing

P99 Latency

642ms

↑ from 84ms

SLA threshold: 500ms

Rate Limit Hits

14.2k

↑ 12k/min

Top offender: mobile-app

Request Rate — Last 30 min (req/sec)

Error Rate % — Last 30 min

Latency Percentiles — P50 / P95 / P99 (ms)

Rate Limit Hits by Client — Last 5 min

Top: mobile-app v3.2 · 9,847 hits · client_id: mob-4f7a

Error Rate Heatmap — Last 24 Hours (each column = 1 hour)

00:0003:0006:00 09:0012:0015:00 18:0021:00Now

Route Performance Breakdown — Last 5 min

Endpoint

RPS

P99

Err%

Status

GET /api/products

2,840

642ms

11.2%

DEGRADED

GET /api/products/:id

780

310ms

0.8%

SLOW

POST /api/cart

145

88ms

0.2%

POST /api/checkout

124ms

0.0%

GET /health

4ms

0.0%

Live Error Log Stream

09:14:22 ERROR rate_limit_exceeded · client_id=mob-4f7a · path=/api/products · limit=100/min · current=847/min

09:14:23 ERROR upstream_timeout · path=/api/products · upstream=product-service:3000 · duration_ms=5003 · request_id=req-9f2b

09:14:24 WARN circuit_breaker_open · route=product-catalog · consecutive_failures=5 · tripping in 500ms

09:14:31 INFO rate_limit_config_updated · new_limit=50/min · applied to route=product-catalog · hotfix deployed via GitOps

09:14:45 INFO error_rate_recovering · current=3.2% · target=<1% · upstream latency recovering · p99=280ms



What Happened Next

The team identified client mob-4f7a (mobile app v3.2) as the culprit — a polling loop bug was firing every 70ms instead of 5s. They pushed a rate limit reduction from 100 to 50 req/min via GitOps in under 90 seconds. Error rate recovered to 0.4% within 2 minutes, without touching the product catalog service or mobile app. That's the gateway layer working as intended.

Section 04

Alert Rules That Don't Lie

Most alert configs fire too late or too often — either they wake you for noise, or they miss the real incident. These rules are calibrated from the RetailCo and SecureBank production incidents, not invented in a vacuum.

CRITICAL

HighErrorRate

rate(flex_requests_errors_total[5m]) / rate(flex_requests_total[5m]) > 0.05

for: 2m

CRITICAL

LatencySLABreach

histogram_quantile(0.99, rate(flex_request_duration_ms_bucket[5m])) > 500

for: 3m

WARNING

RateLimitStorm

rate(flex_rate_limit_exceeded[5m]) / rate(flex_requests_total[5m]) > 0.20

for: 5m

WARNING

AuthFailureSpike

rate(flex_auth_failures_total[5m]) > 50

for: 1m

INFO

GatewayPodDown

up{job="flex-gateway"} == 0

for: 30s

alerts.yml — production-calibrated Prometheus alert rules

groups:
  - name: flex-gateway-production
    rules:

    - alert: HighErrorRate
      expr: |
        (rate(flex_requests_errors_total[5m])
        / rate(flex_requests_total[5m])) > 0.05
      for: 2m
      labels:
        severity: critical
        team:     platform
      annotations:
        summary: "Error rate {{ $value | humanizePercentage }}"
        runbook_url: "https://wiki.internal/runbooks/flex-gateway-errors"

    - alert: LatencySLABreach
      expr: |
        histogram_quantile(0.99,
          rate(flex_request_duration_ms_bucket[5m])
        ) > 500
      for: 3m
      labels:
        severity: critical

    - alert: RateLimitStorm
      expr: |
        rate(flex_rate_limit_exceeded[5m])
        / rate(flex_requests_total[5m]) > 0.20
      for: 5m
      labels:
        severity: warning

Section 05 · SecureBank Use Case

PCI-DSS Compliance Logging

SecureBank has a hard requirement: every API call to their payments endpoint must be logged with who made it, when, from where, and what the outcome was — and that log must be retained for 7 years and be tamper-evident. Here's how they achieve that entirely through Flex Gateway's telemetry stack, without touching application code.

gateway-config.yaml — SecureBank PCI audit logging

apiVersion: gateway.mulesoft.com/v1alpha1
kind: Gateway
metadata:
  name: securebank-gateway
spec:
  telemetry:
    logging:
      format:  json
      level:   info
      fields:
        - request.id
        - request.timestamp    # ISO-8601 UTC — non-negotiable for PCI
        - request.path
        - request.method
        - request.remoteAddress # source IP — required by PCI DSS 10.3
        - response.status
        - response.duration_ms
        - jwt.sub               # cardholder / user identity
        - jwt.org_id            # merchant / tenant ID
        - tls.clientCertSubject # mTLS client cert DN
        - policy.outcomes       # which policies ran and what they decided

  routes:
    - id: payments-api
      path: ["/v1/payments/*"]
      upstream:
        url: "http://payments-service:8080"
      policies:
        - audit-log:
            always: true
            includeRequestBody: false   # never log card numbers
            destination: "syslog://audit-relay:514"



Tamper-Evident Log Storage

SecureBank ships these logs via syslog to an Elasticsearch cluster with ILM rollover policies and index-level write-once settings. After 24 hours, indices are frozen and marked read-only. Deletion requires dual-authorisation through Vault. This satisfies PCI DSS Requirement 10 without any custom application code.

Section 06 · Cloud-Native Option

Azure Monitoring Stack — App Insights & Azure Monitor

If your team already runs on Azure, standing up Prometheus, Grafana, and Jaeger yourself is unnecessary overhead. Azure offers a fully managed observability stack that maps directly to Flex Gateway's telemetry outputs — Application Insights for distributed tracing and APM, Azure Monitor for metrics and alerting, Log Analytics Workspace for structured log retention, and Azure Managed Grafana as the dashboard layer.

RetailCo migrated from their self-hosted Prometheus + Grafana stack to the Azure-native equivalent after moving their Kubernetes cluster to AKS. The migration took one afternoon — not because the tools are simpler, but because the integration path with OpenTelemetry Collector is identical. The config changes were minimal.



Metrics & Alerts

Azure Monitor

Receives Prometheus remote_write from the gateway. Hosts alert rules, action groups (PagerDuty, Teams, email), and metric dashboards.

Replaces Prometheus



APM & Tracing

Application Insights

Receives OTLP traces via OpenTelemetry Collector. Provides end-to-end transaction maps, failure analysis, dependency tracking, and live metrics stream.

Replaces Jaeger



Log Retention

Log Analytics Workspace

Ingests structured JSON logs from Flex Gateway. KQL queries replace Kibana. Retention up to 7 years (Archive tier) — perfect for PCI DSS.

Replaces Elasticsearch



Dashboards

Azure Managed Grafana

Fully managed Grafana backed by Azure Monitor and Log Analytics as native datasources. No servers to manage.

Replaces Self-Hosted Grafana



Telemetry Router

OTel Collector

The bridge between Flex Gateway and Azure services. Receives OTLP, fans out to App Insights (traces), Azure Monitor (metrics), and Log Analytics (logs) simultaneously.

Central Hub



Log Ingestion

Data Collection Rule

Azure DCR defines the transformation pipeline from gateway JSON logs to Log Analytics tables. KQL-based filtering keeps ingestion costs down.

Cost Control

Azure Native · Full Observability Architecture

⚡ Flex Gateway

→ OTLP gRPC

 OTel Collector

→ exporters

 App Insights

(traces + APM)

→

→ remote_write

 Azure Monitor

(metrics + alerts)

→

→ DCR pipeline

 Log Analytics

(logs + KQL queries)

 Azure Monitor

 Log Analytics

→ datasource

 Azure Managed Grafana

(unified dashboards)

Self-Hosted vs Azure — When to Choose What

 Self-Hosted

Prometheus + Grafana + Jaeger

Setup time2–4 hours

Ongoing opsYou own it

Cloud dependencyNone

Air-gap support✓ Full

Cost (1M series)~$0 (infra only)

Log retentionManage yourself

Best forAir-gap / Local Mode

☁ Azure

Monitor + App Insights + Log Analytics

Setup time30–60 min

Ongoing opsManaged by Azure

Cloud dependencyAzure subscription

Air-gap support✗ Needs internet

Cost (1M series)~$200–400/mo

Log retentionUp to 7 years

Best forAKS / Connected Mode



Trace → Log Correlation in App Insights

Application Insights automatically correlates traces with logs using the operation_Id field — the same value as Flex Gateway's request.id. From any failed request in the Failures blade, click "View in Transaction Search" to see every log line, dependency call, and span for that specific request end-to-end.

Series Complete

From Zero to Production-Grade

Across three posts, we've taken Flex Gateway from concept to fully operational — deployed, secured, and observable. Every decision is grounded in real architectures: a retail platform surviving Black Friday, a bank meeting PCI-DSS, a factory floor with no internet, and a cloud team running on Azure AKS.

What You've Built

Part

Outcome

Architecture, modes, first Docker deployment

Part 1

Gateway running in minutes

Kubernetes, Helm, policy chains, GitOps CI/CD

Part 2

Production-grade deployment

Self-hosted Prometheus + Grafana + Jaeger

Part 3

Full observability — any infra

Azure Monitor + App Insights + Log Analytics

Part 3

Full observability — Azure native

PCI-DSS audit logs, KQL queries, tamper-evident retention

Part 3

Compliance without code changes

 MuleSoft Flex Gateway — Series Complete

One gateway. Every cloud.
Zero blind spots.

Self-hosted or Azure-native, Connected Mode or air-gapped Local Mode — the OTel Collector, Prometheus metrics, and structured JSON logs are the same. The destination changes; the visibility doesn't. Your team will know about problems before your users do, whether you're debugging in Grafana, Application Insights, or a KQL query in Log Analytics at 2 AM.

Flex Gateway Part 3 — Observability in Production

Flex Gateway in Production:
Observability That Actually Works

Three Systems, Three Realities

Which Monitoring Stack For Each Mode?

Installing the Monitoring Stack

 Connected Mode — RetailCo & SecureBank Setup

⚙️ Local Mode — FactoryEdge Air-Gapped Setup

The Dashboard You Actually Need at 2 AM

Alert Rules That Don't Lie

PCI-DSS Compliance Logging

Azure Monitoring Stack — App Insights & Azure Monitor

Self-Hosted vs Azure — When to Choose What

From Zero to Production-Grade

Comments

Post a Comment

Flex Gateway Part 3 — Observability in Production

Flex Gateway in Production:Observability That Actually Works

Three Systems, Three Realities

Which Monitoring Stack For Each Mode?

Installing the Monitoring Stack

 Connected Mode — RetailCo & SecureBank Setup

⚙️ Local Mode — FactoryEdge Air-Gapped Setup

The Dashboard You Actually Need at 2 AM

Alert Rules That Don't Lie

PCI-DSS Compliance Logging

Azure Monitoring Stack — App Insights & Azure Monitor

Self-Hosted vs Azure — When to Choose What

From Zero to Production-Grade

Comments

Post a Comment

Flex Gateway in Production:
Observability That Actually Works