Multi-Tenant Observability Architecture for Carrier Integration: Preventing Data Leakage Without Sacrificing Performance

Multi-Tenant Observability Architecture for Carrier Integration: Preventing Data Leakage Without Sacrificing Performance

When you build carrier integration middleware that serves multiple shippers, observability becomes a minefield. Each tenant's data needs complete isolation, costs need precise attribution, and different retention policies can't interfere with performance. Yet most platform teams treat observability as an afterthought until data leaks between tenants or billing becomes impossible to reconcile.

Multi-tenant observability architectures face three critical challenges: data isolation violations where tenants can see each other's data, impossible cost attribution where you can't determine which tenant generates how much telemetry, and conflicting retention policies where some tenants need 30-day retention while others need 90. For carrier middleware platforms handling thousands of shipments across hundreds of customers, these aren't theoretical concerns.

The Foundation: Tenant Context Propagation

The architecture starts with OpenTelemetry resource attributes that give you a clean way to tag all telemetry data with tenant identity from the very start, with resource attributes applying to every signal emitted by a service instance. This isn't just about adding a tenant ID to individual spans. You need systematic context propagation that works across async message queues, webhook callbacks, and long-running batch jobs.

Here's how tenant context flows through the system:

// Tenant middleware injects context at request ingress
func TenantMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tenantID := extractTenantID(r) // from JWT, header, or subdomain
        
        // Set resource attribute for this request context
        span := trace.SpanFromContext(r.Context())
        span.SetAttributes(attribute.String("tenant.id", tenantID))
        
        // Propagate via baggage for downstream services
        ctx := baggage.ContextWithValues(r.Context(),
            baggage.String("tenant.id", tenantID),
            baggage.String("tenant.tier", getTenantTier(tenantID)),
        )
        
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

The baggage mechanism ensures when your API server makes HTTP calls to downstream services, the tenant ID automatically propagates through the W3C Baggage header, allowing downstream services to read it and apply the same tenant tagging to their own telemetry. This creates an unbroken chain of tenant context from the initial API request through background jobs and webhook deliveries.

Data Segregation at Collection

Once telemetry carries tenant identity, the routing processor becomes the core piece that evaluates routing conditions based on resource attributes and sends data to matching sub-pipelines. The OpenTelemetry Collector routing processor handles this elegantly:

processors:
  routing:
    from_attribute: tenant.id
    attribute_source: resource
    default_exporters: [otlp/default]
    table:
      - value: "enterprise-customer-1"
        exporters: [otlp/enterprise]
      - value: "enterprise-customer-2" 
        exporters: [otlp/enterprise]
      - value: "startup-customer-*"
        exporters: [otlp/standard]

exporters:
  otlp/enterprise:
    endpoint: enterprise-observability.internal:4317
    headers:
      X-Tenant-Isolation: "strict"
  otlp/standard:
    endpoint: shared-observability.internal:4317
    headers:
      X-Tenant-Pool: "standard"

Notice how enterprise customers get routed to dedicated backends while smaller customers share infrastructure. You may need to route high-value tenants to dedicated observability backends for better retention, with compliance requirements demanding certain tenants' data stays in specific regions.

The routing processor also enables sophisticated tenant-aware sampling. Enterprise customers paying for premium observability get higher sampling rates while free-tier tenants can be sampled more aggressively, with enterprise tenants keeping 50% of traces compared to 10% for standard tenants. This prevents high-volume customers from overwhelming your observability budget while maintaining debugging capability where it matters most.

Storage and Backend Isolation

For storage backends, you have several isolation strategies depending on tenant requirements and scale. Prometheus supports multi-tenancy through label-based isolation with tenant-specific retention policies:

# Prometheus config with tenant-aware retention
global:
  external_labels:
    cluster: 'carrier-integration'

rule_files:
  - "tenant-*.yml"

# Per-tenant retention via recording rules
- name: tenant.retention
  rules:
  - record: tenant:retention_days
    expr: |
      label_replace(
        up{tenant_tier="enterprise"}, 
        "retention", "180d", "", ""
      )
  - record: tenant:retention_days  
    expr: |
      label_replace(
        up{tenant_tier="standard"}, 
        "retention", "30d", "", ""
      )

For distributed tracing, platforms like Grafana support multi-tenant metrics generation where tenant-specific traces generate separate metric streams. This allows per-tenant SLO monitoring without cross-contamination.

Larger platforms often run completely separate Grafana instances per tenant tier. Enterprise customers get dedicated instances with custom dashboards, while standard customers share a multi-tenant instance with namespace-based isolation. In account-per-tenant models, tenant-specific cost reporting becomes trivial as SaaS providers can align billing directly with usage through automatic monthly reporting per tenant.

Cost Attribution and Metering

Cost attribution requires tracking telemetry volume and processing costs at the tenant level. Cost attribution metrics help you understand and bill for observability costs accurately, which is not just a technical nicety but a business requirement for SaaS operators.

The collector exposes internal metrics that track data volume per tenant:

# Collector metrics for cost attribution
otelcol_processor_batch_batch_send_size_bytes{tenant_id="customer-123"}
otelcol_exporter_sent_spans{tenant_id="customer-123"}
otelcol_receiver_accepted_metric_points{tenant_id="customer-123"}

You can aggregate these into billing-ready metrics:

-- Daily telemetry volume per tenant for billing
SELECT 
    tenant_id,
    date_trunc('day', timestamp) as billing_day,
    sum(span_count) as daily_spans,
    sum(metric_points) as daily_metrics,
    sum(log_lines) as daily_logs,
    sum(bytes_processed) as daily_bytes
FROM observability_usage 
WHERE timestamp >= current_date - interval '30 days'
GROUP BY tenant_id, billing_day
ORDER BY daily_bytes DESC;

For multi-tenant SaaS applications, knowing how much each tenant actually costs is critical, as without this data you're flying blind on pricing and can't identify unprofitable tenants, with GCP providing resource labels and BigQuery billing export that let you break down cloud bills by tenant with surprising accuracy.

Performance and Sampling at Scale

Multi-tenant observability generates massive data volumes. A carrier integration platform handling 100,000+ shipments daily across 500 customers can produce millions of spans. Smart sampling prevents costs from spiraling while preserving debugging capability.

Implement head sampling at the application level based on tenant tier:

// Tenant-aware sampling in the tracer
func createTenantSampler(tenantID string) trace.Sampler {
    tier := getTenantTier(tenantID)
    
    switch tier {
    case "enterprise":
        return trace.ParentBased(trace.TraceIDRatioBased(0.1)) // 10%
    case "professional": 
        return trace.ParentBased(trace.TraceIDRatioBased(0.05)) // 5%
    default:
        return trace.ParentBased(trace.TraceIDRatioBased(0.01)) // 1%
    }
}

Combine this with tail sampling at the collector level that always keeps error traces and high-latency requests regardless of tenant. This ensures critical debugging information survives even aggressive sampling.

For high-cardinality metrics like per-shipment tracking numbers, implement tenant-aware cardinality limits. Enterprise customers might get 10,000 unique labels while standard customers get 1,000. This prevents any single tenant from exploding your metrics storage costs.

Production Implementation Checklist

When deploying multi-tenant observability architecture for carrier integration middleware, validate these patterns:

Tenant Context Validation: Implement synthetic monitoring that injects known tenant contexts and verifies they appear correctly in traces, metrics, and logs. Alert if tenant context is missing or incorrect.

Data Isolation Testing: Run integration tests that create telemetry for multiple tenants simultaneously and verify no cross-contamination. Query backends to ensure tenant A's data never appears in tenant B's results.

Cost Attribution Accuracy: Deploy cost tracking dashboards that reconcile collector-reported volumes with backend billing. Track per-tenant data volumes daily and alert on unexpected spikes.

Sampling Effectiveness: Monitor sampling ratios per tenant and ensure error traces always survive. Implement circuit breakers that temporarily increase sampling when tenant-specific error rates spike.

Backend Health Monitoring: Track collector internal metrics to ensure the routing pipeline itself remains reliable, as a multi-tenant routing pipeline is only as reliable as your ability to observe it. Monitor routing processor throughput, dropped spans, and backend connectivity per tenant route.

Major carrier integration platforms like nShift and MercuryGate handle these challenges through different approaches. Some use completely separate observability stacks per customer tier, while others rely on sophisticated collector routing. Cargoson's approach emphasizes OpenTelemetry-native tenant isolation with automatic cost attribution, allowing precise billing based on actual observability resource consumption.

The investment in proper multi-tenant observability architecture pays dividends beyond compliance and billing. When a carrier API starts returning errors for specific customers, you can immediately scope the problem to affected tenants rather than investigating across your entire platform. When capacity planning for peak shipping seasons, per-tenant metrics show exactly which customers drive resource consumption. And when SLAs are questioned, tenant-specific observability data provides concrete evidence of performance delivered.

Read more

Unified API Gateway Patterns for Multi-Carrier Integration: Abstracting Provider Complexity Without Breaking Tenant Isolation

Unified API Gateway Patterns for Multi-Carrier Integration: Abstracting Provider Complexity Without Breaking Tenant Isolation

TMS platforms are rapidly shifting from EDI to API integrations in 2026, creating fresh architectural challenges that look deceptively simple. While EDI remains the connecting technology leader in TMS deployment, API connectivity is increasing, especially among parcel and LTL freight carriers. Your organisation now faces dozens of provider APIs—FedEx

By Koen M. Vermeulen