Skip to content

Monitoring

Monitoring, metrics, and health checks for RecordPlatform.

Health Endpoints

EndpointDescription
/actuator/healthOverall health status
/actuator/health/livenessKubernetes liveness probe
/actuator/health/readinessKubernetes readiness probe
/actuator/prometheusPrometheus metrics
/actuator/circuitbreakersCircuit breaker status

Health Check Components

The /actuator/health endpoint includes:

ComponentChecks
dbMySQL connectivity
redisRedis connectivity
rabbitRabbitMQ connectivity
s3StorageS3 node availability
sagaSaga transaction health
outboxOutbox event health
encryptionEncryption strategy status

Sample Response

json
{
  "status": "UP",
  "components": {
    "db": { "status": "UP" },
    "redis": { "status": "UP" },
    "s3Storage": {
      "status": "UP",
      "details": {
        "healthyNodes": 3,
        "totalNodes": 3
      }
    },
    "encryption": {
      "status": "UP",
      "details": {
        "algorithm": "ChaCha20-Poly1305",
        "likelyHasAesNi": true
      }
    }
  }
}

Key Metrics

Saga Metrics

MetricTypeDescription
saga_totalCounterTotal Saga count by status
saga_durationTimerExecution/compensation duration
saga_runningGaugeCurrently running Sagas
saga_pending_compensationGaugeSagas awaiting compensation

Outbox Metrics

MetricTypeDescription
outbox_events_totalCounterEvents by status (published/failed)
outbox_publish_latencyTimerEvent publish latency
outbox_pendingGaugePending events
outbox_exhaustedGaugeEvents exceeding max retries

Storage Metrics

MetricTypeDescription
s3_node_online_statusGaugeNode online status (0/1)
s3_node_load_scoreGaugeNode load score
s3_node_operations_totalCounterOperations per node

Health Thresholds

Configure alerting thresholds:

yaml
# Outbox thresholds
outbox:
  health:
    pending-threshold: 500    # >500 pending → DEGRADED
    failed-threshold: 20      # >20 failed → DOWN

# Saga thresholds
saga:
  health:
    running-threshold: 100    # >100 running → DEGRADED
    failed-threshold: 10      # >10 failed → DOWN
    pending-compensation-threshold: 50  # >50 pending → DEGRADED

Prometheus Configuration

Scrape Config

yaml
scrape_configs:
  - job_name: 'recordplatform-backend'
    metrics_path: '/record-platform/actuator/prometheus'
    static_configs:
      - targets: ['backend:8000']

  - job_name: 'recordplatform-storage'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['storage:8092']

  - job_name: 'recordplatform-fisco'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['fisco:8091']

Alert Rules

yaml
groups:
  - name: recordplatform
    rules:
      - alert: SagaFailureHigh
        expr: saga_total{status="failed"} > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High Saga failure rate"

      - alert: OutboxBacklog
        expr: outbox_pending > 500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Outbox event backlog"

      - alert: S3NodeDown
        expr: s3_node_online_status == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "S3 storage node offline"

Grafana Dashboards

  1. System Overview

    • Request rate and latency
    • Error rate
    • Active connections
  2. Saga Status

    • Running vs completed vs failed
    • Compensation queue depth
    • Average duration
  3. Storage Health

    • Node status per domain
    • Replication lag
    • Disk usage
  4. Blockchain

    • Transaction rate
    • Circuit breaker status
    • Retry counts

SkyWalking Integration

Configuration

The deployment scripts automatically configure SkyWalking agent:

bash
SKYWALKING_OPTS="-javaagent:/path/to/skywalking-agent.jar \
  -Dskywalking.agent.service_name=platform-backend \
  -Dskywalking.collector.backend_service=skywalking-oap:11800"

Distributed Tracing

SkyWalking provides:

  • Request tracing across services
  • Slow query detection
  • Service dependency mapping
  • Error tracking

Log Aggregation

ELK Stack Configuration

yaml
# Logstash pipeline
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] == "recordplatform" {
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "recordplatform-%{+YYYY.MM.dd}"
  }
}

Released under the Apache 2.0 License.