Skip to content

Distributed Storage

RecordPlatform uses a fault-domain-aware distributed storage architecture for high availability.

Fault Domain Architecture

Domain Types

The system supports N active domains + 1 optional standby domain with flexible configuration:

Domain TypeDescriptionAccepts WritesPurpose
ActiveConfigured via active-domains, supports 1~N domainsYesData write targets
StandbyConfigured via standby-domain (optional)NoHot standby pool

Deployment Modes

ModeActive DomainsReplication FactorUse Case
Single-domain11Development
Dual-domain22Standard production
Multi-domain3+ConfigurableHigh-availability production

How It Works

  • Each file chunk is written to N active domains (one copy per domain)
  • Replication factor defaults to active domain count, adjustable via storage.replication.factor
  • N-1 domain tolerance: System remains available if any N-1 domains fail
  • Standby nodes are automatically promoted when active domain is degraded

Consistent Hashing

Virtual Node Mechanism

Configuration

PropertyDefaultDescription
virtualNodesPerNode150Virtual nodes per physical node
replication.factorActive domain countTotal replicas (capped at active domain count)
weight100Node weight (affects vnode count)

Benefits

  • Even distribution: Virtual nodes spread data uniformly
  • Minimal migration: Adding/removing nodes only affects ~1/N keys
  • Weighted allocation: Higher weight = more virtual nodes = more data

Automatic Failover

Failover Flow

Event Types

EventTriggerAction
NODE_ONLINEHealth check passesUpdate status
NODE_OFFLINEHealth check failsTrigger failover
NODE_DOMAIN_CHANGEDStandby promotionData rebalancing

Data Rebalancing

Rebalance Types

TypeTriggerOperation
MIGRATE_FROM_FAILEDNode goes offlineCopy from other active domains to new target
COPY_TO_PROMOTEDStandby promotedCopy relevant data to new node
REBALANCE_TO_NEWNew node addedRedistribute data

Rebalance Process

java
// Triggered by NodeTopologyChangeEvent
@EventListener
public void onNodeTopologyChange(NodeTopologyChangeEvent event) {
    switch (event.getChangeType()) {
        case NODE_OFFLINE:
            migrateFromFailedNode(event.getNodeName());
            break;
        case NODE_DOMAIN_CHANGED:
            copyToPromotedNode(event.getNodeName());
            break;
    }
}

Rate Limiting

  • Default: 10 objects/second
  • Prevents I/O saturation during rebalancing
  • Configurable via Nacos

Rebalance Configuration

yaml
storage:
  rebalance:
    enabled: true # Enable automatic rebalancing
    rate-limit-per-second: 10 # Max objects copied per second (rate limiting)
    cleanup-source: false # Delete source data after rebalance (default: false)

State Persistence

Rebalance status is persisted to Redis to ensure:

  • Status synchronization across multiple instances
  • State recovery after service restart
  • Status TTL: 24 hours

Replica Consistency Repair

Periodically scans domains to detect and repair cross-domain replica inconsistencies.

Repair Process

  1. Acquire distributed lock (prevents concurrent execution)
  2. Iterate all active domains, collect object lists
  3. Compare objects between domain pairs, identify missing replicas
  4. Copy from healthy replicas to missing domains

Consistency Configuration

yaml
storage:
  consistency:
    repair:
      enabled: true # Enable scheduled repair
      cron: "0 */15 * * * ?" # Cron expression (default: every 15 minutes)
      batch-size: 100 # Objects per batch
      lock-timeout-seconds: 600 # Distributed lock timeout (seconds)

Node Configuration

Required Configuration

yaml
storage:
  # Required: List of active domains (in priority order)
  active-domains:
    - domain-a
    - domain-b

  # Optional: External endpoint (v3.2.0)
  # Used to replace internal endpoints when generating presigned URLs
  # Solves cross-network access issues (e.g., VPN)
  external-endpoint: http://10.1.0.2:9000

  # Optional: Standby domain (for failover)
  standby-domain: standby

  # Replication policy (v3.1.0)
  replication:
    factor: 2                     # Replica count, defaults to active domain count
    quorum: auto                  # Quorum strategy: auto|majority|all|number

  # Degraded write configuration (v3.1.0)
  degraded-write:
    enabled: true                 # Allow degraded writes
    min-replicas: 1               # Minimum replicas in degraded mode
    track-for-sync: true          # Track degraded writes for later sync

NodeConfig Properties

yaml
storage:
  nodes:
    - name: node-a1 # Unique name (also bucket name)
      endpoint: http://s3-a1:9000
      accessKey: ${S3_ACCESS_KEY}
      secretKey: ${S3_SECRET_KEY}
      faultDomain: domain-a # Must match active-domains or standby-domain
      weight: 100 # Affects data distribution
      enabled: true # Include in hash ring
      metricsPath: /minio/v2/metrics/node # Health metrics

Domain Configuration (Optional)

yaml
storage:
  domains:
    - name: domain-a
      minNodes: 1 # Min healthy nodes before promotion
      replicaCount: 1 # Replicas per chunk in this domain
      acceptsWrites: true
    - name: domain-b
      minNodes: 1
      replicaCount: 1
      acceptsWrites: true
    - name: standby
      minNodes: 0
      replicaCount: 0
      acceptsWrites: false

Complete Configuration Examples

Development Environment (Single-domain Mode)

yaml
storage:
  active-domains:
    - local
  nodes:
    - name: local-minio
      endpoint: http://localhost:9000
      accessKey: minioadmin
      secretKey: minioadmin
      faultDomain: local

Production Environment (Three Domains + Standby)

yaml
storage:
  active-domains:
    - dc-us-east
    - dc-us-west
    - dc-eu-west
  standby-domain: standby
  replication:
    factor: 3

  nodes:
    - name: us-east-node1
      endpoint: http://s3-use:9000
      faultDomain: dc-us-east
    - name: us-west-node1
      endpoint: http://s3-usw:9000
      faultDomain: dc-us-west
    - name: eu-west-node1
      endpoint: http://s3-euw:9000
      faultDomain: dc-eu-west
    - name: standby-node1
      endpoint: http://s3-standby:9000
      faultDomain: standby

Load-Aware Read Selection

When reading, the system selects the optimal node based on load score:

txt
Load Score = (disk_usage × 0.4) + (inflight_requests × 0.35) + (waiting_requests × 0.25)

Lower score = preferred for reads.

Monitoring

Key Metrics

MetricDescription
s3_node_online_statusNode online/offline status
s3_node_load_scoreCalculated load score
s3_node_operations_totalOperation counter

Health Check Intervals

  • Node health check: 30 seconds
  • Standby promotion check: 30 seconds
  • Metrics cache TTL: 60 seconds

Released under the Apache 2.0 License.