Files
weaseldb/persistence.md

7.6 KiB

Persistence Thread Design

Overview

The persistence thread receives commit batches from the main processing pipeline and uploads them to S3. It uses a single-threaded design with connection pooling and batching for optimal performance.

Architecture

Input: Commits arrive via ThreadPipeline interface from upstream processing Output: Batched commits uploaded to S3 persistence backend Transport: Single-threaded TCP client with connection pooling Protocol: Higher layers handle HTTP, authentication, and S3-specific details

Batching Strategy

The persistence thread collects commits into batches using two trigger conditions:

  1. Time Trigger: batch_timeout_ms elapsed since batch collection started
  2. Size Trigger: batch_size_threshold commits collected (can be exceeded by final commit)

Flow Control: When max_in_flight_requests reached, block until responses received.

Main Processing Loop

1. Batch Collection

No In-Flight Requests:

  • Use blocking acquire to get first commit batch
  • Process immediately (no batching delay)

With In-Flight Requests:

  • Check flow control: if at max_in_flight_requests, block for responses
  • Collect commits using non-blocking acquire until trigger condition:
    • Check for available commits (non-blocking)
    • If batch_size_threshold reached → process batch immediately
    • If below threshold → use epoll_wait(batch_timeout_ms) for I/O and timeout
    • On timeout → process collected commits
  • If no commits available and no in-flight requests → switch to blocking acquire

2. Connection Management

  • Acquire healthy connection from pool
  • Create new connections if pool below target_pool_size
  • If no healthy connections available, block until one becomes available
  • Maintain automatic pool replenishment

3. Data Transmission

  • Write batch data to S3 connection using appropriate protocol
  • Publish accepted transactions to subscriber system
  • Track request as in-flight for flow control

4. I/O Event Processing

  • Handle epoll events for all in-flight connections
  • Monitor connection health via heartbeats
  • Process incoming responses and detect connection failures

5. Response Handling

  • Ordered Acknowledgment: Only acknowledge batch after all prior batches are durable
  • Release batch via StageGuard destructor (publishes to next pipeline stage)
  • Publish durability events to subscriber system
  • Return healthy connection to pool

6. Failure Handling

  • Remove failed connection from pool
  • Retry batch with exponential backoff (up to max_retry_attempts)
  • Backoff delays only affect the specific failing batch
  • If retries exhausted, abort process or escalate error
  • Initiate pool replenishment if below target

Connection Pool

Target Size: target_pool_size connections (recommended: 2x max_in_flight_requests) Replenishment: Automatic creation when below target Health Monitoring: Heartbeat-based connection validation Sizing Rationale: 2x multiplier ensures availability during peak load and connection replacement

Key Design Properties

Batch Ordering: Batches may be retried out-of-order for performance, but acknowledgment to next pipeline stage maintains strict ordering.

Backpressure: Retry delays for failing batches create natural backpressure that eventually blocks the persistence thread when in-flight limits are reached.

Graceful Shutdown: On shutdown signal, drain all in-flight batches to completion before terminating.

Configuration Parameters

Parameter Default Description
batch_timeout_ms 1ms Maximum time to wait collecting commits for batching
batch_size_threshold 1MB Threshold for triggering batch processing
max_in_flight_requests 50 Maximum concurrent requests to persistence backend
target_pool_size 2x in-flight Target number of connections to maintain
max_retry_attempts 3 Maximum retries for failed batches before aborting
retry_base_delay_ms 100ms Base delay for exponential backoff retries

Configuration Validation

Required Constraints:

  • batch_size_threshold > 0 (must process at least one commit per batch)
  • max_in_flight_requests > 0 (must allow at least one concurrent request)
  • max_in_flight_requests < 1000 (required for single-call recovery guarantee)
  • target_pool_size >= max_in_flight_requests (pool must accommodate all in-flight requests)
  • batch_timeout_ms > 0 (timeout must be positive)
  • max_retry_attempts >= 0 (zero disables retries)
  • retry_base_delay_ms > 0 (delay must be positive if retries enabled)

Performance Recommendations:

  • target_pool_size <= 2x max_in_flight_requests (optimal for performance)

Recovery and Consistency

Recovery Model

WeaselDB's batched persistence design enables efficient recovery while maintaining strict serializable consistency guarantees.

Batch Ordering and Durability

Ordered Acknowledgment Property: Batches may be retried out-of-order for performance, but acknowledgment to the next pipeline stage maintains strict ordering. This ensures that if batch N is acknowledged as durable, all batches 0 through N-1 are also guaranteed durable.

Durability Watermark: The system maintains a durable watermark indicating the highest consecutively durable batch ID. This watermark advances only when all preceding batches are confirmed persistent.

Recovery Protocol

WeaselDB uses a sequential batch numbering scheme with S3 atomic operations to provide efficient crash recovery and split-brain prevention without external coordination services.

Batch Numbering Scheme:

  • Batch numbers start at 2^64 - 1 and count downward: 18446744073709551615, 18446744073709551614, 18446744073709551613, ...
  • Each batch is stored as S3 object batches/{batch_number:020d} with zero-padding
  • S3 lexicographic ordering ensures recent batches (higher numbers) appear first in LIST operations

Leadership and Split-Brain Prevention:

  • New persistence thread instances scan S3 to find the next available batch number
  • Each batch write uses If-None-Match="*" to atomically claim the sequential batch number
  • Only one instance can successfully claim each batch number, preventing split-brain scenarios
  • Batch object content includes leader_id to identify which leader wrote each batch

Recovery Scenarios:

Clean Shutdown:

  • All in-flight batches are drained to completion before termination
  • Durability watermark accurately reflects all durable state
  • No recovery required on restart

Crash Recovery:

  1. S3 Scan with Bounded Cost: List S3 objects with prefix batches/ and limit of 1000 objects
  2. Gap Detection: Check for missing sequential batch numbers. WeaselDB never puts 1000 batches in flight concurrently, so a limit of 1000 is sufficient.
  3. Watermark Reconstruction: Set durability watermark to highest consecutive batch number found
  4. Leadership Transition: Begin writing batches starting from next available batch number. Skip past any batch numbers claimed in the durability watermark scan.

Bounded Recovery Guarantee: Since at most 999 batches can be in-flight during a crash, the durability watermark is guaranteed to be found within the first 1000 objects in S3. This ensures O(1) recovery time regardless of database size, with at most one S3 LIST operation required.

Recovery Performance Limits: To maintain single-call recovery guarantees, max_in_flight_requests is limited to 1000, matching S3's maximum objects per LIST operation. This ensures recovery a single S3 API call is sufficient for recovery.