163 lines
8.2 KiB
Markdown
163 lines
8.2 KiB
Markdown
# Persistence Thread Design
|
|
|
|
## Overview
|
|
|
|
The persistence thread receives commit batches from the main processing pipeline and uploads them to S3. It uses a single-threaded design with connection pooling and batching for optimal performance.
|
|
|
|
## Architecture
|
|
|
|
**Input**: Commits arrive via `ThreadPipeline` interface from upstream processing
|
|
**Output**: Batched commits uploaded to S3 persistence backend
|
|
**Transport**: Single-threaded TCP client with connection pooling
|
|
**Protocol**: Higher layers handle HTTP, authentication, and S3-specific details
|
|
|
|
## Batching Strategy
|
|
|
|
The persistence thread collects commits into batches using two trigger conditions:
|
|
|
|
1. **Time Trigger**: `batch_timeout_ms` elapsed since batch collection started
|
|
2. **Size Trigger**: `batch_size_threshold` commits collected (can be exceeded by final commit)
|
|
|
|
**Flow Control**: When `max_in_flight_requests` reached, block until responses received.
|
|
|
|
## Main Processing Loop
|
|
|
|
### 1. Batch Collection
|
|
|
|
**No In-Flight Requests**:
|
|
- Use blocking acquire to get first commit batch
|
|
- Process immediately (no batching delay)
|
|
|
|
**With In-Flight Requests**:
|
|
- Check flow control: if at `max_in_flight_requests`, block for responses
|
|
- Collect commits using non-blocking acquire until trigger condition:
|
|
- Check for available commits (non-blocking)
|
|
- If `batch_size_threshold` reached → process batch immediately
|
|
- If below threshold → use `epoll_wait(batch_timeout_ms)` for I/O and timeout
|
|
- On timeout → process collected commits
|
|
- If no commits available and no in-flight requests → switch to blocking acquire
|
|
|
|
### 2. Connection Management
|
|
|
|
- Acquire healthy connection from pool
|
|
- Create new connections if pool below `target_pool_size`
|
|
- If no healthy connections available, block until one becomes available
|
|
- Maintain automatic pool replenishment
|
|
|
|
### 3. Data Transmission
|
|
|
|
- Write batch data to S3 connection using appropriate protocol
|
|
- Publish accepted transactions to subscriber system
|
|
- Track request as in-flight for flow control
|
|
|
|
### 4. I/O Event Processing
|
|
|
|
- Handle epoll events for all in-flight connections
|
|
- Monitor connection health via heartbeats
|
|
- Process incoming responses and detect connection failures
|
|
|
|
### 5. Response Handling
|
|
|
|
- **Ordered Acknowledgment**: Only acknowledge batch after all prior batches are durable
|
|
- Release batch via `StageGuard` destructor (publishes to next pipeline stage)
|
|
- Publish durability events to subscriber system
|
|
- Return healthy connection to pool
|
|
|
|
### 6. Failure Handling
|
|
|
|
- Remove failed connection from pool
|
|
- Retry batch with exponential backoff (up to `max_retry_attempts`)
|
|
- Backoff delays only affect the specific failing batch
|
|
- If retries exhausted, abort process or escalate error
|
|
- Initiate pool replenishment if below target
|
|
|
|
## Connection Pool
|
|
|
|
**Target Size**: `target_pool_size` connections (recommended: 2x `max_in_flight_requests`)
|
|
**Replenishment**: Automatic creation when below target
|
|
**Health Monitoring**: Heartbeat-based connection validation
|
|
**Sizing Rationale**: 2x multiplier ensures availability during peak load and connection replacement
|
|
|
|
## Key Design Properties
|
|
|
|
**Batch Ordering**: Batches may be retried out-of-order for performance, but acknowledgment to next pipeline stage maintains strict ordering.
|
|
|
|
**Backpressure**: Retry delays for failing batches create natural backpressure that eventually blocks the persistence thread when in-flight limits are reached.
|
|
|
|
**Graceful Shutdown**: On shutdown signal, drain all in-flight batches to completion before terminating.
|
|
|
|
## Configuration Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `batch_timeout_ms` | 1ms | Maximum time to wait collecting commits for batching |
|
|
| `batch_size_threshold` | 1MB | Threshold for triggering batch processing |
|
|
| `max_in_flight_requests` | 50 | Maximum concurrent requests to persistence backend |
|
|
| `target_pool_size` | 2x in-flight | Target number of connections to maintain |
|
|
| `max_retry_attempts` | 3 | Maximum retries for failed batches before aborting |
|
|
| `retry_base_delay_ms` | 100ms | Base delay for exponential backoff retries |
|
|
|
|
## Configuration Validation
|
|
|
|
**Required Constraints**:
|
|
- `batch_size_threshold` > 0 (must process at least one commit per batch)
|
|
- `max_in_flight_requests` > 0 (must allow at least one concurrent request)
|
|
- `max_in_flight_requests` <= 1000 (required for single-call recovery guarantee)
|
|
- `target_pool_size` >= `max_in_flight_requests` (pool must accommodate all in-flight requests)
|
|
- `batch_timeout_ms` > 0 (timeout must be positive)
|
|
- `max_retry_attempts` >= 0 (zero disables retries)
|
|
- `retry_base_delay_ms` > 0 (delay must be positive if retries enabled)
|
|
|
|
**Performance Recommendations**:
|
|
- `target_pool_size` <= 2x `max_in_flight_requests` (optimal for performance)
|
|
|
|
## Recovery and Consistency
|
|
|
|
### Recovery Model
|
|
|
|
WeaselDB's batched persistence design enables efficient recovery while maintaining strict serializable consistency guarantees.
|
|
|
|
#### **Batch Ordering and Durability**
|
|
|
|
**Ordered Acknowledgment Property**: Batches may be retried out-of-order for performance, but acknowledgment to the next pipeline stage maintains strict ordering. This ensures that if batch N is acknowledged as durable, all earlier batches (higher numbers N+1, N+2, etc.) are also guaranteed durable.
|
|
|
|
**Durability Watermark**: The system maintains a durable watermark indicating the latest consecutively durable batch (lowest batch number in the consecutive sequence). This watermark advances only when all preceding batches (higher numbers) are confirmed persistent.
|
|
|
|
#### **Recovery Protocol**
|
|
|
|
WeaselDB uses a **sequential batch numbering** scheme with **S3 atomic operations** to provide efficient crash recovery and split-brain prevention without external coordination services.
|
|
|
|
**Batch Numbering Scheme**:
|
|
- Batch numbers start at `2^64 - 1` and count downward: `18446744073709551615, 18446744073709551614, 18446744073709551613, ...`
|
|
- Each batch is stored as S3 object `batches/{batch_number:020d}` with zero-padding
|
|
- S3 lexicographic ordering ensures recent batches (higher numbers) appear first in LIST operations
|
|
|
|
**Leadership and Split-Brain Prevention**:
|
|
- New persistence thread instances scan S3 to find the next available batch number
|
|
- Each batch write uses `If-None-Match="*"` to atomically claim the sequential batch number
|
|
- Only one instance can successfully claim each batch number, preventing split-brain scenarios
|
|
- Batch object content includes `leader_id` to identify which leader wrote each batch
|
|
|
|
**Recovery Scenarios**:
|
|
|
|
**Clean Shutdown**:
|
|
- All in-flight batches are drained to completion before termination
|
|
- Durability watermark accurately reflects all durable state
|
|
- No recovery required on restart
|
|
|
|
**Crash Recovery**:
|
|
1. **S3 Scan with Bounded Cost**: List S3 objects with prefix `batches/` and limit of 1000 objects
|
|
2. **Gap Detection**: Check for missing sequential batch numbers. WeaselDB never puts more than 1000 batches in flight concurrently, so a limit of 1000 is sufficient.
|
|
3. **Watermark Reconstruction**: Set durability watermark to the latest consecutive batch (lowest batch number in consecutive sequence)
|
|
4. **Leadership Transition**: Begin writing batches starting from next available batch number. Skip past any batch numbers claimed in the durability watermark scan.
|
|
|
|
**Bounded Recovery Guarantee**: Since at most 1000 batches can be in-flight during a crash, the durability watermark is guaranteed to be found within the first 1000 objects in S3. This ensures **O(1) recovery time** regardless of database size, with at most **one S3 LIST operation** required.
|
|
|
|
**Recovery Protocol Detail**: Even with exactly 1000 batches in-flight, recovery works correctly:
|
|
- Worst case: earliest batch (highest number) fails while remaining 999 batches succeed
|
|
- S3 LIST returns 1000 objects: the 999 successful batches plus one previously written batch
|
|
- Gap detection identifies the missing batch and sets watermark to latest consecutive batch (lowest number in sequence)
|
|
- Since batches count downward with zero-padded names, lexicographic ordering ensures proper sequence detection
|
|
|
|
**Recovery Performance Limits**: To maintain single-call recovery guarantees, `max_in_flight_requests` is limited to **1000**, matching S3's maximum objects per LIST operation. This ensures a single S3 API call is sufficient for recovery.
|