Add mdformat pre-commit hook
This commit is contained in:
@@ -16,7 +16,7 @@ The persistence thread receives commit batches from the main processing pipeline
|
||||
The persistence thread collects commits into batches using two trigger conditions:
|
||||
|
||||
1. **Time Trigger**: `batch_timeout_ms` elapsed since batch collection started
|
||||
2. **Size Trigger**: `batch_size_threshold` commits collected (can be exceeded by final commit)
|
||||
1. **Size Trigger**: `batch_size_threshold` commits collected (can be exceeded by final commit)
|
||||
|
||||
**Flow Control**: When `max_in_flight_requests` reached, block until responses received. Batches in retry backoff count toward the in-flight limit, creating natural backpressure during failures.
|
||||
|
||||
@@ -25,10 +25,12 @@ The persistence thread collects commits into batches using two trigger condition
|
||||
### 1. Batch Collection
|
||||
|
||||
**No In-Flight Requests** (no I/O to pump):
|
||||
|
||||
- Use blocking acquire to get first commit batch (can afford to wait)
|
||||
- Process immediately (no batching delay)
|
||||
|
||||
**With In-Flight Requests** (I/O to pump in event loop):
|
||||
|
||||
- Check flow control: if at `max_in_flight_requests`, block for responses
|
||||
- Collect commits using non-blocking acquire until trigger condition:
|
||||
- Check for available commits (non-blocking)
|
||||
@@ -97,9 +99,10 @@ The persistence thread collects commits into batches using two trigger condition
|
||||
## Configuration Validation
|
||||
|
||||
**Required Constraints**:
|
||||
|
||||
- `batch_size_threshold` > 0 (must process at least one commit per batch)
|
||||
- `max_in_flight_requests` > 0 (must allow at least one concurrent request)
|
||||
- `max_in_flight_requests` <= 1000 (required for single-call recovery guarantee)
|
||||
- `max_in_flight_requests` \<= 1000 (required for single-call recovery guarantee)
|
||||
- `batch_timeout_ms` > 0 (timeout must be positive)
|
||||
- `max_retry_attempts` >= 0 (zero disables retries)
|
||||
- `retry_base_delay_ms` > 0 (delay must be positive if retries enabled)
|
||||
@@ -123,16 +126,19 @@ WeaselDB's batched persistence design enables efficient recovery while maintaini
|
||||
WeaselDB uses a **sequential batch numbering** scheme with **S3 atomic operations** to provide efficient crash recovery and split-brain prevention without external coordination services.
|
||||
|
||||
**Batch Numbering Scheme**:
|
||||
|
||||
- Batch numbers start at `2^64 - 1` and count downward: `18446744073709551615, 18446744073709551614, 18446744073709551613, ...`
|
||||
- Each batch is stored as S3 object `batches/{batch_number:020d}` with zero-padding
|
||||
- S3 lexicographic ordering on zero-padded numbers returns batches in ascending numerical order (latest batches first)
|
||||
|
||||
**Terminology**: Since batch numbers decrease over time, we use numerical ordering:
|
||||
|
||||
- "Older" batches = higher numbers (written first in time)
|
||||
- "Newer" batches = lower numbers (written more recently)
|
||||
- "Most recent" batches = lowest numbers (most recently written)
|
||||
|
||||
**Example**: If batches 100, 99, 98, 97 are written, S3 LIST returns them as:
|
||||
|
||||
```
|
||||
batches/00000000000000000097 (newest, lowest batch number)
|
||||
batches/00000000000000000098
|
||||
@@ -142,6 +148,7 @@ batches/00000000000000000100 (oldest, highest batch number)
|
||||
```
|
||||
|
||||
**Leadership and Split-Brain Prevention**:
|
||||
|
||||
- New persistence thread instances scan S3 to find the highest (oldest) available batch number
|
||||
- Each batch write uses `If-None-Match="*"` to atomically claim the sequential batch number
|
||||
- Only one instance can successfully claim each batch number, preventing split-brain scenarios
|
||||
@@ -150,28 +157,32 @@ batches/00000000000000000100 (oldest, highest batch number)
|
||||
**Recovery Scenarios**:
|
||||
|
||||
**Clean Shutdown**:
|
||||
|
||||
- All in-flight batches are drained to completion before termination
|
||||
- Durability watermark accurately reflects all durable state
|
||||
- No recovery required on restart
|
||||
|
||||
**Crash Recovery**:
|
||||
|
||||
1. **S3 Scan with Bounded Cost**: List S3 objects with prefix `batches/` and limit of 1000 objects
|
||||
2. **Gap Detection**: Check for missing sequential batch numbers. WeaselDB never puts more than 1000 batches in flight concurrently, so a limit of 1000 is sufficient.
|
||||
3. **Watermark Reconstruction**: Set durability watermark to the latest consecutive batch (scanning from highest numbers downward, until a gap)
|
||||
4. **Leadership Transition**: Begin writing batches starting from next available batch number. Skip past any batch numbers already claimed in the durability watermark scan.
|
||||
1. **Gap Detection**: Check for missing sequential batch numbers. WeaselDB never puts more than 1000 batches in flight concurrently, so a limit of 1000 is sufficient.
|
||||
1. **Watermark Reconstruction**: Set durability watermark to the latest consecutive batch (scanning from highest numbers downward, until a gap)
|
||||
1. **Leadership Transition**: Begin writing batches starting from next available batch number. Skip past any batch numbers already claimed in the durability watermark scan.
|
||||
|
||||
**Bounded Recovery Guarantee**: Since at most 1000 batches can be in-flight during a crash, any gap in the sequential numbering (indicating the durability watermark) must appear within the first 1000 S3 objects. This is because:
|
||||
|
||||
1. At most 1000 batches can be incomplete when crash occurs
|
||||
2. S3 LIST returns objects in ascending numerical order (most recent batches first due to countdown numbering)
|
||||
3. The first gap found represents the boundary between durable and potentially incomplete batches
|
||||
4. S3 LIST operations have a maximum limit of 1000 objects per request
|
||||
5. Therefore, scanning 1000 objects (the maximum S3 allows in one request) is sufficient to find this boundary
|
||||
1. S3 LIST returns objects in ascending numerical order (most recent batches first due to countdown numbering)
|
||||
1. The first gap found represents the boundary between durable and potentially incomplete batches
|
||||
1. S3 LIST operations have a maximum limit of 1000 objects per request
|
||||
1. Therefore, scanning 1000 objects (the maximum S3 allows in one request) is sufficient to find this boundary
|
||||
|
||||
This ensures **O(1) recovery time** regardless of database size, with at most **one S3 LIST operation** required.
|
||||
|
||||
**Recovery Protocol Detail**: Even with exactly 1000 batches in-flight, recovery works correctly:
|
||||
|
||||
**Example Scenario**: Batches 2000 down to 1001 (1000 batches) are in-flight when crash occurs
|
||||
|
||||
- Previous successful run had written through batch 2001
|
||||
- Worst case: batch 2000 (oldest in-flight) fails, batches 1999 down to 1001 (newer) all succeed
|
||||
- S3 LIST(limit=1000) returns: 1001, 1002, ..., 1998, 1999, 2001 (ascending numerical order)
|
||||
|
||||
Reference in New Issue
Block a user