weaseldb/persistence.md

# Persistence Thread Design

## Overview

The persistence thread receives commit batches from the main processing pipeline and uploads them to S3. It uses a single-threaded design with connection pooling and batching.

## Architecture

**Input**: Commits arrive via `ThreadPipeline` interface from upstream processing
**Output**: Batched commits uploaded to S3 persistence backend
**Transport**: Single-threaded TCP client with connection pooling
**Protocol**: Higher layers handle HTTP, authentication, and S3-specific details

## Batching Strategy

The persistence thread collects commits into batches using two trigger conditions:

1. **Time Trigger**: `batch_timeout_ms` elapsed since batch collection started
1. **Size Trigger**: `batch_size_threshold` commits collected (can be exceeded by final commit)

**Flow Control**: When `max_in_flight_requests` reached, block until responses received. Batches in retry backoff count toward the in-flight limit, creating natural backpressure during failures.

## Main Processing Loop

### 1. Batch Collection

**No In-Flight Requests** (no I/O to pump):

- Use blocking acquire to get first commit batch (can afford to wait)
- Process immediately (no batching delay)

**With In-Flight Requests** (I/O to pump in event loop):

- Check flow control: if at `max_in_flight_requests`, block for responses
- Collect commits using non-blocking acquire until trigger condition:
  - Check for available commits (non-blocking)
  - If `batch_size_threshold` reached → process batch immediately
  - If below threshold → use `epoll_wait(batch_timeout_ms)` to simultaneously collect more commits and process I/O responses
  - On timeout → process collected commits
- If no commits available and no in-flight requests → switch to blocking acquire

### 2. Connection Management

- Acquire healthy connection from pool
- Create new connections if pool below `target_pool_size`
- If no healthy connections available, block until one becomes available
- Maintain automatic pool replenishment

### 3. Data Transmission

- Write batch data to S3 connection using appropriate protocol
- Publish accepted transactions to subscriber system
- Track request as in-flight for flow control

### 4. I/O Event Processing

- Handle epoll events for all in-flight connections
- Process incoming responses and detect connection failures

### 5. Response Handling

- **Ordered Acknowledgment**: Only acknowledge batch after all prior batches are durable
- Release batch via `StageGuard` destructor (publishes to next pipeline stage)
- Publish durability events to subscriber system
- Return healthy connection to pool

### 6. Failure Handling

- Remove failed connection from pool
- Retry batch with exponential backoff (up to `max_retry_attempts`)
- Backoff delays only affect the specific failing batch
- If retries exhausted, abort process or escalate error
- Initiate pool replenishment if below target

## Connection Pool

**Target Size**: `target_pool_size` connections
**Replenishment**: Automatic creation when below target

## Key Design Properties

**Batch Ordering**: Batches may be retried out-of-order for performance, but acknowledgment to next pipeline stage maintains strict ordering.

**Backpressure**: Retry delays for failing batches create natural backpressure that eventually blocks the persistence thread when in-flight limits are reached.

**Graceful Shutdown**: On shutdown signal, drain all in-flight batches to completion before terminating.

## Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `batch_timeout_ms` | 1ms | Maximum time to wait collecting commits for batching |
| `batch_size_threshold` | 1MB | Threshold for triggering batch processing |
| `max_in_flight_requests` | 50 | Maximum concurrent requests to persistence backend |
| `target_pool_size` | 100 | Target number of connections to maintain |
| `max_retry_attempts` | 3 | Maximum retries for failed batches before aborting |
| `retry_base_delay_ms` | 100ms | Base delay for exponential backoff retries |

## Configuration Validation

**Required Constraints**:

- `batch_size_threshold` > 0 (must process at least one commit per batch)
- `max_in_flight_requests` > 0 (must allow at least one concurrent request)
- `max_in_flight_requests` \<= 1000 (required for single-call recovery guarantee)
- `batch_timeout_ms` > 0 (timeout must be positive)
- `max_retry_attempts` >= 0 (zero disables retries)
- `retry_base_delay_ms` > 0 (delay must be positive if retries enabled)

## Recovery and Consistency

### Recovery Model

WeaselDB's batched persistence design enables efficient recovery while maintaining strict serializable consistency guarantees.

#### **Batch Ordering and Durability**

**Ordered Acknowledgment Property**: Batches may be retried out-of-order for performance, but acknowledgment to the next pipeline stage maintains strict ordering. This ensures that if batch N is acknowledged as durable, all older batches (higher numbers N+1, N+2, etc.) are also guaranteed durable.

**Durability Watermark**: The system maintains a durable watermark indicating the latest consecutively durable batch. Starting from the highest batch numbers (oldest batches) and scanning downward, the watermark is the lowest batch number (newest) before any gap in the sequence. This watermark advances only when all older batches (higher numbers) are confirmed persistent.

**Example**: If batches 100, 99, 97 exist but 98 is missing, the watermark is 99 (the newest consecutive batch before the gap at 98).

#### **Recovery Protocol**

WeaselDB uses a **sequential batch numbering** scheme with **S3 atomic operations** to provide efficient crash recovery and split-brain prevention without external coordination services.

**Batch Numbering Scheme**:

- Batch numbers start at `2^64 - 1` and count downward: `18446744073709551615, 18446744073709551614, 18446744073709551613, ...`
- Each batch is stored as S3 object `batches/{batch_number:020d}` with zero-padding
- S3 lexicographic ordering on zero-padded numbers returns batches in ascending numerical order (latest batches first)

**Terminology**: Since batch numbers decrease over time, we use numerical ordering:

- "Older" batches = higher numbers (written first in time)
- "Newer" batches = lower numbers (written more recently)
- "Most recent" batches = lowest numbers (most recently written)

**Example**: If batches 100, 99, 98, 97 are written, S3 LIST returns them as:

```
batches/00000000000000000097  (newest, lowest batch number)
batches/00000000000000000098
batches/00000000000000000099
batches/00000000000000000100  (oldest, highest batch number)
...
```

**Leadership and Split-Brain Prevention**:

- New persistence thread instances scan S3 to find the highest (oldest) available batch number
- Each batch write uses `If-None-Match="*"` to atomically claim the sequential batch number
- Only one instance can successfully claim each batch number, preventing split-brain scenarios
- Batch object content includes `leader_id` to identify which leader wrote each batch

**Recovery Scenarios**:

**Clean Shutdown**:

- All in-flight batches are drained to completion before termination
- Durability watermark accurately reflects all durable state
- No recovery required on restart

**Crash Recovery**:

1. **S3 Scan with Bounded Cost**: List S3 objects with prefix `batches/` and limit of 1000 objects
1. **Gap Detection**: Check for missing sequential batch numbers. WeaselDB never puts more than 1000 batches in flight concurrently, so a limit of 1000 is sufficient.
1. **Watermark Reconstruction**: Set durability watermark to the latest consecutive batch (scanning from highest numbers downward, until a gap)
1. **Leadership Transition**: Begin writing batches starting from next available batch number. Skip past any batch numbers already claimed in the durability watermark scan.

**Bounded Recovery Guarantee**: Since at most 1000 batches can be in-flight during a crash, any gap in the sequential numbering (indicating the durability watermark) must appear within the first 1000 S3 objects. This is because:

1. At most 1000 batches can be incomplete when crash occurs
1. S3 LIST returns objects in ascending numerical order (most recent batches first due to countdown numbering)
1. The first gap found represents the boundary between durable and potentially incomplete batches
1. S3 LIST operations have a maximum limit of 1000 objects per request
1. Therefore, scanning 1000 objects (the maximum S3 allows in one request) is sufficient to find this boundary

This ensures **O(1) recovery time** regardless of database size, with at most **one S3 LIST operation** required.

**Recovery Protocol Detail**: Even with exactly 1000 batches in-flight, recovery works correctly:

**Example Scenario**: Batches 2000 down to 1001 (1000 batches) are in-flight when crash occurs

- Previous successful run had written through batch 2001
- Worst case: batch 2000 (oldest in-flight) fails, batches 1999 down to 1001 (newer) all succeed
- S3 LIST(limit=1000) returns: 1001, 1002, ..., 1998, 1999, 2001 (ascending numerical order)
- Gap detection: missing batch 2000 creates a gap between 1999 and 2001 in the consecutive sequence
- Watermark set to 2001 (newest consecutive batch before the gap where batch 2000 is missing)
- Recovery scans exactly 1000 objects, finds gap, sets correct watermark
- Next batch will be written as 2000 (fill in the gap)

**Recovery Performance Limits**: To maintain single-call recovery guarantees, `max_in_flight_requests` is limited to **1000**, matching S3's maximum objects per LIST operation. This ensures a single S3 API call is sufficient for recovery.