Files

Andrew Noyes 333148bb5a Improve clarity

2025-08-24 20:02:11 -04:00

4.6 KiB

Raw Blame History

Persistence Thread Design

Overview

The persistence thread receives commit batches from the main processing pipeline and uploads them to S3. It uses a single-threaded design with connection pooling and batching for optimal performance.

Architecture

Input: Commits arrive via ThreadPipeline interface from upstream processing Output: Batched commits uploaded to S3 persistence backend Transport: Single-threaded TCP client with connection pooling Protocol: Higher layers handle HTTP, authentication, and S3-specific details

Batching Strategy

The persistence thread collects commits into batches using two trigger conditions:

Time Trigger: batch_timeout_ms elapsed since batch collection started
Size Trigger: batch_size_threshold commits collected (can be exceeded by final commit)

Flow Control: When max_in_flight_requests reached, block until responses received.

Main Processing Loop

1. Batch Collection

No In-Flight Requests:

Use blocking acquire to get first commit batch
Process immediately (no batching delay)

With In-Flight Requests:

Check flow control: if at max_in_flight_requests, block for responses
Collect commits using non-blocking acquire until trigger condition:
- Check for available commits (non-blocking)
- If batch_size_threshold reached → process batch immediately
- If below threshold → use epoll_wait(batch_timeout_ms) for I/O and timeout
- On timeout → process collected commits
If no commits available and no in-flight requests → switch to blocking acquire

2. Connection Management

Acquire healthy connection from pool
Create new connections if pool below target_pool_size
If no healthy connections available, block until one becomes available
Maintain automatic pool replenishment

3. Data Transmission

Write batch data to S3 connection using appropriate protocol
Publish accepted transactions to subscriber system
Track request as in-flight for flow control

4. I/O Event Processing

Handle epoll events for all in-flight connections
Monitor connection health via heartbeats
Process incoming responses and detect connection failures

5. Response Handling

Ordered Acknowledgment: Only acknowledge batch after all prior batches are durable
Release batch via StageGuard destructor (publishes to next pipeline stage)
Publish durability events to subscriber system
Return healthy connection to pool

6. Failure Handling

Remove failed connection from pool
Retry batch with exponential backoff (up to max_retry_attempts)
Backoff delays only affect the specific failing batch
If retries exhausted, abort process or escalate error
Initiate pool replenishment if below target

Connection Pool

Target Size: target_pool_size connections (recommended: 2x max_in_flight_requests) Replenishment: Automatic creation when below target Health Monitoring: Heartbeat-based connection validation Sizing Rationale: 2x multiplier ensures availability during peak load and connection replacement

Key Design Properties

Batch Ordering: Batches may be retried out-of-order for performance, but acknowledgment to next pipeline stage maintains strict ordering.

Backpressure: Retry delays for failing batches create natural backpressure that eventually blocks the persistence thread when in-flight limits are reached.

Graceful Shutdown: On shutdown signal, drain all in-flight batches to completion before terminating.

Configuration Parameters

Parameter	Default	Description
`batch_timeout_ms`	5ms	Maximum time to wait collecting commits for batching
`batch_size_threshold`	-	Threshold for triggering batch processing
`max_in_flight_requests`	-	Maximum concurrent requests to persistence backend
`target_pool_size`	2x in-flight	Target number of connections to maintain
`max_retry_attempts`	3	Maximum retries for failed batches before aborting
`retry_base_delay_ms`	100ms	Base delay for exponential backoff retries

Configuration Validation

Required Constraints:

batch_size_threshold > 0 (must process at least one commit per batch)
max_in_flight_requests > 0 (must allow at least one concurrent request)
target_pool_size >= max_in_flight_requests (pool must accommodate all in-flight requests)
batch_timeout_ms > 0 (timeout must be positive)
max_retry_attempts >= 0 (zero disables retries)
retry_base_delay_ms > 0 (delay must be positive if retries enabled)

Performance Recommendations:

target_pool_size <= 2x max_in_flight_requests (optimal for performance)

4.6 KiB Raw Blame History