Files

Andrew Noyes 5bfa20643a Update design.md

2025-08-19 13:33:31 -04:00

18 KiB

Raw Blame History

WeaselDB Development Guide

Project Summary

WeaselDB is a high-performance write-side database component designed for systems where reading and writing are decoupled. The system focuses exclusively on handling transactional commits with optimistic concurrency control, while readers are expected to maintain their own queryable representations by subscribing to change streams.

Quick Start

Build System

Use CMake with C++20 standard
Primary build commands:
- mkdir -p build && cd build
- cmake .. -DCMAKE_BUILD_TYPE=Release
- ninja or make -j$(nproc)

Testing and Development Workflow

Run all tests: ninja test or ctest
Individual test targets:
- ./test_arena_allocator - Arena allocator unit tests
- ./test_commit_request - JSON parsing and validation tests
Benchmarking:
- ./bench_arena_allocator - Memory allocation performance
- ./bench_commit_request - JSON parsing performance
- ./bench_parser_comparison - Compare against nlohmann::json and RapidJSON
Debug tools: ./debug_arena - Analyze arena allocator behavior

Code Style and Conventions

C++ Style: Modern C++20 with RAII and move semantics
Memory Management: Prefer arena allocation over standard allocators
String Handling: Use std::string_view for zero-copy operations
Error Handling: Return error codes or use exceptions appropriately
Naming: snake_case for variables/functions, PascalCase for classes
Performance: Always consider allocation patterns and cache locality

Dependencies and External Libraries

weaseljson: Must be installed system-wide (high-performance JSON parser)
simdutf: Fetched automatically (SIMD base64 encoding/decoding)
toml11: Fetched automatically (TOML configuration parsing)
doctest: Fetched automatically (testing framework)
nanobench: Fetched automatically (benchmarking library)
gperf: System requirement for perfect hash generation

Architecture Overview

Core Components

1. Arena Allocator (`src/arena_allocator.hpp`)

Ultra-fast memory allocator (~1ns per allocation vs ~20-270ns for malloc)
Lazy initialization with geometric block growth (doubling strategy)
Intrusive linked list design for minimal memory overhead
Memory-efficient reset that keeps the first block and frees others
STL-compatible interface via ArenaStlAllocator

Key features:

O(1) amortized allocation
Proper alignment handling for all types
Move semantics for efficient transfers
Requires trivially destructible types only

2. Commit Request Data Model (`src/commit_request.hpp`)

Format-agnostic data structure for representing transactional commits
Arena-backed string storage with efficient memory management
Move-only semantics for optimal performance
Builder pattern for constructing commit requests
Zero-copy string views pointing to arena-allocated memory

3. JSON Commit Request Parser (`src/json_commit_request_parser.{hpp,cpp}`)

High-performance JSON parser using weaseljson library
Streaming parser support for incremental parsing of network data
gperf-optimized token recognition for fast JSON key parsing
Base64 decoding using SIMD-accelerated simdutf
Comprehensive validation of transaction structure

Parser capabilities:

One-shot parsing for complete JSON
Streaming parsing for network protocols
Parse state management with error recovery
Memory-efficient string views backed by arena storage
Perfect hash table lookup for JSON keys using gperf

4. Parser Interface (`src/commit_request_parser.hpp`)

Abstract base class for commit request parsers
Format-agnostic parsing interface supporting multiple serialization formats
Streaming and one-shot parsing modes
Standardized error handling across parser implementations

5. Configuration System (`src/config.{hpp,cpp}`)

TOML-based configuration using toml11 library
Structured configuration with server, commit, and subscription sections
Default fallback values for all configuration options
Type-safe parsing with validation and bounds checking
Comprehensive validation with meaningful error messages

See config.md for complete configuration documentation.

6. JSON Token Optimization (`src/json_tokens.gperf`, `src/json_token_enum.hpp`)

Perfect hash table generated by gperf for O(1) JSON key lookup
Compile-time token enumeration for type-safe key identification
Minimal perfect hash reduces memory overhead and improves cache locality
Build-time code generation ensures optimal performance

7. Server (`src/server.{hpp,cpp}`)

High-performance multi-threaded networking using epoll with thread pools
Factory pattern construction via Server::create() ensures proper shared_ptr semantics
Safe shutdown mechanism with async-signal-safe shutdown() method
Connection ownership management with automatic cleanup on server destruction
Pluggable protocol handlers via ConnectionHandler interface

Key features:

Multi-threaded architecture: separate accept and network thread pools
EPOLL_EXCLUSIVE load balancing across accept threads
Connection lifecycle safety with weak_ptr references
Graceful shutdown with proper resource cleanup
RAII-based connection management with unique_ptr ownership

8. Connection (`src/connection.{hpp,cpp}`)

Efficient per-connection state management with arena-based memory allocation
Safe ownership transfer between server threads and protocol handlers
Automatic cleanup on connection closure or server shutdown
Handler interface isolation - only exposes necessary methods to protocol handlers

Key features:

Arena allocator per connection for efficient memory management
Request/Response arena lifecycle: Arena resets after each complete request/response cycle
Weak reference to server for safe cleanup after server destruction
Private networking details accessible only to Server via friend relationship
Public handler interface: appendMessage(), closeAfterSend(), getArena(), getId()
Thread-safe ownership transfer with Server::releaseBackToServer()

9. ConnectionHandler Interface (`src/connection_handler.hpp`)

Abstract protocol interface decoupling networking from application logic
Ownership transfer support allowing handlers to take connections for async processing
Streaming data processing with partial message handling
Connection lifecycle hooks for initialization and cleanup

Key features:

process_data() with unique_ptr& for ownership transfer
ProcessResult enum for connection lifecycle control (Continue/CloseAfterSend/CloseNow)
on_connection_established/closed() hooks for protocol state management
Zero-copy data processing with arena allocator integration
Thread-safe ownership transfer via Server::releaseBackToServer()

Data Model

Transaction Structure

CommitRequest {
  - request_id: Optional unique identifier
  - leader_id: Expected leader for consistency
  - read_version: Snapshot version for preconditions
  - preconditions[]: Optimistic concurrency checks
    - point_read: Single key existence/content validation
    - range_read: Range-based consistency validation
  - operations[]: Ordered mutation operations
    - write: Set key-value pair
    - delete: Remove single key
    - range_delete: Remove key range
}

Memory Management

Arena-based allocation ensures efficient bulk memory management per connection
String views eliminate unnecessary copying of JSON data
Zero-copy design for binary data handling
RAII-based connection lifecycle with automatic cleanup on destruction
Safe ownership transfer between server threads and protocol handlers
Weak reference safety prevents crashes when connections outlive server

Connection Ownership Model:

Creation: Accept threads create connections, transfer to epoll as raw pointers
Processing: Network threads claim ownership by wrapping in unique_ptr
Handler Transfer: Handlers can take ownership for async processing via unique_ptr.release()
Return Path: Handlers use Server::releaseBackToServer() to return connections
Safety: All transfers use weak_ptr to server for safe cleanup
Cleanup: RAII ensures proper resource cleanup in all scenarios

Arena Memory Lifecycle:

Request Processing: Handler uses conn->getArena() to allocate memory for parsing request data
Response Generation: Handler uses arena for temporary response construction (headers, JSON, etc.)
Response Queuing: Handler calls conn->appendMessage() which copies data to arena-backed message queue
Response Writing: Server writes all queued messages to socket via writeBytes()
Arena Reset: After successful write completion, arena resets to reclaim all memory from the request/response cycle

This design assumes request/response pairs (HTTP-like protocols) but works for any protocol where there's a clear completion point for memory reclamation.

API Design

The system implements a RESTful API with three core endpoints:

GET /v1/version: Retrieve current committed version and leader
POST /v1/commit: Submit transactional operations
GET /v1/subscribe: Stream committed transactions (implied)
GET /v1/status: Check commit status by request_id (implied)

Performance Characteristics

Memory Allocation

~1ns allocation time vs standard allocators
Bulk deallocation eliminates individual free() calls
Optimized geometric growth uses current block size for doubling strategy
Alignment-aware allocation prevents performance penalties

JSON Parsing

Streaming parser handles large payloads efficiently
Incremental processing suitable for network protocols
Arena storage eliminates string allocation overhead
SIMD-accelerated base64 decoding using simdutf for maximum performance
Perfect hash table provides O(1) JSON key lookup via gperf
Zero hash collisions for known JSON tokens eliminates branching

Design Principles

Performance-first: Every component optimized for high throughput
Memory efficiency: Arena allocation eliminates fragmentation
Zero-copy: Minimize data copying throughout pipeline
Streaming-ready: Support incremental processing
Type safety: Compile-time validation where possible
Resource management: RAII and move semantics throughout

Testing & Benchmarking

The project includes comprehensive testing infrastructure:

Unit tests using doctest framework
Performance benchmarks using nanobench
Memory allocation benchmarks for arena performance
JSON parsing validation for correctness

Build targets:

test_arena_allocator: Arena allocator functionality tests
test_commit_request: JSON parsing and validation tests
Main server executable (compiled from src/main.cpp)
bench_arena_allocator: Arena allocator performance benchmarks
bench_commit_request: JSON parsing performance benchmarks
bench_parser_comparison: Comparison benchmarks vs nlohmann::json and RapidJSON
debug_arena: Debug tool for arena allocator analysis

Dependencies

weaseljson: High-performance streaming JSON parser
simdutf: SIMD-accelerated UTF-8 validation and base64 encoding/decoding
toml11: TOML configuration file parsing
doctest: Lightweight testing framework
nanobench: Micro-benchmarking library
gperf: Perfect hash function generator for JSON token optimization
nlohmann::json: Reference JSON parser for benchmarking comparisons
RapidJSON: High-performance JSON parser for benchmarking comparisons

Future Considerations

This write-side component is designed to integrate with:

Leader election systems for distributed consensus
Replication mechanisms for fault tolerance
Read-side systems that consume the transaction stream
Monitoring systems for operational visibility

The modular design allows each component to be optimized independently while maintaining clear interfaces for system integration.

Development Guidelines

Important Implementation Details

Server Creation: Always use Server::create() factory method - direct construction is impossible
Connection Ownership: Use unique_ptr semantics for safe ownership transfer between components
Arena Allocator Pattern: Always use ArenaAllocator for temporary allocations within request processing
String View Usage: Prefer std::string_view over std::string when pointing to arena-allocated memory
Ownership Transfer: Use Server::releaseBackToServer() for returning connections to server from handlers
JSON Token Lookup: Use the gperf-generated perfect hash table in json_tokens.hpp for O(1) key recognition
Base64 Handling: Always use simdutf for base64 encoding/decoding for performance
Error Propagation: Use structured error types that can be efficiently returned up the call stack
Thread Safety: Connection ownership transfers are designed to be thread-safe with proper RAII cleanup

File Organization

Core Headers: src/ contains all primary implementation files
Tests: tests/ contains doctest-based unit tests
Benchmarks: benchmarks/ contains nanobench performance tests
Tools: tools/ contains debugging and analysis utilities
Build-Generated: build/ contains CMake-generated files including json_tokens.cpp

Adding New Protocol Handlers

Inherit from ConnectionHandler in src/connection_handler.hpp
Implement process_data() with proper ownership semantics
Use connection's arena allocator for temporary allocations: conn->getArena()
Handle partial messages and streaming protocols appropriately
Return appropriate ProcessResult for connection lifecycle management
Use Server::releaseBackToServer() if taking ownership for async processing
Add corresponding test cases and integration tests
Consider performance implications of ownership transfers

Adding New Parsers

Inherit from CommitRequestParser in src/commit_request_parser.hpp
Implement both streaming and one-shot parsing modes
Use arena allocation for all temporary string storage
Add corresponding test cases in tests/
Add benchmark comparisons in benchmarks/

Performance Considerations

Memory: Arena allocation eliminates fragmentation - use it for all request-scoped data
CPU: Perfect hashing and SIMD operations are critical paths - avoid alternatives
I/O: Streaming parser design supports incremental network data processing
Cache: String views avoid copying, keeping data cache-friendly

Configuration Management

All configuration is TOML-based using config.toml
Comprehensive documentation available in config.md
Type-safe parsing with validation and bounds checking
Always validate configuration values and provide meaningful errors

Testing Strategy

Unit tests validate individual component correctness
Benchmarks ensure performance characteristics are maintained
Debug tools help analyze memory usage patterns
Always run both tests and benchmarks before submitting changes

Build System Details

CMake generates gperf hash tables at build time
Ninja is preferred over make for faster incremental builds
Release builds include debug symbols for profiling
All external dependencies except weaseljson are auto-fetched

Common Patterns

Server Creation Pattern

// Server must be created via factory method
auto server = Server::create(config, handler);

// Never create on stack or with make_shared (won't compile):
// Server server(config, handler);  // Compiler error - constructor private
// auto server = std::make_shared<Server>(config, handler);  // Compiler error

ConnectionHandler Implementation Patterns

Simple Synchronous Handler

class HttpHandler : public ConnectionHandler {
public:
  ProcessResult process_data(std::string_view data, std::unique_ptr<Connection>& conn_ptr) override {
    // Parse HTTP request using connection's arena
    ArenaAllocator& arena = conn_ptr->getArena();

    // Generate response
    conn_ptr->appendMessage("HTTP/1.1 200 OK\r\n\r\nHello World");

    // Server retains ownership
    return ProcessResult::CloseAfterSend;
  }
};

Async Handler with Ownership Transfer

class AsyncHandler : public ConnectionHandler {
public:
  ProcessResult process_data(std::string_view data, std::unique_ptr<Connection>& conn_ptr) override {
    // Take ownership for async processing
    auto connection = std::move(conn_ptr); // conn_ptr is now null

    work_queue.push([connection = std::move(connection)](std::string_view data) mutable {
      // Process asynchronously
      connection->appendMessage("Async response");

      // Return ownership to server when done
      Server::releaseBackToServer(std::move(connection));
    });

    return ProcessResult::Continue; // Server won't continue processing (conn_ptr is null)
  }
};

Arena-Based String Handling

// Preferred: Zero-copy string view with arena allocation
std::string_view process_json_key(const char* data, ArenaAllocator& arena);

// Avoid: Unnecessary string copies
std::string process_json_key(const char* data);

Error Handling Pattern

enum class ParseResult { Success, InvalidJson, MissingField };
ParseResult parse_commit_request(const char* json, CommitRequest& out);

Builder Pattern Usage

CommitRequest request = CommitRequestBuilder(arena)
    .request_id("example-id")
    .leader_id("leader-123")
    .read_version(42)
    .build();

Safe Connection Ownership Transfer

// In handler - take ownership for background processing
Connection* raw_conn = conn_ptr.release();

// Process on worker thread
background_processor.submit([raw_conn]() {
  // Do work...
  raw_conn->appendMessage("Background result");

  // Return to server safely (handles server destruction)
  Server::releaseBackToServer(std::unique_ptr<Connection>(raw_conn));
});

18 KiB Raw Blame History

WeaselDB Development Guide

Project Summary

Quick Start

Build System

Testing and Development Workflow

Code Style and Conventions

Dependencies and External Libraries

Architecture Overview

Core Components

1. Arena Allocator (src/arena_allocator.hpp)

2. Commit Request Data Model (src/commit_request.hpp)

3. JSON Commit Request Parser (src/json_commit_request_parser.{hpp,cpp})

4. Parser Interface (src/commit_request_parser.hpp)

5. Configuration System (src/config.{hpp,cpp})

6. JSON Token Optimization (src/json_tokens.gperf, src/json_token_enum.hpp)

7. Server (src/server.{hpp,cpp})

8. Connection (src/connection.{hpp,cpp})

9. ConnectionHandler Interface (src/connection_handler.hpp)

Data Model

Transaction Structure

Memory Management

API Design

Performance Characteristics

Memory Allocation

JSON Parsing

Design Principles

Testing & Benchmarking

Dependencies

Future Considerations

Development Guidelines

Important Implementation Details

File Organization

Adding New Protocol Handlers

Adding New Parsers

Performance Considerations

Configuration Management

Testing Strategy

Build System Details

Common Patterns

Server Creation Pattern

ConnectionHandler Implementation Patterns

Simple Synchronous Handler

Async Handler with Ownership Transfer

Arena-Based String Handling

Error Handling Pattern

Builder Pattern Usage

Safe Connection Ownership Transfer

18 KiB

Raw Blame History

1. Arena Allocator (`src/arena_allocator.hpp`)

2. Commit Request Data Model (`src/commit_request.hpp`)

3. JSON Commit Request Parser (`src/json_commit_request_parser.{hpp,cpp}`)

4. Parser Interface (`src/commit_request_parser.hpp`)

5. Configuration System (`src/config.{hpp,cpp}`)

6. JSON Token Optimization (`src/json_tokens.gperf`, `src/json_token_enum.hpp`)

7. Server (`src/server.{hpp,cpp}`)

8. Connection (`src/connection.{hpp,cpp}`)

9. ConnectionHandler Interface (`src/connection_handler.hpp`)