Files

Andrew Noyes bf90b8856a Add mdformat pre-commit hook

2025-09-12 11:24:16 -04:00

25 KiB

Raw Blame History

WeaselDB C++ Style Guide

This document describes the C++ coding style used in the WeaselDB project. These conventions ensure consistency, readability, and maintainability across the codebase.

General Principles
Naming Conventions
File Organization
Code Structure
Memory Management
Error Handling
Documentation
Testing

General Principles

Language Standard

C++20 is the target standard
Use modern C++ features: RAII, move semantics, constexpr, concepts where appropriate
Prefer standard library containers and algorithms over custom implementations

C Library Functions and Headers

Always use std:: prefixed versions of C library functions for consistency and clarity
Use C++ style headers (<cstring>, <cstdlib>, etc.) instead of C style headers (<string.h>, <stdlib.h>, etc.)
This applies to all standard libc functions: std::abort(), std::fprintf(), std::free(), std::memcpy(), std::strlen(), std::strncpy(), std::memset(), std::signal(), etc.
Exception: Functions with no std:: equivalent (e.g., perror(), gai_strerror()) and system-specific headers (e.g., <unistd.h>, <fcntl.h>)

// Preferred - C++ style
#include <cstring>
#include <cstdlib>
#include <csignal>
std::abort();
std::fprintf(stderr, "Error message\n");
std::free(ptr);
std::memcpy(dest, src, size);
std::strlen(str);
std::strncpy(dest, src, n);
std::memset(ptr, value, size);
std::signal(SIGTERM, handler);

// Avoid - C style
#include <string.h>
#include <stdlib.h>
#include <signal.h>
abort();
fprintf(stderr, "Error message\n");
free(ptr);
memcpy(dest, src, size);
strlen(str);
strncpy(dest, src, n);
memset(ptr, value, size);
signal(SIGTERM, handler);

Data Types

Almost always signed - prefer int, int64_t, ssize_t over unsigned types except for:
- Bit manipulation operations
- Interfacing with APIs that require unsigned types
- Where defined unsigned overflow behavior (wraparound) is intentional and desired
Almost always auto - let the compiler deduce types except when:
- The type is not obvious from context (prefer explicit for clarity)
- Specific type requirements matter (numeric conversions, template parameters)
- Interface contracts need explicit types (public APIs, function signatures)
Prefer uninitialized memory to default initialization when using before initializing would be an error
- Valgrind will catch uninitialized memory usage bugs
- Avoid hiding logic errors with unnecessary zero-initialization
- Default initialization can mask bugs and hurt performance
Floating point is for metrics only - avoid float/double in core data structures and algorithms
- Use for performance measurements, statistics, and monitoring data
- Never use for counts, sizes, or business logic

Type Casting

Never use C-style casts - they're unsafe and can hide bugs by performing dangerous conversions
Use C++ cast operators for explicit type conversions with clear intent and safety checks
Avoid reinterpret_cast - almost always indicates poor design; redesign APIs instead
Prefer no casts - design APIs and use types that avoid casting entirely when possible

// Dangerous - C-style casts (NEVER DO THIS)
// int* ptr = (int*)malloc(sizeof(int));     // Unsafe
// int64_t id = (int64_t)some_pointer;       // Dangerous pointer conversion
// BaseClass* base = (BaseClass*)derived;    // Loses type safety

// Acceptable C++ cast operators (use sparingly)
auto ptr = static_cast<int*>(malloc(sizeof(int)));    // Explicit conversion
auto base = static_cast<BaseClass*>(derived_ptr);     // Safe upcast
auto derived = dynamic_cast<DerivedClass*>(base_ptr); // Runtime type checking
auto mutable_ptr = const_cast<int*>(const_ptr);       // Remove const (rare)
// reinterpret_cast can be appropriate for low-level operations (very rare)
auto addr = reinterpret_cast<uintptr_t>(ptr);         // Pointer to integer conversion

Performance Focus

Performance-first design - optimize for the hot path
Simple is fast - find exactly what's necessary, strip away everything else
Complexity must be justified with benchmarks - measure performance impact before adding complexity
Strive for 0% CPU usage when idle - avoid polling, busy waiting, or unnecessary background activity
Use inline functions for performance-critical code (e.g., allocate_raw)
String views with std::string_view to minimize unnecessary copying
Arena allocation for efficient memory management (~1ns vs ~20-270ns for malloc)

String Formatting

Always use format.hpp functions - formats directly into arena-allocated memory
Use static_format() for performance-sensitive code - faster but less flexible than format()
Use format() function with arena allocator for printf-style formatting

// Most performance-sensitive - compile-time optimized concatenation
std::string_view response = static_format(arena,
    "HTTP/1.1 ", status_code, " OK\r\n",
    "Content-Length: ", body.size(), "\r\n",
    "\r\n", body);

// Printf-style formatting - runtime flexible
Arena& arena = conn.get_arena();
std::string_view response = format(arena,
    "HTTP/1.1 %d OK\r\n"
    "Content-Length: %zu\r\n"
    "\r\n%.*s",
    status_code, body.size(),
    static_cast<int>(body.size()), body.data());

Complexity Control

Encapsulation is the main tool for controlling complexity
Header files define the interface - they are the contract with users of your code
Headers should be complete - include everything needed to use the interface effectively:
- Usage examples in comments
- Preconditions and postconditions
- Thread safety guarantees
- Performance characteristics
- Ownership and lifetime semantics
Do not rely on undocumented interface properties - if it's not in the header, don't depend on it

Naming Conventions

Variables and Functions

snake_case for all variables, functions, and member functions
Legacy camelCase exists - the codebase currently contains mixed naming due to historical development. New code should use snake_case. Existing camelCase should be converted to snake_case during natural refactoring (not mass renaming).

int64_t used_bytes() const;
void add_block(int64_t size);
int32_t initial_block_size_;

Classes and Structs

PascalCase for class/struct names
Always use struct keyword - eliminates debates about complexity and maintains consistency
Public members first, private after - puts the interface users care about at the top, implementation details below
Full encapsulation still applies - use private: sections to hide implementation details and maintain deep, capable structs
The struct keyword doesn't mean shallow design - it means interface-first organization for human readers

struct Arena {
  // Public interface first
  explicit Arena(int64_t initial_size = 1024);
  void* allocate_raw(int64_t size);

private:
  // Private members after
  int32_t initial_block_size_;
  Block* current_block_;
};

Enums

PascalCase for enum class names
PascalCase for enum values (not SCREAMING_SNAKE_CASE)

enum class Type {
  PointRead,
  RangeRead
};

enum class ParseState {
  Root,
  PreconditionsArray,
  OperationObject
};

Constants and Macros

snake_case for constants
Avoid macros when possible; prefer constexpr variables

static const WeaselJsonCallbacks json_callbacks;

Member Variables

Trailing underscore for private member variables

private:
  int32_t initial_block_size_;
  Block *current_block_;

Template Parameters

PascalCase for template type parameters

template <typename T, typename... Args>
template <typename T> struct rebind { using type = T*; };

File Organization

Include Organization

Use #pragma once instead of include guards
Never using namespace std - always use fully qualified names for clarity and safety
Include order (applies to both headers and source files):
1. Corresponding header file (for .cpp files only)
2. Standard library headers (alphabetical)
3. Third-party library headers
4. Project headers

#pragma once

#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <memory>
#include <string_view>

#include <simdutf.h>
#include <weaseljson/weaseljson.h>

#include "arena.hpp"
#include "commit_request.hpp"

// Never this:
// using namespace std;

// Always this:
std::vector<int> data;
std::unique_ptr<Parser> parser;

Code Structure

Class Design

Move-only semantics for resource-owning types
Explicit constructors to prevent implicit conversions
Delete copy operations when inappropriate

struct Arena {
  explicit Arena(int64_t initial_size = 1024);

  // Copy construction is not allowed
  Arena(const Arena &source) = delete;
  Arena &operator=(const Arena &source) = delete;

  // Move semantics
  Arena(Arena &&source) noexcept;
  Arena &operator=(Arena &&source) noexcept;

private:
  int32_t initial_block_size_;
  Block *current_block_;
};

Function Design

Const correctness - mark methods const when appropriate
Parameter passing:
- Pass by value for types ≤ 16 bytes (int, pointers, string_view, small structs)
- Pass by const reference for types > 16 bytes (containers, large objects)
Return by value for small types (≤ 16 bytes), string_view to avoid copying strings
noexcept specification for move operations and non-throwing functions

std::span<const Operation> operations() const { return operations_; }
void process_data(std::string_view request_data);  // ≤ 16 bytes, pass by value
void process_request(const CommitRequest& commit_request);  // > 16 bytes, pass by reference
Arena(Arena &&source) noexcept;

Template Usage

Template constraints using static_assert for better error messages
SFINAE or concepts for template specialization

Factory Patterns & Ownership

Static factory methods for complex construction requiring shared ownership
Friend-based factories for access control when constructor should be private
Ownership guidelines:
- unique_ptr for exclusive ownership (most common case)
- shared_ptr only when multiple owners need concurrent access to same object
- Factory methods return appropriate smart pointer type based on ownership needs

// Shared ownership - multiple components need concurrent access
auto server = Server::create(config, handler);  // Returns shared_ptr

// Exclusive ownership - single owner, transfer via move
auto connection = Connection::createForServer(addr, fd, connection_id, handler, server_ref);

// Friend-based factory for access control
struct Connection {
  void append_message(std::string_view message_data);
private:
  Connection(struct sockaddr_storage client_addr, int file_descriptor,
             int64_t connection_id, ConnectionHandler* request_handler,
             std::weak_ptr<Server> server_ref);
  friend struct Server;   // Only Server can construct
};

Control Flow

Early returns to reduce nesting
Range-based for loops when possible

if (size == 0) {
  return nullptr;
}

for (auto &precondition : preconditions_) {
  // ...
}

Atomic Operations

Never use assignment operators with std::atomic - always use explicit store() and load()
Always specify memory ordering explicitly for atomic operations
Use the least restrictive correct memory ordering - choose the weakest ordering that maintains correctness

// Preferred - explicit store/load with precise memory ordering
std::atomic<uint64_t> counter;
counter.store(42, std::memory_order_relaxed);      // Single-writer metric updates
auto value = counter.load(std::memory_order_relaxed); // Reading metrics for display

counter.store(1, std::memory_order_release);       // Publishing initialization
auto ready = counter.load(std::memory_order_acquire); // Synchronizing with publisher

counter.store(42, std::memory_order_seq_cst);      // When sequential consistency needed

// Avoid - assignment operators (implicit memory ordering)
std::atomic<uint64_t> counter;
counter = 42;        // Implicit - memory ordering not explicit
auto value = counter; // Implicit - memory ordering not explicit

Memory Management

Ownership & Allocation

Arena allocators for request-scoped memory with STL allocator adapters (see Performance Focus section for characteristics)
String views pointing to arena-allocated memory to avoid unnecessary copying
STL containers with arena allocators require default construction after arena reset - clear() is not sufficient

// STL containers with arena allocators - correct reset pattern
std::vector<Operation, ArenaStlAllocator<Operation>> operations(arena);
// ... use container ...
operations = {};  // Default construct - clear() won't work correctly
arena.reset();  // Reset arena memory

Resource Management

RAII everywhere - constructors acquire, destructors release
Move semantics for efficient resource transfer
Explicit cleanup methods where appropriate

~Arena() {
  while (current_block_) {
    Block *prev = current_block_->prev;
    std::free(current_block_);
    current_block_ = prev;
  }
}

Error Handling

Error Classification & Response

Expected errors (invalid input, timeouts): Return error codes for programmatic handling
System failures (malloc fail, socket fail): Abort immediately with error message
Programming errors (precondition violations, assertions): Abort immediately

Error Contract Design

Error codes are the API contract - use enums for programmatic decisions
Error messages are human-readable only - never parse message strings
Consistent error boundaries - each component defines what it can/cannot recover from
Interface precondition violations are undefined behavior - acceptable to skip checks for performance in hot paths
Error code types must be nodiscard - mark error code enums with [[nodiscard]] to prevent silent failures

enum class [[nodiscard]] ParseResult { Success, InvalidJson, MissingField };

// System failure - abort immediately
void* memory = std::malloc(size);
if (!memory) {
    std::fprintf(stderr, "Arena: Memory allocation failed\n");
    std::abort();
}
// ... use memory, eventually std::free(memory)

// Programming error - precondition violation (may be omitted for performance)
assert(ptr != nullptr && "Precondition violated: pointer must be non-null");

Assertions

Programming error detection using standard assert() macro
Assertion behavior follows C++ standards:
- Debug builds: Assertions active (undefined NDEBUG)
- Release builds: Assertions removed (defined NDEBUG)
Test targets override: Use -UNDEBUG to force assertions active in all builds
Static assertions for compile-time validation (always active)

Usage guidelines:

Use for programming errors: null checks, precondition validation, invariants
Don't use for expected runtime errors: use return codes instead

// Good: Programming error checks
assert(current_block_ && "realloc called with non-null ptr but no current block");
assert(size > 0 && "Cannot allocate zero bytes");

// Good: Compile-time validation (always enabled)
static_assert(std::is_trivially_destructible_v<T>, "Arena requires trivially destructible types");

// Bad: Don't use assert for expected runtime errors
// assert(file_exists(path));  // File might legitimately not exist - use return code instead

System Call Error Handling

When a system call is interrupted by a signal (EINTR), it is usually necessary to retry the call. This is especially true for "slow" system calls that can block for a long time, such as read, write, accept, connect, sem_wait, and epoll_wait.

Rule: Always wrap potentially interruptible system calls in a do-while loop that checks for EINTR.

Example:

int fd;
do {
  fd = accept(listen_fd, nullptr, nullptr);
} while (fd == -1 && errno == EINTR);

if (fd == -1) {
  // Handle other errors (perror has no std:: equivalent)
  perror("accept");
  std::abort();
}

Special case - close():

The close() system call is a special case on Linux. According to man 2 close, when close() returns EINTR on Linux, the file descriptor is still guaranteed to be closed. Therefore, close() should never be retried.

// Correct: Do not retry close() on EINTR
int result = close(fd);
if (result == -1 && errno != EINTR) {
  // Handle non-EINTR errors only (perror has no std:: equivalent)
  perror("close");
  std::abort();
}
// Note: fd is guaranteed closed even on EINTR

Non-interruptible calls:

Most system calls are not interruptible in practice. For these, it is not necessary to add a retry loop. This includes:

fcntl (with F_GETFL, F_SETFL, F_GETFD, F_SETFD - note: F_SETLKW and F_OFD_SETLKW CAN return EINTR)
epoll_ctl
socketpair
pipe
setsockopt
epoll_create1
close (special case: guaranteed closed even on EINTR on Linux)

When in doubt, consult the man page for the specific system call to see if it can return EINTR.

Documentation

Doxygen Style

/ for struct and public method documentation
@brief for short descriptions
@param and @return for function parameters
@note for important implementation notes
@warning for critical usage warnings

/**
 * @brief Type-safe version of realloc_raw for arrays of type T.
 * @param existing_ptr Pointer to the existing allocation
 * @param current_size Size in number of T objects
 * @param requested_size Desired new size in number of T objects
 * @return Pointer to reallocated memory
 * @note Prints error to stderr and calls std::abort() if allocation fails
 */
template <typename T>
T *realloc(T *existing_ptr, int32_t current_size, int32_t requested_size);

Code Comments

Explain why, not what - code should be self-documenting
Performance notes for optimization decisions
Thread safety and ownership semantics

// Uses O(1) accumulated counters for fast retrieval
int64_t total_allocated() const;

// Only Server can create connections - no public constructor
Connection(struct sockaddr_storage addr, int fd, int64_t id,
           ConnectionHandler *handler, std::weak_ptr<Server> server);

Testing

Test Framework

doctest for unit testing
TEST_CASE and SUBCASE for test organization
CHECK for assertions (non-terminating)
REQUIRE for critical assertions (terminating)

Test Structure

Descriptive test names explaining the scenario
SUBCASE for related test variations
Fresh instances for each test to avoid state contamination

TEST_CASE("Arena basic allocation") {
  Arena arena;

  SUBCASE("allocate zero bytes returns nullptr") {
    void *ptr = arena.allocate_raw(0);
    CHECK(ptr == nullptr);
  }

  SUBCASE("allocate single byte") {
    void *ptr = arena.allocate_raw(1);
    CHECK(ptr != nullptr);
    CHECK(arena.used_bytes() >= 1);
  }
}

Test Design Principles

Test the contract, not the implementation - validate what the API promises to deliver, not implementation details
Both integration and unit tests - test components in isolation and working together
Prefer fakes to mocks - use real implementations for internal components, fake external dependencies
Always enable assertions in tests - use -UNDEBUG pattern to ensure assertions are checked (see Build Integration section)

// Good: Testing through public API
TEST_CASE("Server accepts connections") {
  auto config = Config::defaultConfig();
  auto handler = std::make_unique<TestHandler>();
  auto server = Server::create(config, std::move(handler));

  // Test observable behavior - server can accept connections
  auto result = connectToServer(server->getPort());
  CHECK(result.connected);
}

// Avoid: Testing internal implementation details
// TEST_CASE("Server creates epoll instance") { /* implementation detail */ }

What NOT to Test

Avoid testing language features and plumbing:

Don't test that virtual functions dispatch correctly
Don't test that standard library types work (unique_ptr, containers, etc.)
Don't test basic constructor/destructor calls

Test business logic instead:

When does your code call hooks/callbacks and why?
What state transitions trigger behavior changes?
How does your code handle error conditions?
What promises does your API make to users?

Ask: "Am I testing the C++ compiler or my application logic?"

Test Synchronization (Authoritative Rules)

ABSOLUTELY NEVER use timeouts (sleep_for, wait_for, etc.)
Deterministic synchronization only:
- Blocking I/O (naturally waits for completion)
- condition_variable.wait() without timeout
- std::latch, std::barrier, futures/promises
Force concurrent execution using std::latch to synchronize thread startup

Threading Checklist for Tests/Benchmarks

Common threading principles (all concurrent code):

Count total threads - Include main/benchmark thread in count
Always assume concurrent execution needed - Tests/benchmarks require real concurrency
Add synchronization primitive - std::latch start_latch{N} (most common), std::barrier, or similar where N = total concurrent threads
Each thread synchronizes before doing work - e.g., start_latch.arrive_and_wait() or barrier.arrive_and_wait()
Main thread synchronizes before measurement/execution - ensures all threads start simultaneously

Test-specific:

Perform many operations per thread creation - amortize thread creation cost and increase chances of hitting race conditions
Pattern: Create test that spawns threads and runs many operations, then run that test many times - amortizes thread creation cost while providing fresh test instances
Run 100-10000 operations per test, and 100-10000 test iterations - maximizes chances of hitting race conditions
Always run with ThreadSanitizer - compile with -fsanitize=thread

Benchmark-specific:

NEVER create threads inside the benchmark measurement - creates thread creation/destruction overhead, not contention
Create background threads OUTSIDE the benchmark that run continuously during measurement
Use std::atomic<bool> keep_running to cleanly shut down background threads after benchmark
Measure only the foreground operation under real contention from background threads

Red flags to catch immediately:

❌ Creating threads in a loop without std::latch
❌ Background threads starting work immediately
❌ Benchmark measuring before all threads synchronized
❌ Any use of sleep_for, wait_for, or timeouts

Simple rule: Multiple threads = std::latch synchronization. No exceptions, even for "simple" background threads.

// BAD: Race likely over before threads start
int counter = 0;
for (int i = 0; i < 4; ++i) {
  threads.emplace_back([&]() { counter++; }); // Probably sequential
}

// GOOD: Force threads to race simultaneously
int counter = 0;
std::latch start_latch{4};
for (int i = 0; i < 4; ++i) {
  threads.emplace_back([&]() {
    start_latch.count_down_and_wait(); // All threads start together
    counter++; // Now they actually race (data race on non-atomic)
  });
}

Build Integration

Build Configuration

# Debug: assertions on, optimizations off
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

# Release: assertions off, optimizations on
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

Test Target Pattern:

Production targets follow build type (assertions off in Release)
Test targets use -UNDEBUG to force assertions on in all builds
Ensures consistent test validation regardless of build type

# Test target with assertions always enabled
add_executable(test_example tests/test_example.cpp src/example.cpp)
target_link_libraries(test_example doctest::doctest)
target_compile_options(test_example PRIVATE -UNDEBUG)  # Always enable assertions

# Production target follows build type
add_executable(example src/example.cpp src/main.cpp)
# No -UNDEBUG → assertions disabled in Release, enabled in Debug

Code Generation

Generated files go in build directory, not source

25 KiB Raw Blame History