conflict-set/paper/paper.tex

\documentclass[twocolumn]{article}

\usepackage{hyperref}
\usepackage[utf8]{inputenc}
\usepackage[angle=90, hpos=\leftmargin]{draftwatermark}
\usepackage{tikz}
\usepackage{tikzscale}
\usepackage[edges]{forest}
\usepackage{amsmath}

\input{version.txt}

\title{ARTful Concurrency Control à la FoundationDB}
\author{Andrew Noyes \thanks{\href{mailto:andrew@weaselab.dev}{andrew@weaselab.dev}}}
\date{Version \versionnumber}

\usepackage{biblatex}
\bibliography{bibliography}

\begin{document}

\maketitle

\section*{Abstract}

FoundationDB \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21} provides serializability using a specialized data structure called \emph{lastCommit} \footnote{See Algorithm 1 referenced in \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21}.} to implement optimistic concurrency control \cite{kung1981optimistic}.
This data structure encodes the write sets for recent transactions as a map from key ranges (represented as bitwise-lexicographically-ordered half-open intervals) to most recent write versions.
Before a transaction is allowed to commit, its read set is checked for writes that happened after its read version.
FoundationDB implements \emph{lastCommit} as a version-augmented probabilistic skip list \cite{10.1145/78973.78977}.
In this paper, we propose an alternative implementation as a version-augmented Adaptive Radix Tree (ART) \cite{DBLP:conf/icde/LeisK013}, and evaluate its performance.
This implementation is available at \url{https://git.weaselab.dev/weaselab/conflict-set/releases} as a C or C++ library.
\emph{lastCommit} operates on logical keys and is agnostic to the physical structure of the data store, so it should be straightforward to use outside of FoundationDB.

\section{Introduction}

Let's begin by considering design options for \emph{lastCommit}.
In order to manage half-open intervals we need an ordered data structure, so hash tables are out of consideration.
For any ordered data structure we can implement \emph{lastCommit} using a representation where a logical key range (figure \ref{fig:logicalrangemap}) is mapped so that the value of a key is the value of the last physical key (figure \ref{fig:physicalrangemap}) less than or equal to the key.
This is a standard technique used throughout FoundationDB called a \emph{range map}.

\begin{figure}
    \caption{Physical structure of range map}
    \label{fig:physicalrangemap}
    \centering
    \begin{tikzpicture}
        \draw[-latex] (-3.5,0) -- (3.5,0);
        \foreach \x [count=\xi from 0] in  {\epsilon, a, b}
            {
                \draw[shift={(\xi * 2.333 - 3.5,0)},color=black] (0pt,3pt) -- (0pt,-3pt);
                \node[] at (\xi * 2.333 - 3.5,0.5) {$\x$};
                \node[anchor=west] at (\xi * 2.333 - 3.5,-0.5) {$\x \mapsto \xi$};
            };
    \end{tikzpicture}
\end{figure}

\begin{figure}
    \caption{Logical structure of range map}
    \label{fig:logicalrangemap}
    \centering
    \begin{tikzpicture}
        \draw[-latex] (-3.5,0) -- (3.5,0);
        \foreach \x [count=\xi from 0] in  {\epsilon, a, b}
            {
                \draw[shift={(\xi * 2.333 - 3.5,0)},color=black] (0pt,3pt) -- (0pt,-3pt);
                \node[] at (\xi * 2.333 - 3.5,0.5) {$\x$};
            };
        \foreach \x [count=\xi from 0] in  {{$[\epsilon, a) \mapsto \xi$}, {$[a, b) \mapsto \xi$}, {$[b, \infty) \mapsto \xi$}}
            {
                \node[anchor=west] at (\xi * 2.333 - 3.5,-0.5) {\x};
            };
    \end{tikzpicture}
\end{figure}

The problem with applying this to an off-the-shelf ordered data structure is that checking a read range is linear in the number of intersecting physical keys.
Scanning through every recent point write intersecting a large range read would make conflict checking unacceptably slow for high-write-throughput workloads.

This suggests we consider augmenting \cite{cormen2022introduction} an ordered data structure to make checking the max version of a range sublinear.
Since finding the maximum of a set of elements is a decomposable search problem \cite{bentley1979decomposable}, we could apply the general technique using \texttt{std::max} as our binary operation, and \texttt{MIN\_INT} as our identity.
Algorithmically, this describes FoundationDB's skip list.
We can also consider any other ordered data structure to augment, such as any variant of a balanced search tree \cite{adelson1962algorithm,guibas1978dichromatic,seidel1996randomized,comer1979ubiquitous}, or a radix tree \cite{DBLP:conf/icde/LeisK013,binna2018hot}.

Let's compare the relevant properties of our candidate data structures for insertion/update and read operations.
After insertion, the max version along the search path must reflect the update.
For self-balancing comparison-based trees, updating max version along the search path cannot be done during top-down search, because \emph{insertion will change the search path}, and we do not know whether or not this is an insert or an update until we complete the top-down search.
We have no choice but to do a second, bottom-up pass to propagate max version changes.
Furthermore, the change will always propagate all the way to the root, since inserts always use the highest-yet version.
For a radix tree, insertion does not affect the search path, and so max version can be updated on the top-down pass.
There's minimal overhead compared to the radix tree unaugmented.

For ``last less than or equal to'' queries (which make up the core of our read workload), skip lists have the convenient property that no backtracking is necessary, since the bottommost level is a sorted linked list.
Binary search trees and radix trees both require backtracking up the search path when an equal element is not found.
It's possible to trade off the backtracking for the increased overhead of maintaining the elements in an auxiliary sorted linked list during insertion.

Our options also have various tradeoffs inherited from their unaugmented versions such as different worst-case and expected bounds on the length of search paths and the number of rotations performed upon insert.
ART has been shown \cite{DBLP:conf/icde/LeisK013} to offer superior performance to comparison-based data structures on modern hardware, which is on its own a compelling reason to consider it.
The Height Optimized Trie (HOT) \cite{binna2018hot} outperforms ART, but has a few practical disadvantages \footnote{Implementing HOT is more complex than the already-daunting ART, and requires AVX2 and BMI2 instructions. HOT also involves rebalancing operations during insertion. Even so, it's likely that a HOT-based \emph{lastCommit} implementation would be superior.} and will not be considered in this paper.

\section{Augmented radix tree}

We now propose our design for an augmented radix tree implementation of \emph{lastCommit}.
The design is the same as the Adaptive Radix Tree \cite{DBLP:conf/icde/LeisK013}, but each node in the tree is annotated with either one field \emph{max}, or three fields: \emph{max}, \emph{point}, and \emph{range}.
\emph{max} represents the maximum version among all keys starting with the prefix associated with the node's place in the tree (i.e. the search path from the root to this node).
\emph{point} represents the version of the exact key matching this node's prefix.
\emph{range} represents the version of all keys $k$ such that there is no node matching $k$ and this is the first node greater than $k$ with all three fields set.
See figure \ref{fig:tree} for an example tree after inserting
     $[AND, ANT) \rightarrow 1$,
     $\{ANY\} \rightarrow 2$,
     $\{ARE\} \rightarrow 3$, and
     $\{ART\} \rightarrow 4$.
Each node shows its partial prefix annotated with $max$ or $max,point,range$.

\begin{figure}
    \caption{}
    \label{fig:tree}
    \centering
    \includegraphics{tree.tikz}
\end{figure}

\subsection{Checking point reads} \label{Checking point reads}

The algorithm for checking point reads follows directly from the definitions of the \emph{point} and \emph{range} fields.
Our input is a key $k$ and a read version $r$, and we must report whether or not the write version $v_{k}$ of $k$ is less than or equal to $r$.
In order to find $v_{k}$, we search for the node whose prefix matches $k$.
If such a node exists and has \emph{point} set, then $v_{k}$ is its \emph{point} field.
Otherwise, we scan forward to find the first node greater than $k$ with \emph{range} set.
If such a node exists, then $v_{k}$ is its \emph{range} field.
Otherwise $k$ logically has no write version, and so the read does not conflict.

As an optimization, during the search phase for a point read we can inspect the \emph{max} at each node that's a prefix of $k$.
If the max version among all keys starting with a prefix of $k$ is less than or equal to $r$, then $v_{k} \leq r$.

\subsection{Checking range reads}

Checking range reads is more involved. Logically the idea is to partition the range read so that each subrange in the partition is a single point or coincides with the set of keys beginning with a prefix (a \emph{prefix range}).
The max version of a single point is $v$ as described in \ref{Checking point reads}.
The max version of a prefix range is the $max$ of the node associated with the prefix if such a node exists, and $range$ of the next node with a $range$ field otherwise.
If there is no next node with a range field, then we ignore that subrange in our max version calculation.
The max version among all max versions of subranges in this partition is the max version of the whole range, which we compare to $r$.

Let's start with partitioning the range in the case where the beginning of the range is a prefix of the end of the range.
We'll be able to use this as a subroutine in the general case.
Suppose our range is $[a_{0}\dots a_{k}, a_{0}\dots a_{n})$ where $k < n$, and $a_{i} \in [0, 256)$.
The partition starts with the singleton set containing the first key in the range.
\[
    \{a_{0}\dots a_{k}\}
\]
or equivalently
\[
    [a_{0}\dots a_{k}, a_{0}\dots a_{k} 0)
\]
and continues with a sequence of prefix ranges ending in each digit up until $a_{k+1}$.

\begin{align*}
\dots \quad \cup \quad & [a_{0}\dots a_{k} 0, a_{0}\dots a_{k} 1) \quad \cup \\
& [a_{0}\dots a_{k} 1, a_{0}\dots a_{k} 2) \quad \cup \\
& \dots \quad \cup \\
& [a_{0}\dots a_{k} (a_{k+1}-1), a_{0}\dots a_{k+1})
\end{align*}

Recall that the range $[a_{0}\dots a_{k} 0, a_{0}\dots a_{k} 1)$ is the set of keys starting with $a_{0}\dots a_{k} 0$.
The remainder of the partition begins with the singleton set

\[
    \dots \quad \cup \quad [a_{0}\dots a_{k + 1}, a_{0}\dots a_{k + 1} 0) \quad \cup\ \quad \dots
\]
and proceeds as above until a range ending at $a_{0}\dots a_{n}$.

Let's now consider a range where begin is not a prefix of end.

\[
    [a_{0}\dots a_{m}, b_{0}\dots b_{n})
\]

Let $i$ be the lowest index such that $a_{i} \neq b_{i}$.
For brevity we will elide the common prefix up until $i$ in the following discussion so that our range is denoted as $[a_{i}\dots a_{m}, b_{i}\dots b_{n})$.
We'll start with partitioning this range coarsely:

\begin{align*}
     & [a_{i}\dots a_{m}, a_{i} + 1) \quad \cup \\
     & [a_{i} + 1, a_{i} + 2) \quad \cup \\
     & \dots \\
     & [b_{i} - 1, b_{i}) \quad \cup \\
     & [b_{i}, b_{i}\dots b_{n})
\end{align*}

The last range has a begin that's a prefix of end, and so we'll partition that as before.
The inner ranges are already prefix ranges.
This leaves only $[a_{i}\dots a_{m}, a_{i} + 1)$.

If $m = i$, then this range is adjacent to the first inner range above, and we're done.
Otherwise we'll partition this into

\begin{align*}
     & [a_{i}\dots a_{m}, a_{i}\dots (a_{m} + 1)) \quad \cup \\
     & [a_{i}\dots (a_{m} + 1), a_{i}\dots (a_{m} + 2)) \quad \cup \\
     & \dots \\
     & [a_{i}\dots 254, a_{i}\dots 255) \quad \cup \\
     & [a_{i}\dots 255, a_{i}\dots (a_{m-1} + 1) )
\end{align*}

and repeat \footnote{This doesn't explicitly describe how to handle the case where $a_{m-1} = 255$. In this case we would skip to the largest $j < m$ such that $a_{j} \neq 255$. We know $j \geq i$ since if $a_{i} = 255$ then the range is inverted.} starting at
\[
    \dots \quad \cup \quad [a_{i}\dots (a_{m-1} + 1), a_{i}\dots (a_{m-1} + 2))
\]
until we end at $a_{i} + 1$, adjacent to the first inner range.

A few notes on implementation:
\begin{itemize}
    \item{For clarity, the above algorithm decouples the logical partitioning from the physical structure of the tree.
          An optimized implementation would merge adjacent prefix ranges that don't correspond to nodes in the tree as it scans, so that it only calculates the version of such merged ranges once.
          Additionally, our implementation uses SIMD instructions and instruction-level parallelism to compare many prefix ranges to the read version $r$ in parallel.}
    \item{In order to avoid many costly pointer indirections, and to take advantage of SIMD, we can store the max version of child nodes as a dense array directly in the parent node.
          Without this, the range read performance is not competetive with the skip list.}
    \item{An optimized implementation would visit the partition of $[a_{i}\dots a_{m}, a_{i} + 1)$ in reverse order, as it descends along the search path to $a_{i}\dots a_{m}$}
    \item{An optimized implementation would search for the common prefix first, and return early if any prefix of the common prefix has a $max \leq r$.}
\end{itemize}

\subsection{Reclaiming old entries}

In order to bound memory usage, we track an \emph{oldest version}, reject transactions with read versions before \emph{oldest version}, and reclaim nodes made redundant by \emph{oldest version}.
We track the rate of insertions of new nodes and make sure that our incremental reclaiming of old nodes according to \emph{oldest version} outpaces inserts.

\subsection{Adding point writes}

A point write of $k$ at version $v$ simply sets $max \gets v$ \footnote{Write versions are non-decreasing.} for every node along $k$'s search path, and sets $range$ for $k$'s node to the $range$ of the first node greater than $k$, or \emph{oldest version} if none exists.

\subsection{Adding range writes}

A range write of $[b, e)$ at version $v$ performs a point write of $b$ at $v$, and then inserts a node at $e$ with $range$ set to $v$, and $point$ set such that the result of checking a read of $e$ is unaffected.
Nodes along the search path to $e$ that are a strict prefix of $e$ get $max$ set to $v$, and all nodes between $b$ and $e$ are removed.

\section{Evaluation}

\section{Testing}

The correctness of \emph{lastCommit} is critically important, as a bug would likely result in data corruption, and so we use a variety of testing techniques.
The main technique is to let libfuzzer \cite{libfuzzer} generate sequences of arbitrary operations, and apply each sequence to both the optimized radix tree and a naive implementation based on an unaugmented ordered map that serves as the specification of the intended behavior.
After libfuzzer generates inputs with broad code coverage, we use libfuzzer's ``corpus minimization'' feature to pare down the test inputs without losing coverage (as measured by libfuzzer) into a fixed set of tests short enough that it's feasible to run interactively during development.
In order to keep these test inputs short, we constrain the size of keys at the loss of some generality.
We believe there isn't anything in the implementation particularly sensitive to the exact length of keys \footnote{\texttt{longestCommonPrefix} (a routine in the implementation) is a possible exception, but its length sensitivity is well encapsulated}.
Libfuzzer's minimized corpus achieves 98\% line coverage on its own.
We regenerate the corpus on an ad hoc basis by running libfuzzer for a few cpu-hours, during which it tests millions of unique inputs.

In addition to asserting correct externally-visible behavior, in each of these tests we assert that internal invariants hold between operations.
We also use address sanitizer \cite{10.5555/2342821.2342849} to detect memory errors, undefined behavior sanitizer \cite{ubsan} to detect invocations of undefined behavior, and thread sanitizer \cite{10.1145/1791194.1791203} (while exercising concurrent access as allowed by the contract documented in the c++ header file) to detect data-race-related undefined behavior.

Each of these sanitizers is implemented using compiler instrumentation, which means that they are not testing the final binary artifact that will be run in production.
Therefore we also run the test inputs linking directly to the final release artifact, both standalone and under valgrind \cite{10.5555/1247360.1247362}.
When testing the final artifacts, we do not assert internal invariants as we lack convenient access to the internals.
As a defense against possible bugs in compilers' sanitizer and optimizer passes \cite{10.1145/3591257}, we also test with sanitizers enabled and optimizations disabled, and test with both clang and gcc.

We audited the 2\% of lines that were not covered by libfuzzer \footnote{In order to see the uncovered lines for yourself, exclude all tests containing the word ``script'' with \texttt{ctest -E script}. Look in \texttt{Jenkinsfile} in the root of the source tree for an example of how to measure coverage.} and found the following:
\begin{itemize}
    \item Three occurrences which can be reached from an input that libfuzzer could theoretically generate. In each case the uncovered code is straightforward, and is exercised from an entry point by a manually written test.
    \item One occurrence which requires a large number of operations, and cannot be reached from an input satisfying the size constraints we impose on libfuzzer. This code is also straightforward, and is exercised from an entry point by a manually written test. The purpose of this code is to keep memory usage in check, and so it's expected that it cannot be reached without a large number of operations.
    \item One occurrence which is not reachable from any entry point. This line is now suppressed with an explanatory comment.
\end{itemize}

We assert 100\% line coverage in continuous integration, which is achieved with a few caveats.
2\% of the code is only covered by a few manually written tests.
We suppress lines manually checked to be unreachable from an entry point.
There is also a significant amount of test-only code which is suppressed from coverage measurements.
There's a small difference in the behavior between debug and release builds: the code which scans for old entries gets run more frequently when assertions are enabled.
This code is not straightforward, so exercising it from only a manually written test seems insufficient.

\section{Conclusion}

% https://www.reddit.com/r/LaTeX/comments/9d9u97/some_reference_get_weird_space_between_them_when/
\raggedright
\printbibliography

\end{document}