conflict-set/paper/paper.tex

\documentclass[twocolumn]{article}

\usepackage{hyperref}
\usepackage[utf8]{inputenc}
\usepackage[angle=90, hpos=\leftmargin]{draftwatermark}
\usepackage{tikz}
\usepackage{tikzscale}
\usepackage[edges]{forest}

\title{ARTful Conflict Checking for FoundationDB}
\author{Andrew Noyes \thanks{\href{mailto:andrew@weaselab.dev}{andrew@weaselab.dev}}}
\date{}

\usepackage{biblatex}
\bibliography{bibliography}

\begin{document}

\maketitle

\section*{Abstract}

FoundationDB \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21} provides serializability using a specialized data structure called \textit{lastCommit} \footnote{See Algorithm 1 referenced in \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21}} to implement optimistic concurrency control \cite{kung1981optimistic}.
This data structure encodes the write sets for recent transactions as a map from key ranges (represented as bitwise-lexicographically-ordered half-open intervals) to most recent write versions.
FoundationDB implements \textit{lastCommit} as a version-augmented probabilistic skip list \cite{10.1145/78973.78977}.
In this paper, we propose an alternative implementation of \textit{lastCommit} as a version-augmented Adaptive Radix Tree (ART) \cite{DBLP:conf/icde/LeisK013}, and evaluate its performance.

\section{Introduction}

Let's begin by considering design options for \textit{lastCommit}.
In order to manage half-open intervals we need an ordered data structure, so hash tables are out of consideration.
For any ordered data structure we can implement \textit{lastCommit} using a representation where a logical key is mapped to the value of the last physical key less than or equal to the logical key.
This is a standard technique used throughout FoundationDB.

The problem with applying this to an off-the-shelf ordered data structure is that checking a read range is linear in the number of intersecting physical keys.
Scanning through every recent point write intersecting a large range read would make conflict checking unacceptably slow for high-write-throughput workloads.

This suggests we consider augmenting \cite{cormen2022introduction} an ordered data structure to make checking the max version of a range sublinear.
Since finding the maximum of a set of elements is a decomposable search problem \cite{bentley1979decomposable}, we could apply the general technique using \texttt{std::max} as our binary operation, and \texttt{MIN\_INT} as our identity.
Algorithmically, this describes FoundationDB's skip list.
We can also consider any other ordered data structure to augment, such as any variant of a balanced binary search tree \cite{adelson1962algorithm,guibas1978dichromatic,seidel1996randomized}, a b-tree \cite{comer1979ubiquitous}, or a radix tree \cite{DBLP:conf/icde/LeisK013,binna2018hot}.

Let's compare the relevant properties of our candidate data structures for insertion/update and read operations.
After insertion, the max version along the search path must reflect the update.
For self-balancing comparison-based trees, updating max version along the search path cannot be done during top-down search, because \emph{insertion will change the search path}, and we do not know whether or not this is an insert or an update until we complete the top-down search.
We have no choice but to do a second, bottom-up pass to propagate max version changes.
Furthermore, the change will always propagate all the way to the root, since inserts always use the highest-yet version.
For a radix tree, insertion does not affect the search path, and so max version can be updated on the top-down pass.
There's minimal overhead compared to the radix tree unaugmented.

For ``last less than or equal to'' queries (which make up the core of our read workload), skip lists have the convenient property that no backtracking is necessary, since the bottommost level is a sorted linked list.
Binary search trees and radix trees both require backtracking up the search path when an equal element is not found.
It's possible to trade off the backtracking for the increased overhead of maintaining the elements in an auxiliary sorted linked list during insertion.

Our options also have various tradeoffs inherited from their unaugmented versions such as different worst-case and expected bounds on the length of search paths and the number of rotations performed upon insert.
ART has been shown \cite{DBLP:conf/icde/LeisK013} to offer superior performance to comparison-based data structures on modern hardware, which is on its own a compelling reason to consider it.
The Height Optimized Trie (HOT) \cite{binna2018hot} outperforms ART, but has a few practical disadvantages \footnote{Implementing HOT is more complex than the already-daunting ART, and requires AVX2 and BMI2 instructions. HOT also involves rebalancing operations during insertion. Even so, it's likely that a HOT-based \emph{lastCommit} implementation would be superior.} and will not be considered in this paper.

\section{Augmented radix tree}

We now propose our design for an augmented radix tree implementation of \emph{lastCommit}.
The design is the same as the Adaptive Radix Tree \cite{DBLP:conf/icde/LeisK013}, but each node in the tree is annotated with either one field \emph{max}, or three fields: \emph{max}, \emph{point}, and \emph{range}.
\emph{max} represents the maximum version among all keys starting with the prefix associated with the node's place in the tree (i.e. the search path from the root to this node).
\emph{point} represents the version of the exact key matching this node's prefix.
\emph{range} represents the version of all keys $k$ such that this is the first node greater than $k$ with all three fields set.
See figure \ref{fig:tree} for an example tree after inserting
     $[AND, ANT) \rightarrow 1$,
     $\{ANY\} \rightarrow 2$,
     $\{ARE\} \rightarrow 3$, and
     $\{ART\} \rightarrow 4$.
Each node shows its partial prefix annotated with $(max,point,range)$.

\subsection{Checking point reads}

The algorithm for checking point reads follows directly from the definitions of the \emph{point} and \emph{range} fields.
Our input is a key $k$ and a read version $r$, and we must report whether or not the write version $v_{k}$ of $k$ is less than or equal to $r$.
In order to find $v_{k}$, we search for the node whose prefix matches $k$.
If such a node exists and has \emph{point} set, then $v_{k}$ is its \emph{point} field.
Otherwise, we scan forward to find the first node greater than $k$ with \emph{range} set.
If such a node exists, then $v_{k}$ is its \emph{range} field.
Otherwise $k$ logically has no write version, and so the read does not conflict.

As an optimization, during the search phase for a point read we can inspect the \emph{max} at each node that's a prefix of $k$.
If the max version among all keys starting with a prefix of $k$ is less than or equal to $r$, then $v_{k} \leq r$.

\subsection{Checking range reads}
\subsection{Adding point writes}
\subsection{Adding range writes}
\subsection{Reclaiming old entries}

\begin{figure}
    \caption{}
    \label{fig:tree}
    \centering
    \includegraphics{tree.tikz}
\end{figure}

\printbibliography

\end{document}