Second pass at "Checking range reads"
This commit is contained in:
@@ -21,7 +21,7 @@
|
||||
|
||||
\section*{Abstract}
|
||||
|
||||
FoundationDB \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21} provides serializability using a specialized data structure called \textit{lastCommit} \footnote{See Algorithm 1 referenced in \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21}} to implement optimistic concurrency control \cite{kung1981optimistic}.
|
||||
FoundationDB \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21} provides serializability using a specialized data structure called \textit{lastCommit} \footnote{See Algorithm 1 referenced in \cite{DBLP:conf/sigmod/ZhouXSNMTABSLRD21}.} to implement optimistic concurrency control \cite{kung1981optimistic}.
|
||||
This data structure encodes the write sets for recent transactions as a map from key ranges (represented as bitwise-lexicographically-ordered half-open intervals) to most recent write versions.
|
||||
FoundationDB implements \textit{lastCommit} as a version-augmented probabilistic skip list \cite{10.1145/78973.78977}.
|
||||
In this paper, we propose an alternative implementation of \textit{lastCommit} as a version-augmented Adaptive Radix Tree (ART) \cite{DBLP:conf/icde/LeisK013}, and evaluate its performance.
|
||||
@@ -71,7 +71,7 @@ See figure \ref{fig:tree} for an example tree after inserting
|
||||
$\{ART\} \rightarrow 4$.
|
||||
Each node shows its partial prefix annotated with $(max,point,range)$.
|
||||
|
||||
\subsection{Checking point reads}
|
||||
\subsection{Checking point reads} \label{Checking point reads}
|
||||
|
||||
The algorithm for checking point reads follows directly from the definitions of the \emph{point} and \emph{range} fields.
|
||||
Our input is a key $k$ and a read version $r$, and we must report whether or not the write version $v_{k}$ of $k$ is less than or equal to $r$.
|
||||
@@ -86,12 +86,15 @@ If the max version among all keys starting with a prefix of $k$ is less than or
|
||||
|
||||
\subsection{Checking range reads}
|
||||
|
||||
Checking range reads is more involved. Logically the idea is to partition the range read so that each partition is a single point or coincides with the set of keys beginning with a prefix.
|
||||
The max version of the set of keys starting with a prefix is then $max$ of the node associated with the prefix if such a node exists, and $range$ of the next node with a $range$ field otherwise.
|
||||
Checking range reads is more involved. Logically the idea is to partition the range read so that each subrange in the partition is a single point or coincides with the set of keys beginning with a prefix (a \emph{prefix range}).
|
||||
The max version of a single point is $v$ as described in \ref{Checking point reads}.
|
||||
The max version of a prefix range is the $max$ of the node associated with the prefix if such a node exists, and $range$ of the next node with a $range$ field otherwise.
|
||||
If there is no next node with a range field, then we ignore that subrange in our max version calculation.
|
||||
The max version among all versions and max versions of subranges in this partition is the max version of the whole range, which we compare to $r$.
|
||||
|
||||
Let's start with partitioning the range in the case where the beginning of the range is a prefix of the end of the range.
|
||||
We'll be able to use this as a subroutine in the general case.
|
||||
Suppose our range is $[a_{0}\dots a_{k}, a_{0}\dots a_{n})$ where $k < n$.
|
||||
Suppose our range is $[a_{0}\dots a_{k}, a_{0}\dots a_{n})$ where $k < n$, and $a_{i} \in [0, 256)$.
|
||||
The partition starts with the singleton set containing the first key in the range.
|
||||
\[
|
||||
\{a_{0}\dots a_{k}\}
|
||||
@@ -112,7 +115,7 @@ Recall that the range $[a_{0}\dots a_{k} 0, a_{0}\dots a_{k} 1)$ is equivalent t
|
||||
|
||||
The remainder of the partition begins with the singleton set
|
||||
\[
|
||||
\dots \quad \cup \quad [a_{0}\dots a_{k + 1}, a_{0}\dots a_{k + 1} 0)
|
||||
\dots \quad \cup \quad [a_{0}\dots a_{k + 1}, a_{0}\dots a_{k + 1} 0) \quad \cup\ \quad \dots
|
||||
\]
|
||||
and proceeds as above until a range ending at $a_{0}\dots a_{n}$.
|
||||
|
||||
@@ -143,17 +146,23 @@ Otherwise we'll partition this into
|
||||
|
||||
\begin{align*}
|
||||
& [a_{i}\dots a_{m}, a_{i}\dots (a_{m} + 1)) \quad \cup \\
|
||||
& [a_{i}\dots (a_{m} + 1), a_{i}\dots (a_{m} + 2)) \quad \cup \\
|
||||
& \dots \\
|
||||
& [a_{i}\dots 254, a_{i}\dots 255)
|
||||
& [a_{i}\dots 254, a_{i}\dots 255) \quad \cup \\
|
||||
& [a_{i}\dots 255, a_{i}\dots (a_{m-1} + 1) )
|
||||
\end{align*}
|
||||
|
||||
and repeat with $m \gets m - 1$ until we are adjacent to the first inner range.
|
||||
and repeat starting at \footnote{This doesn't explicitly describe how to handle the case where $a_{m-1} = 255$. In this case we would skip to the largest $j < m$ such that $a_{j} \neq 255$. We know $j \geq i$ since if $a_{i} = 255$ then the range is inverted.}
|
||||
\[
|
||||
\dots \quad \cup \quad [a_{i}\dots (a_{m-1} + 1), a_{i}\dots (a_{m-1} + 2))
|
||||
\]
|
||||
until we end at $a_{i} + 1$, adjacent to the first inner range.
|
||||
|
||||
A few notes on implementation:
|
||||
\begin{itemize}
|
||||
\item{For clarity, the above algorithm decouples the logical partitioning from the physical structure of the tree. An optimized implementation would merge adjacent prefix ranges that don't correspond to nodes in the tree as it scans, so that it only calculates the version of merged ranges once. Additionally, our implementation stores an index of which child pointers are valid as a bitset for Node48 and Node256, using techniques inspired by \cite{Lemire_2018}.}
|
||||
\item{For clarity, the above algorithm decouples the logical partitioning from the physical structure of the tree. An optimized implementation would merge adjacent prefix ranges that don't correspond to nodes in the tree as it scans, so that it only calculates the version of such merged ranges once. Additionally, our implementation stores an index of which child pointers are valid as a bitset for Node48 and Node256 to speed up this scan using techniques inspired by \cite{Lemire_2018}.}
|
||||
\item{In order to avoid many costly pointer indirections, we can store the max version not in each node itself but next to each node's parent pointer. Without this, the range read performance is not competetive with the skip list.}
|
||||
\item{An optimized implementation would construct the partition of $[a_{i}\dots a_{m}, a_{i} + 1)$ in reverse order, as it descends along the search path to $[a_{i}\dots a_{m})$}
|
||||
\item{An optimized implementation would construct the partition of $[a_{i}\dots a_{m}, a_{i} + 1)$ in reverse order, as it descends along the search path to $a_{i}\dots a_{m}$}
|
||||
\end{itemize}
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user