From 442755d0a6d2e3d4698552675acf24c1aef47b76 Mon Sep 17 00:00:00 2001
From: Andrew Noyes <andrew@weaselab.dev>
Date: Wed, 21 Aug 2024 13:09:24 -0700
Subject: [PATCH] Update implementation notes

This doesn't really capture the complexity, but at least it's more
accurate
---
 paper/paper.tex | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/paper/paper.tex b/paper/paper.tex
index 2fca47d..c3ac619 100644
--- a/paper/paper.tex
+++ b/paper/paper.tex
@@ -206,8 +206,11 @@ until we end at $a_{i} + 1$, adjacent to the first inner range.
 
 A few notes on implementation:
 \begin{itemize}
-    \item{For clarity, the above algorithm decouples the logical partitioning from the physical structure of the tree. An optimized implementation would merge adjacent prefix ranges that don't correspond to nodes in the tree as it scans, so that it only calculates the version of such merged ranges once. Additionally, our implementation stores an index of which child pointers are valid as a bitset for Node48 and Node256 to speed up this scan using techniques inspired by \cite{Lemire_2018}.}
-    \item{In order to avoid many costly pointer indirections, we can store the max version not in each node itself but next to each node's parent pointer. Without this, the range read performance is not competetive with the skip list.}
+    \item{For clarity, the above algorithm decouples the logical partitioning from the physical structure of the tree.
+          An optimized implementation would merge adjacent prefix ranges that don't correspond to nodes in the tree as it scans, so that it only calculates the version of such merged ranges once.
+          Additionally, our implementation uses SIMD instructions and instruction-level parallelism to compare many prefix ranges to the read version $r$ in parallel.}
+    \item{In order to avoid many costly pointer indirections, and to take advantage of SIMD, we can store the max version of child nodes as a dense array directly in the parent node.
+          Without this, the range read performance is not competetive with the skip list.}
     \item{An optimized implementation would visit the partition of $[a_{i}\dots a_{m}, a_{i} + 1)$ in reverse order, as it descends along the search path to $a_{i}\dots a_{m}$}
     \item{An optimized implementation would search for the common prefix first, and return early if any prefix of the common prefix has a $max \leq r$.}
 \end{itemize}