Initial commit

This commit is contained in:
2025-08-14 10:27:52 -04:00
commit 5fe2127a49
2 changed files with 273 additions and 0 deletions

272
api.md Normal file
View File

@@ -0,0 +1,272 @@
# API Specification
> **Note:** This is a design for the API of the write-side of a database system where writing and reading are decoupled. The read-side of the system is expected to use the `/v1/subscribe` endpoint to maintain a queryable representation of the key-value data. In other words, reading from this "database" is left as an exercise for the reader. Authentication and authorization are out of scope for this design.
-----
## `GET /v1/version`
Retrieves the latest known committed version and the current leader.
### Response
```json
{
// A committed version that is guaranteed to be greater than or equal to the latest transaction version known to be committed at the time of the request.
// Suitable for externally-consistent reads.
"version": 123456,
// The unique ID of the leader at this version.
"leader_id": "abcdefg"
}
```
-----
## `POST /v1/commit`
Submits a transaction to be committed. The transaction consists of read preconditions, writes, and deletes.
* Clients may receive a **`413 Content Too Large`** response if the request exceeds a configurable limit.
* A malformed request will result in a **`400 Bad Request`** response.
* Keys are sorted by a lexicographical comparison of their raw byte values.
* All binary data for keys and values must be encoded using the standard base64 scheme defined in [RFC 4648](https://datatracker.ietf.org/doc/html/rfc4648#section-4), with padding included.
### Request
```json
{
// Optional unique ID that identifies this commit request. If omitted, a UUID will be generated. See Note 1.
"request_id": "abcdefg",
// The expected leader_id. The request is rejected if the leader has changed.
"leader_id": "abcdefg",
// The default version for read preconditions
"read_version": 123456,
// A list of optimistic concurrency conditions that must be satisfied. See Note 2.
"preconditions": [
// Verifies that the existence and content of "key" has not changed since "version"
{
"type": "point_read",
// The known-committed version the precondition is based on. See Note 3.
// Takes the value of "read_version" if omitted.
"version": 123456,
"key": "base64=="
},
{
"type": "range_read",
"version": 123456,
// Inclusive
"begin": "base64==",
// Exclusive. Must be > "begin"
"end": "base64=="
}
],
// Applied in order. Operations may overwrite the effects of earlier operations.
"operations": [
{
"type": "write",
"key": "base64==",
"value": "base64=="
},
{
"type": "delete",
"key": "base64=="
},
{
"type": "range_delete",
"begin": "base64==",
"end": "base64=="
}
]
}
```
### Response
```json
{
"status": "committed|not_committed",
// If not committed, a list of preconditions that were not satisfied
"conflicts": [/* Same as "preconditions" in /v1/commit */],
// If committed, the version at which the transaction was applied.
// If not committed, a more recent version that the client can use to retry.
"version": 123456,
// The unique ID of the leader at this version.
"leader_id": "abcdefg"
}
```
### Detailed Notes for `/v1/commit`
1. **`request_id`**: Optional field that can be used with `/v1/status` to determine the outcome if no reply is received. If omitted, a UUID will be automatically generated by the server, and clients will not be able to determine commit status if there's no response. When provided, the request_id must meet the minimum length requirement (configurable, default 20 characters) to ensure sufficient entropy for collision avoidance. This ID must not be reused in a commit request. For idempotency, if a response is not received, the client must use `/v1/status` to determine the request's outcome. The original `request_id` should not be reused for a new commit attempt; instead, a retry should be sent with a new `request_id`. The alternative design would require the leader to store every request ID in memory.
2. **`preconditions` (Guarantees and Usage)**: The condition is satisfied if the server verifies that the range has not changed since the specified version. Clients can achieve serializable isolation by including all reads that influenced their writes. By default, clients should assume that any read they perform influences their writes. Omitting reads is an expert-level optimization and should generally be avoided.
3. **`preconditions` (False Positives & Leader Changes)**: Precondition checks are conservative and best-effort; it's possible to reject a transaction where the range hasn't actually changed. In all such cases, clients should retry with a more recent read version. Two examples of false positives are:
* **Implementation Detail:** The leader may use partitioned conflict history for performance. A conflict in one partition (even from a transaction that later aborts) can cause a rejection.
* **Leader Changes:** A version is only valid within the term of the leader that issued it. Since conflict history is stored in memory, a leadership change invalidates all previously issued read versions. Any transaction using such a version will be rejected.
The versions in the precondition checks need not be the same.
-----
## `GET /v1/status`
`GET /v1/status?request_id=<ID>&min_version=<VERSION>`
Gets the status of a previous commit request by its `request_id`.
> This is an expensive operation and should only be used when the original request did not receive a response.
### Query Parameters
| Parameter | Type | Required | Description |
| :--- | :--- | :--- | :--- |
| `request_id` | string | Yes | The `request_id` from the original `/v1/commit` request. |
| `min_version` | integer | Yes | An optimization that constrains the log scan. This value should be the latest version the client knew to be committed *before* sending the original request. |
> **Warning\!** If the provided `min_version` is later than the transaction's actual commit version, the server might not find the record in the scanned portion of the log. This can result in an `id_not_found` status, even if the transaction actually committed.
### Response
A response from this endpoint guarantees the original request is no longer in flight.
```json
{
// The final status of the original request.
"status": "committed|id_not_found|log_truncated",
// If committed, the version at which the original request committed.
"version": 123456,
// If committed, the unique ID of the leader as of the commit version.
"leader_id": "abcdefg"
}
```
> **Note on `log_truncated` status:** This indicates the `request_id` log has been truncated after `min_version`, making it impossible to determine the original request's outcome. There is no way to avoid this without storing an arbitrarily large number of request IDs. Clients must treat this as an indeterminate outcome. Retrying the transaction is unsafe unless the client has an external method to verify the original transaction's status. This error should be propagated to the caller. `request_id`s are retained for a configurable minimum time and number of versions so this should be extremely rare.
-----
## `GET /v1/subscribe`
`GET /v1/subscribe?after=<VERSION>&durable=<bool>`
Streams a suffix of transactions using the **Server-Sent Events** (SSE) [protocol](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). The server will respond with a `Content-Type: text/event-stream` header.
Clients should rely on the `version` field within the `transaction` and `checkpoint` event data to track their position in the stream for handling reconnections, as the `id` field (`Last-Event-ID`) is not sent.
### Query Parameters
| Parameter | Type | Required | Description |
| :-------- | :------ | :------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `after` | integer | No | The version after which to start streaming transactions. Defaults to streaming from the latest committed version. On reconnect, clients should set this to the last version they successfully processed. |
| `durable` | boolean | No | If `true` (the default), the stream sends `transaction` events only after they are durably committed. This increases latency but simplifies client logic. When `durable=true`, `checkpoint` events are not sent. |
### Server-Sent Events Stream
The response is a stream of events compliant with the SSE protocol.
*A **`transaction`** event containing the operations that took effect at a version:*
```
event: transaction
data: {"request_id":"abcdefg","version":123456,"timestamp":"2025-08-07T20:27:42.555Z","leader_id":"abcdefg","operations":[...]}
```
*A **`checkpoint`** event indicating the latest durable version. Only sent when `durable=false`:*
```
event: checkpoint
data: {"committed_version":123456,"leader_id":"abcdefg"}
```
*A **`keepalive`** comment sent periodically to prevent idle timeouts. This is an SSE comment and does not have an `event` or `data` field:*
```
: keepalive
```
### Detailed Notes for `/v1/subscribe`
1. **Data Guarantees**: When `durable=false`, this endpoint streams *accepted*, but not necessarily *durable/committed*, transactions. *Accepted* transactions will eventually commit unless the current leader changes.
2. **Leader Changes & Reconnection**: When `durable=false`, if the leader changes, clients **must** discard all of that leader's `transaction` events received after their last-seen `checkpoint` event. They must then manually reconnect (as the server connection will likely be terminated) and restart the subscription by setting the `after` query parameter to the version specified in that last-known checkpoint. Clients should implement a randomized exponential backoff strategy (backoff with jitter) when reconnecting.
3. **Connection Handling & Errors**: The server may periodically send `keepalive` comments to prevent idle timeouts on network proxies. The server will buffer unconsumed data up to a configurable limit; if the client falls too far behind, the connection will be closed. If the `after` version has been truncated from the log, this endpoint will return a standard `410 Gone` HTTP error instead of an event stream.
## `PUT /v1/retention/<policy_id>`
Creates or updates a retention policy.
### Request
```json
{
// Prevents truncating this or any higher version of the log
"prevent_truncate": 123400
}
```
### Response
* `201 Created` if the policy was created.
* `200 OK` if the policy was updated.
-----
## `GET /v1/retention/<policy_id>`
Retrieves a retention policy by ID.
### Response
```json
{
"prevent_truncate": 123400
}
```
-----
## `GET /v1/retention/`
Retrieves all retention policies.
### Response
```json
[
{
"policy_id": "<policy_id>",
"prevent_truncate": 123400
}
]
```
-----
## `DELETE /v1/retention/<policy_id>`
Removes a retention policy, which may allow the log to be truncated.
### Response
`204 No Content`
-----
## `GET /metrics`
Retrieves server metrics for monitoring.
### Response
The response body uses the Prometheus text-based format.
```
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 157.4
```