DOI : 10.5281/zenodo.21023673
- Open Access

- Authors : Chaitanya Paraskar, Shreyash Dhas, Shreyash Gajbhiye, Shreyas Gopnarayan, Prof. Madhuri Mane
- Paper ID : IJERTV15IS061133
- Volume & Issue : Volume 15, Issue 06 , June – 2026
- Published (First Online): 29-06-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
KiwiDB: An Async-Native Log-Structured Merge Tree Key-Value Store
Chaitanya Paraskar
Department of Computer Engineering Pune Institute of Computer Technology, Pune, India
Shreyash Dhas
Department of Computer Engineering Pune Institute of Computer Technology, Pune, India
Shreyash Gajbhiye
Department of Computer Engineering Pune Institute of Computer Technology, Pune, India
Shreyas Gopnarayan
Department of Computer Engineering Pune Institute of Computer Technology, Pune, India
Prof. Madhuri Mane
Department of Computer Engineering Pune Institute of Computer Technology, Pune, India
Abstract – Log-Structured Merge Trees (LSM trees) are the dominant write-optimized storage engine architecture, underpin- ning systems such as LevelDB, RocksDB, and Apache Cassandra. However, existing implementations are written in C/C++ with synchronous APIs, making them incompatible with modern asyn- chronous application frameworks. This paper presents KiwiDB, an LSM tree key-value store implemented from scratch in Python 3.12+. KiwiDB introduces three design innovations: (1) an async-native API where the event loop never blocks on disk I/O;
(2) a parallel ush pipeline with event-chain ordered commits that
achieves concurrent SSTable writes while preserving strict FIFO ordering; and (3) subprocess-based compaction that bypasses CPythons Global Interpreter Lock (GIL) for true write-path isolation. The engine employs a hybrid L0-tiering / L1-leveling compaction strategy, a concurrent skip list with lock-free reads, data-derived bloom lter sizing, and a three-tier block cache. Compared with the stock LevelDB design, KiwiDB reduces ush-time write amplication by 10× and provides built-in observability through a FastAPI REST interface and a React web dashboard. The system comprises approximately 6,600 lines of type-checked Python with 211 tests across 25 modules.
Index TermsLSM tree, key-value store, async I/O, com- paction, skip list, bloom lter, write-ahead log, Python, storage engine
-
Introduction
With the rapid growth of web applications and cloud com- puting, key-value (KV) stores have become a critical compo- nent of modern data infrastructure. A KV store maps a set of keys to associated values and can be considered a distributed hash table. Without having to provide features usually required for a relational database system, KV stores offer higher per- formance, better scalability, and more availability [15]. They play a central role in data centers supporting Internet-scale services, including BigTable [5] at Google, Cassandra [4] at Facebook, and Dynamo [6] at Amazon.
The B+ tree is a common structure used in traditional databases, because its high fanout reduces the number of I/O operations for a query. However, it is highly inefcient
for random insertions and updates. When there are intensive mutations, B+ trees lead to a large number of costly disk seeks and signicantly degraded performance. The Log-Structured Merge Tree (LSM tree) [1] addresses this by transforming random writes into sequential writes: incoming data is sorted in an in-memory buffer and ushed to storage as a whole when the buffer is full. Many popular KV stores adopt this design, including BigTable [5], Cassandra [4], HBase [7], and
LevelDB [2].
Despite the maturity of these systems, we observe two limitations that motivate this work:
-
Synchronous APIs. LevelDB and RocksDB [3] ex- pose blocking C++ interfaces. Modern Python applications (FastAPI, aiohttp) run on asyncio event loops. Integrating a synchronous storage engine requires thread-pool wrappers that add latency, complexity, and concurrency bugs. To the best of our knowledge, no existing LSM implementation offers a natively asynchronous API.
-
Opaque internals. Production engines prioritize throughput over observability. Inspecting memtable state, bloom lter hit rates, compaction progress, or SSTable lay- outs requires external tooling. Internal design decisions are documented sparsely, making these systems difcult to learn from.
To address these limitations, we present KiwiDB, an LSM tree key-value store implemented from scratch in Python 3.12+. The contributions of this work are:
-
KiwiDB provides an async-native API where all public operations (put, get, delete, flush, close) are async coroutines. WAL fsync, SSTable ush, and compaction execute off the event loop via asyncio.to_thread() and ProcessPoolExecutor.
-
We design a parallel ush pipeline that writes multiple
SSTables concurrently while preserving strict FIFO commit ordering via an asyncio.Event chain. We prove its crash-safety properties and analyze failure propagation.
Write
WAL
MemTable
freeze
Immutable MemTable
Memory Disk
Level 0
Level 1
Level 2
Merged SSTable (larger)
-
-
Merged SSTable
SST
SST
SST
SST
TABLE I
Impact of CPythons GIL on engine components and the concurrency strategy adopted by KiwiDB.
|
Component |
Bound |
GIL Impact |
KiwiDB Strategy |
|
MemTable insert |
CPU |
Serialized |
Sync (fast, s) |
|
WAL fsync |
I/O |
GIL released |
to_thread() |
|
SSTable read |
I/O |
GIL released |
mmap (zero-copy) |
|
SSTable write |
I/O |
GIL released |
Async parallel |
|
Compaction |
CPU |
Serialized |
Subprocess |
Fig. 1. Architecture of LevelDB. Writes go to the WAL and MemTable. When the MemTable is full, it becomes an Immutable MemTable and is ushed to Level 0. Compaction merges SSTables across levels.
-
Compaction merges run in subprocess workers via ProcessPoolExecutor, bypassing CPythons GIL to achieve true CPU parallelism with zero impact on the write path.
-
We adopt a hybrid L0-tiering / L1-leveling compaction strategy that reduces write amplication by 10× compared to LevelDBs pure leveling, and implement a concurrent skip list with lock-free reads and per-node write locking.
-
We provide a built-in observability layer with a FastAPI REST interface, a React web dashboard, and live log stream- ing via WebSocket making the engines internal state fully inspectable in real time.
The remainder of this paper is organized as follows. Section 2 provides background on LevelDBs architecture and CPythons GIL. Section 3 describes how we extend the LSM design with our proposed techniques. Section 4 discusses implementation-specic details and Python-specic challenges. We describe related work in Section 5, followed by conclusions in Section 6.
-
Background
In order to present the details of our design, we rst review the architecture of LevelDB, the canonical LSM tree imple- mentation. Then we describe CPythons Global Interpreter Lock, which motivates several of our design decisions.
-
LevelDB Architecture
LevelDB [2] is an open-source KV store that originated from Googles BigTable [5]. It s a widely studied imple- mentation of the LSM tree [2], [14]. Figure 1 illustrates its architecture, which consists of two MemTables in main memory and a set of SSTables on disk, along with auxiliary les such as the Manifest and the write-ahead log.
Write path. When the user inserts a KV pair, it is rst appended to the write-ahead log (WAL) for durability, then in- serted into the MemTable, a sorted in-memory structure. When the MemTable reaches its size limit, it is converted to a read- only Immutable MemTable. A new MemTable is created for incoming writes. A background thread ushes the Immutable
MemTable to disk as a Sorted String Table (SSTable). Deletes are a special case: a deletion marker (tombstone) is stored.
compaction
Level organization. SSTables are organized into levels. Level 0 is produced directly from MemTable ushes and may contain overlapping key ranges. Levels 1 and above have non- overlapping key ranges within each level. Each level has a size limit that grows exponentially: Level 1 holds up to 10 MB, Level 2 up to 100 MB, and so on.
Compaction. When the total size of Level L exceeds its limit, the compaction thread picks one SSTable from Level L and all overlapping SSTables from Level L+1, merges them, and produces new Level L+1 les. This eliminates overwritten values, drops tombstones, and ensures that fresher data resides in lower levels.
Read path. A get(key) request searches the MemTable, then the Immutable MemTable, then SSTables from Level 0 upward until a match is found. Since lower levels contain fresher data, the rst match is authoritative. A Bloom lter [9] in each SSTable reduces unnecessary I/O by quickly ruling out les that do not contain the requested key.
Limitations. Two aspects of LevelDBs design are relevant to this work. First, there is only one Immutable MemTable and one background thread for ush and compaction. If the MemTable lls while the Immutable MemTable is still being ushed, writes are blocked. Second, the API is entirely synchronous every operation blocks the calling thread until completion.
-
CPythons Global Interpreter Lock
CPython, the reference Python implementation, uses a Global Interpreter Lock (GIL) that allows only one thread to execute Python bytecode at a time [16]. This has two consequences for a storage engine:
-
I/O-bound work is unaffected: Threads release the GIL during system calls (read, write, fsync), so I/O-bound operations like WAL append and SSTable reads achieve true concurrency.
-
CPU-bound work is serialized: A compaction merge that iterates millions of records in Python is CPU-bound and cannot benet from multi-threading. True parallelism requires a separate process with its own GIL.
Table I summarizes the impact on each engine component and the concurrency strategy KiwiDB adopts.
Tier 4: memtable_mgr sstable_mgr compaction/
Tier 0: types crc uuid7 errors
-
-
Tier 1: bloom/ cache/ index/ rwlock
Tier 2: memtable/ wal/
Tier 3: sstable/ config seq_generator
Tier 5: flush_pipeline compaction_mgr
Tier 6: LSMEngine (public async API)
all tiers
1. SeqGenerator.next()
-
WAL.sync_append()
-
MemTable.put()
4. maybe_freeze()
threshold reached
Signal ush pipeline
monotonic seq
fsync to disk
held under
write_lock
skip list insert
freeze if threshold
Fig. 3. KiwiDB write path. All four steps execute atomically under
stdlib only write_lock. The WAL is fsynced (Step 2) before the memtable is updated (Step 3), ensuring durability before visibility.
Fig. 2. KiwiDB layered architecture. Each tier depends only on tiers below it. The engine at Tier 6 is a thin coordinator.
TABLE II
KiwiDB software specification.
Property Value
Language Python 3.12+
Lines of code 6,600 (engine)
Test suite 211 tests across 25 modules Type checking mypy strict + basedpyright strict Linting ruff + bandit (security)
Build system uv + setuptools
Web framework FastAPI + Vite/React
Key dependencies msgpack, mmp, structlog, cachetools
-
Design and Implementation
In this section we rst introduce the architectural overview of KiwiDB. Then we describe each subsystem extension over the classical LevelDB design: the async-native write and read paths, the concurrent skip list, the parallel ush pipeline, the hybrid compaction strategy, and the observability layer.
-
Overall Architecture
Figure 2 shows the overall architecture of KiwiDB, orga- nized into a strict layered dependency model. The system consists of seven tiers, from leaf modules (Tier 0: types, CRC, UUID) up to the top-level engine coordinator (Tier 6: LSMEngine). No module imports from a tier above it, preventing circular dependencies and enabling each tier to be tested in isolation.
As shown in Figure 2, the software is composed of six subsystems: the Write-Ahead Log (WAL), MemTable with concurrent skip list, SSTable storage with bloom lter and sparse index, ush pipeline, compaction engine, and observ- ability layer. The engine coordinates these subsystems through managers at Tiers 45.
Table II lists the software specication of KiwiDB.
TABLE III
MemTable freeze thresholds by environment.
Mode Condition Default Cong Key
Dev entry count limit 10 entries max_memtable_entries
Prod byte size limit 64 MB max_memtable_size_mb
-
Async-Native Write Path
All writes in KiwiDB follow the same atomic path under a single lock. The key invariant is durability before visibility: the WAL is fsynced before the memtable is updated. Figure 3 illustrates this four-step sequence.
Step 1: Sequence generation. SeqGenerator.next() atomically increments a monotonic counter under threading.Lock. This sequence number serves as the MVCC version higher values indicate newer data.
Step 2: WAL durability. The entry is serialized using msgpack and framed as [4B length][payload][4B CRC32]. Two operation types are supported: PUT (op=1) and DELETE (op=2). The le descriptor is fsynced before returning. This is the most expensive step, but it guarantees that even if the process crashes immediately after put() returns, the entry is recoverable.
Step 3: MemTable insert. The KV pair is inserted into the active skip list (described in Section III-D).
Step 4: Conditional freeze. If the active memtable exceeds its threshold, it is frozen into an ImmutableMemTable, pushed to the immutable queue, and the ush pipeline is signaled. Table III shows the freeze conditions.
Lock ordering. KiwiDB enforces a strict ordering
write_lock wal_lock to prevent dead- lock. The write path holds write_lock while calling sync_append(), which internally acquires wal_lock. Reversing this order would deadlock.
Delete vs. Put. A delete(key) follows the identical path but writes OpType.DELETE with a tombstone sentinel value b”\x00 tomb \x00″. The tombstone propagates through ush and compaction until garbage-collected.
f
-
H
engine.get(key) L3
lock-free
O(1) dict lookup
-
Active MemTable
miss
-
Immutable Queue (04)
miss
hit return
L2
k
f
c
a
H
f
c
H
per-node lock
L1
(writers only)
bloom index bl3oc. kL0 SSTables (all les)
miss
4. L1 SSTable
5. L2, L3 SSTables
miss
highest seq wins
L0
k
f
c
a
H
Fig. 5. Concurrent skip list with four keys. Each node has per-node threading.Lock for writers. Readers traverse lock-free by checking GIL- atomic fully_linked and marked boolean ags.
Fig. 4. KiwiDB read path. Data sources are checked newest-rst. MemTable lookup is lock-free. Each SSTable uses a bloom index block pipeline.
TABLE IV
MemTable concurrency model comparison.
|
Engine |
Structure |
Write |
Read |
|
LevelDB |
Skip list |
Global mutex |
Serialized |
|
RocksDB |
Skip list |
Lock-free (CAS) |
Lock-free |
|
KiwiDB |
Skip list |
Per-node locks |
Lock-free (GIL) |
The rst match with the highest sequence number wins.
-
Read Path
Figure 4 illustrates the read path. Reads scan data sources from newest to oldest, returning the rst match with the highest sequence number. Crucially, no write locks are acquired reads are non-blocking.
Step 1: The active memtable is searched via O(log n) lock- free skip list traversal.
Step 2: The immutable queue (up to 4 frozen snapshots) is scanned newest-rst. Each lookup is O(1) via an internal dictionary index.
Step 3: All L0 SSTables are checked because their key ranges overlap. For each le, the lookup pipeline is: bloom lter (rule out absent keys) sparse index bisect (nd candidate block) block scan (sequential scan within block). The result with the highest sequence number across all L0 les is kept.
Steps 45: L1L3 each have at most one SSTable. The same bloom index block pipeline applies, but only one le per level needs checking.
Tombstone resolution: After nding a result from any source, if the value equals the tombstone sentinel, the key has been deleted the engine returns None.
Example. Suppose a user writes put(“x”, “v1”) with seq=5, then put(“x”, “v2”) with seq=8, then the memtable ushes to L0. A get(“x”) rst checks the (now empty) active memtable, then the immutable queue, then L0. In L0, the SSTable contains both records; the one with seq=8 wins. The engine returns “v2”.
-
Concurrent Skip List
The memtable is backed by a concurrent skip list with 16 levels and promotion probability p = 0.5. Figure 5 shows the node structure and concurrency model.
Each node stores: key, seq (MVCC version), timestamp_ms, value, forward[] (one pointer per level), a per-node threading.Lock, and two boolean ags: marked (logical deletion) and fully_linked (visibility).
Writer protocol. Writers lock only the predecessor nodes at the insertion boundary. Two concurrent inserts with disjoint keys do not contend. The protocol is: (1) top-down traversal to nd predecessors and successors, (2) lock predecessors,
(3) validate they are still correct, (4) create the node and link at level 0, (5) link at higher levels and set fully_linked
= True, (6) release locks.
Lock-free readers. Readers perform a top-down traversal checking the fully_linked and marked ags. Under CPython, single-attribute reads on Python objects are GIL- atomic no torn reads occur. This makes reader traversal effectively lock-free without CAS instructions.
Table IV compares the concurrency models of the three engines.
-
SSTable Format
Each SSTable is a directory containing four les, as shown in Table V. The presence of meta.json is the completeness signal: if it is missing, the SSTable is considered incomplete (e.g., crash during write) and is ignored on recovery.
Record encoding. Each record uses the binary format shown in Figure 6. The CRC32 covers the entire record for integrity verication.
Writer state machine. The SSTable writer follows a strict lifecycle: OPEN put()×N finish() DONE. Keys must be in ascending order (enforced). finish() is async for L0 ushes (writing bloom and index concurrently via asyncio.gather), while finish_sync() is syn- chronous for compaction subprocesses.
Reader with mmap. The reader uses memory-mapped I/O for zero-copy access to data.bin. Bloom lter and sparse index are loaded lazily on rst get() and cached in the shared block cache.
TABLE V
SSTable directory layout.
File Purpose
data.bin Sorted KV records grouped into 4 KB blocks index.bin Sparse index: rst key per block byte offset filter.bin Bloom lter bit array + header
meta.json Metadata; written last as completeness signal
en
2B 2B 8B 8B var var 4B
keyl
vallen
seq
tsms
key bytes
val bytes
CRC32
Fig. 6. SSTable record binary format. Key and value lengths are 2-byte big-endian unsigned integers. Sequence and timestamp are 8-byte big-endian signed integers.
-
Bloom Filter with Data-Derived Sizing
Unlike LevelDB and RocksDB, which use a xed bits-per- key ratio, KiwiDB sizes each bloom lter using the actual record count n at write time:
(ln 2)2
n
m = n · ln p , k = m · ln 2 (1)
where p is the target false positive rate, m is the bit array size, and k is the number of hash functions (MurmurHasp with distinct seeds). Table VI shows the difference.
Example. If LevelDB uses 10 bits/key for an SSTable with 100 entries, it allocates 1,000 bits the same ratio it would use for 1M entries. KiwiDB computes the optimal bit count for exactly 100 entries at 1% FPR: m = I100 · ln(0.01)/(ln 2)2l = 959 bits with k = 7 hash functions. Every SSTable gets a right-sized lter.
-
Three-Tier Block Cache
KiwiDB uses a three-tier LRU cache that serves data blocks, sparse indexes, and bloom lters with independent eviction policies. The tiered design reects access patterns: bloom lters should remain resident (checked on every lookup), indexes are checked on positive bloom results, and data blocks are accessed only when the index indicates a candidate.
Each tier is backed by an independent cachetools.LRUCache with threading.RLock for thread safety. Cache entries persist across engine restarts: a new SSTableReader instance reuses cached bloom lters and indexes from a previous session.
-
Parallel Flush Pipeline
LevelDB uses a single background thread for ushing, which means that if the working MemTable lls while the Immutable MemTable is still being written, user writes are blocked. Wang et al. [14] addressed this by increasing the number of Immutable MemTables and ushing them concur- rently to separate SSD channels. KiwiDB adopts a similar insight at the software level multiple SSTables are written concurrently but must solve an additional problem: commit ordering.
TABLE VI
Bloom filter sizing comparison.
Engine Size Parameter FPR Cong
LevelDB Fixed bits/key (default 10) Fixed at creation RocksDB Fixed bits/key (congurable) Per column family KiwiDB Actual record count Per-env: 5% dev, 1% prod
TABLE VII
Block cache tier configuration.
Tier Key Capacity Eviction
Data blocks offset 0 256 entries First (lowest priority) Sparse indexes offset = 2 64 entries Medium Bloom lters offset = 1 64 entries Last (highest priority)
On a multi-channel SSD, each SSTable lands on a separate channel and ordering is implicit. On a standard lesystem, KiwiDB must explicitly ensure that L0 les appear in oldest- rst ordr, because the read path depends on this ordering. We solve this with an event-chain commit mechanism.
-
Event-Chain Mechanism: Figure 7 illustrates the mech-
anism with two workers. Each ush slot carries two
asyncio.Event objects: prev_committed (set by pre- decessor) and my_committed (set by this slot).
Phase 1 (parallel): Up to flush_max_workers (de- fault: 2) SSTable writes execute concurrently, bounded by asyncio.Semaphore. Each write targets a separate direc- tory.
Phase 2 (ordered): After its write completes, each worker blocks on prev_committed.wait() before committing. The commit atomically: (1) registers the new SSTable reader,
(2) pops the snapshot from the immutable queue, (3) truncates the WAL. Then it sets my_committed to unblock the next slot.
-
Correctness Invariant: The core invariant is: at every point in time, every durable key must be ndable by get()
either in the immutable queue or in the SSTable registry,
never in neither.
-
Crash Safety: Table VIII shows that data is recoverable at every crash point.
-
Failure Propagation: If any slot fails during Phase 1:
(1) a batch_abort event is set, causing downstream slots to skip commits; (2) my_committed is set in the finally block, preventing chain deadlock; (3) the failed snapshot stays in the queue for retry; (4) the WAL is not truncated.
-
-
Hybrid Compaction
KiwiDB uses a hybrid strategy: L0 tiering with L1+ level- ing. Figure 8 illustrates the level model.
Trigger mechanism. Compaction is ush-triggered, not daemon-driven. After every ush commit, CompactionManager.check_and_compact() runs. Table IX shows the thresholds.
K-way merge. The KWayMergeIterator merges N
sorted inputs via a min-heap keyed on (key, -seq). For
Slot 0
parallel writes ordered commits
prev = pre-set (no wait)
L0
compaction
L1
overlapping
az (merged, deduplicated)
az (larger, older data)
az (oldest data)
dg
ac
bf
ad
1 le
Slot 1
write snap0 to disk commit
prev = Event A Event A
write snap1 wait commit
Event B
time
L2
L3
1 le
1 le
Fig. 7. Parallel ush with event-chain ordering. Writes execute concurrently (bounded by semaphore). Commits are serialized: Slot 1 waits on Event_A from Slot 0 before committing. Slot 1 nishes writing 200 ms early but waits
total time is max(write times) + commit overhead.
TABLE VIII
Crash safety analysis of the flush pipeline.
Fig. 8. KiwiDB level model. L0 uses tiering (multiple overlapping les). L1L3 use leveling (one merged le per level). Compaction merges all L0 les into L1 when le count threshold.
Transition
Trigger
Default
L0 L1
File count threshold
10 les
L1 L2
Size 102× base
Cascading
L2 L3
Size 103× base
Cascading
TABLE IX Compaction trigger thresholds.
Crash Point Recovery
During write Snapshot in queue. Incomplete SSTable (no
meta.json) ignored. WAL replays all.
After register, before pop SSTable readable; snapshot redundant but
safe. WAL deduplicates by seq.
After pop, before WAL truncate SSTable has data. WAL replayed and dedu-
plicated. 3
After WAL truncate Clean state. 4
“levels”: {“1”: “file_c”, “2”: “file_d”}
}
Listing 1. Manifest structure.
equal keys, the highest seq (newest) wins. This enables single- pass deduplication. Tombstones with seq < seq_cutoff are garbage-collected; those above the cutoff are preserved to shadow values in deeper levels.
Subprocess execution. Each compaction job is a CompactionTask a frozen dataclass with only prim- itive elds (strings, integers) that can cross the subprocess boundary. The merge runs in a ProcessPoolExecutor subprocess with its own GIL. As shown in Table I, this is the only way to achieve true parallelism for CPU-bound work in CPython.
Atomic commit. After the subprocess produces the new SSTable, the commit sequence is: (1) register new reader,
(2) write manifest, (3) mark old les for deletion (deferred by ref-count), (4) update in-memory state, (5) evict stale cache. Write locks on source and destination levels prevent reads from seeing inconsistent state.
Write amplication analysis. Table X compares the write amplication of the three strategies.
KiwiDBs L0 tiering eliminates the merge at ush a pure sequential append. The trade-off is checking up to T bloom lters at L0 per read. With 1% FPR: 10 × 0.01 + 3 × 0.01 =
0.13 expected false positives per lookup, versus LevelDBs 3 × 0.01 = 0.03. This 34× read cost increase is acceptable given the 10× write improvement.
-
SSTable Manager and Manifest
{
“l0_order”: [“file_a”, “file_b”],
The SSTableManager coordinates all on-disk state via a persistent JSON manifest:
1
2
The l0_order list stores le IDs in newest-rst order. Per-level AsyncRWLock instances serialize compaction com- mits while allowing concurrent readers. A reference-counted registry prevents reader closure while in-ight reads are ac- tive. The manifest is written atomically via tempfile + os.replace().
-
Recovery Model
On startup, KiwiDB reconstructs state in six steps:
-
Load cong and open WAL le.
-
Load all SSTables from disk via manifest. Directories without meta.json are discarded (incomplete writes).
-
Compute max_seq across all loaded SSTables.
-
Replay WAL entries where seq > max_seq into a fresh memtable.
-
Restore SeqGenerator to max_seq.
-
Start ush pipeline and compaction manager.
Entries already ushed to SSTables are skipped. The WAL is the source of truth for unushed data.
-
-
Observability Layer
Figure 9 shows the observability architecture.
Structured logging. All engine events are logged via structlog with key-value elds (seq, leid, level, size). Logs go to both a rotating le and a TCP broadcast server.
REST API. FastAPI exposes 20+ endpoints: KV operations (GET/POST/DELETE /kv/{key}), memtable inspection, SSTable browsing by level, engine statistics with history, conguration management, and a terminal interface.
Web dashboard. A React SPA (Vite + TypeScript) provides real-time views of memtable ll level, SSTable layout, com- paction progress, live log stream, and an in-browser REPL.
TABLE X
Write amplification comparison at flush time. T = size ratio
TABLE XI
Key runtime-mutable configuration parameters.
(DEFAULT 10), L = RECORD SIZE, B = BLOCK SIZE, = BLOOM FPR.
Parameter Dev Prod Purpose
Strategy Write amp. (ush) Read amp. (bloom checks)
LevelDB (leveling) O(T × L/B) L × = 0.04
KiwiDB (hybrid) O(L/B) T + L = 0.13
max_memtable_entries 10 Freeze (count) max_memtable_size_mb 64 Freeze (bytes) l0_compact_threshold 10 10 L0 le limit
|
Improvement |
10× |
3.25× worse |
bloom_fpr flush_max_workers |
2 |
2 |
Parallel ushes |
|
|
block_size |
4096 |
4096 |
SSTable block (B) |
||||
|
max_levels |
3 |
3 |
Compaction depth |
||||
|
LSMEngine |
FastAPI |
0.05 0.01 Bloom FPR
D. Backpressure
structlog
REST API
When the immutable queue reaches capacity (default: 4),
Log le TCP server
WebSocket React Dashboard
writes block via threading.Event with a 60-second time- out. If ush is stuck, a FreezeBackpressureTimeout exception is raised rather than allowing unbounded memory growth.
Fig. 9. Observability architecture. The engine emits structured logs to le and a TCP broadcast server. FastAPI exposes a REST API. The React dashboard consumes both REST endpoints and live WebSocket log streams.
Live streaming. The TCP log server relays events to WebSocket clients. Updates arrive without polling.
-
Implementation-Specific Details
In this section we discuss techniques specic to implement- ing an LSM engine in Python 3.12+.
-
Async/Sync Bridge
The engine uses three bridge patterns:
-
Direct sync: Microsecond operations (seq generation, memtable insert) execute synchronously inside an async def method.
-
asyncio.to_thread(): Blocking I/O (WAL fsync, mmap reads) is ofoaded to the thread pool.
-
ProcessPoolExecutor: CPU-bound compaction runs in a subprocess.
-
-
Sync-to-Async Signal Bridging
When a sync write triggers a memtable freeze, the async ush pipeline must wake immediately. KiwiDB bridges this via loop.call_soon_threadsafe(): the sync code schedules an asyncio.Event.set() on the event loop, waking the pipeline on the next iteration. No polling; sub- millisecond latency.
-
Writer-Preferred AsyncRWLock
KiwiDB implements a custom AsyncRWLock with writer preference: once a writer is waiting, new readers block. This prevents writer starvation critical for compaction commits. The lock tracks three counters: active readers, active writers (0 or 1), and waiting writers.
E. Runtime-Mutable Conguration
All parameters are stored in a thread-safe Config object backed by a JSON le. Values are read at point of use (not cached), so changes take effect on the next operation. Writes use atomic os.replace(). Table XI lists key parameters.
F. Type Safety and Formal Contracts
Two independent type checkers run in strict mode on every CI check: mypy (disallows implicit Any) and basedpyright (additional ow analysis). Formal contracts are dened via ABCs (StorageEngine, Serializable, MemTable), Protocols (structural subtyping), and .pyi stub les for the public API.
-
-
Related Work
LSM tree foundations. The LSM tree was proposed by ONeil et al. [1] as a write-optimized alternative to B-trees. Luo and Carey [13] formalized the design space across tiering, leveling, size ratio, and merge policy.
Production implementations. LevelDB [2] is the canonical LSM implementation with pure leveling, a single compaction thread, and a global-mutex skip list. RocksDB [3] extends it with concurrent compaction, column families, and cong- urable strategies. WiredTiger [8] combines B-tree and LSM approaches. All expose synchronous APIs.
LSM on ash storage. Wang et al. [14] proposed LOCS, which extends LevelDB to exploit the channel-level paral- lelism of open-channel SSDs by increasing the number of Immutable MemTables and ushing them concurrently to sep- arate channels. Their work demonstrates up to 4× throughput improvement with optimized I/O scheduling. KiwiDB shares the insight that concurrent ush improves throughput, but achieves it at the software level with an event-chain commit ordering mechanism on commodity storage.
Write amplication. Dostoevsky [11] introduces lazy lev- eling as a hybrid between tiering and leveling. KiwiDBs L0- tiering / L1-leveling achieves a similar reduction through a simpler design.
Bloom lter optimization. Monkey [10] optimizes bloom lter memory across levels. KiwiDB instead sizes each lter from actual data, producing optimal lters without cross-level tuning.
Concurrent data structures. KiwiDBs skip list draws on Herlihy et al. [12], adapted for CPythons GIL: per-node locks for writers, GIL-atomic ag reads for lock-free readers.
-
Conclusion
We presented KiwiDB, a complete LSM tree key-value store built from scratch in Python 3.12+. By implementing the engine with an async-native API, parallel ush pipeline with event-chain commit ordering, and subprocess compaction for GIL bypass, we demonstrate that a modern storage engine can be built from rst principles in a high-level language while incorporating design innovations that address real limitations in LevelDB and RocksDB.
The hybrid L0-tiering / L1-leveling strategy reduces write amplication by 10× at the cost of 3.25× more bloom lter checks an acceptable trade-off. The concurrent skip list achieves lock-free reads via GIL-atomic ag checks, and the three-tier block cache keeps hot bloom lters and indexes resident.
Limitations. KiwiDB trades feature breadth for depth: no range queries (though the internal merge iterator could support them), no block compression, no multi-key transactions, and 10100× lower raw throughput than C++ engines.
Future work. Natural extensions include: (1) range query API exposing the merge iterator; (2) LZ4/zstd block compres- sion; (3) WAL group commit for batched fsync; (4) cross-level bloom optimization following Monkey [10]; (5) migration to GIL-free Python (PEP 703) to replace subprocess compaction with true multi-threaded parallelism.
References
-
P. ONeil, E. Cheng, D. Gawlick, and E. ONeil, The log-structured merge-tree (LSM-tree), Acta Informatica, vol. 33, no. 4, pp. 351385, 1996.
-
Google, LevelDB: A fast and lightweight key/value database library, 2011. [Online]. Available: https://github.com/google/leveldb
-
Facebook, RocksDB: A persistent key-value store for fast storage environments, 2013. [Online]. Available: https://rocksdb.org
-
A. Lakshman and P. Malik, Cassandra: A decentralized structured storage system, ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35 40, 2010.
-
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur- rows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 4:14:26, 2008.
-
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,
A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazons highly available key-value store, in Proc. ACM SOSP, 2007,
pp. 205220.
-
Apache, HBase, [Online]. Available: https://hbase.apache.org
-
MongoDB, WiredTiger storage engine, [Online]. Available: https:// source.wiredtiger.com
-
B. H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM, vol. 13, no. 7, pp. 422426, 1970.
-
N. Dayan, M. Athanassoulis, and S. Idreos, Monkey: Optimal navigable key-value store, in Proc. ACM SIGMOD, 2017, pp. 7994.
-
N. Dayan and S. Idreos, Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superuous merging in Proc. ACM SIGMOD, 2018, pp. 505520.
-
M. Herlihy, Y. Lev, V. Luchangco, and N. Shavit, A provably correct scalable concurrent skip list, in Proc. OPODIS, 2006.
-
C. Luo and M. J. Carey, LSM-based storage techniques: A survey,
The VLDB Journal, vol. 29, pp. 393418, 2020.
-
P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong, An efcient design and implementation of LSM-tree based key-value store on open-channel SSD, in Proc. ACM EuroSys, 2014, pp. 16:116:14.
-
N. Leavitt, Will NoSQL databases live up to their promise? Computer, vol. 43, no. 2, pp. 1214, 2010.
-
Python Software Foundation, GlobalInterpreterLock, [Online]. Avail- able: https://wiki.python.org/moin/GlobalInterpreterLock
