Files
opaque-lattice/papers_txt/ProckStore--An-NDP-empowered-key-value-store-with-asynchr_2025_Journal-of-Sy.txt
2026-01-06 12:49:26 -07:00

1076 lines
126 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103342
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
ProckStore: An NDP-empowered key-value store with asynchronous and
multi-threaded compaction scheme for optimized performance✩
Hui Sun a ,, Chao Zhao a , Yinliang Yue b , Xiao Qin c
a Anhui University, Jiu long road 111, Hefei, 230601, Anhui, China
b
Zhongguancun Laboratory, Cuihu North Road 2, Beijing, 100094, China
c
Auburn University, The Quad Center Auburn, Auburn, 36849, AL, USA
ARTICLE INFO ABSTRACT
Keywords: With the exponential growth of large-scale unstructured data, LSM-tree-based key-value (KV) stores have
Near-data processing (NDP) become increasingly prevalent in storage systems. However, KV stores face challenges during compaction,
LSM-tree particularly when merging and reorganizing SSTables, which leads to high I/O bandwidth consumption and
Asynchronous multi-threaded compaction
performance degradation due to frequent data migration. Near-data processing (NDP) techniques, which
Write amplification
integrate computational units within storage devices, alleviate the data movement bottleneck to the CPU.
Key-value separation
The NDP framework is a promising solution to address the compaction challenges in KV stores. In this
paper, we propose ProckStore, an NDP-enhanced KV store that employs an asynchronous and multi-threaded
compaction scheme. ProckStore incorporates a multi-threaded model with a four-level priority scheduling
mechanismcovering the compaction stages of triggering, selection, execution, and distribution, thereby
minimizing task interference and optimizing scheduling efficiency. To reduce write amplification, ProckStore
utilizes a triple-level filtering compaction strategy that minimizes unnecessary writes. Additionally, ProckStore
adopts a key-value separation approach to reduce data transmission overhead during host-side compaction.
Implemented as an extension of RocksDB on an NDP platform, ProckStore demonstrates significant performance
improvements in practical applications. Experimental results indicate a 1.6× throughput increase over the
single-threaded and asynchronous model and a 4.2× improvement compared with synchronous schemes.
1. Introduction with each level having a capacity threshold that increases at a fixed
rate as the level number grows. When the amount of data in a level
The rapid development of large language models [1], graph exceeds its threshold, some data migrates to lower levels, potentially
databases [2], and social network [3] has led to the generation of real- causing overlapping key ranges between SSTables in different levels.
time large amounts of data, contributing to a global surge in large-scale To maintain data organization and prevent duplication, SSTables with
data. This data is growing exponentially and is increasingly manifested overlapping key ranges must be loaded into memory and merged. The
in semi-structured and unstructured formats, in addition to traditional sorted and de-duplicated key-value pairs are then rewritten as new
structured data. For example, semi-structured and unstructured data SSTables at a lower level. This process, known as compaction, involves
have been grown in recent years according to IDC [4], and they now frequent read and write operations that consume a lot of I/O bandwidth
account for over 85% of total data volume. To cope with the large between the host and storage devices, thereby delaying foreground
amount of unstructured data, LSM-tree-based key-value stores (KV requests and degrading system performance.
stores) [5] have become widely adopted in large-scale storage systems.
GPUs, DPUs, and FPGAs General-purpose graphics processing unit
LSM-tree structures are popularly used in modern database engines
(GPGPU), data processing unit (DPU), and field-programmable gate
(e.g., LevelDB) [6] RocksDB [7]). In the LSM-tree structure, key-value
array (FPGA) offer additional computational resources to address com-
pairs are first written to an immutable MemTable in memory and
paction performance challenges. Near-Data Processing (NDP), intro-
then persist to disk as Sorted String Tables (SSTables) once preset
duced in the late 1990s as the smart disk [8], has regained attention
threshold is reached. On disk, the LSM-tree is organized hierarchically,
✩ This work is supported in part by National Natural Science Foundation of China under Grants 62472002 and 62072001. Xiao Qins work is supported
by the U.S. National Science Foundation (Grants IIS-1618669 and OAC-1642133), the National Aeronautics and Space Administration, United States (Grant
80NSSC20M0044), the National Highway Traffic Safety Administration, United States (Grant 451861-19158), and Wright Media, LLC (Grants 240250 and 240311).
Corresponding author.
E-mail addresses: sunhui@ahu.edu.cn (H. Sun), chaozh@stu.ahu.edu.cn (C. Zhao), yylhust@qq.com (Y. Yue), xqin@auburn.edu (X. Qin).
https://doi.org/10.1016/j.sysarc.2025.103342
Received 31 October 2024; Received in revised form 30 December 2024; Accepted 11 January 2025
Available online 24 January 2025
1383-7621/© 2025 Published by Elsevier B.V.
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
as an emerging computational paradigm. The enhanced computational
power within storage devices has fueled interest in NDP. NDP mit-
igates the overhead of data movement by reducing data movement
to the CPU. The NDP paradigm advocates for computation close
to data as an alternative to the computation-centered approach in
large-scale systems. This model enables storage devices to use their
internal bus for data processing rather than transfer data to the host,
where the results would otherwise be computed. Most existing NDP-
empowered KV stores, such as Co-KV [9] and TStore [10], to tackle
compaction tasks using a synchronization-based approach. This work
splits the compaction task, leveraging either averaging or dynamic
time-awareness. In the synchronization model, the host and the device
cannot complete tasks simultaneously, leading to long waiting time and
inefficient resources usage. PStore [11] addresses waiting time by using
an asynchronous model but fails to fully exploit the benefits of this
approach due to its single-threaded processing.
To address these issues, we designed an asynchronous NDP scheme,
ProckStore, which utilizes a multi-threaded strategy to perform com-
paction tasks concurrently. All compaction tasks are managed in a
thread pool and scheduled using multiple threads, exploiting the bene-
fits of asynchronous processing, where tasks do not interfere with one
another. The tasks are executed independently by individual threads.
A four-level priority scheduling mechanism is implemented to ensure
efficient scheduling of compaction tasks within the thread pool, fol-
lowing the four stages of the compaction process. To address the write
amplification issue, a triple-level filtering compaction method is em-
ployed, reducing unnecessary writes and alleviating write amplification
during compaction on the host side. Furthermore, the transmission
process in the NDP architecture and its compaction module is optimized
Fig. 1. The structure of LSM-tree and RocksDB. The LSM-tree is composed of compo-
by utilizing a key-value separation technique, minimizing transmission nents 𝐶0 , 𝐶1 , and 𝐶𝑛 .
time by sending only the keys to the host. The contributions of this
work are summarized as follows
▴ We designed ProckStore with an asynchronous and multi-threaded
2. Background and motivation
scheme. Then, compaction tasks are executed independently with-
out interfering with each other in the thread pool, entirely using
the asynchronous mode, which significantly improves write perfor- 2.1. Background
mance compared with the synchronous mode and the single-threaded
asynchronous scheme. RocksDB is an LSM-tree-based key-value store developed by Face-
We designed ProckStore using an asynchronous and multi-threaded book, and it is widely used in Facebooks storage systems to achieve
architecture. Compaction tasks are executed independently within the high throughput. In RocksDB, the MemTable and Immutable MemTable
thread pool, fully leveraging the asynchronous model. This approach are stored in memory, while the Sorted String Table (SSTable) is stored
significantly enhances write performance compared to the synchronous on disk. As shown in Fig. 1, key-value pairs from the application
model and single-threaded asynchronous scheme. are first written to the commit log and then cached in a sorted data
▴ ProckStore employs a four-level priority scheduling to manage the structure called the MemTable, which has a limited size (e.g., 4MB)
compaction process, which consists of four stages: compaction trigger, in memory. Once the MemTable reaches its predefined capacity, it is
compaction picking, compaction execution, and compaction distribu- converted into an Immutable MemTable. A background thread then
tion. This scheduling prioritizes tasks at different stages, ensuring opti- writes the MemTable to disk as a sorted string table (SSTable). On disk,
mal efficiency during asynchronous and multi-threaded compaction. SSTables are organized in levels, with each level growing by a fixed
▴ To optimize performance in the NDP transmission architecture, multiple.
we implemented key-value separation in the host-side compaction, In Fig. 1(a), in LSM-tree, the hierarchy represents different compo-
reducing data transmission overhead. The device-side compaction mod- nents, such as components 𝐶0 , 𝐶1 ......, and 𝐶𝑛 . Component 𝐶0 resides in
ule employs a cross-level compaction technique to alleviate compu- memory. The new write data is first written into the sequential log file
tational load, thereby improving transmission efficiency and overall and then inserted into an entry placed in 𝐶0 . However, the high-cost
system throughput. memory capacity that accommodates 𝐶0 imposes a limit on the size of
▴ ProckStore, an extension of RocksDB on the NDP platform, was 𝐶0 . To migrate entries to a component on the disk, LSM-tree performs
evaluated using DB_Bench and YCSB-C. Experimental results demon- a merge operation when the size of 𝐶0 reaches the threshold, including
strate that ProckStore increases throughput by a factor of 1.6× com- taking some contiguous segment of entries from 𝐶0 and merging it into
pared to the single-threaded asynchronous PStore, and achieves a 4.2× a component on the disk. Component 𝐶𝑛 (n> 1) resides on the disk
throughput improvement over the synchronous TStore. in the LSM-tree. Although 𝐶1 is disk-resident, the frequently accessed
The paper is organized as follows. Section 2 presents the background page nodes in 𝐶1 remain in the memory buffer. 𝐶1 has a directory
and motivation encountered by ProckStore. Section 3 presents a system structure like B-tree but is optimized for sequential access on the disk.
overview of ProckStore and information on each module. Section 4 The in-memory 𝐶0 servers high-speed writes, and 𝐶𝑛 (n> 1) on the
lists the hardware and software configurations used in the experiments. disk is responsible for persistence and batch-sequential writes. Through
Section 5 demonstrates the performance of ProckStore through exten- the hierarchical and merging strategies, LSM-tree achieves a balance
sive experiments. Section 6 elaborates on the extended experiments. between write optimization and high-efficiency query.
Section 7 summarizes related work. Finally, we conclude our work in In Fig. 1(b), the most recently generated SSTable is placed in the
Section 8. lowest level, 𝐿0 . SSTables in level 𝐿0 can have overlapping key ranges,
2
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 2. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various data volume.
Fig. 3. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various value sizes.
while higher levels are organized by key ranges. Each level has a size other hand, demonstrates its effectiveness in an asynchronous and
threshold for its total SSTables. When this threshold is exceeded, the KV single-threaded setting. Notably, the asynchronous approach allows
store migrates SSTables from level 𝐿𝑘 to level 𝐿𝑘+1 during compaction. compaction tasks to be performed independently; however, it is difficult
The compaction process selects SSTables from level 𝐿𝑘 and searches to fully leverage the benefits of asynchronous processing in a single-
for overlapping key ranges in level 𝐿𝑘+1 . A merge operation is then threaded environment. Therefore, we investigate the performance of
performed on the SSTables with overlapping key ranges to produce new PStore using different thread configurations. Fig. 2(a) presents the
SSTables, which are stored in level 𝐿𝑘+1 . Obsolete SSTables in level throughput of PStore under workloads with 4-KB value and various
𝐿𝑘+1 are deleted from the disk. This compaction process incurs compu- data volumes. We can draw two key observations.
tational and storage overhead, which negatively impacts response time ▵ As the number of threads increases, the throughput of PStore does
and throughput a significant drawback of the LSM-tree. not grow exponentially as expected. The throughput improvement is
Graphical computing [2], machine learning [12], and large lan- minimal during multi-threaded writes, particularly when the number
guage models [1] demand substantial data for model training and of threads is 12.
inference. The data transfer overhead from storage devices to the CPU ▵ With a large number of threads, the throughput of PStore in-
for computation becomes higher as data volumes grow, consuming sys- creases slowly. Under 20-GB workloads, when the thread count in-
tem resources and incurring bottlenecks between storage and memory creases from 8 to 12, the throughput only increases by 0.12 MB/s.
in high-performance systems. As data volumes increase, the overhead These findings indicate that the asynchronous compaction advan-
associated with transferring data from storage devices to the CPU for tages of PStore in single-threaded mode are insufficient to handle the
computation rises, leading to resource consumption and performance large volume of multi-threaded writes. As a result, increasing the num-
bottlenecks between storage and memory in high-performance sys- ber of threads does not enhance throughput, particularly as the thread
tems. Traditional storage architectures struggle to meet the demands count becomes large. While the asynchronous approach in PStore takes
of data-intensive applications under these conditions. NDP mitigates into account the computational imbalance between the host and the
this challenge by fully utilizing the devices internal bandwidth. By NDP device, it fails to implement an appropriate asynchronous com-
incorporating embedded computing units, storage devices can perform paction method. The limitations of the single-threaded mode hinder
computational tasks, offloading these operations from the host and the full potential of the asynchronous compaction mechanism in the
eliminating the overhead of moving large data volumes. The results KV store.
can then be retrieved from the storage device, reducing the need for As shown in Fig. 2(b), the average latency decreases under work-
additional data movement. Furthermore, the KV store can leverage loads with various data volumes, but this reduction is most pronounced
NDP to perform compaction tasks internally, improving compaction when using a small number of threads. Specifically, the most significant
efficiency. decrease occurs between 1 and 4 threads, where the average latency re-
duces by 27.8%. Additionally, the CPU utilization on the host supports
2.2. Motivation these observations (see Fig. 2(c)), with a 34% increment in 12 threads
over 1 thread under 10-GB workloads. The CPU utilization exhibits a
Most existing studies focus on compaction processing in a single- 19% increment as the number of threads grows from 1 to 4. The result
threaded context. For instance, Co-KV and TStore process compaction reveals that PStore is suitable for single- or fewer-threaded workloads,
tasks synchronously and in a single-threaded mode. PStore, on the and it is challenging to adapt to multi-threaded applications.
3
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 4. Overview system of ProckStore. 𝑄1 is the first compaction-task queue. Sub i(0<i<n+1) represents the sub-compaction task of task 1. 𝐾𝑖 and 𝑉𝑖 denote the 𝑖th key and
value, respectively.
We conducted an experiment to study the impact of multiple threads module on the host side, and the compaction module on the device.
on the performance under workloads with 4- and 64-KB value sizes. As shown in Fig. 4, we illustrate the data flow between the NDP
As shown in Fig. 3, we observe similar findings under workloads with device, the host-side compaction, and the device-side compaction mod-
various data volumes. The throughput of PStore is improved under ules. Initially, the data is written from the host side (see host data
the large-sized value. When increasing the number of threads to 4 stream in Fig. 4), and multiple compaction tasks are accumulated in
under workloads with a 64-KB value, the throughput of PStore is 1.65 the thread pool. These tasks are allocated to the host and device by the
MB/s. Furthermore, this metric increases to 1.83 MB/s in the case of asynchronous manager (see host task stream and device task stream
12 threads, and there is only an increment of 11% (see Fig. 3(a)). In in Fig. 4). After completing the compaction tasks, the data is written
Fig. 3(b), the average latency decreases when the number of threads to the flash memory inside the NDP device through the transmission
increases to 4. The average latency of 4 threads decreases by 24.8% module (see device data stream in Fig. 4). The dark blue line represents
compared to one thread. The degree of decrement becomes little as the host-side data flow, where ProckStore transfers the data from
the number of threads increases. The host-side CPU utilization becomes flash memory to the host for compaction tasks. The four-level priority
smoother in Fig. 3(c), but the improvement is still most pronounced scheduling module manages the thread pool, the host-side compaction
when there are a limited number of threads. module, and controls the asynchronous manager (see control stream in
Thus, we plane to use a multi-threaded approach to implement Fig. 4). When a compaction task is generated, the four-level priority
the asynchronous compaction mechanism in the KV store. We have scheduling module places it in the compaction queue of the thread
redesigned the asynchronous compaction mechanism extended from pool. It then determines whether the task should be executed on the
RocksDB, and fully leveraged its internal multi-threaded capability to host or device and notifies the asynchronous manager to allocate the
develop the asynchronous compaction solution ProckStore. task. When the host executes a compaction task, the scheduling module
issues instructions to the host-side compaction module to execute it.
3. Design of ProckStore We provide the asynchronous compaction mechanism with the
multi-level task queue module in Section 3.2, where compaction tasks
3.1. System overview are kept in the thread pool. Section 3.3 presents the four-level prior-
ity scheduling module, which controls the priority scheduling in the
In this paper, we propose ProckStore, an NDP-empowered KV store compaction process. The triple-level filtering compaction module is in
that incorporates an asynchronous and multi-threaded compaction Section 3.4. We describe the transport mechanism on the NDP device
scheme. ProckStore consists of a host-side subsystem, an NDP device, and the cross-level compaction module in Section 3.5.
and a communication module that connects the two sides, as shown The host-side asynchronous compaction management module allo-
in Fig. 4. The host-side subsystem manages I/O requests, while the cates compaction tasks to the device side. A multi-level queue stores
NDP device, which serves as a storage unit, extends computational the tasks awaiting execution and calculates the computational capa-
resources to process tasks offloaded from the host. The NDP device bilities of both the host and device. These tasks are then scheduled
stores persistent data, with read and write operations akin to those to the compaction modules on the host and device sides. The host-
of standard storage devices. We implement various modules on both side compaction module executes the tasks, while the device side must
the host and NDP device to accommodate task-offloading requirements. transmit compaction information via the semantic management mod-
Data is stored as SSTables in a leveled structure on the NDP device. ule, which facilitates communication between the host and device. The
SSTables are either transferred to the host for compaction or to the NDP processed information is sent to the device-side compaction module for
for compaction via the communication channel. During transmission, task execution. The four-level priority scheduling module manages the
the SSTables pass through a key-value separation module, ensuring that entire process, from task triggering to execution. Data and commands
only the keys of the KV pairs are sent to the host for the merge oper- are transmitted between the host and device through the semantic
ation. The data flow occurs between the NDP device, the compaction management module. The NDP device encodes (decodes) interacting
4
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
In the case of a multi-level queue, it is important to prevent the
queue from becoming starved of compaction tasks. On both the host
and device sides, one side may pause the compaction task while wait-
ing for new task allocations upon completion of the task allocation.
Consequently, the triggering conditions must be adjusted to trigger
more frequently, ensuring a sufficient number of compaction tasks are
available in the task queue. Additionally, different priorities must be set
for scheduling various tasks. ProckStore assigns a score to each priority
level, see Fig. 6
First-level Priority: the priority of the compaction trigger. In
the stage of triggering a compaction task, the goal of ProckStore is
to select the level most urgently needed to perform the compaction
task. ProckStore sets the first_score to realize the prioritization of the
compaction triggering stage. Due to the structure of the LSM-tree, the
score in level 𝐿0 is calculated as the ratio of the number of SSTables
to the threshold value of level 𝐿0 . However, the score in other levels
is calculated as the ratio of the total size of SSTables to the threshold
value of the level. Thus, the calculation of first_score is also divided
into level 𝐿0 and other levels, see the following equation.
𝑁sst 𝑁no_comp 𝑁being_comp
𝑁max
, level i, i =0
first_score = ⎨ 𝑆sst 𝑆no_comp 𝑆being_comp (1)
𝑆max
, level 𝑖, 𝑖 > 0
where 𝑁𝑠𝑠𝑡 and 𝑁𝑚𝑎𝑥 denote the number of SSTables and the thresh-
Fig. 5. Multilevel Task Queue in ProckStore. old of the number of SSTables in level 𝐿0 , respectively. 𝑁𝑛𝑜_𝑐 𝑜𝑚𝑝 and
𝑁𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the number of compaction tasks in level 𝐿0 that have
been picked into the task queue to be executed and the number of
compaction tasks in level 𝐿0 that are executed and contain SSTables.
data, storing the SSTable based on key-range granularity, performing
𝑆𝑠𝑠𝑡 and 𝑆𝑚𝑎𝑥 denote the size of the total data volume of SSTable in
garbage collection, maintaining information, and executing compaction
tasks on the files. level i and the threshold of the data volume of SSTable in level i,
respectively. In contrast, 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 and 𝑆𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the data volume
of SSTable included in the compaction task to be executed in the
3.2. Asynchronous mechanisms
queue of tasks picked in level i and the current compaction task being
executed. The data volume containing SSTable is being executed in the
To implement the asynchronous strategy, we decouple the two
phases compaction triggering and execution to establish conditions compaction task. Different from the RocksDB score, we can see that
for asynchronous compaction. In contrast, the synchronous approach the SSTables that have been picked into the compaction queue and
treats the task from compaction trigger to completion as a single the SSTables that are involved in compaction tasks are subtracted to
process. In the asynchronous mechanism, compaction tasks are con- reduce the number of SSTables that are not in the level, which makes
tinuously generated when the conditions for triggering compaction are the calculation of the first_score more accurate.
met. These tasks, generated during the trigger phase, must be executed. When a level triggers a compaction, the generated compaction task
To manage them efficiently, we propose a multi-level task queue that will be put into the corresponding task queue of the level, and the
stores compaction tasks uniformly and waits for the asynchronous compaction module will wait for its processing. See Fig. 6. In an
manager to schedule them. To align with the structure of the LSM-tree, asynchronous trigger mode, the compaction task will not be executed
a compaction task queue is assigned to each level, with tasks generated immediately, and the asynchronous compaction manager will have to
during the trigger phase placed into the task queue at a level, awaiting wait for it to be scheduled. The device side triggers the compaction
scheduling. In Fig. 5, ProckStore employs a multi-task queue for each task according to the first_score of each level and places it into the task
column family. The multi-task queue selects compaction tasks at each queue. The device triggers the compaction task according to first_score
level based on a score value. (select the maximum value) in each level and puts it into the task
We implement multi-level task queues in a thread pool. Tasks in queue. It provides the basis for prioritizing the compaction triggering
each queue are sorted in ascending order by the number of SSTables. and the execution between levels, which is the first-level prioritization.
A heap sorting algorithm is used in each task queue to ensure sorting A second level of prioritization is the prioritization of SSTable
occurs in time complexity 𝑂(𝑛 log 𝑛). The task queue is a double-ended picking. In the compaction task generation phase, we select some
queue, allowing compaction tasks to be allocated from both ends to the SSTables in the level and all the overlapping SSTables from the fol-
host and device sides. Since multiple tasks are scheduled in the queue, lowing level. These SSTables are conducted and merged in the com-
a thread pool is used to manage the pending tasks in the compaction paction operation. ProckStore puts the compacted SSTables into the
queue. compaction_queue and then reads the first file information that needs
to be compacted from the queue. The compacted SSTables in the queue
3.3. Four-level priority scheduling perform compaction operations sequentially without considering the
hot and cold data and the size of the compaction task. Thus, the
An asynchronous mechanism-based compaction procedure separates information about the number of overlapping SSTables with the lower
the two phases: compaction triggering and execution. To achieve this, level is added to the FileMetaData of each SSTable. The second_score is
we propose a four-level priority scheduling strategy that assigns priority set to the number of overlapping SSTables, and the meta-information
levels to the four steps involved: triggering compaction tasks, generat- is sorted in ascending order by each level following the size of the
ing tasks, allocating tasks, and executing tasks. This strategy ensures second_score, see Eq. (2), as follows
efficient execution of the asynchronous compaction process. second_score = 𝑂𝑣𝑒𝑟𝑙𝑎𝑝sst , for SSTable (2)
5
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 6. Four-level priority scheduling in ProckStore. The light gold circles represent SSTables selected for compaction at the current level, while the light blue circles denote
SSTables selected for compaction at the next level. The dark blue circles indicate SSTables that are not selected for compaction. The yellow rectangles represent newly generated
compaction tasks, the light red rectangles signify compaction tasks assigned to the device side, and the dark orange rectangles represent compaction tasks assigned to the host
side.
where 𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠𝑠𝑡 denote the number of overlapping SSTables. The host side; (3) the score of the host side is equal.
SSTable with the smallest number of overlapping SSTables at the lower In case (1), the default rules of acquiring tasks in the queue remain
level is prioritized to select the SSTable to be compacted. Compaction unchanged, and case (2) is changed into a situation in which the device
and the metadata information of SSTable are maintained using a linked side fetches the tasks from the left side of the queue, and the host
list to facilitate insertion and deletion. However, we query the overlap obtains the tasks from the right side of the queue. In case (3), the
of SSTables with the lower level and cost O(n) time complexity to default rules for taking tasks are still maintained, and the host side
maintain the order of the linked list. We can ensure that the min- and the device side re-record and calculate the compaction time at one
imum number of SSTables is selected in each compaction, reducing end, then make judgments according to the comparison results. In a
compaction time. running process, the configuration of the host and device sides cannot
A third level of priority is the priority of allocating compaction change, so the queue to get the task rules decided after the numerical
tasks. In the stage of compaction task allocation, we consider the comparison is no longer carried out. The decision of the task to get the
different computational capabilities of the host and the device sides rules cannot be changed in this process.
for compaction tasks. Meanwhile, the compaction processing efficiency A fourth level of prioritization is the priority of executing
varies with the configurations of the host and the device and the data sub-task. After selecting the SSTables, these SSTables are integrated
paths of read, write, and transmission. Therefore, it is necessary to into a complete compaction task that reaches the stage of compaction
select appropriate compaction tasks for both the host and the device. execution on the NDP device and the host. The compaction task is
When all the SSTables involved in compaction are selected, the com- decomposed into multiple sub-tasks, which can be executed in parallel
paction task information is generated and inserted into the compaction on the device. Notably, sub-task refers to as sub-compaction that are
task queue. The compaction task queue of each level is a double-ended performed in the multi-threaded compaction mechanism in RocksDB.
In a compaction process, the primary thread first executes a sub-
priority task queue, which is sorted in ascending order according to the
task. Notably, the first sub-task is designated as the main thread for
number of SSTables in a compaction task. The queue is heap sorted with
execution by default. The rest of the sub-task creates many sub-threads
time complexity 𝑂(𝑛𝑙𝑜𝑔 𝑛). Initially, the host obtains the tasks from the
to be executed concurrently. Then, the primary thread merges the
left side of the queue with fewer SSTables, and the device side gets the
results and writes them back in a unified manner.
tasks from the right side with more SSTables. The host and the device
The amounts of data and execution time are different in the sub-
sides record the compaction time.
task. The computational resources are underutilized by default. To
During the compaction process, the host and the device record the
address this issue, we prioritize the concurrent execution process of
time cost of five consecutive compaction tasks and data volume of
sub-task. Let us have 𝑓 𝑜𝑢𝑟𝑡_𝑠𝑐 𝑜𝑟𝑒 = 𝑆𝑆 𝑆 𝑇 , where 𝑆𝑆 𝑆 𝑇 denotes the
compacted SSTables. The third_score is the ratio of the time taken for
total data volume of SSTable in each sub-task. When dividing the
five consecutive compaction tasks to the data volume of the compacted
sub-task tasks, we compare the data size of each sub-task. The sub-
SSTables, which is given as task containing the smallest data is set to be the highest priority. It
𝑆host_sst means that the smaller the fourth_score is, the higher the priority is,
𝑇host_comp
, for host
third_score = ⎨ 𝑆device_sst (3) and the highest-priority sub-task is placed into the primary thread for
𝑇device_comp , for device compaction. The compaction execution time can be illustrated as
where 𝑆𝑜𝑠𝑡_ 𝑠𝑠𝑡 and 𝑇𝑜𝑠𝑡_ 𝑐 𝑜𝑚𝑝 denote the total data volume of compacted 𝑇𝑒𝑥𝑒 = 𝑇𝑝𝑡𝑟𝑒𝑎𝑑 + 𝑇𝑠𝑢𝑏𝑡𝑟𝑒𝑎𝑑 , (4)
SSTables on the host and the time cost, respectively. 𝑆𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑠𝑠𝑡 and where 𝑇𝑒𝑥𝑒 , 𝑇𝑝𝑡𝑟𝑒𝑎𝑑 , 𝑎𝑛𝑑 𝑇𝑠𝑢𝑏𝑡𝑟𝑒𝑎𝑑 represent the overall execution time,
𝑇𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑐 𝑜𝑚𝑝 represent the total data volume of compacted SSTables on primary thread execution time, and sub-thread execution time. The sub-
the device and the time cost, respectively. We use the third_score to task with the least execution time is placed into the primary thread
evaluate the compaction processing capability of both the host and for execution to reduce the execution time. Notably, the sub-thread
device sides. The side with a higher compaction processing capability execution time is determined by the sub-task with the longest execution
handles tasks containing a large number of SSTables from the right end time. This procedure cannot affect the execution time of sub-tasks,
of the queue, while the other side handles tasks from the left end. thereby reducing the overall execution time and improving the system
The larger the value of third_score1 is, the less efficient the com- performance.
paction is. Compared 𝑡𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒𝑜𝑠𝑡 with 𝑡𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒𝑑 𝑒𝑣𝑖𝑐 𝑒 , there are
three cases: (1) the score of the host side is greater than that of the 3.4. Triple-level filter compaction
device side; (2) the score of the device side is greater than that of the
The asynchronous compaction method of ProckStore improves the
compaction performance; however, this method brings the write am-
1
It indicates that the compaction operation spends more time processing plification problem. Therefore, we propose the mechanism of triple-
an SSTable. level filtering compaction (see Fig. 7) . In a compaction procedure,
6
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 7. The triple-level filter compaction in ProckStore.
Fig. 8. The transmission module between the host and the device in ProckStore.
triple-level filtering compaction involves SSTables from three levels introduces write amplification. The SSTables newly written to level
to remove duplicate data. The triggering of the triple-level filtering 𝐿𝑖+1 may immediately needs to be combined with the SSTables with
mechanism, however, requires certain conditions to be met. When overlapping key ranges in level 𝐿𝑖+2 to form a new compaction task.
performing the compaction involving SSTables in levels 𝐿𝑖 and 𝐿𝑖+1 , These SSTables are merged and the new data are written to level 𝐿𝑖+2 ,
ProckStore first_score value of level 𝐿𝑖+1 . If the value is greater than resulting in additional write amplification for data previously written
1, the triple-level filtering compaction is triggered, involving SSTables to level 𝐿𝑖+1 . Consequently, this procedure incurs two instances of write
from level 𝐿𝑖+2 that overlap with those from level 𝐿𝑖+1 . This mechanism amplification.
helps reduce duplicate writes and alleviates write amplification. The triple-level filtering compaction combines all the overlapping-
As triple-level filtering compaction contains overlapping-key-range key-range SSTables in the three levels to perform compaction. The data
SSTables of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , it causes the problem of excessive in level 𝐿𝑖 is written to level 𝐿𝑖+2 , which eliminates a compaction
compaction data. When performing the three-level compaction, some process and the write amplification from level 𝐿𝑖+1 to 𝐿𝑖+2 .
key ranges can exist levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 . This key ranges can be
deleted and filtered at the intermediate level (i.e., level 𝐿𝑖+1 ), which 3.5. Transmission in ProckStore
cannot affect the update of new keys in level 𝐿𝑖 to 𝐿𝑖+2 or the merging
of old keys in level 𝐿𝑖+2 . At the stage of generating compaction tasks, In ProckStore, data is transferred between the host and device sides,
we mark the duplicate key range in the three levels when picking the as shown in Fig. 8. During compaction, a large amount of data is
overlapping SSTables from the three levels and filter the duplicate key read from the NDP device to the host for compaction, which involves
ranges in the three levels out of level 𝐿𝑖+1 in advance. Then, the newest transferring many KV pairs. This results in significant data migration
keys in level 𝐿𝑖 and the oldest keys in level 𝐿𝑖+2 are retained. This overhead. To address this issue, we employ keyvalue separation for
approach reduces the data volume involved in compaction by reducing compaction to reduce the data transfer overhead, which minimizes data
redundancy across the three levels, thereby alleviating the issue of migration between the host and the NDP device, reduces write am-
excessive data in compaction operations. plification, and improves compaction performance. In the compaction
As shown in Fig. 7, when level 𝐿1 performs compaction, the score process, only the keys of the KV pairs are read, sorted, and written to
of level 𝐿2 is greater than 1 to satisfy the condition of triple-level filter the NDP device. The key size is less than 1 KB, while the value size
compaction. Compared the key ranges of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , exceeds 1 KB. During compaction on the host side, the NDP device
ProckStore filters out and deletes the same keys that exist across the transmits only the keys to the host, while the device processes the
three levels. In Fig. 7, the keys 7, 8, 9, 10, 11, and 13 are filtered values locally. Afterward, the host processes the values, sends them
out from level 𝐿2 before performing the compaction operation. These back to the device, and integrates them into an SSTable. This approach
keys are placed in the compaction queue, awaiting the asynchronous significantly reduces data migration between the CPU and memory on
manager to initiate the compaction. According to Eq. (1), these keys the host side and minimizes the overhead of data transfers between the
are marked as 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 , causing the first_score value in levels 𝐿1 and 𝐿2 device and host.
to drop below 1 due to the subtraction of these keys. This results in a The keyvalue separation mechanism is implemented during host-
reduction of excess data in the level. The default compaction method in side compaction, with the entire KV pair stored on the NDP device. The
ProckStore merges the SSTables with overlapping key ranges in levels key is processed during host-side compaction, reducing both device-
𝐿𝑖 and 𝐿𝑖+1 and then writes new SSTables into level 𝐿𝑖+1 . This process to-host data transfers and host-to-device compaction operations. Based
7
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
on the compaction information, the device separates the key from the 4. Experimental settings
value in the SSTables. The key array stores the address of each value,
which is used for subsequent reorganization. The keys are then sent to Platform. We implemented ProckStore based on RocksDB and con-
the host for a sort-merge operation. Afterward, the compacted keys are
ducted experiments to assess its performance. To evaluate ProckStore,
sent back to the device, where they are reorganized into new SSTables
we constructed a test platform simulating the NDP architecture, where
based on the value addresses. There are three threads for each step: (1)
data transfer between the host and NDP device occurs over Ethernet.
the separation thread on the device, (2) the merge operation thread on
the host, and (3) the keyvalue reorganization thread on the host. The Although this platform was used for validation, ProckStore is scalable
host-side compaction task is divided into the following three steps: to real NDP platforms. SocketRocksDB, a version of RocksDB deployed
▵ Step 1. The keyvalue separation thread in the NDP retrieves on the NDP collaborative framework, was used as the baseline. TStore,
the KV pairs based on the SSTable data format. The key or value is PStore, and ProckStore all share the NDP-empowered storage frame-
stored in the corresponding array in the NDP devices memory. In the work. The experimental platform comprises two subsystems: a host-side
key array, each key records the subscript of its corresponding value, and a device-side NDP subsystem. The host system is equipped with an
and the time complexity for searching the array is O(1). The value Intel(R) Core(TM) i3-10100 CPU (8 cores) and 16 GB of DRAM, while
array is transferred to the NDP device via memory sharing and waits the NDP device runs on an ARM-based development board with four
for the sorted key array to be fetched from the host. The key array is Cortex-A76 cores, four Cortex-A55 cores, 16-GB DRAM, and a 256 GB
transferred to the host via the host-NDP interface. Western Digital SSD. A network cable connects the host to the NDP
▵ Step 2: The host fetches the key array, sorts the individual keys, device, with a bandwidth of 1000 Mbps.
and sends the sorted key array to the NDP device for restructuring. All
The host system runs Ubuntu 16.04, and RocksDB version 6.10.2 is
these steps are organized within a single thread.
employed. The NDP platform uses a lightweight embedded operating
▵ Step 3: Upon receiving the new key array, the NDP device finds
each keys corresponding value based on its subscript. Simultaneously, system. Data transfer between the host and the NDP device is facilitated
the device reconstructs the new SSTables according to the order of the by the SOCKET interface, replacing the standard POSIX interface to
keys. To minimize data transfer time between the host and device, the ensure efficient data transmission. In RocksDB, the buffer and SSTable
data volume is reduced, and a separate transfer thread ensures that sizes are set to 4 MB, the block size is 4 KB, and the level settings remain
the communication between the host and device remains unaffected, at default values. The number of sub-tasks on the host is limited to 4,
minimizing transmission latency. and all other configuration parameters in RocksDB are set to default
As shown in Fig. 8, only the keys (which are reconstructed on the values.
host side) are passed between the host and the device. When the host Workloads. In this section, we evaluate the performance of Prock-
performs compaction, a compaction request is sent to the device, which Store under realistic workloads. The details of the DB_Bench and YCSB-
then provides the necessary data from the NDP device. SSTables 1 and C workloads used in the experiments are presented in Table 1. The
2 from level 𝐿0 and SSTables 3 and 4 from level 𝐿1 are separated. DB_Bench workload is used to assess random-write performance.
The duplicate keys and offset addresses are passed to the host, which
Table 1 presents the different workloads in the Type column.
executes the compaction procedure. After deduplication, the keys are
In addition, db_bench_1 is configured in random-write mode with a
re-transmitted to the NDP device, where they are reorganized into a
new SSTable (SSTable 7) in level 𝐿1 . fixed value size of 1 KB and varying data sizes (10 GB, 20 GB, 30 GB,
By reducing the transmission overhead on the host side, the device 40 GB), db_bench_2 is configured in random-write mode with multiple
reduces the compaction tasks time cost, aligning with the NDP archi- value sizes (1 KB, 4 KB, 16 KB, 64 KB) and two data volumes (10 GB
tectures requirements. At the fourth priority level, the host handles and 40 GB). We also employ YCSB-C to measure the ProckStores
most of the compaction tasks, which contain more SSTables, thereby performance under mixed readwrite workloads.
relieving the devices computational load. However, the NDP device
not only processes the values for the host but also handles the KV pairs
5. Performance evaluation
in the compaction task, which increases the devices processing pres-
sure. To alleviate this, cross-level compaction is employed to reduce
computational strain on the NDP device. We conduct experiments to evaluate the performance of ProckStore
When a compaction process is triggered in level 𝐿𝑖 and the first_score in terms of throughput, latency, and write amplification (WA).
of level 𝐿𝑖+1 exceeds one, cross-level compaction is initiated. This
process searches for SSTables with overlapping key ranges in the subse-
5.1. Performance under DB_Bench with various data volumes
quent level 𝐿𝑖+2 . Unlike traditional compaction, cross-level compaction
in ProckStore continuously searches for overlapping key-range SSTables
in level 𝐿𝑖+2 . Subsequently, SSTables from levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 In this section, we evaluate the performance of ProckStore using
undergo compaction, and the newly generated SSTables are written to DB_Bench with various data volumes and a 4-KB value. Fig. 9 illustrates
level 𝐿𝑖+2 . the impact of data volume on performance, focusing on throughput,
The trigger selection in level 𝐿𝑖 follows the priority rules, while WA, CPU utilization, and bandwidth. ProckStore delivers peak perfor-
the selection of SSTables in levels 𝐿𝑖+1 and 𝐿𝑖+2 is based on their sec- mance with 10-GB workloads, achieving up to 48% higher throughput
ond_score(see Eq. (2)). SSTables written to level 𝐿𝑖+1 in traditional com- compared to PStore, and an average improvement of 40%. Under
paction may be written to level 𝐿𝑖+2 through cross-level compaction. 40-GB workloads, the WA of TStore and SocketRocksDB reaches its
This cross-level approach helps balance the SSTable distribution across maximum, while ProckStores WA remains constant at 1.4 across all
levels, reducing the number of compaction operations. However, it
cases. Under 30-GB workloads, ProckStores throughput decreases by
introduces a drawback: compaction involving many SSTables increases
an average of 67% and 61% compared to TStore and SocketRocksDB,
compaction time. For the NDP device, data transmission time can be
respectively. This performance decrement is attributed to the frequency
ignored, thereby reducing overall compaction time. As illustrated in
Fig. 8, SSTables 1 and 2 in level 𝐿0 , SSTables 3 and 4 in level 𝐿1 , of compaction operations, which consume bandwidth and degrade
and SSTables 8 and 9 in level 𝐿2 are involved in compaction on the overall performance. PStore exhibits lower CPU utilization than Prock-
NDP device, and new data is written into SSTable 13 in level 𝐿2 . Store across all workloads. The multi-threaded approach in ProckStore
With an asynchronous mechanism, priority scheduling, and optimized optimizes the utilization of computing resources. In contrast, Sock-
data transmission under the NDP-empowered KV store, ProckStore etRocksDB prioritizes data storage over compaction, leading to lower
efficiently optimizes the compaction process. CPU utilization than PStore and ProckStore.
8
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Table 1
Workload Characteristics used in the Experiment.
Workloads in DB_Bench
Type Feature Fillrandom Value Size (1 KB) Data Size (10 GB)
db_bench_1 100% writes ✓ 4× 1×, 2×, 3×, 4×
db_bench_2 100% writes ✓ 1×, 4×, 16×, 64× 1×, 4×
Workloads in YCSB-C
Type Feature Data Size Record Size Distribution
Load (10 GB) Run (10 GB) (1 KB)
A 50% Reads, 50% Updates Zipfian
B 95% Reads, 5% Updates Zipfian
C 100% Reads 1×, 2× 1×, 2× 1× Zipfian
D 95% Reads, 5% Inserts Latest
95% Range Queries,
E Uniform
5% Inserts
50% Reads,
F Zipfian
50% Read-Modify-Writes
Fig. 9. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with various data volumes.
5.1.1. Write amplification (WA) 5.1.2. Throughput
A large WA indicates significant duplication of write operations, In Fig. 9(c), the operation time of ProckStore ranges from 828.35
which degrades system performance. In SocketRocksDB, WA is primar- micros/op to 1819.18 micros/op. The operation time of ProckStore
ily caused by write-ahead log and compaction on the host. WA increases is lower than that of SocketRocksDB because it takes less to execute
write and read operations.ProckStore reduces operation time by 72.8%
with the amount of data, as the number of compaction operations
compared to TStore, under a 20-GB workload. At the same time, under
is proportional to the data size. As shown in Fig. 9(b), under a 10-
a 40-GB workload, ProckStore reduces the operation time by 24.0%
GB workload, the WA of TStore and PStore is reduced by 39% and and 61.5%, compared to PStore and SocketRocksDB. In Fig. 9(a), with
62%, respectively, compared to SocketRocksDB, which performs all a 40-GB dataset, the throughput of ProckStore is 4.15× and 1.47×
compaction tasks on the host. By offloading a portion of the compaction higher than that of TStore and PStore. Meanwhile ProckStore achieves a
tasks to the NDP device, TStore and PStore reduces WA. Notably, throughput of 2.75× higher than SocketRocksDB. Under a 10-GB work-
ProckStore exhibits a 55% reduction in WA. A similar trend is observed load, ProckStore achieves 2.81× write throughput of SocketRocksDB
under 20- and 30-GB workloads. For a 40-GB workload, WA is reduced through the multi-threaded asynchronous approach. In addition, with
by 36.4%, and 72.0% for ProckStore, respectively, compared to TStore, a 10-GB dataset, the throughput of ProckStore is 45.2% higher than
and SocketRocksDB. PStore.
Other KV stores (excluding SocketRocksDB) leverage collaborative
strategies between the host and NDP device to accelerate compaction,
9
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
thereby enhancing throughput. ProckStore optimizes resource alloca- of bandwidth, with an average improvement of 1.67× compared to
tion with a multi-threaded asynchronous approach, which improves PStore and 2.32× compared to SocketRocksDB across all different value
performance. Its throughput exceeds 4.57 MB/s, achieving a 48% im- sizes. Meanwhile, ProckStore achieves the highest host- and device-side
provement over PStore. CPU utilization. In Fig. 11(e), the device-side CPU utilization of PStore
and ProckStore are similar due to task stacking on the device under
5.1.3. CPU utilization large data volumes.
CPU utilization refers to the proportion of CPU resources consumed
by the KV stores under different workloads. TStore utilizes a single- 5.2.1. Write amplification (WA)
threaded approach on both the host and device, leading to low CPU With increasing value sizes, the amount of data on the host grows,
utilization (see Figs. 9(d) and 9(e)). As a result, TStores CPU utilization exacerbating WA in TStore and SocketRocksDB. In Fig. 11(b), the WA
is lower than that of SocketRocksDB. Despite leveraging multi-threaded of TStore and SocketRocksDB is the lowest (2.18 and 5.2) with a 16-
concurrency, SocketRocksDB faces a transmission bottleneck between KB value. Under 1-KB value workloads, WA increases to 2.39 and 6.1,
the host and the device. During task processing, the host quickly respectively. ProckStores WA is unaffected by host-side compaction.
performs merge operations; however, there is significant latency during
Under 1-KB and 64-KB workloads, ProckStore reduces WA by 76.2%
read and write operations. By offloading a portion of tasks to the NDP
and 75.1%, respectively, compared to SocketRocksDB. However, it
device, ProckStore reduces CPU idle time and improves CPU utiliza-
increases to 76.4% and 77.6% under 10-GB workloads. This improve-
tion on the host. Compared to SocketRocksDB, ProckStore achieves
ment is due to ProckStores triple-filter compaction on the host, which
improvements of 97% and 89% in CPU utilization under 10-GB and
reduces compaction operations and the volume of compacted data.
40-GB workloads, respectively. ProckStore demonstrates the highest
host-side CPU utilization, peaking at 7.03% under a 10-GB workload.
ProckStores multi-threaded method on the host further enhances 5.2.2. Throughput
CPU utilization. As shown in Fig. 9(e), PStore employs a single-threaded, In Figs. 10(a) and 11(a), ProckSotres average throughput ranges
asynchronous method, offering greater flexibility than traditional from 3.8 MB/s to 5.1 MB/s and 2.7 MB/s to 4.0 MB/s under 10-GB
scheduling models. Furthermore, reduced compaction time increases and 40-GB workloads, respectively. It is worth noting that ProckStores
the device-side CPU utilization of PStore by over 20.49%, a 73% im- throughput increases compared with PStore, indicating lower response
provement compared to TStore under a 10-GB workload. In ProckStore, times to foreground requests. Compared with SocketRocksDB, Prock-
device-side CPU utilization is further enhanced through cross-level Store improves by 2.04× and 2.1× under 40-GB workloads with 1-KB
compaction. This metric increases by 27%, 33%, 40%, and 35% com- and 16-KB values, respectively. Compared with PStore, ProckStore
pared to PStore under 10-, 20-, 30-, and 40-GB workloads, respectively. improves throughput by 1.51× and 1.58× (see Fig. 11(a)). In particular,
compared with TStore, ProckStore archives 4.1× and 2.68× improve-
5.1.4. Compaction bandwidth ment under workloads with 4-KB and 64-KB values, respectively.
The compaction bandwidth unveils the compaction performance of
a KV store. In this paper, the term compaction bandwidth refers 5.2.3. CPU utilization
to the host-side compaction bandwidth, as our proposed ProckStore Large-sized values increase compaction overhead and host-side CPU
primarily focuses on optimizing host-side performance. For instance, utilization, peaking under workloads with a 64-KB value. ProckStores
the four-level priority scheduling in Section 3.3 prioritizes four steps— host-side and device-side CPU utilization reach 10.83% and 29.11%,
triggering, task generation, task allocation, and task execution on the respectively (see Figs. 10(e) and 11(d)), while SocketRocksDBs values
host—to perform asynchronous compaction efficiently. The triple-level are 8.34% and 18.28%. Additionally, ProckStores CPU utilization on
filter compaction in Section 3.4 combines two compaction procedures both sides is 8.27% and 25.39% under 40-GB workloads with 1-KB val-
into one, thereby improving host-side compaction performance. There- ues. On average, ProckStores CPU utilization is 3.35× and 4.1× higher
fore, we define compaction bandwidth as the ratio of compaction time than TStore and SocketRocksDB, respectively, and outperforms PStore
to the amount of compacted data on the host side. SocketRocksDB in both host- and device-side CPU utilization under all workloads.
performs compaction tasks on the host, while other KV stores provide
compaction bandwidth on the host and NDP device. In Fig. 9(f), the
5.2.4. Compaction bandwidth
single-threaded TStore fails to fully leverage the multi-core computa-
In Figs. 10(f) and 11(f), the compaction bandwidth of the KV
tional capabilities of the host, resulting in an average bandwidth of 2.35
stores varies. TStores device-side bandwidth peaks at 3.14 MB/s,
MB/s.
while ProckStore shows an average improvement of 4.29× and 1.61×
In contrast, SocketRocksDB uses the multi-threaded method to en-
over TStore and PStore, respectively. SocketRocksDB leverages multi-
hance the bandwidth to 2.86 MB/s, which outperforms other KV stores.
threaded parallelism to enhance computation and reduce processing
This is because the host handles all the tasks, resulting in much total
time, leading to superior bandwidth performance under 40-GB work-
data. The collaborative solution improves processing efficiency on the
loads across all value sizes. However, PStore achieves higher bandwidth
host. Under 40-GB workloads, ProckStores bandwidth improves by
than SocketRocksDB under 10-GB workloads. ProckStore outperforms
3.56× and 1.51× over SocketRocksDB and PStore, respectively.
all other stores in terms of bandwidth across all workloads, achieving a
3.54× improvement over SocketRocksDB under workloads with a 64-KB
5.2. Performance under DB_Bench with various value sizes
value.
We configured the workloads with various value sizes and two
data volumes (10 GB and 40 GB). The large-sized value increases the 5.3. Performance under YCSB-c
compaction overhead while improving the throughput under work-
loads with a fixed-size data volume. ProckStore maintains optimal YCSB-C provides real-world workloads, which we use to evaluate
performance under workloads with different value sizes and two data the compaction performance of TStore, PStore, SocketRocksDB, and
volumes (see Figs. 10 and 11). ProckStores throughput increases on ProckStore. We configure this workload with two types of data vol-
average by 63.1% and 77.7% compared to PStore and SocketRocksDB umes: 10 GB and 20 GB in the Load and Run phases. We define the
in the case of 1-KB value (see Fig. 10(a)). The performance increases at configuration with 10 GB Load and 10 GB Run as small data volumes,
64 KB because large-value workloads trigger more frequent compaction and 20 GB Load and 20 GB Run as large data volumes. We use six types
and shorter running time. ProckStore has the best performance in terms of workloads in the experiment.
10
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 10. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 10-GB data volume and various value sizes.
Fig. 11. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 40-GB data volume and various value sizes.
11
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 12. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 10 GB and run 10 GB data volume.
5.3.1. Case 1: Load 10 GB and run 10 GB workload C, ProckStores average latency is 13.4% and 37.5% lower
Load. In YCSB-C, the workload Load is write-intensive, resulting than that of TStore and SocketRocksDB, respectively. ProckStore ex-
in frequent compaction. ProckStore optimizes compaction under var- hibits similar trends under workloads B and D. The average latency
ious workloads (Fig. 12). Its throughput outperforms SocketRocksDB of ProckStore under write-intensive workloads is 5.72% and 8.45%
by a factor of 2.3×. TStore benefits from time-aware dynamic task lower than that of TStore and PStore under workload E, respectively.
scheduling, which reduces the performance gap compared to Prock- Moreover, the throughput of ProckStore is not lower than other KV
Store. PStores asynchronous compaction improves performance, Prock- stores.
Stores multi-threaded execution further enhances the performance Write Amplification (WA). ProckStore achieves lower WA than
of the asynchronous compaction strategy. Consequently, ProckStores both TStore and SocketRocksDB (see Fig. 12(b)). WA is reduced by
throughput is 4.24× and 1.80× higher than that of TStore and PStore, approximately 1.2 compared to SocketRocksDB. ProckStores host-side
respectively. During the Run phase, ProckStores collaborative mode multi-threaded method further decreases WA by an average of 62.3%
improves performance under write-intensive workloads. Workloads A compared to TStore. The minimum WA of ProckStore is 1.20 under
and F exhibit the highest write ratios at 29.0%. Under workload A, workload C. WA in ProckStore is influenced by the write-ahead log and
ProckStores throughput is 28.2% and 29.0% higher than that of TStore host-side compaction, its compaction frequency is higher, resulting in
and PStore, respectively. Under workload F, ProckStores throughput greater WA than PStore. Under workloads C and D, WA is 1.20 and
1.36, respectively. However, ProckStores triple-level filter compaction
surpasses TStore and PStore by 36.4% and 71.3%, respectively. How-
mechanism mitigates WA compared to SocketRocksDB.
ever, when the write percentage is low, ProckStores throughput shows
CPU Utilization and Compaction Bandwidth. Figs. 12(d) and
minimal variation compared to other KV stores. Additionally, under
12(e) show the host-side CPU utilization of the KV store. Notably,
read-intensive workloads, ProckStore achieves the maximum through-
TStore runs on a single thread. The bandwidth limits the data transfer
put improvement of 60.7%, 59.8%, 122.4%, and 9.2% under workloads
between the host and the device. Overall, CPU utilization patterns for
B, C, D, and E, respectively. In contrast, TStores read performance
SocketRocksDB and ProckStore are similar on both sides, which can be
suffers due to the excessive number of SSTables, which increases the
attributed to the reduction in total processing time, accompanied by a
query operation overhead.
reduction in compaction time. Fig. 12(f) shows compaction bandwidth
Throughput and Latency. Throughput and latency are critical met-
in the Load and Run phases. ProckStore achieves the highest bandwidth
rics for KV stores. As KV stores are widely deployed in real-world ap- on the host. Its bandwidth is 8.64× and 3.32× improvement than TStore
plications, these metrics significantly impact response time. ProckStore and SocketRocksDB, respectively, under workload A. This improvement
maintains its performance advantage in the Load phase when the data increases to 7.97× and 2.99× under workload F. Nevertheless, Prock-
size increases from 10 GB to 20 GB under a workload with the same Store improves the bandwidth by exploiting multi-threaded parallelism.
amount of data. Its throughput is 4.24× that of TStore, see Fig. 12(a). The average bandwidth of ProckStore is 56.8% higher than PStore
Compared with SocketRocksDB and PStore, ProckStores throughput due to its efficient task scheduling, which leverages the computational
improves by 2.33× and 1.8×, respectively, under the same workload. capabilities of both the host and the device.
The advantage of ProckStore becomes even more pronounced under
workloads A and F which involves a higher percentage of writes. La- 5.3.2. Case 2: Load 20 GB and run 20 GB
tency results further demonstrate the flexibility of ProckStores schedul- In the Load phase, the throughput of ProckStore surpasses Sock-
ing method. Under workloads D and F, ProckStore has 55.1% and etRocksDB and TStore by 3.44× and 3.73×, respectively, see Fig. 13(a).
42.5% lower latency than PStore (see Fig. 12(c)). Compared with Although the asynchronous approach of PStore enhances performance,
TStore and PStore, ProckStore has 20.8% and 22.1% lower latency the multi-threaded method of ProckStore integrates with the asyn-
under workload A, respectively, see Fig. 12(c). Under read-intensive chronous compaction mechanism. Consequently, the throughput of
12
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 13. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 20 GB and run 20-GB data volume.
ProckStore reaches 1.59× of PStore, respectively. In the Run phase, ProckStore increases host- and device-side CPU utilization by 18.1%
the multi-threaded asynchronous mode improves the performance of and 32.6%, respectively, compared with PStore under workload C. Un-
ProckStore under write-intensive workloads A and F, where half of der mixed readwrite workloads, such as A and F, ProckStore increases
the operations are writes. Specifically, under workload A, ProckStores host-side CPU utilization by 6.7% and 20.0% and device-side CPU
throughput exceeds that of TStore and PStore by 21.0% and 23.1%, utilization by 12.2% and 13.2%, respectively. Fig. 13(f) shows the com-
respectively. Similarly, under workload F, ProckStore achieves 19.1% paction bandwidth of the Run phase. ProckStore achieves the highest
and 21.3% higher throughput than TStore and PStore, respectively. bandwidth under all workloads. In workload C, ProckStores bandwidth
Workloads A and F involve large-sized data volumes. Under these is 45.4% and 35.1% higher than that of PStore and SocketRocksDB,
workloads, the data volume increased from 10-GB to 20-GB, and the respectively. This improvement is attributed to ProckStores utilization
throughput of ProckStore decreased by 32.2% and 42.7%, respec- of multi-threaded parallelism. However, with large-size data volumes,
tively. In addition, the throughput of ProckStore is optimized under ProckStores bandwidth decreases by 17.7% compared to a 10-GB data
read-intensive workloads. In ProckStore, we focus on optimizing com- volume. Under workload D, ProckStores average bandwidth is 6.47×
paction, and the read performance improvement is small. For read- and 1.38× higher than TStore and SocketRocksDB, respectively. In
intensive workloads, ProckStore achieves 20.4%, 16.3%, and 28.5% comparison to a 10-GB data volume, the CPU utilization decreases by
improvement under B, C, and D, compared with PStore, respectively. 38.1%, 32.3%, 17.7%, 33.7%, 30.9%, and 33.1% under workloads A,
For workloads with small-sized data volumes, ProckStore decreases by B, C, D, E, and F, respectively.
27.8%, 8.4%, and 40.6% under workloads B, C, and D, respectively.
Throughput and Latency. With a data size of 20 GB, Prock- 5.3.3. Tail latency
Store maintains its performance advantage in the Load phase. Com- We analyzed the tail latency of ProckStore, including P90, P99,
pared with SocketRocksDB and PStore, ProckStores throughput is im- and P999 latencies, and compared it with TStore, SocketRocksDB, and
proved by 3.44× and 1.58×, respectively, in the Load phase. Both PStore under workloads of different data volumes (10 GB, 20 GB) and
average latency and throughput of ProckStore have the highest perfor- a 1-KB value size. The experimental results are shown in Figs. 14 and
mance in Figs. 13(a) and 13(c). Under read-intensive workloads such 15.
as B and C, ProckStore outperforms SocketRocksDB by about 9.1% The results demonstrate that ProckStore outperforms other key-
and 21.4%, respectively. This improvement is attributed to the triple- value stores, exhibiting lower tail latency. SocketRocksDBs P90 and
filtering compaction, which reduces execution time in the run phase, P99 tail latencies are notably lower than those of other KV stores, due to
thereby increasing throughput. its multi-version management mechanism in RocksDB. ProckStores P90
As shown in Fig. 13(c), under write-intensive workload A, the and P99 tail latencies are lower than those of other KV stores thanks
average latency of ProckStore is 17.6% and 18.9% lower than TStore to its asynchronous allocation method, which reduces tail latency.
and PStore, respectively. ProckStore has a similar trend under workload Under a 10 GB workload, the most significant reduction in latency
F. In addition, ProckStores latency also reduces by 16.9%, 14.1%, occurs when ProckStore lowers P90 latency by 94.07% and 93.89%
and 22.2% under read-intensive workloads B, C, and D, compared compared to TStore and PStore under workload E. This improvement
with SocketRocksDB, respectively. However, compared with 10 GB is attributed to ProckStores superior range query performance, while
data volume, the latency increases due to the additional compaction TStore and PStore are not optimized for range queries. Similarly, Prock-
operations and the associated lookup costs. Store achieves the lowest P99 latency. Fig. 14(b) shows that ProckStore
CPU Utilization and Compaction Bandwidth. When the data achieves the most significant P99 latency reduction under workload E,
volume increases from 10 GB to 20 GB, Figs. 13(d) and 13(e) illustrate lowering P99 by 79.4% and 79.2% compared to TStore and PStore,
the changes in CPU utilization for ProckStore under various workloads. respectively. It also shows substantial improvements under workload B,
13
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 16. Write performance of ProckStore under DB_Bench with different numbers of
subtasks.
Fig. 14. The tail latency of ProckStore under YCSB-C with load 10 GB and run 10 GB
data volume.
6.1. Impact of number of subtasks
To validate the fourth-level prioritization, we conducted experi-
ments to evaluate the impact of various subtasks on the write perfor-
mance of ProckStore. The extended experiments replicate the configu-
ration from Section 4. We configured DB_Bench with a 10 GB dataset
and a 1-KB value. Specifically, we examine the impact of the number of
subtasks on the fourth-level prioritization in ProckStore by configuring
four types of subtasks on the host. The experimental results are shown
in Fig. 16.
As shown in Fig. 16(a), the throughput of ProckStore increases
significantly with the number of subtasks. The throughput is 2.33
Fig. 15. The tail latency of ProckStore under YCSB-C with load 20 GB and run 20 GB MB/s, 2.81 MB/s, and 3.13 MB/s for one, two, and three subtasks,
data volume. respectively, and subsequently stabilizes. With four subtasks, Prock-
Store achieves a peak throughput of 3.8 MB/s. The average latency
shows a similar trend, where ProckStore achieves the lowest latency
with ProckStore reducing P99 by 75.6% and 76.2% compared to TStore (0.22 ms) with four subtasks, showing a 17.1% improvement from three
and PStore. to four subtasks. The host-side CPU utilization also reflects ProckStores
In Fig. 15, the differences in tail latency become more pronounced performance with different numbers of subtasks, as multi-core CPUs
under a 20 GB workload. ProckStore reduces P90 latency by 9.32% enable parallel execution of multiple threads.
and 31.06% under workloads A and F, respectively, compared to As shown in Fig. 16(c), CPU utilization increases with the number of
PStore. Under workload E, ProckStore reduces P90 latency by 93.46% subtasks, allowing the CPU to utilize its computational resources fully.
CPU utilization is 4.78% (the lowest) with one subtask, improving by
and 93.68% compared to PStore and TStore, respectively. ProckStores
10.3% with two subtasks. The highest CPU utilization (6.84%) occurs
four-level priority scheduling mechanism prevents low-priority requests
with four subtasks. However, as the number of subtasks increases,
from blocking high-priority writes, reducing extreme write latency
the performance improvements in CPU utilization, throughput, and
often blocked by flush or compaction in TStore and SocketRocksDB.
average latency become less pronounced. This is because while par-
Similarly, ProckStore reduces P99 tail latency under workloads A and allel execution of multiple threads reduces compaction execution time,
F by 28.9% and 17.9%, respectively, compared to SocketRocksDB and the overhead from thread creation and synchronization increases. As
TStore. Under workload C, ProckStore reduces P99 tail latency by the number of threads grows, this additional CPU overhead impacts
54.9% and 9.45% compared to PStore and SocketRocksDB, respec- ProckStores CPU utilization.
tively. Under workload D, ProckStore reduces P99 tail latency by 23.5%
and 6.0% compared to the same alternative KV stores. ProckStore 6.2. Impact of number of threads
performs best under workload E, reducing P99 tail latency by 79.22%,
78.69%, and 6.23% compared to TStore, PStore, and SocketRocksDB, In Section 2.2, we studied the performance of PStore with different
respectively. numbers of threads. For the multi-threaded comparison experiment of
The FIFO scheduling used by traditional KV stores like ProckStore, we extended the analysis by comparing its throughput with
SocketRocksDB can cause high-priority requests to be blocked, leading that of PStore under varying thread counts. The experimental results
to increased tail latency. In contrast, ProckStores multi-level queue are shown in Fig. 17.
scheduling mechanism enables compaction tasks to be executed in pri- Fig. 17(a) shows the throughput of ProckStore and PStore under
ority order, with high-priority compaction tasks executed first, thereby workloads with 4 KB values and a 10 GB data volume. As the number
of threads increases, the throughput of PStore does not increase ex-
reducing tail latency.
ponentially, and its performance is poor during multi-threaded writes.
Specifically, the throughput increases by only 1.58% when the number
6. Extended experiment of threads rises from 8 to 12. In contrast, the throughput of ProckStore
increases significantly with the number of threads. At 12 threads, the
throughput reaches 7.86 MB/s, which is 10.9% higher than that at 8
In this section, we study the impact of multi-threaded and the num- threads. ProckStores throughput is 144.1% higher than PStores, as its
ber of subtasks on ProckStore performance. The results demonstrate multi-threaded execution efficiently processes the large data volume
the effectiveness of ProckStore under multi-threaded and verify its written by multiple threads, avoiding the computational limitations of
performance under multiple subtask numbers. The environment of the single-thread execution in PStore.
extended experiment is the same as the experimental configuration in As shown in Fig. 17(b), the average latency of PStore decreases
Section 4. with the increase in threads under a 10 GB data volume workload.
14
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
Fig. 17. Write performance of ProckStore under DB_Bench with different number of threads.
However, the decrease in latency is more significant when the number Storage architecture. ListDB [24] employs a skip-list as the core
of threads is low, such as from 1 to 4 threads, where the latency data structure at all levels within non-volatile memory (NVM) or per-
drops by 27.8%. For ProckStore, as the number of threads increases, sistent memory (PM) to mitigate the WA problem by leveraging byte-
performance improves steadily. At 4 threads, the average latency of addressable in-place merge ordering. This approach reduces the gap
ProckStore is 0.154 ms, and at 12 threads, it reduces to 0.104 ms with between DRAM and NVM write latency and addresses the write stall
a 32.3% reduction. Fig. 17(c) shows that when the number of threads issue. HiKV [25] utilizes the benefits of hash and B+Tree indexes to
reaches 12, the host-side CPU utilization for PStore and ProckStore is design the KV store on hybrid DRAM-NVM storage systems, where hash
highest, at 8.74% and 11.62%, respectively, with ProckStore showing a indexes in NVM are used to enhance indexing performance. In a hybrid
41.92% increase over PStore. Additionally, the CPU utilization for both NVM-SSD system, WaLSM [26] tackles the WA problem through virtual
systems increases by 7.28% and 10.7%, respectively, when the number partitioning, dividing the key space during compaction. Additionally, a
of threads increases from 8 to 12. As the number of threads decreases, reinforcement-learning method is applied to balance the merging strat-
CPU utilization also drops. For 1 thread, the CPU utilization is at its egy of different partitions under various workloads, optimizing read
lowest 6.48% for PStore and 9.37% for ProckStore. and write performance. TrieKV [27] integrates DRAM, PM, and disk
into a unified storage system, utilizing a tri-structured index for all KV
pairs in memory, enabling dynamic determination of KV pair locations
7. Related work
across storage hierarchies and persistence requirements. Moreover,
ROCKSMASH [28] utilizes local storage for frequently accessed data
LSM-tree has become a popular data structure in key-value storage
and metadata, while cloud storage is employed for less frequently
systems, offering an alternative to traditional structures by efficiently
accessed data.
handling write-intensive workloads and large-scale datasets. Although
Computing architecture. Heterogeneous computing [29] (e.g.,
KV stores manage data through compaction operations, these processes
GPUs, DPUs, and FPGAs) alleviates the computational burden on the
come at the cost of performance. Consequently, several studies have
CPU. Sun et al. [30] propose an accelerated solution for key-value
sought to mitigate the performance impact of compaction in KV stores.
stores by offloading the compaction task to an FPGA. Similarly, the
LSM-tree structure. PebblesDB [13] introduces the FLSM data FPGA-accelerated KV store [31] offloads the compaction task to the
structure, which alleviates the limitations of non-overlapping key ranges FPGA, minimizing competition for CPU resources and accelerating
within a level, thereby delaying the compaction process and reducing compaction while reducing CPU bottlenecks. LUDA [32] employs GPUs
WA. WiscKey [14] separates keys and values to minimize WA during to process SSTables using a co-ordering mechanism that minimizes
compaction but increases garbage collection overhead. To address this data movement, thereby reducing CPU pressure. gLSM [33] separates
issue, HashKV [15] employs hash partitioning and a hot/cold partition- keys and values to minimize data transfer between the CPU and
ing strategy, while DiffKV [16] separates keys based on the size of key- GPU, thereby accelerating compaction. dCompaction [34] leverages
value pairs to balance performance. FenceKV [17] enhances HashKV DPUs to accelerate the compaction and decompaction of SSTables,
by incorporating a fence-value-based partitioning strategy and key- offloading compaction tasks to the DPU according to a hierarchical
range-based garbage collection, optimizing range queries. FGKV [18] structure, relieving CPU overload. Despite these advances, heteroge-
and Spooky [19] reduce WA by adjusting the data granularity in neous computing still requires data transfer from host-side memory to
compaction. FGKV introduces a fine-grained compaction mechanism the computing units, which can impact overall system performance.
based on the LSpM-tree structure, minimizing redundant writes of Near-data processing (NDP), which offloads computational tasks
irrelevant data. Spooky partitions the data at the largest level into from the CPU to the data location, is an emerging computing paradigm.
equal-sized files and partitions the smaller levels according to file Previous studies [35] investigated storage computing and propose
boundaries for fine-grained compaction. frameworks for storage- and memory-level processing. Biscuit [36]
For compaction strategies, TRIAD [20] improves LSM-tree perfor- introduces a generalized framework for NDP. RFNS [37] examines
mance by optimizing logs, memory, and storage. Work [21,22] opti- the advantages of reconfigurable NDP-driven servers based on ARM
mize the traditional top-level-driven compaction of LSM-trees by shift- and FPGA architectures for data- and compute-intensive applications.
ing to a low-level-driven approach, decomposing large compaction 𝜆-IO [38] designs a unified computational storage stack to manage stor-
tasks into smaller ones to reduce granularity. WipDB [23] utilizes a age and computing resources through interfaces, runtime systems, and
bucket-sort-like algorithm that minimizes merge operations by writing scheduling. HuFu [39] is an I/O scheduling architecture for computable
KV pairs in an approximately sorted list. Although these studies en- SSDs that allows the system to manage background I/O tasks, offload
hance compaction efficiency, they primarily focus on a single device computational tasks to SSDs, and exploit the parallelism and idle time
and fail to address the competition for CPU and I/O resources between of flash memory for improved task scheduling. Li et al. [40] addresses
foreground requests and background tasks. In contrast, NDP devices the resource contention problem between user I/O and NDP requests,
expand computational resources to process tasks internally, reducing using the critical path to maximize the parallelism of multiple requests,
data transfer and resource contention. thereby improving the performance of hybrid NDP-user I/O workflows.
15
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
ABNDP [41] leverages a novel hardware-software collaborative opti- References
mization approach to solve the challenges of remote data access and
computational load balancing without requiring trade-offs. [1] Z. Zhang, Y. Sheng, T. Zhou, et al., H2o: Heavy-hitter oracle for efficient gen-
In addition, hosts and NDP devices employ distinct task scheduling erative inference of large language models, in: Advances in Neural Information
Processing Systems, vol. 36, 2024.
policies to collaborate on compaction tasks [9,10,42]. The nKV [43]
[2] H. Lin, Z. Wang, S. Qi, et al., Building a high-performance graph storage on top
defines data formats and layouts for computable storage devices and of tree-structured key-value stores, Big Data Min. Anal. 7 (1) (2023) 156170.
designs both hardware and software architectures to optimize data [3] S. Pei, J. Yang, Q. Yang, REGISTOR: A platform for unstructured data processing
placement and computation. inside SSD storage, ACM Trans. Storage (TOS) 15 (1) (2019) 124.
KV-CSD [44] builds NDP architectures using NVMe SSDs and system- [4] IDC, IDC innovators: Privacy-preserving computation, 2023, [EB/OL]. (2023-09-
on-chip designs to reduce data movement during queries by offloading 20). https://www.idc.com/getdoc.jsp?containerId=prCHC51469323.
[5] P. ONeil, E. Cheng, D. Gawlick, E. ONeil, The log-structured merge-tree
tasks. Research such as OI-RAID [45] introduces an additional fault
(LSM-tree), Acta Inform. 33 (4) (1996) 351385.
tolerance mechanism by adding an extra level on top of the RAID levels, [6] Google, Leveldb, 2025, https://leveldb.org/.
enabling fast recovery and enhanced reliability. KVRAID [46] utilizes [7] Facebook, Rocksdb: a persistent key-value store for fast storage environments,
logical-to-physical key conversion to pack similar-sized KV pairs into a 2016, http://rocksdb.org/.
single physical object, thereby reducing WA, and applies off-site update [8] A. Acharya, M. Uysal, J. Saltz, Active disks: Programming model, algorithms and
techniques to mitigate I/O amplification. Distributed storage systems, evaluation, Oper. Syst. Rev. 32 (5) (1998) 8191.
[9] H. Sun, W. Liu, J. Huang, et al., Collaborative compaction optimization system
such as EdgeKV, have also been explored [47]. A sharding strategy is
using near-data processing for LSM-tree-based key-value stores, J. Parallel Distrib.
used to distribute data across multiple edge nodes, while consistent Comput. 131 (2019) 2943.
hashing ensures balanced data distribution and high availability. ER- [10] H. Sun, W. Liu, Z. Qiao, et al., Dstore: A holistic key-value store exploring near-
KV [48] integrates a hybrid fault-tolerant design combining erasure data processing and on-demand scheduling for compaction optimization, IEEE
coding and PBR, providing fault tolerance to ensure system reliability Access 6 (2018) 6123361253.
and high availability. Additionally, Song et al. [49] coupled each SSD [11] H. Sun, et al., Asynchronous compaction acceleration scheme for near-data
processing-enabled LSM-tree-based KV stores, ACM Trans. Embed. Comput. Syst.
with a dedicated NDP engine in an NDP server to fully leverage the 23 (6) (2024) 133.
data transfer bandwidth of SSD arrays. MStore [50] extends an NDP [12] Isaac Kofi Nti, et al., A mini-review of machine learning in big data analytics:
device to multiple devices, utilizing them to perform compaction tasks. Applications, challenges, and prospects, Big Data Min. Anal. 5 (2) (2022) 8197.
Although NDP devices can handle host-side computational tasks, [13] P. Raju, R. Kadekodi, V. Chidambaram, et al., Pebblesdb: Building key-value
their resources remain limited. Consequently, it is critical to optimize stores using fragmented log-structured merge trees, in: Proceedings of the 26th
Symposium on Operating Systems Principles, 2017, pp. 497514.
the use of these resources on the NDP device. The multi-threaded
[14] L. Lu, T.S. Pillai, H. Gopalakrishnan, et al., Wisckey: Separating keys from values
asynchronous method in ProckStore addresses this challenge by fully in SSD-conscious storage, ACM Trans. Storage (TOS) 13 (1) (2017) 128.
utilizing computation on both the host and device sides, avoiding [15] H. H. W. Chan, C. J. M. Liang, Y. Li, et al., HashKV: Enabling efficient updates in
resource wastage while ensuring sufficient computational capacity on KV storage via hashing, in: 2018 USENIX Annual Technical Conference, USENIX
the NDP device. ATC 18, 2018, pp. 10071019.
[16] Y. Li, Z. Liu, P. P. C. Lee, et al., Differentiated key-value storage management
for balanced I/O performance, in: 2021 USENIX Annual Technical Conference,
8. Conclusions
USENIX ATC 21, 2021, pp. 673687.
[17] C. Tang, J. Wan, C. Xie, Fencekv: Enabling efficient range query for key-value
In this paper, we present ProckStore, an NDP-empowered KV store, separation, IEEE Trans. Parallel Distrib. Syst. 33 (12) (2022) 33753386.
to improve compaction performance for large-scale unstructured data [18] H. Sun, G. Chen, Y. Yue, et al., Improving LSM-tree based key-value stores with
storage. In ProckStore, the multi-threaded and asynchronous mecha- fine-grained compaction mechanism, IEEE Trans. Cloud Comput. (2023).
nism leverages computational resources within storage devices, reduc- [19] N. Dayan, T. Weiss, S. Dashevsky, et al., Spooky: granulating LSM-tree com-
pactions correctly, in: Proceedings of the VLDB Endowment, Vol. 15, (11) 2022,
ing data movement and enhancing compaction efficiency. ProckStore
pp. 30713084.
optimally schedules compaction tasks across the host and NDP de- [20] O. Balmau, D. Didona, R. Guerraoui, et al., TRIAD: Creating synergies between
vice by implementing a four-level priority scheduling mechanism. This memory, disk and log in log structured key-value stores, in: 2017 USENIX Annual
separation of compaction stages provides parallel processing with- Technical Conference, USENIX ATC 17, 2017, pp. 363375.
out interference, achieving efficient resource utilization. In addition, [21] Y. Chai, Y. Chai, X. Wang, et al., LDC: a lower-level driven compaction method
ProckStore uses key-value separation to reduce data transfer between to optimize SSD-oriented key-value stores, in: 2019 IEEE 35th International
Conference on Data Engineering, ICDE, 2019, pp. 722733.
the host and NDP device, minimizing transmission time. Experimental
[22] Y. Chai, Y. Chai, X. Wang, et al., Adaptive lower-level driven compaction to
results unveil that ProckStore outperforms existing synchronous and optimize LSM-tree key-value stores, IEEE Trans. Knowl. Data Eng. 34 (6) (2020)
single-threaded asynchronous NDP-empowered KV stores, achieving up 25952609.
to 4.2× higher throughput than the baseline KV store. ProckStore also [23] X. Zhao, S. Jiang, X. Wu, WipDB: A write-in-place key-value store that mimics
reduces WA, compaction time, and CPU utilization. bucket sort, in: 2021 IEEE 37th International Conference on Data Engineering,
ICDE, 2021, pp. 14041415.
[24] W. Kim, C. Park, D. Kim, et al., Listdb: Union of write-ahead logs and persistent
CRediT authorship contribution statement
SkipLists for incremental checkpointing on persistent memory, in: 16th USENIX
Symposium on Operating Systems Design and Implementation (OSDI 22), 2022,
Hui Sun: Writing review & editing, Writing original draft, pp. 161177.
Visualization, Validation, Supervision, Software, Resources, Project ad- [25] F. Xia, D. Jiang, J. Xiong, et al., HiKV: a hybrid index key-value store for DRAM-
ministration, Methodology, Investigation, Funding acquisition, Formal NVM memory systems, in: 2017 USENIX Annual Technical Conference, USENIX
ATC 17, 2017, pp. 349362.
analysis, Data curation, Conceptualization. Chao Zhao: Writing re-
[26] L. Chen, R. Chen, C. Yang, et al., Workload-aware log-structured merge key-value
view & editing, Writing original draft, Visualization, Validation, store for NVM-SSD hybrid storage, in: 2023 IEEE 39th International Conference
Software, Resources, Project administration, Methodology, Investiga- on Data Engineering, ICDE, 2023, pp. 22072219.
tion, Funding acquisition, Formal analysis, Data curation, Conceptu- [27] H. Sun, et al., TrieKV: A high-performance key-value store design with memory
alization. Yinliang Yue: Validation, Supervision, Software. Xiao Qin: as its first-class citizen, IEEE Trans. Parallel Distrib. Syst. (2024).
Supervision, Resources, Methodology, Formal analysis, Data curation. [28] P. Xu, N. Zhao, J. Wan, et al., Building a fast and efficient LSM-tree store by
integrating local storage with cloud storage, ACM Trans. Archit. Code Optim. (
TACO) 19 (3) (2022) 126.
Declaration of competing interest [29] Hao Zhou, Yuanhui Chen, Lixiao Cui, Gang Wang, Xiaoguang Liu, A GPU-
accelerated compaction strategy for LSM-based key-value store system, in: The
The authors declare that there is no conflict of interests regarding 38th International Conference on Massive Storage Systems and Technology,
the publication of this article. 2024, pp. 111.
16
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
[30] X. Sun, J. Yu, Z. Zhou, et al., Fpga-based compaction engine for accelerating [41] B. Tian, Q. Chen, M. Gao, ABNDP: Co-optimizing data access and load balance in
lsm-tree key-value stores, in: 2020 IEEE 36th International Conference on Data near-data processing, in: Proceedings of the 28th ACM International Conference
Engineering, ICDE, 2020, pp. 12611272. on Architectural Support for Programming Languages and Operating Systems,
[31] T. Zhang, J. Wang, X. Cheng, et al., FPGA-accelerated compactions for LSM-based Vol. 3, 2023, pp. 317.
key-value store, in: 18th USENIX Conference on File and Storage Technologies, [42] H. Sun, W. Liu, J. Huang, et al., Near-data processing-enabled and time-aware
FAST 20, 2020, pp. 225237. compaction optimization for LSM-tree-based key-value stores, in: Proceedings of
[32] P. Xu, J. Wan, P. Huang, et al., LUDA: Boost LSM key value store compactions the 48th International Conference on Parallel Processing, 2019, pp. 111.
with GPUs, 2020, arXiv preprint arXiv:2004.03054. [43] T. Vincon, A. Bernhardt, I. Petrov, et al., nKV: near-data processing with KV-
[33] H. Sun, J. Xu, X. Jiang, et al., gLSM: Using GPGPU to accelerate compactions stores on native computational storage, in: Proceedings of the 16th International
in LSM-tree-based key-value stores, ACM Trans. Storage (2023). Workshop on Data Management on New Hardware, 2020, pp. 111.
[34] C. Ding, J. Zhou, J. Wan, et al., Dcomp: Efficient offload of LSM-tree compaction [44] I. Park, Q. Zheng, D. Manno, et al., KV-CSD: A hardware-accelerated key-value
with data processing units, in: Proceedings of the 52nd International Conference store for data-intensive applications, in: 2023 IEEE International Conference on
on Parallel Processing, 2023, pp. 233243. Cluster Computing, CLUSTER, 2023, pp. 132144.
[35] E. Riedel, G. Gibson, C. Faloutsos, Active storage for large-scale data mining [45] N. Wang, Y. Xu, Y. Li, et al., OI-RAID: a two-layer RAID architecture towards
and multimedia applications, in: Proceedings of 24th Conference on Very Large fast recovery and high reliability, in: 2016 46th Annual IEEE/IFIP International
Databases, 1998, pp. 6273. Conference on Dependable Systems and Networks, DSN, 2016, pp. 6172.
[36] B. Gu, A.S. Yoon, D.H. Bae, et al., Biscuit: A framework for near-data processing [46] M. Qin, A.L.N. Reddy, P. V. Gratz, et al., KVRAID: high performance, write
of big data workloads, ACM SIGARCH Comput. Archit. News 44 (3) (2016) efficient, update friendly erasure coding scheme for KV-SSDs, in: Proceedings of
153165. the 14th ACM International Conference on Systems and Storage, 2021, pp. 112.
[37] X. Song, T. Xie, S. Fischer, Two reconfigurable NDP servers: Understanding the [47] K. Sonbol, Ö. Özkasap, I. Al-Oqily, et al., EdgeKV: Decentralized, scalable, and
impact of near-data processing on data center applications, ACM Trans. Storage consistent storage for the edge, J. Parallel Distrib. Comput. 144 (2020) 2840.
(TOS) 17 (4) (2021) 127. [48] Y. Geng, J. Luo, G. Wang, et al., Er-kv: High performance hybrid fault-
[38] Z. Yang, Y. Lu, X. Liao, et al., 𝜆-IO: A unified IO stack for computational storage, tolerant key-value store, in: 2021 IEEE 23rd International Conference on High
in: 21st USENIX Conference on File and Storage Technologies, FAST 23, 2023, Performance Computing & Communications; 7th International Conference on
pp. 347362. Data Science & Systems; 19th International Conference on Smart City; 7th
[39] Y. Wang, Y. Zhou, F. Wu, et al., Holistic and opportunistic scheduling of International Conference on Dependability in Sensor, Cloud & Big Data Systems
background I/Os in flash-based SSDs, IEEE Trans. Comput. (2023). & Application, HPCC/DSS/SmartCity/DependSys, 2021, pp. 179188.
[40] J. Li, X. Chen, D. Liu, et al., Horae: A hybrid I/O request scheduling technique for [49] X. Song, T. Xie, S. Fischer, A near-data processing server architecture and
near-data processing-based SSD, IEEE Trans. Comput.-Aided Des. Integr. Circuits its impact on data center applications, in: High Performance Computing: 34th
Syst. 41 (11) (2022) 38033813. International Conference, ISC High Performance 2019, Frankfurt/Main, Germany,
June 1620, 2019, Proceedings 34, Springer International Publishing, 2019, pp.
8198.
[50] H. Sun, Q. Wang, Y.L. Yue, et al., A storage computing architecture with multiple
NDP devices for accelerating compaction performance in LSM-tree based KV
stores, J. Syst. Archit. 130 (2022) 102681.
17