opaque-lattice/papers_txt/ProckStore--An-NDP-empowered-key-value-store-with-asynchr_2025_Journal-of-Sy.txt

                                                             Journal of Systems Architecture 160 (2025) 103342


                                                                     Contents lists available at ScienceDirect


                                                         Journal of Systems Architecture
                                                         journal homepage: www.elsevier.com/locate/sysarc


ProckStore: An NDP-empowered key-value store with asynchronous and
multi-threaded compaction scheme for optimized performance✩
Hui Sun a ,∗, Chao Zhao a , Yinliang Yue b , Xiao Qin c
a Anhui University, Jiu long road 111, Hefei, 230601, Anhui, China
b
    Zhongguancun Laboratory, Cuihu North Road 2, Beijing, 100094, China
c
    Auburn University, The Quad Center Auburn, Auburn, 36849, AL, USA


ARTICLE                 INFO                             ABSTRACT

Keywords:                                                With the exponential growth of large-scale unstructured data, LSM-tree-based key-value (KV) stores have
Near-data processing (NDP)                               become increasingly prevalent in storage systems. However, KV stores face challenges during compaction,
LSM-tree                                                 particularly when merging and reorganizing SSTables, which leads to high I/O bandwidth consumption and
Asynchronous multi-threaded compaction
                                                         performance degradation due to frequent data migration. Near-data processing (NDP) techniques, which
Write amplification
                                                         integrate computational units within storage devices, alleviate the data movement bottleneck to the CPU.
Key-value separation
                                                         The NDP framework is a promising solution to address the compaction challenges in KV stores. In this
                                                         paper, we propose ProckStore, an NDP-enhanced KV store that employs an asynchronous and multi-threaded
                                                         compaction scheme. ProckStore incorporates a multi-threaded model with a four-level priority scheduling
                                                         mechanism–covering the compaction stages of triggering, selection, execution, and distribution, thereby
                                                         minimizing task interference and optimizing scheduling efficiency. To reduce write amplification, ProckStore
                                                         utilizes a triple-level filtering compaction strategy that minimizes unnecessary writes. Additionally, ProckStore
                                                         adopts a key-value separation approach to reduce data transmission overhead during host-side compaction.
                                                         Implemented as an extension of RocksDB on an NDP platform, ProckStore demonstrates significant performance
                                                         improvements in practical applications. Experimental results indicate a 1.6× throughput increase over the
                                                         single-threaded and asynchronous model and a 4.2× improvement compared with synchronous schemes.


1. Introduction                                                                            with each level having a capacity threshold that increases at a fixed
                                                                                           rate as the level number grows. When the amount of data in a level
    The rapid development of large language models [1], graph                              exceeds its threshold, some data migrates to lower levels, potentially
databases [2], and social network [3] has led to the generation of real-                   causing overlapping key ranges between SSTables in different levels.
time large amounts of data, contributing to a global surge in large-scale                  To maintain data organization and prevent duplication, SSTables with
data. This data is growing exponentially and is increasingly manifested                    overlapping key ranges must be loaded into memory and merged. The
in semi-structured and unstructured formats, in addition to traditional                    sorted and de-duplicated key-value pairs are then rewritten as new
structured data. For example, semi-structured and unstructured data                        SSTables at a lower level. This process, known as compaction, involves
have been grown in recent years according to IDC [4], and they now                         frequent read and write operations that consume a lot of I/O bandwidth
account for over 85% of total data volume. To cope with the large                          between the host and storage devices, thereby delaying foreground
amount of unstructured data, LSM-tree-based key-value stores (KV                           requests and degrading system performance.
stores) [5] have become widely adopted in large-scale storage systems.
                                                                                               GPUs, DPUs, and FPGAs General-purpose graphics processing unit
    LSM-tree structures are popularly used in modern database engines
                                                                                           (GPGPU), data processing unit (DPU), and field-programmable gate
(e.g., LevelDB) [6] RocksDB [7]). In the LSM-tree structure, key-value
                                                                                           array (FPGA) offer additional computational resources to address com-
pairs are first written to an immutable MemTable in memory and
                                                                                           paction performance challenges. Near-Data Processing (NDP), intro-
then persist to disk as Sorted String Tables (SSTables) once preset
                                                                                           duced in the late 1990s as the ‘‘smart disk’’ [8], has regained attention
threshold is reached. On disk, the LSM-tree is organized hierarchically,


 ✩ This work is supported in part by National Natural Science Foundation of China under Grants 62472002 and 62072001. Xiao Qin’s work is supported
by the U.S. National Science Foundation (Grants IIS-1618669 and OAC-1642133), the National Aeronautics and Space Administration, United States (Grant
80NSSC20M0044), the National Highway Traffic Safety Administration, United States (Grant 451861-19158), and Wright Media, LLC (Grants 240250 and 240311).
  ∗ Corresponding author.
    E-mail addresses: sunhui@ahu.edu.cn (H. Sun), chaozh@stu.ahu.edu.cn (C. Zhao), yylhust@qq.com (Y. Yue), xqin@auburn.edu (X. Qin).

https://doi.org/10.1016/j.sysarc.2025.103342
Received 31 October 2024; Received in revised form 30 December 2024; Accepted 11 January 2025
Available online 24 January 2025
1383-7621/© 2025 Published by Elsevier B.V.
H. Sun et al.                                                                                                        Journal of Systems Architecture 160 (2025) 103342


as an emerging computational paradigm. The enhanced computational
power within storage devices has fueled interest in NDP. NDP mit-
igates the overhead of data movement by reducing data movement
to the CPU. The NDP paradigm advocates for ‘‘computation close
to data’’ as an alternative to the computation-centered approach in
large-scale systems. This model enables storage devices to use their
internal bus for data processing rather than transfer data to the host,
where the results would otherwise be computed. Most existing NDP-
empowered KV stores, such as Co-KV [9] and TStore [10], to tackle
compaction tasks using a synchronization-based approach. This work
splits the compaction task, leveraging either averaging or dynamic
time-awareness. In the synchronization model, the host and the device
cannot complete tasks simultaneously, leading to long waiting time and
inefficient resources usage. PStore [11] addresses waiting time by using
an asynchronous model but fails to fully exploit the benefits of this
approach due to its single-threaded processing.
    To address these issues, we designed an asynchronous NDP scheme,
ProckStore, which utilizes a multi-threaded strategy to perform com-
paction tasks concurrently. All compaction tasks are managed in a
thread pool and scheduled using multiple threads, exploiting the bene-
fits of asynchronous processing, where tasks do not interfere with one
another. The tasks are executed independently by individual threads.
A four-level priority scheduling mechanism is implemented to ensure
efficient scheduling of compaction tasks within the thread pool, fol-
lowing the four stages of the compaction process. To address the write
amplification issue, a triple-level filtering compaction method is em-
ployed, reducing unnecessary writes and alleviating write amplification
during compaction on the host side. Furthermore, the transmission
process in the NDP architecture and its compaction module is optimized
                                                                                  Fig. 1. The structure of LSM-tree and RocksDB. The LSM-tree is composed of compo-
by utilizing a key-value separation technique, minimizing transmission            nents 𝐶0 , 𝐶1 , and 𝐶𝑛 .
time by sending only the keys to the host. The contributions of this
work are summarized as follows
    ▴ We designed ProckStore with an asynchronous and multi-threaded
                                                                                  2. Background and motivation
scheme. Then, compaction tasks are executed independently with-
out interfering with each other in the thread pool, entirely using
the asynchronous mode, which significantly improves write perfor-                 2.1. Background
mance compared with the synchronous mode and the single-threaded
asynchronous scheme.                                                                  RocksDB is an LSM-tree-based key-value store developed by Face-
    We designed ProckStore using an asynchronous and multi-threaded               book, and it is widely used in Facebook’s storage systems to achieve
architecture. Compaction tasks are executed independently within the              high throughput. In RocksDB, the MemTable and Immutable MemTable
thread pool, fully leveraging the asynchronous model. This approach               are stored in memory, while the Sorted String Table (SSTable) is stored
significantly enhances write performance compared to the synchronous              on disk. As shown in Fig. 1, key-value pairs from the application
model and single-threaded asynchronous scheme.                                    are first written to the commit log and then cached in a sorted data
    ▴ ProckStore employs a four-level priority scheduling to manage the           structure called the MemTable, which has a limited size (e.g., 4MB)
compaction process, which consists of four stages: compaction trigger,            in memory. Once the MemTable reaches its predefined capacity, it is
compaction picking, compaction execution, and compaction distribu-                converted into an Immutable MemTable. A background thread then
tion. This scheduling prioritizes tasks at different stages, ensuring opti-       writes the MemTable to disk as a sorted string table (SSTable). On disk,
mal efficiency during asynchronous and multi-threaded compaction.                 SSTables are organized in levels, with each level growing by a fixed
    ▴ To optimize performance in the NDP transmission architecture,               multiple.
we implemented key-value separation in the host-side compaction,                      In Fig. 1(a), in LSM-tree, the hierarchy represents different compo-
reducing data transmission overhead. The device-side compaction mod-              nents, such as components 𝐶0 , 𝐶1 ......, and 𝐶𝑛 . Component 𝐶0 resides in
ule employs a cross-level compaction technique to alleviate compu-                memory. The new write data is first written into the sequential log file
tational load, thereby improving transmission efficiency and overall              and then inserted into an entry placed in 𝐶0 . However, the high-cost
system throughput.                                                                memory capacity that accommodates 𝐶0 imposes a limit on the size of
    ▴ ProckStore, an extension of RocksDB on the NDP platform, was                𝐶0 . To migrate entries to a component on the disk, LSM-tree performs
evaluated using DB_Bench and YCSB-C. Experimental results demon-                  a merge operation when the size of 𝐶0 reaches the threshold, including
strate that ProckStore increases throughput by a factor of 1.6× com-              taking some contiguous segment of entries from 𝐶0 and merging it into
pared to the single-threaded asynchronous PStore, and achieves a 4.2×             a component on the disk. Component 𝐶𝑛 (n> 1) resides on the disk
throughput improvement over the synchronous TStore.                               in the LSM-tree. Although 𝐶1 is disk-resident, the frequently accessed
    The paper is organized as follows. Section 2 presents the background          page nodes in 𝐶1 remain in the memory buffer. 𝐶1 has a directory
and motivation encountered by ProckStore. Section 3 presents a system             structure like B-tree but is optimized for sequential access on the disk.
overview of ProckStore and information on each module. Section 4                  The in-memory 𝐶0 servers high-speed writes, and 𝐶𝑛 (n> 1) on the
lists the hardware and software configurations used in the experiments.           disk is responsible for persistence and batch-sequential writes. Through
Section 5 demonstrates the performance of ProckStore through exten-               the hierarchical and merging strategies, LSM-tree achieves a balance
sive experiments. Section 6 elaborates on the extended experiments.               between write optimization and high-efficiency query.
Section 7 summarizes related work. Finally, we conclude our work in                   In Fig. 1(b), the most recently generated SSTable is placed in the
Section 8.                                                                        lowest level, 𝐿0 . SSTables in level 𝐿0 can have overlapping key ranges,

                                                                              2
H. Sun et al.                                                                                                                 Journal of Systems Architecture 160 (2025) 103342


                  Fig. 2. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various data volume.


                  Fig. 3. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various value sizes.


while higher levels are organized by key ranges. Each level has a size                    other hand, demonstrates its effectiveness in an asynchronous and
threshold for its total SSTables. When this threshold is exceeded, the KV                 single-threaded setting. Notably, the asynchronous approach allows
store migrates SSTables from level 𝐿𝑘 to level 𝐿𝑘+1 during compaction.                    compaction tasks to be performed independently; however, it is difficult
The compaction process selects SSTables from level 𝐿𝑘 and searches                        to fully leverage the benefits of asynchronous processing in a single-
for overlapping key ranges in level 𝐿𝑘+1 . A merge operation is then                      threaded environment. Therefore, we investigate the performance of
performed on the SSTables with overlapping key ranges to produce new                      PStore using different thread configurations. Fig. 2(a) presents the
SSTables, which are stored in level 𝐿𝑘+1 . Obsolete SSTables in level                     throughput of PStore under workloads with 4-KB value and various
𝐿𝑘+1 are deleted from the disk. This compaction process incurs compu-                     data volumes. We can draw two key observations.
tational and storage overhead, which negatively impacts response time                         ▵ As the number of threads increases, the throughput of PStore does
and throughput – a significant drawback of the LSM-tree.                                  not grow exponentially as expected. The throughput improvement is
    Graphical computing [2], machine learning [12], and large lan-                        minimal during multi-threaded writes, particularly when the number
guage models [1] demand substantial data for model training and                           of threads is 12.
inference. The data transfer overhead from storage devices to the CPU                         ▵ With a large number of threads, the throughput of PStore in-
for computation becomes higher as data volumes grow, consuming sys-                       creases slowly. Under 20-GB workloads, when the thread count in-
tem resources and incurring bottlenecks between storage and memory                        creases from 8 to 12, the throughput only increases by 0.12 MB/s.
in high-performance systems. As data volumes increase, the overhead                           These findings indicate that the asynchronous compaction advan-
associated with transferring data from storage devices to the CPU for                     tages of PStore in single-threaded mode are insufficient to handle the
computation rises, leading to resource consumption and performance                        large volume of multi-threaded writes. As a result, increasing the num-
bottlenecks between storage and memory in high-performance sys-                           ber of threads does not enhance throughput, particularly as the thread
tems. Traditional storage architectures struggle to meet the demands                      count becomes large. While the asynchronous approach in PStore takes
of data-intensive applications under these conditions. NDP mitigates                      into account the computational imbalance between the host and the
this challenge by fully utilizing the device’s internal bandwidth. By                     NDP device, it fails to implement an appropriate asynchronous com-
incorporating embedded computing units, storage devices can perform                       paction method. The limitations of the single-threaded mode hinder
computational tasks, offloading these operations from the host and                        the full potential of the asynchronous compaction mechanism in the
eliminating the overhead of moving large data volumes. The results                        KV store.
can then be retrieved from the storage device, reducing the need for                          As shown in Fig. 2(b), the average latency decreases under work-
additional data movement. Furthermore, the KV store can leverage                          loads with various data volumes, but this reduction is most pronounced
NDP to perform compaction tasks internally, improving compaction                          when using a small number of threads. Specifically, the most significant
efficiency.                                                                               decrease occurs between 1 and 4 threads, where the average latency re-
                                                                                          duces by 27.8%. Additionally, the CPU utilization on the host supports
2.2. Motivation                                                                           these observations (see Fig. 2(c)), with a 34% increment in 12 threads
                                                                                          over 1 thread under 10-GB workloads. The CPU utilization exhibits a
   Most existing studies focus on compaction processing in a single-                      19% increment as the number of threads grows from 1 to 4. The result
threaded context. For instance, Co-KV and TStore process compaction                       reveals that PStore is suitable for single- or fewer-threaded workloads,
tasks synchronously and in a single-threaded mode. PStore, on the                         and it is challenging to adapt to multi-threaded applications.

                                                                                      3
H. Sun et al.                                                                                                               Journal of Systems Architecture 160 (2025) 103342


Fig. 4. Overview system of ProckStore. 𝑄1 is the first compaction-task queue. Sub i(0<i<n+1) represents the sub-compaction task of task 1. 𝐾𝑖 and 𝑉𝑖 denote the 𝑖th key and
value, respectively.


   We conducted an experiment to study the impact of multiple threads                    module on the host side, and the compaction module on the device.
on the performance under workloads with 4- and 64-KB value sizes.                            As shown in Fig. 4, we illustrate the data flow between the NDP
As shown in Fig. 3, we observe similar findings under workloads with                     device, the host-side compaction, and the device-side compaction mod-
various data volumes. The throughput of PStore is improved under                         ules. Initially, the data is written from the host side (see host data
the large-sized value. When increasing the number of threads to 4                        stream in Fig. 4), and multiple compaction tasks are accumulated in
under workloads with a 64-KB value, the throughput of PStore is 1.65                     the thread pool. These tasks are allocated to the host and device by the
MB/s. Furthermore, this metric increases to 1.83 MB/s in the case of                     asynchronous manager (see host task stream and device task stream
12 threads, and there is only an increment of 11% (see Fig. 3(a)). In                    in Fig. 4). After completing the compaction tasks, the data is written
Fig. 3(b), the average latency decreases when the number of threads                      to the flash memory inside the NDP device through the transmission
increases to 4. The average latency of 4 threads decreases by 24.8%                      module (see device data stream in Fig. 4). The dark blue line represents
compared to one thread. The degree of decrement becomes little as                        the host-side data flow, where ProckStore transfers the data from
the number of threads increases. The host-side CPU utilization becomes                   flash memory to the host for compaction tasks. The four-level priority
smoother in Fig. 3(c), but the improvement is still most pronounced                      scheduling module manages the thread pool, the host-side compaction
when there are a limited number of threads.                                              module, and controls the asynchronous manager (see control stream in
   Thus, we plane to use a multi-threaded approach to implement                          Fig. 4). When a compaction task is generated, the four-level priority
the asynchronous compaction mechanism in the KV store. We have                           scheduling module places it in the compaction queue of the thread
redesigned the asynchronous compaction mechanism extended from                           pool. It then determines whether the task should be executed on the
RocksDB, and fully leveraged its internal multi-threaded capability to                   host or device and notifies the asynchronous manager to allocate the
develop the asynchronous compaction solution – ProckStore.                               task. When the host executes a compaction task, the scheduling module
                                                                                         issues instructions to the host-side compaction module to execute it.
3. Design of ProckStore                                                                      We provide the asynchronous compaction mechanism with the
                                                                                         multi-level task queue module in Section 3.2, where compaction tasks
3.1. System overview                                                                     are kept in the thread pool. Section 3.3 presents the four-level prior-
                                                                                         ity scheduling module, which controls the priority scheduling in the
    In this paper, we propose ProckStore, an NDP-empowered KV store                      compaction process. The triple-level filtering compaction module is in
that incorporates an asynchronous and multi-threaded compaction                          Section 3.4. We describe the transport mechanism on the NDP device
scheme. ProckStore consists of a host-side subsystem, an NDP device,                     and the cross-level compaction module in Section 3.5.
and a communication module that connects the two sides, as shown                             The host-side asynchronous compaction management module allo-
in Fig. 4. The host-side subsystem manages I/O requests, while the                       cates compaction tasks to the device side. A multi-level queue stores
NDP device, which serves as a storage unit, extends computational                        the tasks awaiting execution and calculates the computational capa-
resources to process tasks offloaded from the host. The NDP device                       bilities of both the host and device. These tasks are then scheduled
stores persistent data, with read and write operations akin to those                     to the compaction modules on the host and device sides. The host-
of standard storage devices. We implement various modules on both                        side compaction module executes the tasks, while the device side must
the host and NDP device to accommodate task-offloading requirements.                     transmit compaction information via the semantic management mod-
Data is stored as SSTables in a leveled structure on the NDP device.                     ule, which facilitates communication between the host and device. The
SSTables are either transferred to the host for compaction or to the NDP                 processed information is sent to the device-side compaction module for
for compaction via the communication channel. During transmission,                       task execution. The four-level priority scheduling module manages the
the SSTables pass through a key-value separation module, ensuring that                   entire process, from task triggering to execution. Data and commands
only the keys of the KV pairs are sent to the host for the merge oper-                   are transmitted between the host and device through the semantic
ation. The data flow occurs between the NDP device, the compaction                       management module. The NDP device encodes (decodes) interacting

                                                                                     4
H. Sun et al.                                                                                                     Journal of Systems Architecture 160 (2025) 103342


                                                                                     In the case of a multi-level queue, it is important to prevent the
                                                                                 queue from becoming starved of compaction tasks. On both the host
                                                                                 and device sides, one side may pause the compaction task while wait-
                                                                                 ing for new task allocations upon completion of the task allocation.
                                                                                 Consequently, the triggering conditions must be adjusted to trigger
                                                                                 more frequently, ensuring a sufficient number of compaction tasks are
                                                                                 available in the task queue. Additionally, different priorities must be set
                                                                                 for scheduling various tasks. ProckStore assigns a score to each priority
                                                                                 level, see Fig. 6
                                                                                     First-level Priority: the priority of the compaction trigger. In
                                                                                 the stage of triggering a compaction task, the goal of ProckStore is
                                                                                 to select the level most urgently needed to perform the compaction
                                                                                 task. ProckStore sets the first_score to realize the prioritization of the
                                                                                 compaction triggering stage. Due to the structure of the LSM-tree, the
                                                                                 score in level 𝐿0 is calculated as the ratio of the number of SSTables
                                                                                 to the threshold value of level 𝐿0 . However, the score in other levels
                                                                                 is calculated as the ratio of the total size of SSTables to the threshold
                                                                                 value of the level. Thus, the calculation of first_score is also divided
                                                                                 into level 𝐿0 and other levels, see the following equation.
                                                                                               ⎧ 𝑁sst −𝑁no_comp −𝑁being_comp
                                                                                               ⎪             𝑁max
                                                                                                                             , level i, i =0
                                                                                 first_score = ⎨ 𝑆sst −𝑆no_comp −𝑆being_comp                             (1)
                                                                                               ⎪             𝑆max
                                                                                                                             , level 𝑖, 𝑖 > 0
                                                                                               ⎩
                                                                                 where 𝑁𝑠𝑠𝑡 and 𝑁𝑚𝑎𝑥 denote the number of SSTables and the thresh-
                  Fig. 5. Multilevel Task Queue in ProckStore.                   old of the number of SSTables in level 𝐿0 , respectively. 𝑁𝑛𝑜_𝑐 𝑜𝑚𝑝 and
                                                                                 𝑁𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the number of compaction tasks in level 𝐿0 that have
                                                                                 been picked into the task queue to be executed and the number of
                                                                                 compaction tasks in level 𝐿0 that are executed and contain SSTables.
data, storing the SSTable based on key-range granularity, performing
                                                                                 𝑆𝑠𝑠𝑡 and 𝑆𝑚𝑎𝑥 denote the size of the total data volume of SSTable in
garbage collection, maintaining information, and executing compaction
tasks on the files.                                                              level i and the threshold of the data volume of SSTable in level i,
                                                                                 respectively. In contrast, 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 and 𝑆𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the data volume
                                                                                 of SSTable included in the compaction task to be executed in the
3.2. Asynchronous mechanisms
                                                                                 queue of tasks picked in level i and the current compaction task being
                                                                                 executed. The data volume containing SSTable is being executed in the
    To implement the asynchronous strategy, we decouple the two
phases – compaction triggering and execution – to establish conditions           compaction task. Different from the RocksDB score, we can see that
for asynchronous compaction. In contrast, the synchronous approach               the SSTables that have been picked into the compaction queue and
treats the task from compaction trigger to completion as a single                the SSTables that are involved in compaction tasks are subtracted to
process. In the asynchronous mechanism, compaction tasks are con-                reduce the number of SSTables that are not in the level, which makes
tinuously generated when the conditions for triggering compaction are            the calculation of the first_score more accurate.
met. These tasks, generated during the trigger phase, must be executed.              When a level triggers a compaction, the generated compaction task
To manage them efficiently, we propose a multi-level task queue that             will be put into the corresponding task queue of the level, and the
stores compaction tasks uniformly and waits for the asynchronous                 compaction module will wait for its processing. See Fig. 6. In an
manager to schedule them. To align with the structure of the LSM-tree,           asynchronous trigger mode, the compaction task will not be executed
a compaction task queue is assigned to each level, with tasks generated          immediately, and the asynchronous compaction manager will have to
during the trigger phase placed into the task queue at a level, awaiting         wait for it to be scheduled. The device side triggers the compaction
scheduling. In Fig. 5, ProckStore employs a multi-task queue for each            task according to the first_score of each level and places it into the task
column family. The multi-task queue selects compaction tasks at each             queue. The device triggers the compaction task according to first_score
level based on a score value.                                                    (select the maximum value) in each level and puts it into the task
    We implement multi-level task queues in a thread pool. Tasks in              queue. It provides the basis for prioritizing the compaction triggering
each queue are sorted in ascending order by the number of SSTables.              and the execution between levels, which is the first-level prioritization.
A heap sorting algorithm is used in each task queue to ensure sorting                A second level of prioritization is the prioritization of SSTable
occurs in time complexity 𝑂(𝑛 log 𝑛). The task queue is a double-ended           picking. In the compaction task generation phase, we select some
queue, allowing compaction tasks to be allocated from both ends to the           SSTables in the level and all the overlapping SSTables from the fol-
host and device sides. Since multiple tasks are scheduled in the queue,          lowing level. These SSTables are conducted and merged in the com-
a thread pool is used to manage the pending tasks in the compaction              paction operation. ProckStore puts the compacted SSTables into the
queue.                                                                           compaction_queue and then reads the first file information that needs
                                                                                 to be compacted from the queue. The compacted SSTables in the queue
3.3. Four-level priority scheduling                                              perform compaction operations sequentially without considering the
                                                                                 hot and cold data and the size of the compaction task. Thus, the
    An asynchronous mechanism-based compaction procedure separates               information about the number of overlapping SSTables with the lower
the two phases: compaction triggering and execution. To achieve this,            level is added to the FileMetaData of each SSTable. The second_score is
we propose a four-level priority scheduling strategy that assigns priority       set to the number of overlapping SSTables, and the meta-information
levels to the four steps involved: triggering compaction tasks, generat-         is sorted in ascending order by each level following the size of the
ing tasks, allocating tasks, and executing tasks. This strategy ensures          second_score, see Eq. (2), as follows
efficient execution of the asynchronous compaction process.                      second_score = 𝑂𝑣𝑒𝑟𝑙𝑎𝑝sst , for SSTable                                       (2)

                                                                             5
H. Sun et al.                                                                                                                    Journal of Systems Architecture 160 (2025) 103342


Fig. 6. Four-level priority scheduling in ProckStore. The light gold circles represent SSTables selected for compaction at the current level, while the light blue circles denote
SSTables selected for compaction at the next level. The dark blue circles indicate SSTables that are not selected for compaction. The yellow rectangles represent newly generated
compaction tasks, the light red rectangles signify compaction tasks assigned to the device side, and the dark orange rectangles represent compaction tasks assigned to the host
side.


where 𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠𝑠𝑡 denote the number of overlapping SSTables. The                             host side; (3) the score of the host side is equal.
SSTable with the smallest number of overlapping SSTables at the lower                           In case (1), the default rules of acquiring tasks in the queue remain
level is prioritized to select the SSTable to be compacted. Compaction                      unchanged, and case (2) is changed into a situation in which the device
and the metadata information of SSTable are maintained using a linked                       side fetches the tasks from the left side of the queue, and the host
list to facilitate insertion and deletion. However, we query the overlap                    obtains the tasks from the right side of the queue. In case (3), the
of SSTables with the lower level and cost O(n) time complexity to                           default rules for taking tasks are still maintained, and the host side
maintain the order of the linked list. We can ensure that the min-                          and the device side re-record and calculate the compaction time at one
imum number of SSTables is selected in each compaction, reducing                            end, then make judgments according to the comparison results. In a
compaction time.                                                                            running process, the configuration of the host and device sides cannot
     A third level of priority is the priority of allocating compaction                     change, so the queue to get the task rules decided after the numerical
tasks. In the stage of compaction task allocation, we consider the                          comparison is no longer carried out. The decision of the task to get the
different computational capabilities of the host and the device sides                       rules cannot be changed in this process.
for compaction tasks. Meanwhile, the compaction processing efficiency                           A fourth level of prioritization is the priority of executing
varies with the configurations of the host and the device and the data                      sub-task. After selecting the SSTables, these SSTables are integrated
paths of read, write, and transmission. Therefore, it is necessary to                       into a complete compaction task that reaches the stage of compaction
select appropriate compaction tasks for both the host and the device.                       execution on the NDP device and the host. The compaction task is
When all the SSTables involved in compaction are selected, the com-                         decomposed into multiple sub-tasks, which can be executed in parallel
paction task information is generated and inserted into the compaction                      on the device. Notably, sub-task refers to as sub-compaction that are
task queue. The compaction task queue of each level is a double-ended                       performed in the multi-threaded compaction mechanism in RocksDB.
                                                                                            In a compaction process, the primary thread first executes a sub-
priority task queue, which is sorted in ascending order according to the
                                                                                            task. Notably, the first sub-task is designated as the main thread for
number of SSTables in a compaction task. The queue is heap sorted with
                                                                                            execution by default. The rest of the sub-task creates many sub-threads
time complexity 𝑂(𝑛𝑙𝑜𝑔 𝑛). Initially, the host obtains the tasks from the
                                                                                            to be executed concurrently. Then, the primary thread merges the
left side of the queue with fewer SSTables, and the device side gets the
                                                                                            results and writes them back in a unified manner.
tasks from the right side with more SSTables. The host and the device
                                                                                                The amounts of data and execution time are different in the sub-
sides record the compaction time.
                                                                                            task. The computational resources are underutilized by default. To
     During the compaction process, the host and the device record the
                                                                                            address this issue, we prioritize the concurrent execution process of
time cost of five consecutive compaction tasks and data volume of
                                                                                            sub-task. Let us have 𝑓 𝑜𝑢𝑟𝑡ℎ_𝑠𝑐 𝑜𝑟𝑒 = 𝑆𝑆 𝑆 𝑇 , where 𝑆𝑆 𝑆 𝑇 denotes the
compacted SSTables. The third_score is the ratio of the time taken for
                                                                                            total data volume of SSTable in each sub-task. When dividing the
five consecutive compaction tasks to the data volume of the compacted
                                                                                            sub-task tasks, we compare the data size of each sub-task. The sub-
SSTables, which is given as                                                                 task containing the smallest data is set to be the highest priority. It
                ⎧      𝑆host_sst                                                            means that the smaller the fourth_score is, the higher the priority is,
                ⎪     𝑇host_comp
                                 , for host
third_score = ⎨ 𝑆device_sst                                           (3)                   and the highest-priority sub-task is placed into the primary thread for
                ⎪ 𝑇device_comp , for device                                                 compaction. The compaction execution time can be illustrated as
                ⎩
where 𝑆ℎ𝑜𝑠𝑡_ 𝑠𝑠𝑡 and 𝑇ℎ𝑜𝑠𝑡_ 𝑐 𝑜𝑚𝑝 denote the total data volume of compacted                 𝑇𝑒𝑥𝑒 = 𝑇𝑝𝑡ℎ𝑟𝑒𝑎𝑑 + 𝑇𝑠𝑢𝑏𝑡ℎ𝑟𝑒𝑎𝑑 ,                                                    (4)
SSTables on the host and the time cost, respectively. 𝑆𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑠𝑠𝑡 and                    where 𝑇𝑒𝑥𝑒 , 𝑇𝑝𝑡ℎ𝑟𝑒𝑎𝑑 , 𝑎𝑛𝑑 𝑇𝑠𝑢𝑏𝑡ℎ𝑟𝑒𝑎𝑑 represent the overall execution time,
𝑇𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑐 𝑜𝑚𝑝 represent the total data volume of compacted SSTables on                   primary thread execution time, and sub-thread execution time. The sub-
the device and the time cost, respectively. We use the third_score to                       task with the least execution time is placed into the primary thread
evaluate the compaction processing capability of both the host and                          for execution to reduce the execution time. Notably, the sub-thread
device sides. The side with a higher compaction processing capability                       execution time is determined by the sub-task with the longest execution
handles tasks containing a large number of SSTables from the right end                      time. This procedure cannot affect the execution time of sub-tasks,
of the queue, while the other side handles tasks from the left end.                         thereby reducing the overall execution time and improving the system
    The larger the value of third_score1 is, the less efficient the com-                    performance.
paction is. Compared 𝑡ℎ𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒ℎ𝑜𝑠𝑡 with 𝑡ℎ𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒𝑑 𝑒𝑣𝑖𝑐 𝑒 , there are
three cases: (1) the score of the host side is greater than that of the                     3.4. Triple-level filter compaction
device side; (2) the score of the device side is greater than that of the
                                                                                                The asynchronous compaction method of ProckStore improves the
                                                                                            compaction performance; however, this method brings the write am-
  1
    It indicates that the compaction operation spends more time processing                  plification problem. Therefore, we propose the mechanism of triple-
an SSTable.                                                                                 level filtering compaction (see Fig. 7) . In a compaction procedure,

                                                                                        6
H. Sun et al.                                                                                                               Journal of Systems Architecture 160 (2025) 103342


                                                       Fig. 7. The triple-level filter compaction in ProckStore.


                                           Fig. 8. The transmission module between the host and the device in ProckStore.


triple-level filtering compaction involves SSTables from three levels                 introduces write amplification. The SSTables newly written to level
to remove duplicate data. The triggering of the triple-level filtering                𝐿𝑖+1 may immediately needs to be combined with the SSTables with
mechanism, however, requires certain conditions to be met. When                       overlapping key ranges in level 𝐿𝑖+2 to form a new compaction task.
performing the compaction involving SSTables in levels 𝐿𝑖 and 𝐿𝑖+1 ,                  These SSTables are merged and the new data are written to level 𝐿𝑖+2 ,
ProckStore first_score value of level 𝐿𝑖+1 . If the value is greater than             resulting in additional write amplification for data previously written
1, the triple-level filtering compaction is triggered, involving SSTables             to level 𝐿𝑖+1 . Consequently, this procedure incurs two instances of write
from level 𝐿𝑖+2 that overlap with those from level 𝐿𝑖+1 . This mechanism              amplification.
helps reduce duplicate writes and alleviates write amplification.                         The triple-level filtering compaction combines all the overlapping-
    As triple-level filtering compaction contains overlapping-key-range               key-range SSTables in the three levels to perform compaction. The data
SSTables of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , it causes the problem of excessive          in level 𝐿𝑖 is written to level 𝐿𝑖+2 , which eliminates a compaction
compaction data. When performing the three-level compaction, some                     process and the write amplification from level 𝐿𝑖+1 to 𝐿𝑖+2 .
key ranges can exist levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 . This key ranges can be
deleted and filtered at the intermediate level (i.e., level 𝐿𝑖+1 ), which             3.5. Transmission in ProckStore
cannot affect the update of new keys in level 𝐿𝑖 to 𝐿𝑖+2 or the merging
of old keys in level 𝐿𝑖+2 . At the stage of generating compaction tasks,                  In ProckStore, data is transferred between the host and device sides,
we mark the duplicate key range in the three levels when picking the                  as shown in Fig. 8. During compaction, a large amount of data is
overlapping SSTables from the three levels and filter the duplicate key               read from the NDP device to the host for compaction, which involves
ranges in the three levels out of level 𝐿𝑖+1 in advance. Then, the newest             transferring many KV pairs. This results in significant data migration
keys in level 𝐿𝑖 and the oldest keys in level 𝐿𝑖+2 are retained. This                 overhead. To address this issue, we employ key–value separation for
approach reduces the data volume involved in compaction by reducing                   compaction to reduce the data transfer overhead, which minimizes data
redundancy across the three levels, thereby alleviating the issue of                  migration between the host and the NDP device, reduces write am-
excessive data in compaction operations.                                              plification, and improves compaction performance. In the compaction
    As shown in Fig. 7, when level 𝐿1 performs compaction, the score                  process, only the keys of the KV pairs are read, sorted, and written to
of level 𝐿2 is greater than 1 to satisfy the condition of triple-level filter         the NDP device. The key size is less than 1 KB, while the value size
compaction. Compared the key ranges of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 ,                  exceeds 1 KB. During compaction on the host side, the NDP device
ProckStore filters out and deletes the same keys that exist across the                transmits only the keys to the host, while the device processes the
three levels. In Fig. 7, the keys 7, 8, 9, 10, 11, and 13 are filtered                values locally. Afterward, the host processes the values, sends them
out from level 𝐿2 before performing the compaction operation. These                   back to the device, and integrates them into an SSTable. This approach
keys are placed in the compaction queue, awaiting the asynchronous                    significantly reduces data migration between the CPU and memory on
manager to initiate the compaction. According to Eq. (1), these keys                  the host side and minimizes the overhead of data transfers between the
are marked as 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 , causing the first_score value in levels 𝐿1 and 𝐿2           device and host.
to drop below 1 due to the subtraction of these keys. This results in a                   The key–value separation mechanism is implemented during host-
reduction of excess data in the level. The default compaction method in               side compaction, with the entire KV pair stored on the NDP device. The
ProckStore merges the SSTables with overlapping key ranges in levels                  key is processed during host-side compaction, reducing both device-
𝐿𝑖 and 𝐿𝑖+1 and then writes new SSTables into level 𝐿𝑖+1 . This process               to-host data transfers and host-to-device compaction operations. Based

                                                                                  7
H. Sun et al.                                                                                                     Journal of Systems Architecture 160 (2025) 103342


on the compaction information, the device separates the key from the             4. Experimental settings
value in the SSTables. The key array stores the address of each value,
which is used for subsequent reorganization. The keys are then sent to               Platform. We implemented ProckStore based on RocksDB and con-
the host for a sort-merge operation. Afterward, the compacted keys are
                                                                                 ducted experiments to assess its performance. To evaluate ProckStore,
sent back to the device, where they are reorganized into new SSTables
                                                                                 we constructed a test platform simulating the NDP architecture, where
based on the value addresses. There are three threads for each step: (1)
                                                                                 data transfer between the host and NDP device occurs over Ethernet.
the separation thread on the device, (2) the merge operation thread on
the host, and (3) the key–value reorganization thread on the host. The           Although this platform was used for validation, ProckStore is scalable
host-side compaction task is divided into the following three steps:             to real NDP platforms. SocketRocksDB, a version of RocksDB deployed
    ▵ Step 1. The key–value separation thread in the NDP retrieves               on the NDP collaborative framework, was used as the baseline. TStore,
the KV pairs based on the SSTable data format. The key or value is               PStore, and ProckStore all share the NDP-empowered storage frame-
stored in the corresponding array in the NDP device’s memory. In the             work. The experimental platform comprises two subsystems: a host-side
key array, each key records the subscript of its corresponding value,            and a device-side NDP subsystem. The host system is equipped with an
and the time complexity for searching the array is O(1). The value               Intel(R) Core(TM) i3-10100 CPU (8 cores) and 16 GB of DRAM, while
array is transferred to the NDP device via memory sharing and waits              the NDP device runs on an ARM-based development board with four
for the sorted key array to be fetched from the host. The key array is           Cortex-A76 cores, four Cortex-A55 cores, 16-GB DRAM, and a 256 GB
transferred to the host via the host-NDP interface.                              Western Digital SSD. A network cable connects the host to the NDP
    ▵ Step 2: The host fetches the key array, sorts the individual keys,         device, with a bandwidth of 1000 Mbps.
and sends the sorted key array to the NDP device for restructuring. All
                                                                                     The host system runs Ubuntu 16.04, and RocksDB version 6.10.2 is
these steps are organized within a single thread.
                                                                                 employed. The NDP platform uses a lightweight embedded operating
    ▵ Step 3: Upon receiving the new key array, the NDP device finds
each key’s corresponding value based on its subscript. Simultaneously,           system. Data transfer between the host and the NDP device is facilitated
the device reconstructs the new SSTables according to the order of the           by the SOCKET interface, replacing the standard POSIX interface to
keys. To minimize data transfer time between the host and device, the            ensure efficient data transmission. In RocksDB, the buffer and SSTable
data volume is reduced, and a separate transfer thread ensures that              sizes are set to 4 MB, the block size is 4 KB, and the level settings remain
the communication between the host and device remains unaffected,                at default values. The number of sub-tasks on the host is limited to 4,
minimizing transmission latency.                                                 and all other configuration parameters in RocksDB are set to default
    As shown in Fig. 8, only the keys (which are reconstructed on the            values.
host side) are passed between the host and the device. When the host                 Workloads. In this section, we evaluate the performance of Prock-
performs compaction, a compaction request is sent to the device, which           Store under realistic workloads. The details of the DB_Bench and YCSB-
then provides the necessary data from the NDP device. SSTables 1 and             C workloads used in the experiments are presented in Table 1. The
2 from level 𝐿0 and SSTables 3 and 4 from level 𝐿1 are separated.                DB_Bench workload is used to assess random-write performance.
The duplicate keys and offset addresses are passed to the host, which
                                                                                     Table 1 presents the different workloads in the ‘‘Type’’ column.
executes the compaction procedure. After deduplication, the keys are
                                                                                 In addition, db_bench_1 is configured in random-write mode with a
re-transmitted to the NDP device, where they are reorganized into a
new SSTable (SSTable 7) in level 𝐿1 .                                            fixed value size of 1 KB and varying data sizes (10 GB, 20 GB, 30 GB,
    By reducing the transmission overhead on the host side, the device           40 GB), db_bench_2 is configured in random-write mode with multiple
reduces the compaction task’s time cost, aligning with the NDP archi-            value sizes (1 KB, 4 KB, 16 KB, 64 KB) and two data volumes (10 GB
tecture’s requirements. At the fourth priority level, the host handles           and 40 GB). We also employ YCSB-C to measure the ProckStore’s
most of the compaction tasks, which contain more SSTables, thereby               performance under mixed read–write workloads.
relieving the device’s computational load. However, the NDP device
not only processes the values for the host but also handles the KV pairs
                                                                                 5. Performance evaluation
in the compaction task, which increases the device’s processing pres-
sure. To alleviate this, cross-level compaction is employed to reduce
computational strain on the NDP device.                                              We conduct experiments to evaluate the performance of ProckStore
    When a compaction process is triggered in level 𝐿𝑖 and the first_score       in terms of throughput, latency, and write amplification (WA).
of level 𝐿𝑖+1 exceeds one, cross-level compaction is initiated. This
process searches for SSTables with overlapping key ranges in the subse-
                                                                                 5.1. Performance under DB_Bench with various data volumes
quent level 𝐿𝑖+2 . Unlike traditional compaction, cross-level compaction
in ProckStore continuously searches for overlapping key-range SSTables
in level 𝐿𝑖+2 . Subsequently, SSTables from levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2             In this section, we evaluate the performance of ProckStore using
undergo compaction, and the newly generated SSTables are written to              DB_Bench with various data volumes and a 4-KB value. Fig. 9 illustrates
level 𝐿𝑖+2 .                                                                     the impact of data volume on performance, focusing on throughput,
    The trigger selection in level 𝐿𝑖 follows the priority rules, while          WA, CPU utilization, and bandwidth. ProckStore delivers peak perfor-
the selection of SSTables in levels 𝐿𝑖+1 and 𝐿𝑖+2 is based on their sec-         mance with 10-GB workloads, achieving up to 48% higher throughput
ond_score(see Eq. (2)). SSTables written to level 𝐿𝑖+1 in traditional com-       compared to PStore, and an average improvement of 40%. Under
paction may be written to level 𝐿𝑖+2 through cross-level compaction.             40-GB workloads, the WA of TStore and SocketRocksDB reaches its
This cross-level approach helps balance the SSTable distribution across          maximum, while ProckStore’s WA remains constant at 1.4 across all
levels, reducing the number of compaction operations. However, it
                                                                                 cases. Under 30-GB workloads, ProckStore’s throughput decreases by
introduces a drawback: compaction involving many SSTables increases
                                                                                 an average of 67% and 61% compared to TStore and SocketRocksDB,
compaction time. For the NDP device, data transmission time can be
                                                                                 respectively. This performance decrement is attributed to the frequency
ignored, thereby reducing overall compaction time. As illustrated in
Fig. 8, SSTables 1 and 2 in level 𝐿0 , SSTables 3 and 4 in level 𝐿1 ,            of compaction operations, which consume bandwidth and degrade
and SSTables 8 and 9 in level 𝐿2 are involved in compaction on the               overall performance. PStore exhibits lower CPU utilization than Prock-
NDP device, and new data is written into SSTable 13 in level 𝐿2 .                Store across all workloads. The multi-threaded approach in ProckStore
With an asynchronous mechanism, priority scheduling, and optimized               optimizes the utilization of computing resources. In contrast, Sock-
data transmission under the NDP-empowered KV store, ProckStore                   etRocksDB prioritizes data storage over compaction, leading to lower
efficiently optimizes the compaction process.                                    CPU utilization than PStore and ProckStore.

                                                                             8
H. Sun et al.                                                                                                              Journal of Systems Architecture 160 (2025) 103342

                Table 1
                Workload Characteristics used in the Experiment.
                                                                        Workloads in DB_Bench
                 Type               Feature                         Fillrandom                               Value Size (1 KB)        Data Size (10 GB)
                 db_bench_1         100% writes                     ✓                                        4×                       1×, 2×, 3×, 4×
                 db_bench_2         100% writes                     ✓                                        1×, 4×, 16×, 64×         1×, 4×
                                                                         Workloads in YCSB-C
                 Type               Feature                         Data Size                                Record Size              Distribution
                                                                    Load (10 GB)         Run (10 GB)         (1 KB)
                 A                  50% Reads, 50% Updates                                                                            Zipfian
                 B                  95% Reads, 5% Updates                                                                             Zipfian
                 C                  100% Reads                      1×, 2×               1×, 2×              1×                       Zipfian
                 D                  95% Reads, 5% Inserts                                                                             Latest
                                    95% Range Queries,
                 E                                                                                                                    Uniform
                                    5% Inserts
                                    50% Reads,
                 F                                                                                                                    Zipfian
                                    50% Read-Modify-Writes


                        Fig. 9. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with various data volumes.


5.1.1. Write amplification (WA)                                                         5.1.2. Throughput
    A large WA indicates significant duplication of write operations,                       In Fig. 9(c), the operation time of ProckStore ranges from 828.35
which degrades system performance. In SocketRocksDB, WA is primar-                      micros/op to 1819.18 micros/op. The operation time of ProckStore
ily caused by write-ahead log and compaction on the host. WA increases                  is lower than that of SocketRocksDB because it takes less to execute
                                                                                        write and read operations.ProckStore reduces operation time by 72.8%
with the amount of data, as the number of compaction operations
                                                                                        compared to TStore, under a 20-GB workload. At the same time, under
is proportional to the data size. As shown in Fig. 9(b), under a 10-
                                                                                        a 40-GB workload, ProckStore reduces the operation time by 24.0%
GB workload, the WA of TStore and PStore is reduced by 39% and                          and 61.5%, compared to PStore and SocketRocksDB. In Fig. 9(a), with
62%, respectively, compared to SocketRocksDB, which performs all                        a 40-GB dataset, the throughput of ProckStore is 4.15× and 1.47×
compaction tasks on the host. By offloading a portion of the compaction                 higher than that of TStore and PStore. Meanwhile ProckStore achieves a
tasks to the NDP device, TStore and PStore reduces WA. Notably,                         throughput of 2.75× higher than SocketRocksDB. Under a 10-GB work-
ProckStore exhibits a 55% reduction in WA. A similar trend is observed                  load, ProckStore achieves 2.81× write throughput of SocketRocksDB
under 20- and 30-GB workloads. For a 40-GB workload, WA is reduced                      through the multi-threaded asynchronous approach. In addition, with
by 36.4%, and 72.0% for ProckStore, respectively, compared to TStore,                   a 10-GB dataset, the throughput of ProckStore is 45.2% higher than
and SocketRocksDB.                                                                      PStore.
                                                                                            Other KV stores (excluding SocketRocksDB) leverage collaborative
                                                                                        strategies between the host and NDP device to accelerate compaction,

                                                                                    9
H. Sun et al.                                                                                                       Journal of Systems Architecture 160 (2025) 103342


thereby enhancing throughput. ProckStore optimizes resource alloca-                 of bandwidth, with an average improvement of 1.67× compared to
tion with a multi-threaded asynchronous approach, which improves                    PStore and 2.32× compared to SocketRocksDB across all different value
performance. Its throughput exceeds 4.57 MB/s, achieving a 48% im-                  sizes. Meanwhile, ProckStore achieves the highest host- and device-side
provement over PStore.                                                              CPU utilization. In Fig. 11(e), the device-side CPU utilization of PStore
                                                                                    and ProckStore are similar due to task stacking on the device under
5.1.3. CPU utilization                                                              large data volumes.
    CPU utilization refers to the proportion of CPU resources consumed
by the KV stores under different workloads. TStore utilizes a single-               5.2.1. Write amplification (WA)
threaded approach on both the host and device, leading to low CPU                      With increasing value sizes, the amount of data on the host grows,
utilization (see Figs. 9(d) and 9(e)). As a result, TStore’s CPU utilization        exacerbating WA in TStore and SocketRocksDB. In Fig. 11(b), the WA
is lower than that of SocketRocksDB. Despite leveraging multi-threaded              of TStore and SocketRocksDB is the lowest (2.18 and 5.2) with a 16-
concurrency, SocketRocksDB faces a transmission bottleneck between                  KB value. Under 1-KB value workloads, WA increases to 2.39 and 6.1,
the host and the device. During task processing, the host quickly                   respectively. ProckStore’s WA is unaffected by host-side compaction.
performs merge operations; however, there is significant latency during
                                                                                    Under 1-KB and 64-KB workloads, ProckStore reduces WA by 76.2%
read and write operations. By offloading a portion of tasks to the NDP
                                                                                    and 75.1%, respectively, compared to SocketRocksDB. However, it
device, ProckStore reduces CPU idle time and improves CPU utiliza-
                                                                                    increases to 76.4% and 77.6% under 10-GB workloads. This improve-
tion on the host. Compared to SocketRocksDB, ProckStore achieves
                                                                                    ment is due to ProckStore’s triple-filter compaction on the host, which
improvements of 97% and 89% in CPU utilization under 10-GB and
                                                                                    reduces compaction operations and the volume of compacted data.
40-GB workloads, respectively. ProckStore demonstrates the highest
host-side CPU utilization, peaking at 7.03% under a 10-GB workload.
    ProckStore’s multi-threaded method on the host further enhances                 5.2.2. Throughput
CPU utilization. As shown in Fig. 9(e), PStore employs a single-threaded,              In Figs. 10(a) and 11(a), ProckSotre’s average throughput ranges
asynchronous method, offering greater flexibility than traditional                  from 3.8 MB/s to 5.1 MB/s and 2.7 MB/s to 4.0 MB/s under 10-GB
scheduling models. Furthermore, reduced compaction time increases                   and 40-GB workloads, respectively. It is worth noting that ProckStore’s
the device-side CPU utilization of PStore by over 20.49%, a 73% im-                 throughput increases compared with PStore, indicating lower response
provement compared to TStore under a 10-GB workload. In ProckStore,                 times to foreground requests. Compared with SocketRocksDB, Prock-
device-side CPU utilization is further enhanced through cross-level                 Store improves by 2.04× and 2.1× under 40-GB workloads with 1-KB
compaction. This metric increases by 27%, 33%, 40%, and 35% com-                    and 16-KB values, respectively. Compared with PStore, ProckStore
pared to PStore under 10-, 20-, 30-, and 40-GB workloads, respectively.             improves throughput by 1.51× and 1.58× (see Fig. 11(a)). In particular,
                                                                                    compared with TStore, ProckStore archives 4.1× and 2.68× improve-
5.1.4. Compaction bandwidth                                                         ment under workloads with 4-KB and 64-KB values, respectively.
    The compaction bandwidth unveils the compaction performance of
a KV store. In this paper, the term ‘‘compaction bandwidth’’ refers                 5.2.3. CPU utilization
to the host-side compaction bandwidth, as our proposed ProckStore                       Large-sized values increase compaction overhead and host-side CPU
primarily focuses on optimizing host-side performance. For instance,                utilization, peaking under workloads with a 64-KB value. ProckStore’s
the four-level priority scheduling in Section 3.3 prioritizes four steps—           host-side and device-side CPU utilization reach 10.83% and 29.11%,
triggering, task generation, task allocation, and task execution on the             respectively (see Figs. 10(e) and 11(d)), while SocketRocksDB’s values
host—to perform asynchronous compaction efficiently. The triple-level               are 8.34% and 18.28%. Additionally, ProckStore’s CPU utilization on
filter compaction in Section 3.4 combines two compaction procedures                 both sides is 8.27% and 25.39% under 40-GB workloads with 1-KB val-
into one, thereby improving host-side compaction performance. There-                ues. On average, ProckStore’s CPU utilization is 3.35× and 4.1× higher
fore, we define compaction bandwidth as the ratio of compaction time                than TStore and SocketRocksDB, respectively, and outperforms PStore
to the amount of compacted data on the host side. SocketRocksDB                     in both host- and device-side CPU utilization under all workloads.
performs compaction tasks on the host, while other KV stores provide
compaction bandwidth on the host and NDP device. In Fig. 9(f), the
                                                                                    5.2.4. Compaction bandwidth
single-threaded TStore fails to fully leverage the multi-core computa-
                                                                                        In Figs. 10(f) and 11(f), the compaction bandwidth of the KV
tional capabilities of the host, resulting in an average bandwidth of 2.35
                                                                                    stores varies. TStore’s device-side bandwidth peaks at 3.14 MB/s,
MB/s.
                                                                                    while ProckStore shows an average improvement of 4.29× and 1.61×
    In contrast, SocketRocksDB uses the multi-threaded method to en-
                                                                                    over TStore and PStore, respectively. SocketRocksDB leverages multi-
hance the bandwidth to 2.86 MB/s, which outperforms other KV stores.
                                                                                    threaded parallelism to enhance computation and reduce processing
This is because the host handles all the tasks, resulting in much total
                                                                                    time, leading to superior bandwidth performance under 40-GB work-
data. The collaborative solution improves processing efficiency on the
                                                                                    loads across all value sizes. However, PStore achieves higher bandwidth
host. Under 40-GB workloads, ProckStore’s bandwidth improves by
                                                                                    than SocketRocksDB under 10-GB workloads. ProckStore outperforms
3.56× and 1.51× over SocketRocksDB and PStore, respectively.
                                                                                    all other stores in terms of bandwidth across all workloads, achieving a
                                                                                    3.54× improvement over SocketRocksDB under workloads with a 64-KB
5.2. Performance under DB_Bench with various value sizes
                                                                                    value.
    We configured the workloads with various value sizes and two
data volumes (10 GB and 40 GB). The large-sized value increases the                 5.3. Performance under YCSB-c
compaction overhead while improving the throughput under work-
loads with a fixed-size data volume. ProckStore maintains optimal                      YCSB-C provides real-world workloads, which we use to evaluate
performance under workloads with different value sizes and two data                 the compaction performance of TStore, PStore, SocketRocksDB, and
volumes (see Figs. 10 and 11). ProckStore’s throughput increases on                 ProckStore. We configure this workload with two types of data vol-
average by 63.1% and 77.7% compared to PStore and SocketRocksDB                     umes: 10 GB and 20 GB in the Load and Run phases. We define the
in the case of 1-KB value (see Fig. 10(a)). The performance increases at            configuration with 10 GB Load and 10 GB Run as small data volumes,
64 KB because large-value workloads trigger more frequent compaction                and 20 GB Load and 20 GB Run as large data volumes. We use six types
and shorter running time. ProckStore has the best performance in terms              of workloads in the experiment.

                                                                               10
H. Sun et al.                                                                                                                 Journal of Systems Architecture 160 (2025) 103342


                Fig. 10. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 10-GB data volume and various value sizes.


                Fig. 11. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 40-GB data volume and various value sizes.


                                                                                      11
H. Sun et al.                                                                                                          Journal of Systems Architecture 160 (2025) 103342


                   Fig. 12. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 10 GB and run 10 GB data volume.


5.3.1. Case 1: Load 10 GB and run 10 GB                                              workload C, ProckStore’s average latency is 13.4% and 37.5% lower
    Load. In YCSB-C, the workload Load is write-intensive, resulting                 than that of TStore and SocketRocksDB, respectively. ProckStore ex-
in frequent compaction. ProckStore optimizes compaction under var-                   hibits similar trends under workloads B and D. The average latency
ious workloads (Fig. 12). Its throughput outperforms SocketRocksDB                   of ProckStore under write-intensive workloads is 5.72% and 8.45%
by a factor of 2.3×. TStore benefits from time-aware dynamic task                    lower than that of TStore and PStore under workload E, respectively.
scheduling, which reduces the performance gap compared to Prock-                     Moreover, the throughput of ProckStore is not lower than other KV
Store. PStore’s asynchronous compaction improves performance, Prock-                 stores.
Store’s multi-threaded execution further enhances the performance                        Write Amplification (WA). ProckStore achieves lower WA than
of the asynchronous compaction strategy. Consequently, ProckStore’s                  both TStore and SocketRocksDB (see Fig. 12(b)). WA is reduced by
throughput is 4.24× and 1.80× higher than that of TStore and PStore,                 approximately 1.2 compared to SocketRocksDB. ProckStore’s host-side
respectively. During the Run phase, ProckStore’s collaborative mode                  multi-threaded method further decreases WA by an average of 62.3%
improves performance under write-intensive workloads. Workloads A                    compared to TStore. The minimum WA of ProckStore is 1.20 under
and F exhibit the highest write ratios at 29.0%. Under workload A,                   workload C. WA in ProckStore is influenced by the write-ahead log and
ProckStore’s throughput is 28.2% and 29.0% higher than that of TStore                host-side compaction, its compaction frequency is higher, resulting in
and PStore, respectively. Under workload F, ProckStore’s throughput                  greater WA than PStore. Under workloads C and D, WA is 1.20 and
                                                                                     1.36, respectively. However, ProckStore’s triple-level filter compaction
surpasses TStore and PStore by 36.4% and 71.3%, respectively. How-
                                                                                     mechanism mitigates WA compared to SocketRocksDB.
ever, when the write percentage is low, ProckStore’s throughput shows
                                                                                         CPU Utilization and Compaction Bandwidth. Figs. 12(d) and
minimal variation compared to other KV stores. Additionally, under
                                                                                     12(e) show the host-side CPU utilization of the KV store. Notably,
read-intensive workloads, ProckStore achieves the maximum through-
                                                                                     TStore runs on a single thread. The bandwidth limits the data transfer
put improvement of 60.7%, 59.8%, 122.4%, and 9.2% under workloads
                                                                                     between the host and the device. Overall, CPU utilization patterns for
B, C, D, and E, respectively. In contrast, TStore’s read performance
                                                                                     SocketRocksDB and ProckStore are similar on both sides, which can be
suffers due to the excessive number of SSTables, which increases the
                                                                                     attributed to the reduction in total processing time, accompanied by a
query operation overhead.
                                                                                     reduction in compaction time. Fig. 12(f) shows compaction bandwidth
    Throughput and Latency. Throughput and latency are critical met-
                                                                                     in the Load and Run phases. ProckStore achieves the highest bandwidth
rics for KV stores. As KV stores are widely deployed in real-world ap-               on the host. Its bandwidth is 8.64× and 3.32× improvement than TStore
plications, these metrics significantly impact response time. ProckStore             and SocketRocksDB, respectively, under workload A. This improvement
maintains its performance advantage in the Load phase when the data                  increases to 7.97× and 2.99× under workload F. Nevertheless, Prock-
size increases from 10 GB to 20 GB under a workload with the same                    Store improves the bandwidth by exploiting multi-threaded parallelism.
amount of data. Its throughput is 4.24× that of TStore, see Fig. 12(a).              The average bandwidth of ProckStore is 56.8% higher than PStore
Compared with SocketRocksDB and PStore, ProckStore’s throughput                      due to its efficient task scheduling, which leverages the computational
improves by 2.33× and 1.8×, respectively, under the same workload.                   capabilities of both the host and the device.
The advantage of ProckStore becomes even more pronounced under
workloads A and F which involves a higher percentage of writes. La-                  5.3.2. Case 2: Load 20 GB and run 20 GB
tency results further demonstrate the flexibility of ProckStore’s schedul-              In the Load phase, the throughput of ProckStore surpasses Sock-
ing method. Under workloads D and F, ProckStore has 55.1% and                        etRocksDB and TStore by 3.44× and 3.73×, respectively, see Fig. 13(a).
42.5% lower latency than PStore (see Fig. 12(c)). Compared with                      Although the asynchronous approach of PStore enhances performance,
TStore and PStore, ProckStore has 20.8% and 22.1% lower latency                      the multi-threaded method of ProckStore integrates with the asyn-
under workload A, respectively, see Fig. 12(c). Under read-intensive                 chronous compaction mechanism. Consequently, the throughput of

                                                                                12
H. Sun et al.                                                                                                          Journal of Systems Architecture 160 (2025) 103342


                  Fig. 13. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 20 GB and run 20-GB data volume.


ProckStore reaches 1.59× of PStore, respectively. In the Run phase,                 ProckStore increases host- and device-side CPU utilization by 18.1%
the multi-threaded asynchronous mode improves the performance of                    and 32.6%, respectively, compared with PStore under workload C. Un-
ProckStore under write-intensive workloads A and F, where half of                   der mixed read–write workloads, such as A and F, ProckStore increases
the operations are writes. Specifically, under workload A, ProckStore’s             host-side CPU utilization by 6.7% and 20.0% and device-side CPU
throughput exceeds that of TStore and PStore by 21.0% and 23.1%,                    utilization by 12.2% and 13.2%, respectively. Fig. 13(f) shows the com-
respectively. Similarly, under workload F, ProckStore achieves 19.1%                paction bandwidth of the Run phase. ProckStore achieves the highest
and 21.3% higher throughput than TStore and PStore, respectively.                   bandwidth under all workloads. In workload C, ProckStore’s bandwidth
    Workloads A and F involve large-sized data volumes. Under these                 is 45.4% and 35.1% higher than that of PStore and SocketRocksDB,
workloads, the data volume increased from 10-GB to 20-GB, and the                   respectively. This improvement is attributed to ProckStore’s utilization
throughput of ProckStore decreased by 32.2% and 42.7%, respec-                      of multi-threaded parallelism. However, with large-size data volumes,
tively. In addition, the throughput of ProckStore is optimized under                ProckStore’s bandwidth decreases by 17.7% compared to a 10-GB data
read-intensive workloads. In ProckStore, we focus on optimizing com-                volume. Under workload D, ProckStore’s average bandwidth is 6.47×
paction, and the read performance improvement is small. For read-                   and 1.38× higher than TStore and SocketRocksDB, respectively. In
intensive workloads, ProckStore achieves 20.4%, 16.3%, and 28.5%                    comparison to a 10-GB data volume, the CPU utilization decreases by
improvement under B, C, and D, compared with PStore, respectively.                  38.1%, 32.3%, 17.7%, 33.7%, 30.9%, and 33.1% under workloads A,
For workloads with small-sized data volumes, ProckStore decreases by                B, C, D, E, and F, respectively.
27.8%, 8.4%, and 40.6% under workloads B, C, and D, respectively.
    Throughput and Latency. With a data size of 20 GB, Prock-                       5.3.3. Tail latency
Store maintains its performance advantage in the Load phase. Com-                       We analyzed the tail latency of ProckStore, including P90, P99,
pared with SocketRocksDB and PStore, ProckStore’s throughput is im-                 and P999 latencies, and compared it with TStore, SocketRocksDB, and
proved by 3.44× and 1.58×, respectively, in the Load phase. Both                    PStore under workloads of different data volumes (10 GB, 20 GB) and
average latency and throughput of ProckStore have the highest perfor-               a 1-KB value size. The experimental results are shown in Figs. 14 and
mance in Figs. 13(a) and 13(c). Under read-intensive workloads such                 15.
as B and C, ProckStore outperforms SocketRocksDB by about 9.1%                          The results demonstrate that ProckStore outperforms other key-
and 21.4%, respectively. This improvement is attributed to the triple-              value stores, exhibiting lower tail latency. SocketRocksDB’s P90 and
filtering compaction, which reduces execution time in the run phase,                P99 tail latencies are notably lower than those of other KV stores, due to
thereby increasing throughput.                                                      its multi-version management mechanism in RocksDB. ProckStore’s P90
    As shown in Fig. 13(c), under write-intensive workload A, the                   and P99 tail latencies are lower than those of other KV stores thanks
average latency of ProckStore is 17.6% and 18.9% lower than TStore                  to its asynchronous allocation method, which reduces tail latency.
and PStore, respectively. ProckStore has a similar trend under workload             Under a 10 GB workload, the most significant reduction in latency
F. In addition, ProckStore’s latency also reduces by 16.9%, 14.1%,                  occurs when ProckStore lowers P90 latency by 94.07% and 93.89%
and 22.2% under read-intensive workloads B, C, and D, compared                      compared to TStore and PStore under workload E. This improvement
with SocketRocksDB, respectively. However, compared with 10 GB                      is attributed to ProckStore’s superior range query performance, while
data volume, the latency increases due to the additional compaction                 TStore and PStore are not optimized for range queries. Similarly, Prock-
operations and the associated lookup costs.                                         Store achieves the lowest P99 latency. Fig. 14(b) shows that ProckStore
    CPU Utilization and Compaction Bandwidth. When the data                         achieves the most significant P99 latency reduction under workload E,
volume increases from 10 GB to 20 GB, Figs. 13(d) and 13(e) illustrate              lowering P99 by 79.4% and 79.2% compared to TStore and PStore,
the changes in CPU utilization for ProckStore under various workloads.              respectively. It also shows substantial improvements under workload B,

                                                                               13
H. Sun et al.                                                                                                                Journal of Systems Architecture 160 (2025) 103342


                                                                                          Fig. 16. Write performance of ProckStore under DB_Bench with different numbers of
                                                                                          subtasks.

Fig. 14. The tail latency of ProckStore under YCSB-C with load 10 GB and run 10 GB
data volume.
                                                                                          6.1. Impact of number of subtasks

                                                                                              To validate the fourth-level prioritization, we conducted experi-
                                                                                          ments to evaluate the impact of various subtasks on the write perfor-
                                                                                          mance of ProckStore. The extended experiments replicate the configu-
                                                                                          ration from Section 4. We configured DB_Bench with a 10 GB dataset
                                                                                          and a 1-KB value. Specifically, we examine the impact of the number of
                                                                                          subtasks on the fourth-level prioritization in ProckStore by configuring
                                                                                          four types of subtasks on the host. The experimental results are shown
                                                                                          in Fig. 16.
                                                                                              As shown in Fig. 16(a), the throughput of ProckStore increases
                                                                                          significantly with the number of subtasks. The throughput is 2.33
Fig. 15. The tail latency of ProckStore under YCSB-C with load 20 GB and run 20 GB        MB/s, 2.81 MB/s, and 3.13 MB/s for one, two, and three subtasks,
data volume.                                                                              respectively, and subsequently stabilizes. With four subtasks, Prock-
                                                                                          Store achieves a peak throughput of 3.8 MB/s. The average latency
                                                                                          shows a similar trend, where ProckStore achieves the lowest latency
with ProckStore reducing P99 by 75.6% and 76.2% compared to TStore                        (0.22 ms) with four subtasks, showing a 17.1% improvement from three
and PStore.                                                                               to four subtasks. The host-side CPU utilization also reflects ProckStore’s
    In Fig. 15, the differences in tail latency become more pronounced                    performance with different numbers of subtasks, as multi-core CPUs
under a 20 GB workload. ProckStore reduces P90 latency by 9.32%                           enable parallel execution of multiple threads.
and 31.06% under workloads A and F, respectively, compared to                                 As shown in Fig. 16(c), CPU utilization increases with the number of
PStore. Under workload E, ProckStore reduces P90 latency by 93.46%                        subtasks, allowing the CPU to utilize its computational resources fully.
                                                                                          CPU utilization is 4.78% (the lowest) with one subtask, improving by
and 93.68% compared to PStore and TStore, respectively. ProckStore’s
                                                                                          10.3% with two subtasks. The highest CPU utilization (6.84%) occurs
four-level priority scheduling mechanism prevents low-priority requests
                                                                                          with four subtasks. However, as the number of subtasks increases,
from blocking high-priority writes, reducing extreme write latency
                                                                                          the performance improvements in CPU utilization, throughput, and
often blocked by flush or compaction in TStore and SocketRocksDB.
                                                                                          average latency become less pronounced. This is because while par-
Similarly, ProckStore reduces P99 tail latency under workloads A and                      allel execution of multiple threads reduces compaction execution time,
F by 28.9% and 17.9%, respectively, compared to SocketRocksDB and                         the overhead from thread creation and synchronization increases. As
TStore. Under workload C, ProckStore reduces P99 tail latency by                          the number of threads grows, this additional CPU overhead impacts
54.9% and 9.45% compared to PStore and SocketRocksDB, respec-                             ProckStore’s CPU utilization.
tively. Under workload D, ProckStore reduces P99 tail latency by 23.5%
and 6.0% compared to the same alternative KV stores. ProckStore                           6.2. Impact of number of threads
performs best under workload E, reducing P99 tail latency by 79.22%,
78.69%, and 6.23% compared to TStore, PStore, and SocketRocksDB,                              In Section 2.2, we studied the performance of PStore with different
respectively.                                                                             numbers of threads. For the multi-threaded comparison experiment of
    The FIFO scheduling used by traditional KV stores like                                ProckStore, we extended the analysis by comparing its throughput with
SocketRocksDB can cause high-priority requests to be blocked, leading                     that of PStore under varying thread counts. The experimental results
to increased tail latency. In contrast, ProckStore’s multi-level queue                    are shown in Fig. 17.
scheduling mechanism enables compaction tasks to be executed in pri-                          Fig. 17(a) shows the throughput of ProckStore and PStore under
ority order, with high-priority compaction tasks executed first, thereby                  workloads with 4 KB values and a 10 GB data volume. As the number
                                                                                          of threads increases, the throughput of PStore does not increase ex-
reducing tail latency.
                                                                                          ponentially, and its performance is poor during multi-threaded writes.
                                                                                          Specifically, the throughput increases by only 1.58% when the number
6. Extended experiment                                                                    of threads rises from 8 to 12. In contrast, the throughput of ProckStore
                                                                                          increases significantly with the number of threads. At 12 threads, the
                                                                                          throughput reaches 7.86 MB/s, which is 10.9% higher than that at 8
   In this section, we study the impact of multi-threaded and the num-                    threads. ProckStore’s throughput is 144.1% higher than PStore’s, as its
ber of subtasks on ProckStore performance. The results demonstrate                        multi-threaded execution efficiently processes the large data volume
the effectiveness of ProckStore under multi-threaded and verify its                       written by multiple threads, avoiding the computational limitations of
performance under multiple subtask numbers. The environment of the                        single-thread execution in PStore.
extended experiment is the same as the experimental configuration in                          As shown in Fig. 17(b), the average latency of PStore decreases
Section 4.                                                                                with the increase in threads under a 10 GB data volume workload.

                                                                                     14
H. Sun et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103342


                                   Fig. 17. Write performance of ProckStore under DB_Bench with different number of threads.


However, the decrease in latency is more significant when the number                   Storage architecture. ListDB [24] employs a skip-list as the core
of threads is low, such as from 1 to 4 threads, where the latency                  data structure at all levels within non-volatile memory (NVM) or per-
drops by 27.8%. For ProckStore, as the number of threads increases,                sistent memory (PM) to mitigate the WA problem by leveraging byte-
performance improves steadily. At 4 threads, the average latency of                addressable in-place merge ordering. This approach reduces the gap
ProckStore is 0.154 ms, and at 12 threads, it reduces to 0.104 ms with             between DRAM and NVM write latency and addresses the write stall
a 32.3% reduction. Fig. 17(c) shows that when the number of threads                issue. HiKV [25] utilizes the benefits of hash and B+Tree indexes to
reaches 12, the host-side CPU utilization for PStore and ProckStore is             design the KV store on hybrid DRAM-NVM storage systems, where hash
highest, at 8.74% and 11.62%, respectively, with ProckStore showing a              indexes in NVM are used to enhance indexing performance. In a hybrid
41.92% increase over PStore. Additionally, the CPU utilization for both            NVM-SSD system, WaLSM [26] tackles the WA problem through virtual
systems increases by 7.28% and 10.7%, respectively, when the number                partitioning, dividing the key space during compaction. Additionally, a
of threads increases from 8 to 12. As the number of threads decreases,             reinforcement-learning method is applied to balance the merging strat-
CPU utilization also drops. For 1 thread, the CPU utilization is at its            egy of different partitions under various workloads, optimizing read
lowest – 6.48% for PStore and 9.37% for ProckStore.                                and write performance. TrieKV [27] integrates DRAM, PM, and disk
                                                                                   into a unified storage system, utilizing a tri-structured index for all KV
                                                                                   pairs in memory, enabling dynamic determination of KV pair locations
7. Related work
                                                                                   across storage hierarchies and persistence requirements. Moreover,
                                                                                   ROCKSMASH [28] utilizes local storage for frequently accessed data
    LSM-tree has become a popular data structure in key-value storage
                                                                                   and metadata, while cloud storage is employed for less frequently
systems, offering an alternative to traditional structures by efficiently
                                                                                   accessed data.
handling write-intensive workloads and large-scale datasets. Although
                                                                                       Computing architecture. Heterogeneous computing [29] (e.g.,
KV stores manage data through compaction operations, these processes
                                                                                   GPUs, DPUs, and FPGAs) alleviates the computational burden on the
come at the cost of performance. Consequently, several studies have
                                                                                   CPU. Sun et al. [30] propose an accelerated solution for key-value
sought to mitigate the performance impact of compaction in KV stores.
                                                                                   stores by offloading the compaction task to an FPGA. Similarly, the
    LSM-tree structure. PebblesDB [13] introduces the FLSM data                    FPGA-accelerated KV store [31] offloads the compaction task to the
structure, which alleviates the limitations of non-overlapping key ranges          FPGA, minimizing competition for CPU resources and accelerating
within a level, thereby delaying the compaction process and reducing               compaction while reducing CPU bottlenecks. LUDA [32] employs GPUs
WA. WiscKey [14] separates keys and values to minimize WA during                   to process SSTables using a co-ordering mechanism that minimizes
compaction but increases garbage collection overhead. To address this              data movement, thereby reducing CPU pressure. gLSM [33] separates
issue, HashKV [15] employs hash partitioning and a hot/cold partition-             keys and values to minimize data transfer between the CPU and
ing strategy, while DiffKV [16] separates keys based on the size of key-           GPU, thereby accelerating compaction. dCompaction [34] leverages
value pairs to balance performance. FenceKV [17] enhances HashKV                   DPUs to accelerate the compaction and decompaction of SSTables,
by incorporating a fence-value-based partitioning strategy and key-                offloading compaction tasks to the DPU according to a hierarchical
range-based garbage collection, optimizing range queries. FGKV [18]                structure, relieving CPU overload. Despite these advances, heteroge-
and Spooky [19] reduce WA by adjusting the data granularity in                     neous computing still requires data transfer from host-side memory to
compaction. FGKV introduces a fine-grained compaction mechanism                    the computing units, which can impact overall system performance.
based on the LSpM-tree structure, minimizing redundant writes of                       Near-data processing (NDP), which offloads computational tasks
irrelevant data. Spooky partitions the data at the largest level into              from the CPU to the data location, is an emerging computing paradigm.
equal-sized files and partitions the smaller levels according to file              Previous studies [35] investigated storage computing and propose
boundaries for fine-grained compaction.                                            frameworks for storage- and memory-level processing. Biscuit [36]
    For compaction strategies, TRIAD [20] improves LSM-tree perfor-                introduces a generalized framework for NDP. RFNS [37] examines
mance by optimizing logs, memory, and storage. Work [21,22] opti-                  the advantages of reconfigurable NDP-driven servers based on ARM
mize the traditional top-level-driven compaction of LSM-trees by shift-            and FPGA architectures for data- and compute-intensive applications.
ing to a low-level-driven approach, decomposing large compaction                   𝜆-IO [38] designs a unified computational storage stack to manage stor-
tasks into smaller ones to reduce granularity. WipDB [23] utilizes a               age and computing resources through interfaces, runtime systems, and
bucket-sort-like algorithm that minimizes merge operations by writing              scheduling. HuFu [39] is an I/O scheduling architecture for computable
KV pairs in an approximately sorted list. Although these studies en-               SSDs that allows the system to manage background I/O tasks, offload
hance compaction efficiency, they primarily focus on a single device               computational tasks to SSDs, and exploit the parallelism and idle time
and fail to address the competition for CPU and I/O resources between              of flash memory for improved task scheduling. Li et al. [40] addresses
foreground requests and background tasks. In contrast, NDP devices                 the resource contention problem between user I/O and NDP requests,
expand computational resources to process tasks internally, reducing               using the critical path to maximize the parallelism of multiple requests,
data transfer and resource contention.                                             thereby improving the performance of hybrid NDP-user I/O workflows.

                                                                              15
H. Sun et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103342


ABNDP [41] leverages a novel hardware-software collaborative opti-              References
mization approach to solve the challenges of remote data access and
computational load balancing without requiring trade-offs.                       [1] Z. Zhang, Y. Sheng, T. Zhou, et al., H2o: Heavy-hitter oracle for efficient gen-
    In addition, hosts and NDP devices employ distinct task scheduling               erative inference of large language models, in: Advances in Neural Information
                                                                                     Processing Systems, vol. 36, 2024.
policies to collaborate on compaction tasks [9,10,42]. The nKV [43]
                                                                                 [2] H. Lin, Z. Wang, S. Qi, et al., Building a high-performance graph storage on top
defines data formats and layouts for computable storage devices and                  of tree-structured key-value stores, Big Data Min. Anal. 7 (1) (2023) 156–170.
designs both hardware and software architectures to optimize data                [3] S. Pei, J. Yang, Q. Yang, REGISTOR: A platform for unstructured data processing
placement and computation.                                                           inside SSD storage, ACM Trans. Storage (TOS) 15 (1) (2019) 1–24.
    KV-CSD [44] builds NDP architectures using NVMe SSDs and system-             [4] IDC, IDC innovators: Privacy-preserving computation, 2023, [EB/OL]. (2023-09-
on-chip designs to reduce data movement during queries by offloading                 20). https://www.idc.com/getdoc.jsp?containerId=prCHC51469323.
                                                                                 [5] P. O’Neil, E. Cheng, D. Gawlick, E. O’Neil, The log-structured merge-tree
tasks. Research such as OI-RAID [45] introduces an additional fault
                                                                                     (LSM-tree), Acta Inform. 33 (4) (1996) 351–385.
tolerance mechanism by adding an extra level on top of the RAID levels,          [6] Google, Leveldb, 2025, https://leveldb.org/.
enabling fast recovery and enhanced reliability. KVRAID [46] utilizes            [7] Facebook, Rocksdb: a persistent key-value store for fast storage environments,
logical-to-physical key conversion to pack similar-sized KV pairs into a             2016, http://rocksdb.org/.
single physical object, thereby reducing WA, and applies off-site update         [8] A. Acharya, M. Uysal, J. Saltz, Active disks: Programming model, algorithms and
techniques to mitigate I/O amplification. Distributed storage systems,               evaluation, Oper. Syst. Rev. 32 (5) (1998) 81–91.
                                                                                 [9] H. Sun, W. Liu, J. Huang, et al., Collaborative compaction optimization system
such as EdgeKV, have also been explored [47]. A sharding strategy is
                                                                                     using near-data processing for LSM-tree-based key-value stores, J. Parallel Distrib.
used to distribute data across multiple edge nodes, while consistent                 Comput. 131 (2019) 29–43.
hashing ensures balanced data distribution and high availability. ER-           [10] H. Sun, W. Liu, Z. Qiao, et al., Dstore: A holistic key-value store exploring near-
KV [48] integrates a hybrid fault-tolerant design combining erasure                  data processing and on-demand scheduling for compaction optimization, IEEE
coding and PBR, providing fault tolerance to ensure system reliability               Access 6 (2018) 61233–61253.
and high availability. Additionally, Song et al. [49] coupled each SSD          [11] H. Sun, et al., Asynchronous compaction acceleration scheme for near-data
                                                                                     processing-enabled LSM-tree-based KV stores, ACM Trans. Embed. Comput. Syst.
with a dedicated NDP engine in an NDP server to fully leverage the                   23 (6) (2024) 1–33.
data transfer bandwidth of SSD arrays. MStore [50] extends an NDP               [12] Isaac Kofi Nti, et al., A mini-review of machine learning in big data analytics:
device to multiple devices, utilizing them to perform compaction tasks.              Applications, challenges, and prospects, Big Data Min. Anal. 5 (2) (2022) 81–97.
    Although NDP devices can handle host-side computational tasks,              [13] P. Raju, R. Kadekodi, V. Chidambaram, et al., Pebblesdb: Building key-value
their resources remain limited. Consequently, it is critical to optimize             stores using fragmented log-structured merge trees, in: Proceedings of the 26th
                                                                                     Symposium on Operating Systems Principles, 2017, pp. 497–514.
the use of these resources on the NDP device. The multi-threaded
                                                                                [14] L. Lu, T.S. Pillai, H. Gopalakrishnan, et al., Wisckey: Separating keys from values
asynchronous method in ProckStore addresses this challenge by fully                  in SSD-conscious storage, ACM Trans. Storage (TOS) 13 (1) (2017) 1–28.
utilizing computation on both the host and device sides, avoiding               [15] H. H. W. Chan, C. J. M. Liang, Y. Li, et al., HashKV: Enabling efficient updates in
resource wastage while ensuring sufficient computational capacity on                 KV storage via hashing, in: 2018 USENIX Annual Technical Conference, USENIX
the NDP device.                                                                      ATC 18, 2018, pp. 1007–1019.
                                                                                [16] Y. Li, Z. Liu, P. P. C. Lee, et al., Differentiated key-value storage management
                                                                                     for balanced I/O performance, in: 2021 USENIX Annual Technical Conference,
8. Conclusions
                                                                                     USENIX ATC 21, 2021, pp. 673–687.
                                                                                [17] C. Tang, J. Wan, C. Xie, Fencekv: Enabling efficient range query for key-value
    In this paper, we present ProckStore, an NDP-empowered KV store,                 separation, IEEE Trans. Parallel Distrib. Syst. 33 (12) (2022) 3375–3386.
to improve compaction performance for large-scale unstructured data             [18] H. Sun, G. Chen, Y. Yue, et al., Improving LSM-tree based key-value stores with
storage. In ProckStore, the multi-threaded and asynchronous mecha-                   fine-grained compaction mechanism, IEEE Trans. Cloud Comput. (2023).
nism leverages computational resources within storage devices, reduc-           [19] N. Dayan, T. Weiss, S. Dashevsky, et al., Spooky: granulating LSM-tree com-
                                                                                     pactions correctly, in: Proceedings of the VLDB Endowment, Vol. 15, (11) 2022,
ing data movement and enhancing compaction efficiency. ProckStore
                                                                                     pp. 3071–3084.
optimally schedules compaction tasks across the host and NDP de-                [20] O. Balmau, D. Didona, R. Guerraoui, et al., TRIAD: Creating synergies between
vice by implementing a four-level priority scheduling mechanism. This                memory, disk and log in log structured key-value stores, in: 2017 USENIX Annual
separation of compaction stages provides parallel processing with-                   Technical Conference, USENIX ATC 17, 2017, pp. 363–375.
out interference, achieving efficient resource utilization. In addition,        [21] Y. Chai, Y. Chai, X. Wang, et al., LDC: a lower-level driven compaction method
ProckStore uses key-value separation to reduce data transfer between                 to optimize SSD-oriented key-value stores, in: 2019 IEEE 35th International
                                                                                     Conference on Data Engineering, ICDE, 2019, pp. 722–733.
the host and NDP device, minimizing transmission time. Experimental
                                                                                [22] Y. Chai, Y. Chai, X. Wang, et al., Adaptive lower-level driven compaction to
results unveil that ProckStore outperforms existing synchronous and                  optimize LSM-tree key-value stores, IEEE Trans. Knowl. Data Eng. 34 (6) (2020)
single-threaded asynchronous NDP-empowered KV stores, achieving up                   2595–2609.
to 4.2× higher throughput than the baseline KV store. ProckStore also           [23] X. Zhao, S. Jiang, X. Wu, WipDB: A write-in-place key-value store that mimics
reduces WA, compaction time, and CPU utilization.                                    bucket sort, in: 2021 IEEE 37th International Conference on Data Engineering,
                                                                                     ICDE, 2021, pp. 1404–1415.
                                                                                [24] W. Kim, C. Park, D. Kim, et al., Listdb: Union of write-ahead logs and persistent
CRediT authorship contribution statement
                                                                                     SkipLists for incremental checkpointing on persistent memory, in: 16th USENIX
                                                                                     Symposium on Operating Systems Design and Implementation (OSDI 22), 2022,
    Hui Sun: Writing – review & editing, Writing – original draft,                   pp. 161–177.
Visualization, Validation, Supervision, Software, Resources, Project ad-        [25] F. Xia, D. Jiang, J. Xiong, et al., HiKV: a hybrid index key-value store for DRAM-
ministration, Methodology, Investigation, Funding acquisition, Formal                NVM memory systems, in: 2017 USENIX Annual Technical Conference, USENIX
                                                                                     ATC 17, 2017, pp. 349–362.
analysis, Data curation, Conceptualization. Chao Zhao: Writing – re-
                                                                                [26] L. Chen, R. Chen, C. Yang, et al., Workload-aware log-structured merge key-value
view & editing, Writing – original draft, Visualization, Validation,                 store for NVM-SSD hybrid storage, in: 2023 IEEE 39th International Conference
Software, Resources, Project administration, Methodology, Investiga-                 on Data Engineering, ICDE, 2023, pp. 2207–2219.
tion, Funding acquisition, Formal analysis, Data curation, Conceptu-            [27] H. Sun, et al., TrieKV: A high-performance key-value store design with memory
alization. Yinliang Yue: Validation, Supervision, Software. Xiao Qin:                as its first-class citizen, IEEE Trans. Parallel Distrib. Syst. (2024).
Supervision, Resources, Methodology, Formal analysis, Data curation.            [28] P. Xu, N. Zhao, J. Wan, et al., Building a fast and efficient LSM-tree store by
                                                                                     integrating local storage with cloud storage, ACM Trans. Archit. Code Optim. (
                                                                                     TACO) 19 (3) (2022) 1–26.
Declaration of competing interest                                               [29] Hao Zhou, Yuanhui Chen, Lixiao Cui, Gang Wang, Xiaoguang Liu, A GPU-
                                                                                     accelerated compaction strategy for LSM-based key-value store system, in: The
   The authors declare that there is no conflict of interests regarding              38th International Conference on Massive Storage Systems and Technology,
the publication of this article.                                                     2024, pp. 1–11.


                                                                           16
H. Sun et al.                                                                                                                         Journal of Systems Architecture 160 (2025) 103342


[30] X. Sun, J. Yu, Z. Zhou, et al., Fpga-based compaction engine for accelerating               [41] B. Tian, Q. Chen, M. Gao, ABNDP: Co-optimizing data access and load balance in
     lsm-tree key-value stores, in: 2020 IEEE 36th International Conference on Data                   near-data processing, in: Proceedings of the 28th ACM International Conference
     Engineering, ICDE, 2020, pp. 1261–1272.                                                          on Architectural Support for Programming Languages and Operating Systems,
[31] T. Zhang, J. Wang, X. Cheng, et al., FPGA-accelerated compactions for LSM-based                  Vol. 3, 2023, pp. 3–17.
     key-value store, in: 18th USENIX Conference on File and Storage Technologies,               [42] H. Sun, W. Liu, J. Huang, et al., Near-data processing-enabled and time-aware
     FAST 20, 2020, pp. 225–237.                                                                      compaction optimization for LSM-tree-based key-value stores, in: Proceedings of
[32] P. Xu, J. Wan, P. Huang, et al., LUDA: Boost LSM key value store compactions                     the 48th International Conference on Parallel Processing, 2019, pp. 1–11.
     with GPUs, 2020, arXiv preprint arXiv:2004.03054.                                           [43] T. Vincon, A. Bernhardt, I. Petrov, et al., nKV: near-data processing with KV-
[33] H. Sun, J. Xu, X. Jiang, et al., gLSM: Using GPGPU to accelerate compactions                     stores on native computational storage, in: Proceedings of the 16th International
     in LSM-tree-based key-value stores, ACM Trans. Storage (2023).                                   Workshop on Data Management on New Hardware, 2020, pp. 1–11.
[34] C. Ding, J. Zhou, J. Wan, et al., Dcomp: Efficient offload of LSM-tree compaction           [44] I. Park, Q. Zheng, D. Manno, et al., KV-CSD: A hardware-accelerated key-value
     with data processing units, in: Proceedings of the 52nd International Conference                 store for data-intensive applications, in: 2023 IEEE International Conference on
     on Parallel Processing, 2023, pp. 233–243.                                                       Cluster Computing, CLUSTER, 2023, pp. 132–144.
[35] E. Riedel, G. Gibson, C. Faloutsos, Active storage for large-scale data mining              [45] N. Wang, Y. Xu, Y. Li, et al., OI-RAID: a two-layer RAID architecture towards
     and multimedia applications, in: Proceedings of 24th Conference on Very Large                    fast recovery and high reliability, in: 2016 46th Annual IEEE/IFIP International
     Databases, 1998, pp. 62–73.                                                                      Conference on Dependable Systems and Networks, DSN, 2016, pp. 61–72.
[36] B. Gu, A.S. Yoon, D.H. Bae, et al., Biscuit: A framework for near-data processing           [46] M. Qin, A.L.N. Reddy, P. V. Gratz, et al., KVRAID: high performance, write
     of big data workloads, ACM SIGARCH Comput. Archit. News 44 (3) (2016)                            efficient, update friendly erasure coding scheme for KV-SSDs, in: Proceedings of
     153–165.                                                                                         the 14th ACM International Conference on Systems and Storage, 2021, pp. 1–12.
[37] X. Song, T. Xie, S. Fischer, Two reconfigurable NDP servers: Understanding the              [47] K. Sonbol, Ö. Özkasap, I. Al-Oqily, et al., EdgeKV: Decentralized, scalable, and
     impact of near-data processing on data center applications, ACM Trans. Storage                   consistent storage for the edge, J. Parallel Distrib. Comput. 144 (2020) 28–40.
     (TOS) 17 (4) (2021) 1–27.                                                                   [48] Y. Geng, J. Luo, G. Wang, et al., Er-kv: High performance hybrid fault-
[38] Z. Yang, Y. Lu, X. Liao, et al., 𝜆-IO: A unified IO stack for computational storage,             tolerant key-value store, in: 2021 IEEE 23rd International Conference on High
     in: 21st USENIX Conference on File and Storage Technologies, FAST 23, 2023,                      Performance Computing & Communications; 7th International Conference on
     pp. 347–362.                                                                                     Data Science & Systems; 19th International Conference on Smart City; 7th
[39] Y. Wang, Y. Zhou, F. Wu, et al., Holistic and opportunistic scheduling of                        International Conference on Dependability in Sensor, Cloud & Big Data Systems
     background I/Os in flash-based SSDs, IEEE Trans. Comput. (2023).                                 & Application, HPCC/DSS/SmartCity/DependSys, 2021, pp. 179–188.
[40] J. Li, X. Chen, D. Liu, et al., Horae: A hybrid I/O request scheduling technique for        [49] X. Song, T. Xie, S. Fischer, A near-data processing server architecture and
     near-data processing-based SSD, IEEE Trans. Comput.-Aided Des. Integr. Circuits                  its impact on data center applications, in: High Performance Computing: 34th
     Syst. 41 (11) (2022) 3803–3813.                                                                  International Conference, ISC High Performance 2019, Frankfurt/Main, Germany,
                                                                                                      June 16–20, 2019, Proceedings 34, Springer International Publishing, 2019, pp.
                                                                                                      81–98.
                                                                                                 [50] H. Sun, Q. Wang, Y.L. Yue, et al., A storage computing architecture with multiple
                                                                                                      NDP devices for accelerating compaction performance in LSM-tree based KV
                                                                                                      stores, J. Syst. Archit. 130 (2022) 102681.


                                                                                            17