1076 lines
126 KiB
Plaintext
1076 lines
126 KiB
Plaintext
Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Journal of Systems Architecture
|
||
journal homepage: www.elsevier.com/locate/sysarc
|
||
|
||
|
||
|
||
|
||
ProckStore: An NDP-empowered key-value store with asynchronous and
|
||
multi-threaded compaction scheme for optimized performance✩
|
||
Hui Sun a ,∗, Chao Zhao a , Yinliang Yue b , Xiao Qin c
|
||
a Anhui University, Jiu long road 111, Hefei, 230601, Anhui, China
|
||
b
|
||
Zhongguancun Laboratory, Cuihu North Road 2, Beijing, 100094, China
|
||
c
|
||
Auburn University, The Quad Center Auburn, Auburn, 36849, AL, USA
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: With the exponential growth of large-scale unstructured data, LSM-tree-based key-value (KV) stores have
|
||
Near-data processing (NDP) become increasingly prevalent in storage systems. However, KV stores face challenges during compaction,
|
||
LSM-tree particularly when merging and reorganizing SSTables, which leads to high I/O bandwidth consumption and
|
||
Asynchronous multi-threaded compaction
|
||
performance degradation due to frequent data migration. Near-data processing (NDP) techniques, which
|
||
Write amplification
|
||
integrate computational units within storage devices, alleviate the data movement bottleneck to the CPU.
|
||
Key-value separation
|
||
The NDP framework is a promising solution to address the compaction challenges in KV stores. In this
|
||
paper, we propose ProckStore, an NDP-enhanced KV store that employs an asynchronous and multi-threaded
|
||
compaction scheme. ProckStore incorporates a multi-threaded model with a four-level priority scheduling
|
||
mechanism–covering the compaction stages of triggering, selection, execution, and distribution, thereby
|
||
minimizing task interference and optimizing scheduling efficiency. To reduce write amplification, ProckStore
|
||
utilizes a triple-level filtering compaction strategy that minimizes unnecessary writes. Additionally, ProckStore
|
||
adopts a key-value separation approach to reduce data transmission overhead during host-side compaction.
|
||
Implemented as an extension of RocksDB on an NDP platform, ProckStore demonstrates significant performance
|
||
improvements in practical applications. Experimental results indicate a 1.6× throughput increase over the
|
||
single-threaded and asynchronous model and a 4.2× improvement compared with synchronous schemes.
|
||
|
||
|
||
|
||
1. Introduction with each level having a capacity threshold that increases at a fixed
|
||
rate as the level number grows. When the amount of data in a level
|
||
The rapid development of large language models [1], graph exceeds its threshold, some data migrates to lower levels, potentially
|
||
databases [2], and social network [3] has led to the generation of real- causing overlapping key ranges between SSTables in different levels.
|
||
time large amounts of data, contributing to a global surge in large-scale To maintain data organization and prevent duplication, SSTables with
|
||
data. This data is growing exponentially and is increasingly manifested overlapping key ranges must be loaded into memory and merged. The
|
||
in semi-structured and unstructured formats, in addition to traditional sorted and de-duplicated key-value pairs are then rewritten as new
|
||
structured data. For example, semi-structured and unstructured data SSTables at a lower level. This process, known as compaction, involves
|
||
have been grown in recent years according to IDC [4], and they now frequent read and write operations that consume a lot of I/O bandwidth
|
||
account for over 85% of total data volume. To cope with the large between the host and storage devices, thereby delaying foreground
|
||
amount of unstructured data, LSM-tree-based key-value stores (KV requests and degrading system performance.
|
||
stores) [5] have become widely adopted in large-scale storage systems.
|
||
GPUs, DPUs, and FPGAs General-purpose graphics processing unit
|
||
LSM-tree structures are popularly used in modern database engines
|
||
(GPGPU), data processing unit (DPU), and field-programmable gate
|
||
(e.g., LevelDB) [6] RocksDB [7]). In the LSM-tree structure, key-value
|
||
array (FPGA) offer additional computational resources to address com-
|
||
pairs are first written to an immutable MemTable in memory and
|
||
paction performance challenges. Near-Data Processing (NDP), intro-
|
||
then persist to disk as Sorted String Tables (SSTables) once preset
|
||
duced in the late 1990s as the ‘‘smart disk’’ [8], has regained attention
|
||
threshold is reached. On disk, the LSM-tree is organized hierarchically,
|
||
|
||
|
||
✩ This work is supported in part by National Natural Science Foundation of China under Grants 62472002 and 62072001. Xiao Qin’s work is supported
|
||
by the U.S. National Science Foundation (Grants IIS-1618669 and OAC-1642133), the National Aeronautics and Space Administration, United States (Grant
|
||
80NSSC20M0044), the National Highway Traffic Safety Administration, United States (Grant 451861-19158), and Wright Media, LLC (Grants 240250 and 240311).
|
||
∗ Corresponding author.
|
||
E-mail addresses: sunhui@ahu.edu.cn (H. Sun), chaozh@stu.ahu.edu.cn (C. Zhao), yylhust@qq.com (Y. Yue), xqin@auburn.edu (X. Qin).
|
||
|
||
https://doi.org/10.1016/j.sysarc.2025.103342
|
||
Received 31 October 2024; Received in revised form 30 December 2024; Accepted 11 January 2025
|
||
Available online 24 January 2025
|
||
1383-7621/© 2025 Published by Elsevier B.V.
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
as an emerging computational paradigm. The enhanced computational
|
||
power within storage devices has fueled interest in NDP. NDP mit-
|
||
igates the overhead of data movement by reducing data movement
|
||
to the CPU. The NDP paradigm advocates for ‘‘computation close
|
||
to data’’ as an alternative to the computation-centered approach in
|
||
large-scale systems. This model enables storage devices to use their
|
||
internal bus for data processing rather than transfer data to the host,
|
||
where the results would otherwise be computed. Most existing NDP-
|
||
empowered KV stores, such as Co-KV [9] and TStore [10], to tackle
|
||
compaction tasks using a synchronization-based approach. This work
|
||
splits the compaction task, leveraging either averaging or dynamic
|
||
time-awareness. In the synchronization model, the host and the device
|
||
cannot complete tasks simultaneously, leading to long waiting time and
|
||
inefficient resources usage. PStore [11] addresses waiting time by using
|
||
an asynchronous model but fails to fully exploit the benefits of this
|
||
approach due to its single-threaded processing.
|
||
To address these issues, we designed an asynchronous NDP scheme,
|
||
ProckStore, which utilizes a multi-threaded strategy to perform com-
|
||
paction tasks concurrently. All compaction tasks are managed in a
|
||
thread pool and scheduled using multiple threads, exploiting the bene-
|
||
fits of asynchronous processing, where tasks do not interfere with one
|
||
another. The tasks are executed independently by individual threads.
|
||
A four-level priority scheduling mechanism is implemented to ensure
|
||
efficient scheduling of compaction tasks within the thread pool, fol-
|
||
lowing the four stages of the compaction process. To address the write
|
||
amplification issue, a triple-level filtering compaction method is em-
|
||
ployed, reducing unnecessary writes and alleviating write amplification
|
||
during compaction on the host side. Furthermore, the transmission
|
||
process in the NDP architecture and its compaction module is optimized
|
||
Fig. 1. The structure of LSM-tree and RocksDB. The LSM-tree is composed of compo-
|
||
by utilizing a key-value separation technique, minimizing transmission nents 𝐶0 , 𝐶1 , and 𝐶𝑛 .
|
||
time by sending only the keys to the host. The contributions of this
|
||
work are summarized as follows
|
||
▴ We designed ProckStore with an asynchronous and multi-threaded
|
||
2. Background and motivation
|
||
scheme. Then, compaction tasks are executed independently with-
|
||
out interfering with each other in the thread pool, entirely using
|
||
the asynchronous mode, which significantly improves write perfor- 2.1. Background
|
||
mance compared with the synchronous mode and the single-threaded
|
||
asynchronous scheme. RocksDB is an LSM-tree-based key-value store developed by Face-
|
||
We designed ProckStore using an asynchronous and multi-threaded book, and it is widely used in Facebook’s storage systems to achieve
|
||
architecture. Compaction tasks are executed independently within the high throughput. In RocksDB, the MemTable and Immutable MemTable
|
||
thread pool, fully leveraging the asynchronous model. This approach are stored in memory, while the Sorted String Table (SSTable) is stored
|
||
significantly enhances write performance compared to the synchronous on disk. As shown in Fig. 1, key-value pairs from the application
|
||
model and single-threaded asynchronous scheme. are first written to the commit log and then cached in a sorted data
|
||
▴ ProckStore employs a four-level priority scheduling to manage the structure called the MemTable, which has a limited size (e.g., 4MB)
|
||
compaction process, which consists of four stages: compaction trigger, in memory. Once the MemTable reaches its predefined capacity, it is
|
||
compaction picking, compaction execution, and compaction distribu- converted into an Immutable MemTable. A background thread then
|
||
tion. This scheduling prioritizes tasks at different stages, ensuring opti- writes the MemTable to disk as a sorted string table (SSTable). On disk,
|
||
mal efficiency during asynchronous and multi-threaded compaction. SSTables are organized in levels, with each level growing by a fixed
|
||
▴ To optimize performance in the NDP transmission architecture, multiple.
|
||
we implemented key-value separation in the host-side compaction, In Fig. 1(a), in LSM-tree, the hierarchy represents different compo-
|
||
reducing data transmission overhead. The device-side compaction mod- nents, such as components 𝐶0 , 𝐶1 ......, and 𝐶𝑛 . Component 𝐶0 resides in
|
||
ule employs a cross-level compaction technique to alleviate compu- memory. The new write data is first written into the sequential log file
|
||
tational load, thereby improving transmission efficiency and overall and then inserted into an entry placed in 𝐶0 . However, the high-cost
|
||
system throughput. memory capacity that accommodates 𝐶0 imposes a limit on the size of
|
||
▴ ProckStore, an extension of RocksDB on the NDP platform, was 𝐶0 . To migrate entries to a component on the disk, LSM-tree performs
|
||
evaluated using DB_Bench and YCSB-C. Experimental results demon- a merge operation when the size of 𝐶0 reaches the threshold, including
|
||
strate that ProckStore increases throughput by a factor of 1.6× com- taking some contiguous segment of entries from 𝐶0 and merging it into
|
||
pared to the single-threaded asynchronous PStore, and achieves a 4.2× a component on the disk. Component 𝐶𝑛 (n> 1) resides on the disk
|
||
throughput improvement over the synchronous TStore. in the LSM-tree. Although 𝐶1 is disk-resident, the frequently accessed
|
||
The paper is organized as follows. Section 2 presents the background page nodes in 𝐶1 remain in the memory buffer. 𝐶1 has a directory
|
||
and motivation encountered by ProckStore. Section 3 presents a system structure like B-tree but is optimized for sequential access on the disk.
|
||
overview of ProckStore and information on each module. Section 4 The in-memory 𝐶0 servers high-speed writes, and 𝐶𝑛 (n> 1) on the
|
||
lists the hardware and software configurations used in the experiments. disk is responsible for persistence and batch-sequential writes. Through
|
||
Section 5 demonstrates the performance of ProckStore through exten- the hierarchical and merging strategies, LSM-tree achieves a balance
|
||
sive experiments. Section 6 elaborates on the extended experiments. between write optimization and high-efficiency query.
|
||
Section 7 summarizes related work. Finally, we conclude our work in In Fig. 1(b), the most recently generated SSTable is placed in the
|
||
Section 8. lowest level, 𝐿0 . SSTables in level 𝐿0 can have overlapping key ranges,
|
||
|
||
2
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 2. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various data volume.
|
||
|
||
|
||
|
||
|
||
Fig. 3. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various value sizes.
|
||
|
||
|
||
while higher levels are organized by key ranges. Each level has a size other hand, demonstrates its effectiveness in an asynchronous and
|
||
threshold for its total SSTables. When this threshold is exceeded, the KV single-threaded setting. Notably, the asynchronous approach allows
|
||
store migrates SSTables from level 𝐿𝑘 to level 𝐿𝑘+1 during compaction. compaction tasks to be performed independently; however, it is difficult
|
||
The compaction process selects SSTables from level 𝐿𝑘 and searches to fully leverage the benefits of asynchronous processing in a single-
|
||
for overlapping key ranges in level 𝐿𝑘+1 . A merge operation is then threaded environment. Therefore, we investigate the performance of
|
||
performed on the SSTables with overlapping key ranges to produce new PStore using different thread configurations. Fig. 2(a) presents the
|
||
SSTables, which are stored in level 𝐿𝑘+1 . Obsolete SSTables in level throughput of PStore under workloads with 4-KB value and various
|
||
𝐿𝑘+1 are deleted from the disk. This compaction process incurs compu- data volumes. We can draw two key observations.
|
||
tational and storage overhead, which negatively impacts response time ▵ As the number of threads increases, the throughput of PStore does
|
||
and throughput – a significant drawback of the LSM-tree. not grow exponentially as expected. The throughput improvement is
|
||
Graphical computing [2], machine learning [12], and large lan- minimal during multi-threaded writes, particularly when the number
|
||
guage models [1] demand substantial data for model training and of threads is 12.
|
||
inference. The data transfer overhead from storage devices to the CPU ▵ With a large number of threads, the throughput of PStore in-
|
||
for computation becomes higher as data volumes grow, consuming sys- creases slowly. Under 20-GB workloads, when the thread count in-
|
||
tem resources and incurring bottlenecks between storage and memory creases from 8 to 12, the throughput only increases by 0.12 MB/s.
|
||
in high-performance systems. As data volumes increase, the overhead These findings indicate that the asynchronous compaction advan-
|
||
associated with transferring data from storage devices to the CPU for tages of PStore in single-threaded mode are insufficient to handle the
|
||
computation rises, leading to resource consumption and performance large volume of multi-threaded writes. As a result, increasing the num-
|
||
bottlenecks between storage and memory in high-performance sys- ber of threads does not enhance throughput, particularly as the thread
|
||
tems. Traditional storage architectures struggle to meet the demands count becomes large. While the asynchronous approach in PStore takes
|
||
of data-intensive applications under these conditions. NDP mitigates into account the computational imbalance between the host and the
|
||
this challenge by fully utilizing the device’s internal bandwidth. By NDP device, it fails to implement an appropriate asynchronous com-
|
||
incorporating embedded computing units, storage devices can perform paction method. The limitations of the single-threaded mode hinder
|
||
computational tasks, offloading these operations from the host and the full potential of the asynchronous compaction mechanism in the
|
||
eliminating the overhead of moving large data volumes. The results KV store.
|
||
can then be retrieved from the storage device, reducing the need for As shown in Fig. 2(b), the average latency decreases under work-
|
||
additional data movement. Furthermore, the KV store can leverage loads with various data volumes, but this reduction is most pronounced
|
||
NDP to perform compaction tasks internally, improving compaction when using a small number of threads. Specifically, the most significant
|
||
efficiency. decrease occurs between 1 and 4 threads, where the average latency re-
|
||
duces by 27.8%. Additionally, the CPU utilization on the host supports
|
||
2.2. Motivation these observations (see Fig. 2(c)), with a 34% increment in 12 threads
|
||
over 1 thread under 10-GB workloads. The CPU utilization exhibits a
|
||
Most existing studies focus on compaction processing in a single- 19% increment as the number of threads grows from 1 to 4. The result
|
||
threaded context. For instance, Co-KV and TStore process compaction reveals that PStore is suitable for single- or fewer-threaded workloads,
|
||
tasks synchronously and in a single-threaded mode. PStore, on the and it is challenging to adapt to multi-threaded applications.
|
||
|
||
3
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 4. Overview system of ProckStore. 𝑄1 is the first compaction-task queue. Sub i(0<i<n+1) represents the sub-compaction task of task 1. 𝐾𝑖 and 𝑉𝑖 denote the 𝑖th key and
|
||
value, respectively.
|
||
|
||
|
||
We conducted an experiment to study the impact of multiple threads module on the host side, and the compaction module on the device.
|
||
on the performance under workloads with 4- and 64-KB value sizes. As shown in Fig. 4, we illustrate the data flow between the NDP
|
||
As shown in Fig. 3, we observe similar findings under workloads with device, the host-side compaction, and the device-side compaction mod-
|
||
various data volumes. The throughput of PStore is improved under ules. Initially, the data is written from the host side (see host data
|
||
the large-sized value. When increasing the number of threads to 4 stream in Fig. 4), and multiple compaction tasks are accumulated in
|
||
under workloads with a 64-KB value, the throughput of PStore is 1.65 the thread pool. These tasks are allocated to the host and device by the
|
||
MB/s. Furthermore, this metric increases to 1.83 MB/s in the case of asynchronous manager (see host task stream and device task stream
|
||
12 threads, and there is only an increment of 11% (see Fig. 3(a)). In in Fig. 4). After completing the compaction tasks, the data is written
|
||
Fig. 3(b), the average latency decreases when the number of threads to the flash memory inside the NDP device through the transmission
|
||
increases to 4. The average latency of 4 threads decreases by 24.8% module (see device data stream in Fig. 4). The dark blue line represents
|
||
compared to one thread. The degree of decrement becomes little as the host-side data flow, where ProckStore transfers the data from
|
||
the number of threads increases. The host-side CPU utilization becomes flash memory to the host for compaction tasks. The four-level priority
|
||
smoother in Fig. 3(c), but the improvement is still most pronounced scheduling module manages the thread pool, the host-side compaction
|
||
when there are a limited number of threads. module, and controls the asynchronous manager (see control stream in
|
||
Thus, we plane to use a multi-threaded approach to implement Fig. 4). When a compaction task is generated, the four-level priority
|
||
the asynchronous compaction mechanism in the KV store. We have scheduling module places it in the compaction queue of the thread
|
||
redesigned the asynchronous compaction mechanism extended from pool. It then determines whether the task should be executed on the
|
||
RocksDB, and fully leveraged its internal multi-threaded capability to host or device and notifies the asynchronous manager to allocate the
|
||
develop the asynchronous compaction solution – ProckStore. task. When the host executes a compaction task, the scheduling module
|
||
issues instructions to the host-side compaction module to execute it.
|
||
3. Design of ProckStore We provide the asynchronous compaction mechanism with the
|
||
multi-level task queue module in Section 3.2, where compaction tasks
|
||
3.1. System overview are kept in the thread pool. Section 3.3 presents the four-level prior-
|
||
ity scheduling module, which controls the priority scheduling in the
|
||
In this paper, we propose ProckStore, an NDP-empowered KV store compaction process. The triple-level filtering compaction module is in
|
||
that incorporates an asynchronous and multi-threaded compaction Section 3.4. We describe the transport mechanism on the NDP device
|
||
scheme. ProckStore consists of a host-side subsystem, an NDP device, and the cross-level compaction module in Section 3.5.
|
||
and a communication module that connects the two sides, as shown The host-side asynchronous compaction management module allo-
|
||
in Fig. 4. The host-side subsystem manages I/O requests, while the cates compaction tasks to the device side. A multi-level queue stores
|
||
NDP device, which serves as a storage unit, extends computational the tasks awaiting execution and calculates the computational capa-
|
||
resources to process tasks offloaded from the host. The NDP device bilities of both the host and device. These tasks are then scheduled
|
||
stores persistent data, with read and write operations akin to those to the compaction modules on the host and device sides. The host-
|
||
of standard storage devices. We implement various modules on both side compaction module executes the tasks, while the device side must
|
||
the host and NDP device to accommodate task-offloading requirements. transmit compaction information via the semantic management mod-
|
||
Data is stored as SSTables in a leveled structure on the NDP device. ule, which facilitates communication between the host and device. The
|
||
SSTables are either transferred to the host for compaction or to the NDP processed information is sent to the device-side compaction module for
|
||
for compaction via the communication channel. During transmission, task execution. The four-level priority scheduling module manages the
|
||
the SSTables pass through a key-value separation module, ensuring that entire process, from task triggering to execution. Data and commands
|
||
only the keys of the KV pairs are sent to the host for the merge oper- are transmitted between the host and device through the semantic
|
||
ation. The data flow occurs between the NDP device, the compaction management module. The NDP device encodes (decodes) interacting
|
||
|
||
4
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
In the case of a multi-level queue, it is important to prevent the
|
||
queue from becoming starved of compaction tasks. On both the host
|
||
and device sides, one side may pause the compaction task while wait-
|
||
ing for new task allocations upon completion of the task allocation.
|
||
Consequently, the triggering conditions must be adjusted to trigger
|
||
more frequently, ensuring a sufficient number of compaction tasks are
|
||
available in the task queue. Additionally, different priorities must be set
|
||
for scheduling various tasks. ProckStore assigns a score to each priority
|
||
level, see Fig. 6
|
||
First-level Priority: the priority of the compaction trigger. In
|
||
the stage of triggering a compaction task, the goal of ProckStore is
|
||
to select the level most urgently needed to perform the compaction
|
||
task. ProckStore sets the first_score to realize the prioritization of the
|
||
compaction triggering stage. Due to the structure of the LSM-tree, the
|
||
score in level 𝐿0 is calculated as the ratio of the number of SSTables
|
||
to the threshold value of level 𝐿0 . However, the score in other levels
|
||
is calculated as the ratio of the total size of SSTables to the threshold
|
||
value of the level. Thus, the calculation of first_score is also divided
|
||
into level 𝐿0 and other levels, see the following equation.
|
||
⎧ 𝑁sst −𝑁no_comp −𝑁being_comp
|
||
⎪ 𝑁max
|
||
, level i, i =0
|
||
first_score = ⎨ 𝑆sst −𝑆no_comp −𝑆being_comp (1)
|
||
⎪ 𝑆max
|
||
, level 𝑖, 𝑖 > 0
|
||
⎩
|
||
where 𝑁𝑠𝑠𝑡 and 𝑁𝑚𝑎𝑥 denote the number of SSTables and the thresh-
|
||
Fig. 5. Multilevel Task Queue in ProckStore. old of the number of SSTables in level 𝐿0 , respectively. 𝑁𝑛𝑜_𝑐 𝑜𝑚𝑝 and
|
||
𝑁𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the number of compaction tasks in level 𝐿0 that have
|
||
been picked into the task queue to be executed and the number of
|
||
compaction tasks in level 𝐿0 that are executed and contain SSTables.
|
||
data, storing the SSTable based on key-range granularity, performing
|
||
𝑆𝑠𝑠𝑡 and 𝑆𝑚𝑎𝑥 denote the size of the total data volume of SSTable in
|
||
garbage collection, maintaining information, and executing compaction
|
||
tasks on the files. level i and the threshold of the data volume of SSTable in level i,
|
||
respectively. In contrast, 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 and 𝑆𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the data volume
|
||
of SSTable included in the compaction task to be executed in the
|
||
3.2. Asynchronous mechanisms
|
||
queue of tasks picked in level i and the current compaction task being
|
||
executed. The data volume containing SSTable is being executed in the
|
||
To implement the asynchronous strategy, we decouple the two
|
||
phases – compaction triggering and execution – to establish conditions compaction task. Different from the RocksDB score, we can see that
|
||
for asynchronous compaction. In contrast, the synchronous approach the SSTables that have been picked into the compaction queue and
|
||
treats the task from compaction trigger to completion as a single the SSTables that are involved in compaction tasks are subtracted to
|
||
process. In the asynchronous mechanism, compaction tasks are con- reduce the number of SSTables that are not in the level, which makes
|
||
tinuously generated when the conditions for triggering compaction are the calculation of the first_score more accurate.
|
||
met. These tasks, generated during the trigger phase, must be executed. When a level triggers a compaction, the generated compaction task
|
||
To manage them efficiently, we propose a multi-level task queue that will be put into the corresponding task queue of the level, and the
|
||
stores compaction tasks uniformly and waits for the asynchronous compaction module will wait for its processing. See Fig. 6. In an
|
||
manager to schedule them. To align with the structure of the LSM-tree, asynchronous trigger mode, the compaction task will not be executed
|
||
a compaction task queue is assigned to each level, with tasks generated immediately, and the asynchronous compaction manager will have to
|
||
during the trigger phase placed into the task queue at a level, awaiting wait for it to be scheduled. The device side triggers the compaction
|
||
scheduling. In Fig. 5, ProckStore employs a multi-task queue for each task according to the first_score of each level and places it into the task
|
||
column family. The multi-task queue selects compaction tasks at each queue. The device triggers the compaction task according to first_score
|
||
level based on a score value. (select the maximum value) in each level and puts it into the task
|
||
We implement multi-level task queues in a thread pool. Tasks in queue. It provides the basis for prioritizing the compaction triggering
|
||
each queue are sorted in ascending order by the number of SSTables. and the execution between levels, which is the first-level prioritization.
|
||
A heap sorting algorithm is used in each task queue to ensure sorting A second level of prioritization is the prioritization of SSTable
|
||
occurs in time complexity 𝑂(𝑛 log 𝑛). The task queue is a double-ended picking. In the compaction task generation phase, we select some
|
||
queue, allowing compaction tasks to be allocated from both ends to the SSTables in the level and all the overlapping SSTables from the fol-
|
||
host and device sides. Since multiple tasks are scheduled in the queue, lowing level. These SSTables are conducted and merged in the com-
|
||
a thread pool is used to manage the pending tasks in the compaction paction operation. ProckStore puts the compacted SSTables into the
|
||
queue. compaction_queue and then reads the first file information that needs
|
||
to be compacted from the queue. The compacted SSTables in the queue
|
||
3.3. Four-level priority scheduling perform compaction operations sequentially without considering the
|
||
hot and cold data and the size of the compaction task. Thus, the
|
||
An asynchronous mechanism-based compaction procedure separates information about the number of overlapping SSTables with the lower
|
||
the two phases: compaction triggering and execution. To achieve this, level is added to the FileMetaData of each SSTable. The second_score is
|
||
we propose a four-level priority scheduling strategy that assigns priority set to the number of overlapping SSTables, and the meta-information
|
||
levels to the four steps involved: triggering compaction tasks, generat- is sorted in ascending order by each level following the size of the
|
||
ing tasks, allocating tasks, and executing tasks. This strategy ensures second_score, see Eq. (2), as follows
|
||
efficient execution of the asynchronous compaction process. second_score = 𝑂𝑣𝑒𝑟𝑙𝑎𝑝sst , for SSTable (2)
|
||
|
||
5
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 6. Four-level priority scheduling in ProckStore. The light gold circles represent SSTables selected for compaction at the current level, while the light blue circles denote
|
||
SSTables selected for compaction at the next level. The dark blue circles indicate SSTables that are not selected for compaction. The yellow rectangles represent newly generated
|
||
compaction tasks, the light red rectangles signify compaction tasks assigned to the device side, and the dark orange rectangles represent compaction tasks assigned to the host
|
||
side.
|
||
|
||
|
||
where 𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠𝑠𝑡 denote the number of overlapping SSTables. The host side; (3) the score of the host side is equal.
|
||
SSTable with the smallest number of overlapping SSTables at the lower In case (1), the default rules of acquiring tasks in the queue remain
|
||
level is prioritized to select the SSTable to be compacted. Compaction unchanged, and case (2) is changed into a situation in which the device
|
||
and the metadata information of SSTable are maintained using a linked side fetches the tasks from the left side of the queue, and the host
|
||
list to facilitate insertion and deletion. However, we query the overlap obtains the tasks from the right side of the queue. In case (3), the
|
||
of SSTables with the lower level and cost O(n) time complexity to default rules for taking tasks are still maintained, and the host side
|
||
maintain the order of the linked list. We can ensure that the min- and the device side re-record and calculate the compaction time at one
|
||
imum number of SSTables is selected in each compaction, reducing end, then make judgments according to the comparison results. In a
|
||
compaction time. running process, the configuration of the host and device sides cannot
|
||
A third level of priority is the priority of allocating compaction change, so the queue to get the task rules decided after the numerical
|
||
tasks. In the stage of compaction task allocation, we consider the comparison is no longer carried out. The decision of the task to get the
|
||
different computational capabilities of the host and the device sides rules cannot be changed in this process.
|
||
for compaction tasks. Meanwhile, the compaction processing efficiency A fourth level of prioritization is the priority of executing
|
||
varies with the configurations of the host and the device and the data sub-task. After selecting the SSTables, these SSTables are integrated
|
||
paths of read, write, and transmission. Therefore, it is necessary to into a complete compaction task that reaches the stage of compaction
|
||
select appropriate compaction tasks for both the host and the device. execution on the NDP device and the host. The compaction task is
|
||
When all the SSTables involved in compaction are selected, the com- decomposed into multiple sub-tasks, which can be executed in parallel
|
||
paction task information is generated and inserted into the compaction on the device. Notably, sub-task refers to as sub-compaction that are
|
||
task queue. The compaction task queue of each level is a double-ended performed in the multi-threaded compaction mechanism in RocksDB.
|
||
In a compaction process, the primary thread first executes a sub-
|
||
priority task queue, which is sorted in ascending order according to the
|
||
task. Notably, the first sub-task is designated as the main thread for
|
||
number of SSTables in a compaction task. The queue is heap sorted with
|
||
execution by default. The rest of the sub-task creates many sub-threads
|
||
time complexity 𝑂(𝑛𝑙𝑜𝑔 𝑛). Initially, the host obtains the tasks from the
|
||
to be executed concurrently. Then, the primary thread merges the
|
||
left side of the queue with fewer SSTables, and the device side gets the
|
||
results and writes them back in a unified manner.
|
||
tasks from the right side with more SSTables. The host and the device
|
||
The amounts of data and execution time are different in the sub-
|
||
sides record the compaction time.
|
||
task. The computational resources are underutilized by default. To
|
||
During the compaction process, the host and the device record the
|
||
address this issue, we prioritize the concurrent execution process of
|
||
time cost of five consecutive compaction tasks and data volume of
|
||
sub-task. Let us have 𝑓 𝑜𝑢𝑟𝑡ℎ_𝑠𝑐 𝑜𝑟𝑒 = 𝑆𝑆 𝑆 𝑇 , where 𝑆𝑆 𝑆 𝑇 denotes the
|
||
compacted SSTables. The third_score is the ratio of the time taken for
|
||
total data volume of SSTable in each sub-task. When dividing the
|
||
five consecutive compaction tasks to the data volume of the compacted
|
||
sub-task tasks, we compare the data size of each sub-task. The sub-
|
||
SSTables, which is given as task containing the smallest data is set to be the highest priority. It
|
||
⎧ 𝑆host_sst means that the smaller the fourth_score is, the higher the priority is,
|
||
⎪ 𝑇host_comp
|
||
, for host
|
||
third_score = ⎨ 𝑆device_sst (3) and the highest-priority sub-task is placed into the primary thread for
|
||
⎪ 𝑇device_comp , for device compaction. The compaction execution time can be illustrated as
|
||
⎩
|
||
where 𝑆ℎ𝑜𝑠𝑡_ 𝑠𝑠𝑡 and 𝑇ℎ𝑜𝑠𝑡_ 𝑐 𝑜𝑚𝑝 denote the total data volume of compacted 𝑇𝑒𝑥𝑒 = 𝑇𝑝𝑡ℎ𝑟𝑒𝑎𝑑 + 𝑇𝑠𝑢𝑏𝑡ℎ𝑟𝑒𝑎𝑑 , (4)
|
||
SSTables on the host and the time cost, respectively. 𝑆𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑠𝑠𝑡 and where 𝑇𝑒𝑥𝑒 , 𝑇𝑝𝑡ℎ𝑟𝑒𝑎𝑑 , 𝑎𝑛𝑑 𝑇𝑠𝑢𝑏𝑡ℎ𝑟𝑒𝑎𝑑 represent the overall execution time,
|
||
𝑇𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑐 𝑜𝑚𝑝 represent the total data volume of compacted SSTables on primary thread execution time, and sub-thread execution time. The sub-
|
||
the device and the time cost, respectively. We use the third_score to task with the least execution time is placed into the primary thread
|
||
evaluate the compaction processing capability of both the host and for execution to reduce the execution time. Notably, the sub-thread
|
||
device sides. The side with a higher compaction processing capability execution time is determined by the sub-task with the longest execution
|
||
handles tasks containing a large number of SSTables from the right end time. This procedure cannot affect the execution time of sub-tasks,
|
||
of the queue, while the other side handles tasks from the left end. thereby reducing the overall execution time and improving the system
|
||
The larger the value of third_score1 is, the less efficient the com- performance.
|
||
paction is. Compared 𝑡ℎ𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒ℎ𝑜𝑠𝑡 with 𝑡ℎ𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒𝑑 𝑒𝑣𝑖𝑐 𝑒 , there are
|
||
three cases: (1) the score of the host side is greater than that of the 3.4. Triple-level filter compaction
|
||
device side; (2) the score of the device side is greater than that of the
|
||
The asynchronous compaction method of ProckStore improves the
|
||
compaction performance; however, this method brings the write am-
|
||
1
|
||
It indicates that the compaction operation spends more time processing plification problem. Therefore, we propose the mechanism of triple-
|
||
an SSTable. level filtering compaction (see Fig. 7) . In a compaction procedure,
|
||
|
||
6
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 7. The triple-level filter compaction in ProckStore.
|
||
|
||
|
||
|
||
|
||
Fig. 8. The transmission module between the host and the device in ProckStore.
|
||
|
||
|
||
triple-level filtering compaction involves SSTables from three levels introduces write amplification. The SSTables newly written to level
|
||
to remove duplicate data. The triggering of the triple-level filtering 𝐿𝑖+1 may immediately needs to be combined with the SSTables with
|
||
mechanism, however, requires certain conditions to be met. When overlapping key ranges in level 𝐿𝑖+2 to form a new compaction task.
|
||
performing the compaction involving SSTables in levels 𝐿𝑖 and 𝐿𝑖+1 , These SSTables are merged and the new data are written to level 𝐿𝑖+2 ,
|
||
ProckStore first_score value of level 𝐿𝑖+1 . If the value is greater than resulting in additional write amplification for data previously written
|
||
1, the triple-level filtering compaction is triggered, involving SSTables to level 𝐿𝑖+1 . Consequently, this procedure incurs two instances of write
|
||
from level 𝐿𝑖+2 that overlap with those from level 𝐿𝑖+1 . This mechanism amplification.
|
||
helps reduce duplicate writes and alleviates write amplification. The triple-level filtering compaction combines all the overlapping-
|
||
As triple-level filtering compaction contains overlapping-key-range key-range SSTables in the three levels to perform compaction. The data
|
||
SSTables of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , it causes the problem of excessive in level 𝐿𝑖 is written to level 𝐿𝑖+2 , which eliminates a compaction
|
||
compaction data. When performing the three-level compaction, some process and the write amplification from level 𝐿𝑖+1 to 𝐿𝑖+2 .
|
||
key ranges can exist levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 . This key ranges can be
|
||
deleted and filtered at the intermediate level (i.e., level 𝐿𝑖+1 ), which 3.5. Transmission in ProckStore
|
||
cannot affect the update of new keys in level 𝐿𝑖 to 𝐿𝑖+2 or the merging
|
||
of old keys in level 𝐿𝑖+2 . At the stage of generating compaction tasks, In ProckStore, data is transferred between the host and device sides,
|
||
we mark the duplicate key range in the three levels when picking the as shown in Fig. 8. During compaction, a large amount of data is
|
||
overlapping SSTables from the three levels and filter the duplicate key read from the NDP device to the host for compaction, which involves
|
||
ranges in the three levels out of level 𝐿𝑖+1 in advance. Then, the newest transferring many KV pairs. This results in significant data migration
|
||
keys in level 𝐿𝑖 and the oldest keys in level 𝐿𝑖+2 are retained. This overhead. To address this issue, we employ key–value separation for
|
||
approach reduces the data volume involved in compaction by reducing compaction to reduce the data transfer overhead, which minimizes data
|
||
redundancy across the three levels, thereby alleviating the issue of migration between the host and the NDP device, reduces write am-
|
||
excessive data in compaction operations. plification, and improves compaction performance. In the compaction
|
||
As shown in Fig. 7, when level 𝐿1 performs compaction, the score process, only the keys of the KV pairs are read, sorted, and written to
|
||
of level 𝐿2 is greater than 1 to satisfy the condition of triple-level filter the NDP device. The key size is less than 1 KB, while the value size
|
||
compaction. Compared the key ranges of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , exceeds 1 KB. During compaction on the host side, the NDP device
|
||
ProckStore filters out and deletes the same keys that exist across the transmits only the keys to the host, while the device processes the
|
||
three levels. In Fig. 7, the keys 7, 8, 9, 10, 11, and 13 are filtered values locally. Afterward, the host processes the values, sends them
|
||
out from level 𝐿2 before performing the compaction operation. These back to the device, and integrates them into an SSTable. This approach
|
||
keys are placed in the compaction queue, awaiting the asynchronous significantly reduces data migration between the CPU and memory on
|
||
manager to initiate the compaction. According to Eq. (1), these keys the host side and minimizes the overhead of data transfers between the
|
||
are marked as 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 , causing the first_score value in levels 𝐿1 and 𝐿2 device and host.
|
||
to drop below 1 due to the subtraction of these keys. This results in a The key–value separation mechanism is implemented during host-
|
||
reduction of excess data in the level. The default compaction method in side compaction, with the entire KV pair stored on the NDP device. The
|
||
ProckStore merges the SSTables with overlapping key ranges in levels key is processed during host-side compaction, reducing both device-
|
||
𝐿𝑖 and 𝐿𝑖+1 and then writes new SSTables into level 𝐿𝑖+1 . This process to-host data transfers and host-to-device compaction operations. Based
|
||
|
||
7
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
on the compaction information, the device separates the key from the 4. Experimental settings
|
||
value in the SSTables. The key array stores the address of each value,
|
||
which is used for subsequent reorganization. The keys are then sent to Platform. We implemented ProckStore based on RocksDB and con-
|
||
the host for a sort-merge operation. Afterward, the compacted keys are
|
||
ducted experiments to assess its performance. To evaluate ProckStore,
|
||
sent back to the device, where they are reorganized into new SSTables
|
||
we constructed a test platform simulating the NDP architecture, where
|
||
based on the value addresses. There are three threads for each step: (1)
|
||
data transfer between the host and NDP device occurs over Ethernet.
|
||
the separation thread on the device, (2) the merge operation thread on
|
||
the host, and (3) the key–value reorganization thread on the host. The Although this platform was used for validation, ProckStore is scalable
|
||
host-side compaction task is divided into the following three steps: to real NDP platforms. SocketRocksDB, a version of RocksDB deployed
|
||
▵ Step 1. The key–value separation thread in the NDP retrieves on the NDP collaborative framework, was used as the baseline. TStore,
|
||
the KV pairs based on the SSTable data format. The key or value is PStore, and ProckStore all share the NDP-empowered storage frame-
|
||
stored in the corresponding array in the NDP device’s memory. In the work. The experimental platform comprises two subsystems: a host-side
|
||
key array, each key records the subscript of its corresponding value, and a device-side NDP subsystem. The host system is equipped with an
|
||
and the time complexity for searching the array is O(1). The value Intel(R) Core(TM) i3-10100 CPU (8 cores) and 16 GB of DRAM, while
|
||
array is transferred to the NDP device via memory sharing and waits the NDP device runs on an ARM-based development board with four
|
||
for the sorted key array to be fetched from the host. The key array is Cortex-A76 cores, four Cortex-A55 cores, 16-GB DRAM, and a 256 GB
|
||
transferred to the host via the host-NDP interface. Western Digital SSD. A network cable connects the host to the NDP
|
||
▵ Step 2: The host fetches the key array, sorts the individual keys, device, with a bandwidth of 1000 Mbps.
|
||
and sends the sorted key array to the NDP device for restructuring. All
|
||
The host system runs Ubuntu 16.04, and RocksDB version 6.10.2 is
|
||
these steps are organized within a single thread.
|
||
employed. The NDP platform uses a lightweight embedded operating
|
||
▵ Step 3: Upon receiving the new key array, the NDP device finds
|
||
each key’s corresponding value based on its subscript. Simultaneously, system. Data transfer between the host and the NDP device is facilitated
|
||
the device reconstructs the new SSTables according to the order of the by the SOCKET interface, replacing the standard POSIX interface to
|
||
keys. To minimize data transfer time between the host and device, the ensure efficient data transmission. In RocksDB, the buffer and SSTable
|
||
data volume is reduced, and a separate transfer thread ensures that sizes are set to 4 MB, the block size is 4 KB, and the level settings remain
|
||
the communication between the host and device remains unaffected, at default values. The number of sub-tasks on the host is limited to 4,
|
||
minimizing transmission latency. and all other configuration parameters in RocksDB are set to default
|
||
As shown in Fig. 8, only the keys (which are reconstructed on the values.
|
||
host side) are passed between the host and the device. When the host Workloads. In this section, we evaluate the performance of Prock-
|
||
performs compaction, a compaction request is sent to the device, which Store under realistic workloads. The details of the DB_Bench and YCSB-
|
||
then provides the necessary data from the NDP device. SSTables 1 and C workloads used in the experiments are presented in Table 1. The
|
||
2 from level 𝐿0 and SSTables 3 and 4 from level 𝐿1 are separated. DB_Bench workload is used to assess random-write performance.
|
||
The duplicate keys and offset addresses are passed to the host, which
|
||
Table 1 presents the different workloads in the ‘‘Type’’ column.
|
||
executes the compaction procedure. After deduplication, the keys are
|
||
In addition, db_bench_1 is configured in random-write mode with a
|
||
re-transmitted to the NDP device, where they are reorganized into a
|
||
new SSTable (SSTable 7) in level 𝐿1 . fixed value size of 1 KB and varying data sizes (10 GB, 20 GB, 30 GB,
|
||
By reducing the transmission overhead on the host side, the device 40 GB), db_bench_2 is configured in random-write mode with multiple
|
||
reduces the compaction task’s time cost, aligning with the NDP archi- value sizes (1 KB, 4 KB, 16 KB, 64 KB) and two data volumes (10 GB
|
||
tecture’s requirements. At the fourth priority level, the host handles and 40 GB). We also employ YCSB-C to measure the ProckStore’s
|
||
most of the compaction tasks, which contain more SSTables, thereby performance under mixed read–write workloads.
|
||
relieving the device’s computational load. However, the NDP device
|
||
not only processes the values for the host but also handles the KV pairs
|
||
5. Performance evaluation
|
||
in the compaction task, which increases the device’s processing pres-
|
||
sure. To alleviate this, cross-level compaction is employed to reduce
|
||
computational strain on the NDP device. We conduct experiments to evaluate the performance of ProckStore
|
||
When a compaction process is triggered in level 𝐿𝑖 and the first_score in terms of throughput, latency, and write amplification (WA).
|
||
of level 𝐿𝑖+1 exceeds one, cross-level compaction is initiated. This
|
||
process searches for SSTables with overlapping key ranges in the subse-
|
||
5.1. Performance under DB_Bench with various data volumes
|
||
quent level 𝐿𝑖+2 . Unlike traditional compaction, cross-level compaction
|
||
in ProckStore continuously searches for overlapping key-range SSTables
|
||
in level 𝐿𝑖+2 . Subsequently, SSTables from levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 In this section, we evaluate the performance of ProckStore using
|
||
undergo compaction, and the newly generated SSTables are written to DB_Bench with various data volumes and a 4-KB value. Fig. 9 illustrates
|
||
level 𝐿𝑖+2 . the impact of data volume on performance, focusing on throughput,
|
||
The trigger selection in level 𝐿𝑖 follows the priority rules, while WA, CPU utilization, and bandwidth. ProckStore delivers peak perfor-
|
||
the selection of SSTables in levels 𝐿𝑖+1 and 𝐿𝑖+2 is based on their sec- mance with 10-GB workloads, achieving up to 48% higher throughput
|
||
ond_score(see Eq. (2)). SSTables written to level 𝐿𝑖+1 in traditional com- compared to PStore, and an average improvement of 40%. Under
|
||
paction may be written to level 𝐿𝑖+2 through cross-level compaction. 40-GB workloads, the WA of TStore and SocketRocksDB reaches its
|
||
This cross-level approach helps balance the SSTable distribution across maximum, while ProckStore’s WA remains constant at 1.4 across all
|
||
levels, reducing the number of compaction operations. However, it
|
||
cases. Under 30-GB workloads, ProckStore’s throughput decreases by
|
||
introduces a drawback: compaction involving many SSTables increases
|
||
an average of 67% and 61% compared to TStore and SocketRocksDB,
|
||
compaction time. For the NDP device, data transmission time can be
|
||
respectively. This performance decrement is attributed to the frequency
|
||
ignored, thereby reducing overall compaction time. As illustrated in
|
||
Fig. 8, SSTables 1 and 2 in level 𝐿0 , SSTables 3 and 4 in level 𝐿1 , of compaction operations, which consume bandwidth and degrade
|
||
and SSTables 8 and 9 in level 𝐿2 are involved in compaction on the overall performance. PStore exhibits lower CPU utilization than Prock-
|
||
NDP device, and new data is written into SSTable 13 in level 𝐿2 . Store across all workloads. The multi-threaded approach in ProckStore
|
||
With an asynchronous mechanism, priority scheduling, and optimized optimizes the utilization of computing resources. In contrast, Sock-
|
||
data transmission under the NDP-empowered KV store, ProckStore etRocksDB prioritizes data storage over compaction, leading to lower
|
||
efficiently optimizes the compaction process. CPU utilization than PStore and ProckStore.
|
||
|
||
8
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
Table 1
|
||
Workload Characteristics used in the Experiment.
|
||
Workloads in DB_Bench
|
||
Type Feature Fillrandom Value Size (1 KB) Data Size (10 GB)
|
||
db_bench_1 100% writes ✓ 4× 1×, 2×, 3×, 4×
|
||
db_bench_2 100% writes ✓ 1×, 4×, 16×, 64× 1×, 4×
|
||
Workloads in YCSB-C
|
||
Type Feature Data Size Record Size Distribution
|
||
Load (10 GB) Run (10 GB) (1 KB)
|
||
A 50% Reads, 50% Updates Zipfian
|
||
B 95% Reads, 5% Updates Zipfian
|
||
C 100% Reads 1×, 2× 1×, 2× 1× Zipfian
|
||
D 95% Reads, 5% Inserts Latest
|
||
95% Range Queries,
|
||
E Uniform
|
||
5% Inserts
|
||
50% Reads,
|
||
F Zipfian
|
||
50% Read-Modify-Writes
|
||
|
||
|
||
|
||
|
||
Fig. 9. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with various data volumes.
|
||
|
||
|
||
5.1.1. Write amplification (WA) 5.1.2. Throughput
|
||
A large WA indicates significant duplication of write operations, In Fig. 9(c), the operation time of ProckStore ranges from 828.35
|
||
which degrades system performance. In SocketRocksDB, WA is primar- micros/op to 1819.18 micros/op. The operation time of ProckStore
|
||
ily caused by write-ahead log and compaction on the host. WA increases is lower than that of SocketRocksDB because it takes less to execute
|
||
write and read operations.ProckStore reduces operation time by 72.8%
|
||
with the amount of data, as the number of compaction operations
|
||
compared to TStore, under a 20-GB workload. At the same time, under
|
||
is proportional to the data size. As shown in Fig. 9(b), under a 10-
|
||
a 40-GB workload, ProckStore reduces the operation time by 24.0%
|
||
GB workload, the WA of TStore and PStore is reduced by 39% and and 61.5%, compared to PStore and SocketRocksDB. In Fig. 9(a), with
|
||
62%, respectively, compared to SocketRocksDB, which performs all a 40-GB dataset, the throughput of ProckStore is 4.15× and 1.47×
|
||
compaction tasks on the host. By offloading a portion of the compaction higher than that of TStore and PStore. Meanwhile ProckStore achieves a
|
||
tasks to the NDP device, TStore and PStore reduces WA. Notably, throughput of 2.75× higher than SocketRocksDB. Under a 10-GB work-
|
||
ProckStore exhibits a 55% reduction in WA. A similar trend is observed load, ProckStore achieves 2.81× write throughput of SocketRocksDB
|
||
under 20- and 30-GB workloads. For a 40-GB workload, WA is reduced through the multi-threaded asynchronous approach. In addition, with
|
||
by 36.4%, and 72.0% for ProckStore, respectively, compared to TStore, a 10-GB dataset, the throughput of ProckStore is 45.2% higher than
|
||
and SocketRocksDB. PStore.
|
||
Other KV stores (excluding SocketRocksDB) leverage collaborative
|
||
strategies between the host and NDP device to accelerate compaction,
|
||
|
||
9
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
thereby enhancing throughput. ProckStore optimizes resource alloca- of bandwidth, with an average improvement of 1.67× compared to
|
||
tion with a multi-threaded asynchronous approach, which improves PStore and 2.32× compared to SocketRocksDB across all different value
|
||
performance. Its throughput exceeds 4.57 MB/s, achieving a 48% im- sizes. Meanwhile, ProckStore achieves the highest host- and device-side
|
||
provement over PStore. CPU utilization. In Fig. 11(e), the device-side CPU utilization of PStore
|
||
and ProckStore are similar due to task stacking on the device under
|
||
5.1.3. CPU utilization large data volumes.
|
||
CPU utilization refers to the proportion of CPU resources consumed
|
||
by the KV stores under different workloads. TStore utilizes a single- 5.2.1. Write amplification (WA)
|
||
threaded approach on both the host and device, leading to low CPU With increasing value sizes, the amount of data on the host grows,
|
||
utilization (see Figs. 9(d) and 9(e)). As a result, TStore’s CPU utilization exacerbating WA in TStore and SocketRocksDB. In Fig. 11(b), the WA
|
||
is lower than that of SocketRocksDB. Despite leveraging multi-threaded of TStore and SocketRocksDB is the lowest (2.18 and 5.2) with a 16-
|
||
concurrency, SocketRocksDB faces a transmission bottleneck between KB value. Under 1-KB value workloads, WA increases to 2.39 and 6.1,
|
||
the host and the device. During task processing, the host quickly respectively. ProckStore’s WA is unaffected by host-side compaction.
|
||
performs merge operations; however, there is significant latency during
|
||
Under 1-KB and 64-KB workloads, ProckStore reduces WA by 76.2%
|
||
read and write operations. By offloading a portion of tasks to the NDP
|
||
and 75.1%, respectively, compared to SocketRocksDB. However, it
|
||
device, ProckStore reduces CPU idle time and improves CPU utiliza-
|
||
increases to 76.4% and 77.6% under 10-GB workloads. This improve-
|
||
tion on the host. Compared to SocketRocksDB, ProckStore achieves
|
||
ment is due to ProckStore’s triple-filter compaction on the host, which
|
||
improvements of 97% and 89% in CPU utilization under 10-GB and
|
||
reduces compaction operations and the volume of compacted data.
|
||
40-GB workloads, respectively. ProckStore demonstrates the highest
|
||
host-side CPU utilization, peaking at 7.03% under a 10-GB workload.
|
||
ProckStore’s multi-threaded method on the host further enhances 5.2.2. Throughput
|
||
CPU utilization. As shown in Fig. 9(e), PStore employs a single-threaded, In Figs. 10(a) and 11(a), ProckSotre’s average throughput ranges
|
||
asynchronous method, offering greater flexibility than traditional from 3.8 MB/s to 5.1 MB/s and 2.7 MB/s to 4.0 MB/s under 10-GB
|
||
scheduling models. Furthermore, reduced compaction time increases and 40-GB workloads, respectively. It is worth noting that ProckStore’s
|
||
the device-side CPU utilization of PStore by over 20.49%, a 73% im- throughput increases compared with PStore, indicating lower response
|
||
provement compared to TStore under a 10-GB workload. In ProckStore, times to foreground requests. Compared with SocketRocksDB, Prock-
|
||
device-side CPU utilization is further enhanced through cross-level Store improves by 2.04× and 2.1× under 40-GB workloads with 1-KB
|
||
compaction. This metric increases by 27%, 33%, 40%, and 35% com- and 16-KB values, respectively. Compared with PStore, ProckStore
|
||
pared to PStore under 10-, 20-, 30-, and 40-GB workloads, respectively. improves throughput by 1.51× and 1.58× (see Fig. 11(a)). In particular,
|
||
compared with TStore, ProckStore archives 4.1× and 2.68× improve-
|
||
5.1.4. Compaction bandwidth ment under workloads with 4-KB and 64-KB values, respectively.
|
||
The compaction bandwidth unveils the compaction performance of
|
||
a KV store. In this paper, the term ‘‘compaction bandwidth’’ refers 5.2.3. CPU utilization
|
||
to the host-side compaction bandwidth, as our proposed ProckStore Large-sized values increase compaction overhead and host-side CPU
|
||
primarily focuses on optimizing host-side performance. For instance, utilization, peaking under workloads with a 64-KB value. ProckStore’s
|
||
the four-level priority scheduling in Section 3.3 prioritizes four steps— host-side and device-side CPU utilization reach 10.83% and 29.11%,
|
||
triggering, task generation, task allocation, and task execution on the respectively (see Figs. 10(e) and 11(d)), while SocketRocksDB’s values
|
||
host—to perform asynchronous compaction efficiently. The triple-level are 8.34% and 18.28%. Additionally, ProckStore’s CPU utilization on
|
||
filter compaction in Section 3.4 combines two compaction procedures both sides is 8.27% and 25.39% under 40-GB workloads with 1-KB val-
|
||
into one, thereby improving host-side compaction performance. There- ues. On average, ProckStore’s CPU utilization is 3.35× and 4.1× higher
|
||
fore, we define compaction bandwidth as the ratio of compaction time than TStore and SocketRocksDB, respectively, and outperforms PStore
|
||
to the amount of compacted data on the host side. SocketRocksDB in both host- and device-side CPU utilization under all workloads.
|
||
performs compaction tasks on the host, while other KV stores provide
|
||
compaction bandwidth on the host and NDP device. In Fig. 9(f), the
|
||
5.2.4. Compaction bandwidth
|
||
single-threaded TStore fails to fully leverage the multi-core computa-
|
||
In Figs. 10(f) and 11(f), the compaction bandwidth of the KV
|
||
tional capabilities of the host, resulting in an average bandwidth of 2.35
|
||
stores varies. TStore’s device-side bandwidth peaks at 3.14 MB/s,
|
||
MB/s.
|
||
while ProckStore shows an average improvement of 4.29× and 1.61×
|
||
In contrast, SocketRocksDB uses the multi-threaded method to en-
|
||
over TStore and PStore, respectively. SocketRocksDB leverages multi-
|
||
hance the bandwidth to 2.86 MB/s, which outperforms other KV stores.
|
||
threaded parallelism to enhance computation and reduce processing
|
||
This is because the host handles all the tasks, resulting in much total
|
||
time, leading to superior bandwidth performance under 40-GB work-
|
||
data. The collaborative solution improves processing efficiency on the
|
||
loads across all value sizes. However, PStore achieves higher bandwidth
|
||
host. Under 40-GB workloads, ProckStore’s bandwidth improves by
|
||
than SocketRocksDB under 10-GB workloads. ProckStore outperforms
|
||
3.56× and 1.51× over SocketRocksDB and PStore, respectively.
|
||
all other stores in terms of bandwidth across all workloads, achieving a
|
||
3.54× improvement over SocketRocksDB under workloads with a 64-KB
|
||
5.2. Performance under DB_Bench with various value sizes
|
||
value.
|
||
We configured the workloads with various value sizes and two
|
||
data volumes (10 GB and 40 GB). The large-sized value increases the 5.3. Performance under YCSB-c
|
||
compaction overhead while improving the throughput under work-
|
||
loads with a fixed-size data volume. ProckStore maintains optimal YCSB-C provides real-world workloads, which we use to evaluate
|
||
performance under workloads with different value sizes and two data the compaction performance of TStore, PStore, SocketRocksDB, and
|
||
volumes (see Figs. 10 and 11). ProckStore’s throughput increases on ProckStore. We configure this workload with two types of data vol-
|
||
average by 63.1% and 77.7% compared to PStore and SocketRocksDB umes: 10 GB and 20 GB in the Load and Run phases. We define the
|
||
in the case of 1-KB value (see Fig. 10(a)). The performance increases at configuration with 10 GB Load and 10 GB Run as small data volumes,
|
||
64 KB because large-value workloads trigger more frequent compaction and 20 GB Load and 20 GB Run as large data volumes. We use six types
|
||
and shorter running time. ProckStore has the best performance in terms of workloads in the experiment.
|
||
|
||
10
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 10. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 10-GB data volume and various value sizes.
|
||
|
||
|
||
|
||
|
||
Fig. 11. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 40-GB data volume and various value sizes.
|
||
|
||
|
||
|
||
|
||
11
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 12. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 10 GB and run 10 GB data volume.
|
||
|
||
|
||
5.3.1. Case 1: Load 10 GB and run 10 GB workload C, ProckStore’s average latency is 13.4% and 37.5% lower
|
||
Load. In YCSB-C, the workload Load is write-intensive, resulting than that of TStore and SocketRocksDB, respectively. ProckStore ex-
|
||
in frequent compaction. ProckStore optimizes compaction under var- hibits similar trends under workloads B and D. The average latency
|
||
ious workloads (Fig. 12). Its throughput outperforms SocketRocksDB of ProckStore under write-intensive workloads is 5.72% and 8.45%
|
||
by a factor of 2.3×. TStore benefits from time-aware dynamic task lower than that of TStore and PStore under workload E, respectively.
|
||
scheduling, which reduces the performance gap compared to Prock- Moreover, the throughput of ProckStore is not lower than other KV
|
||
Store. PStore’s asynchronous compaction improves performance, Prock- stores.
|
||
Store’s multi-threaded execution further enhances the performance Write Amplification (WA). ProckStore achieves lower WA than
|
||
of the asynchronous compaction strategy. Consequently, ProckStore’s both TStore and SocketRocksDB (see Fig. 12(b)). WA is reduced by
|
||
throughput is 4.24× and 1.80× higher than that of TStore and PStore, approximately 1.2 compared to SocketRocksDB. ProckStore’s host-side
|
||
respectively. During the Run phase, ProckStore’s collaborative mode multi-threaded method further decreases WA by an average of 62.3%
|
||
improves performance under write-intensive workloads. Workloads A compared to TStore. The minimum WA of ProckStore is 1.20 under
|
||
and F exhibit the highest write ratios at 29.0%. Under workload A, workload C. WA in ProckStore is influenced by the write-ahead log and
|
||
ProckStore’s throughput is 28.2% and 29.0% higher than that of TStore host-side compaction, its compaction frequency is higher, resulting in
|
||
and PStore, respectively. Under workload F, ProckStore’s throughput greater WA than PStore. Under workloads C and D, WA is 1.20 and
|
||
1.36, respectively. However, ProckStore’s triple-level filter compaction
|
||
surpasses TStore and PStore by 36.4% and 71.3%, respectively. How-
|
||
mechanism mitigates WA compared to SocketRocksDB.
|
||
ever, when the write percentage is low, ProckStore’s throughput shows
|
||
CPU Utilization and Compaction Bandwidth. Figs. 12(d) and
|
||
minimal variation compared to other KV stores. Additionally, under
|
||
12(e) show the host-side CPU utilization of the KV store. Notably,
|
||
read-intensive workloads, ProckStore achieves the maximum through-
|
||
TStore runs on a single thread. The bandwidth limits the data transfer
|
||
put improvement of 60.7%, 59.8%, 122.4%, and 9.2% under workloads
|
||
between the host and the device. Overall, CPU utilization patterns for
|
||
B, C, D, and E, respectively. In contrast, TStore’s read performance
|
||
SocketRocksDB and ProckStore are similar on both sides, which can be
|
||
suffers due to the excessive number of SSTables, which increases the
|
||
attributed to the reduction in total processing time, accompanied by a
|
||
query operation overhead.
|
||
reduction in compaction time. Fig. 12(f) shows compaction bandwidth
|
||
Throughput and Latency. Throughput and latency are critical met-
|
||
in the Load and Run phases. ProckStore achieves the highest bandwidth
|
||
rics for KV stores. As KV stores are widely deployed in real-world ap- on the host. Its bandwidth is 8.64× and 3.32× improvement than TStore
|
||
plications, these metrics significantly impact response time. ProckStore and SocketRocksDB, respectively, under workload A. This improvement
|
||
maintains its performance advantage in the Load phase when the data increases to 7.97× and 2.99× under workload F. Nevertheless, Prock-
|
||
size increases from 10 GB to 20 GB under a workload with the same Store improves the bandwidth by exploiting multi-threaded parallelism.
|
||
amount of data. Its throughput is 4.24× that of TStore, see Fig. 12(a). The average bandwidth of ProckStore is 56.8% higher than PStore
|
||
Compared with SocketRocksDB and PStore, ProckStore’s throughput due to its efficient task scheduling, which leverages the computational
|
||
improves by 2.33× and 1.8×, respectively, under the same workload. capabilities of both the host and the device.
|
||
The advantage of ProckStore becomes even more pronounced under
|
||
workloads A and F which involves a higher percentage of writes. La- 5.3.2. Case 2: Load 20 GB and run 20 GB
|
||
tency results further demonstrate the flexibility of ProckStore’s schedul- In the Load phase, the throughput of ProckStore surpasses Sock-
|
||
ing method. Under workloads D and F, ProckStore has 55.1% and etRocksDB and TStore by 3.44× and 3.73×, respectively, see Fig. 13(a).
|
||
42.5% lower latency than PStore (see Fig. 12(c)). Compared with Although the asynchronous approach of PStore enhances performance,
|
||
TStore and PStore, ProckStore has 20.8% and 22.1% lower latency the multi-threaded method of ProckStore integrates with the asyn-
|
||
under workload A, respectively, see Fig. 12(c). Under read-intensive chronous compaction mechanism. Consequently, the throughput of
|
||
|
||
12
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 13. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 20 GB and run 20-GB data volume.
|
||
|
||
|
||
ProckStore reaches 1.59× of PStore, respectively. In the Run phase, ProckStore increases host- and device-side CPU utilization by 18.1%
|
||
the multi-threaded asynchronous mode improves the performance of and 32.6%, respectively, compared with PStore under workload C. Un-
|
||
ProckStore under write-intensive workloads A and F, where half of der mixed read–write workloads, such as A and F, ProckStore increases
|
||
the operations are writes. Specifically, under workload A, ProckStore’s host-side CPU utilization by 6.7% and 20.0% and device-side CPU
|
||
throughput exceeds that of TStore and PStore by 21.0% and 23.1%, utilization by 12.2% and 13.2%, respectively. Fig. 13(f) shows the com-
|
||
respectively. Similarly, under workload F, ProckStore achieves 19.1% paction bandwidth of the Run phase. ProckStore achieves the highest
|
||
and 21.3% higher throughput than TStore and PStore, respectively. bandwidth under all workloads. In workload C, ProckStore’s bandwidth
|
||
Workloads A and F involve large-sized data volumes. Under these is 45.4% and 35.1% higher than that of PStore and SocketRocksDB,
|
||
workloads, the data volume increased from 10-GB to 20-GB, and the respectively. This improvement is attributed to ProckStore’s utilization
|
||
throughput of ProckStore decreased by 32.2% and 42.7%, respec- of multi-threaded parallelism. However, with large-size data volumes,
|
||
tively. In addition, the throughput of ProckStore is optimized under ProckStore’s bandwidth decreases by 17.7% compared to a 10-GB data
|
||
read-intensive workloads. In ProckStore, we focus on optimizing com- volume. Under workload D, ProckStore’s average bandwidth is 6.47×
|
||
paction, and the read performance improvement is small. For read- and 1.38× higher than TStore and SocketRocksDB, respectively. In
|
||
intensive workloads, ProckStore achieves 20.4%, 16.3%, and 28.5% comparison to a 10-GB data volume, the CPU utilization decreases by
|
||
improvement under B, C, and D, compared with PStore, respectively. 38.1%, 32.3%, 17.7%, 33.7%, 30.9%, and 33.1% under workloads A,
|
||
For workloads with small-sized data volumes, ProckStore decreases by B, C, D, E, and F, respectively.
|
||
27.8%, 8.4%, and 40.6% under workloads B, C, and D, respectively.
|
||
Throughput and Latency. With a data size of 20 GB, Prock- 5.3.3. Tail latency
|
||
Store maintains its performance advantage in the Load phase. Com- We analyzed the tail latency of ProckStore, including P90, P99,
|
||
pared with SocketRocksDB and PStore, ProckStore’s throughput is im- and P999 latencies, and compared it with TStore, SocketRocksDB, and
|
||
proved by 3.44× and 1.58×, respectively, in the Load phase. Both PStore under workloads of different data volumes (10 GB, 20 GB) and
|
||
average latency and throughput of ProckStore have the highest perfor- a 1-KB value size. The experimental results are shown in Figs. 14 and
|
||
mance in Figs. 13(a) and 13(c). Under read-intensive workloads such 15.
|
||
as B and C, ProckStore outperforms SocketRocksDB by about 9.1% The results demonstrate that ProckStore outperforms other key-
|
||
and 21.4%, respectively. This improvement is attributed to the triple- value stores, exhibiting lower tail latency. SocketRocksDB’s P90 and
|
||
filtering compaction, which reduces execution time in the run phase, P99 tail latencies are notably lower than those of other KV stores, due to
|
||
thereby increasing throughput. its multi-version management mechanism in RocksDB. ProckStore’s P90
|
||
As shown in Fig. 13(c), under write-intensive workload A, the and P99 tail latencies are lower than those of other KV stores thanks
|
||
average latency of ProckStore is 17.6% and 18.9% lower than TStore to its asynchronous allocation method, which reduces tail latency.
|
||
and PStore, respectively. ProckStore has a similar trend under workload Under a 10 GB workload, the most significant reduction in latency
|
||
F. In addition, ProckStore’s latency also reduces by 16.9%, 14.1%, occurs when ProckStore lowers P90 latency by 94.07% and 93.89%
|
||
and 22.2% under read-intensive workloads B, C, and D, compared compared to TStore and PStore under workload E. This improvement
|
||
with SocketRocksDB, respectively. However, compared with 10 GB is attributed to ProckStore’s superior range query performance, while
|
||
data volume, the latency increases due to the additional compaction TStore and PStore are not optimized for range queries. Similarly, Prock-
|
||
operations and the associated lookup costs. Store achieves the lowest P99 latency. Fig. 14(b) shows that ProckStore
|
||
CPU Utilization and Compaction Bandwidth. When the data achieves the most significant P99 latency reduction under workload E,
|
||
volume increases from 10 GB to 20 GB, Figs. 13(d) and 13(e) illustrate lowering P99 by 79.4% and 79.2% compared to TStore and PStore,
|
||
the changes in CPU utilization for ProckStore under various workloads. respectively. It also shows substantial improvements under workload B,
|
||
|
||
13
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 16. Write performance of ProckStore under DB_Bench with different numbers of
|
||
subtasks.
|
||
|
||
Fig. 14. The tail latency of ProckStore under YCSB-C with load 10 GB and run 10 GB
|
||
data volume.
|
||
6.1. Impact of number of subtasks
|
||
|
||
To validate the fourth-level prioritization, we conducted experi-
|
||
ments to evaluate the impact of various subtasks on the write perfor-
|
||
mance of ProckStore. The extended experiments replicate the configu-
|
||
ration from Section 4. We configured DB_Bench with a 10 GB dataset
|
||
and a 1-KB value. Specifically, we examine the impact of the number of
|
||
subtasks on the fourth-level prioritization in ProckStore by configuring
|
||
four types of subtasks on the host. The experimental results are shown
|
||
in Fig. 16.
|
||
As shown in Fig. 16(a), the throughput of ProckStore increases
|
||
significantly with the number of subtasks. The throughput is 2.33
|
||
Fig. 15. The tail latency of ProckStore under YCSB-C with load 20 GB and run 20 GB MB/s, 2.81 MB/s, and 3.13 MB/s for one, two, and three subtasks,
|
||
data volume. respectively, and subsequently stabilizes. With four subtasks, Prock-
|
||
Store achieves a peak throughput of 3.8 MB/s. The average latency
|
||
shows a similar trend, where ProckStore achieves the lowest latency
|
||
with ProckStore reducing P99 by 75.6% and 76.2% compared to TStore (0.22 ms) with four subtasks, showing a 17.1% improvement from three
|
||
and PStore. to four subtasks. The host-side CPU utilization also reflects ProckStore’s
|
||
In Fig. 15, the differences in tail latency become more pronounced performance with different numbers of subtasks, as multi-core CPUs
|
||
under a 20 GB workload. ProckStore reduces P90 latency by 9.32% enable parallel execution of multiple threads.
|
||
and 31.06% under workloads A and F, respectively, compared to As shown in Fig. 16(c), CPU utilization increases with the number of
|
||
PStore. Under workload E, ProckStore reduces P90 latency by 93.46% subtasks, allowing the CPU to utilize its computational resources fully.
|
||
CPU utilization is 4.78% (the lowest) with one subtask, improving by
|
||
and 93.68% compared to PStore and TStore, respectively. ProckStore’s
|
||
10.3% with two subtasks. The highest CPU utilization (6.84%) occurs
|
||
four-level priority scheduling mechanism prevents low-priority requests
|
||
with four subtasks. However, as the number of subtasks increases,
|
||
from blocking high-priority writes, reducing extreme write latency
|
||
the performance improvements in CPU utilization, throughput, and
|
||
often blocked by flush or compaction in TStore and SocketRocksDB.
|
||
average latency become less pronounced. This is because while par-
|
||
Similarly, ProckStore reduces P99 tail latency under workloads A and allel execution of multiple threads reduces compaction execution time,
|
||
F by 28.9% and 17.9%, respectively, compared to SocketRocksDB and the overhead from thread creation and synchronization increases. As
|
||
TStore. Under workload C, ProckStore reduces P99 tail latency by the number of threads grows, this additional CPU overhead impacts
|
||
54.9% and 9.45% compared to PStore and SocketRocksDB, respec- ProckStore’s CPU utilization.
|
||
tively. Under workload D, ProckStore reduces P99 tail latency by 23.5%
|
||
and 6.0% compared to the same alternative KV stores. ProckStore 6.2. Impact of number of threads
|
||
performs best under workload E, reducing P99 tail latency by 79.22%,
|
||
78.69%, and 6.23% compared to TStore, PStore, and SocketRocksDB, In Section 2.2, we studied the performance of PStore with different
|
||
respectively. numbers of threads. For the multi-threaded comparison experiment of
|
||
The FIFO scheduling used by traditional KV stores like ProckStore, we extended the analysis by comparing its throughput with
|
||
SocketRocksDB can cause high-priority requests to be blocked, leading that of PStore under varying thread counts. The experimental results
|
||
to increased tail latency. In contrast, ProckStore’s multi-level queue are shown in Fig. 17.
|
||
scheduling mechanism enables compaction tasks to be executed in pri- Fig. 17(a) shows the throughput of ProckStore and PStore under
|
||
ority order, with high-priority compaction tasks executed first, thereby workloads with 4 KB values and a 10 GB data volume. As the number
|
||
of threads increases, the throughput of PStore does not increase ex-
|
||
reducing tail latency.
|
||
ponentially, and its performance is poor during multi-threaded writes.
|
||
Specifically, the throughput increases by only 1.58% when the number
|
||
6. Extended experiment of threads rises from 8 to 12. In contrast, the throughput of ProckStore
|
||
increases significantly with the number of threads. At 12 threads, the
|
||
throughput reaches 7.86 MB/s, which is 10.9% higher than that at 8
|
||
In this section, we study the impact of multi-threaded and the num- threads. ProckStore’s throughput is 144.1% higher than PStore’s, as its
|
||
ber of subtasks on ProckStore performance. The results demonstrate multi-threaded execution efficiently processes the large data volume
|
||
the effectiveness of ProckStore under multi-threaded and verify its written by multiple threads, avoiding the computational limitations of
|
||
performance under multiple subtask numbers. The environment of the single-thread execution in PStore.
|
||
extended experiment is the same as the experimental configuration in As shown in Fig. 17(b), the average latency of PStore decreases
|
||
Section 4. with the increase in threads under a 10 GB data volume workload.
|
||
|
||
14
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
|
||
|
||
Fig. 17. Write performance of ProckStore under DB_Bench with different number of threads.
|
||
|
||
|
||
However, the decrease in latency is more significant when the number Storage architecture. ListDB [24] employs a skip-list as the core
|
||
of threads is low, such as from 1 to 4 threads, where the latency data structure at all levels within non-volatile memory (NVM) or per-
|
||
drops by 27.8%. For ProckStore, as the number of threads increases, sistent memory (PM) to mitigate the WA problem by leveraging byte-
|
||
performance improves steadily. At 4 threads, the average latency of addressable in-place merge ordering. This approach reduces the gap
|
||
ProckStore is 0.154 ms, and at 12 threads, it reduces to 0.104 ms with between DRAM and NVM write latency and addresses the write stall
|
||
a 32.3% reduction. Fig. 17(c) shows that when the number of threads issue. HiKV [25] utilizes the benefits of hash and B+Tree indexes to
|
||
reaches 12, the host-side CPU utilization for PStore and ProckStore is design the KV store on hybrid DRAM-NVM storage systems, where hash
|
||
highest, at 8.74% and 11.62%, respectively, with ProckStore showing a indexes in NVM are used to enhance indexing performance. In a hybrid
|
||
41.92% increase over PStore. Additionally, the CPU utilization for both NVM-SSD system, WaLSM [26] tackles the WA problem through virtual
|
||
systems increases by 7.28% and 10.7%, respectively, when the number partitioning, dividing the key space during compaction. Additionally, a
|
||
of threads increases from 8 to 12. As the number of threads decreases, reinforcement-learning method is applied to balance the merging strat-
|
||
CPU utilization also drops. For 1 thread, the CPU utilization is at its egy of different partitions under various workloads, optimizing read
|
||
lowest – 6.48% for PStore and 9.37% for ProckStore. and write performance. TrieKV [27] integrates DRAM, PM, and disk
|
||
into a unified storage system, utilizing a tri-structured index for all KV
|
||
pairs in memory, enabling dynamic determination of KV pair locations
|
||
7. Related work
|
||
across storage hierarchies and persistence requirements. Moreover,
|
||
ROCKSMASH [28] utilizes local storage for frequently accessed data
|
||
LSM-tree has become a popular data structure in key-value storage
|
||
and metadata, while cloud storage is employed for less frequently
|
||
systems, offering an alternative to traditional structures by efficiently
|
||
accessed data.
|
||
handling write-intensive workloads and large-scale datasets. Although
|
||
Computing architecture. Heterogeneous computing [29] (e.g.,
|
||
KV stores manage data through compaction operations, these processes
|
||
GPUs, DPUs, and FPGAs) alleviates the computational burden on the
|
||
come at the cost of performance. Consequently, several studies have
|
||
CPU. Sun et al. [30] propose an accelerated solution for key-value
|
||
sought to mitigate the performance impact of compaction in KV stores.
|
||
stores by offloading the compaction task to an FPGA. Similarly, the
|
||
LSM-tree structure. PebblesDB [13] introduces the FLSM data FPGA-accelerated KV store [31] offloads the compaction task to the
|
||
structure, which alleviates the limitations of non-overlapping key ranges FPGA, minimizing competition for CPU resources and accelerating
|
||
within a level, thereby delaying the compaction process and reducing compaction while reducing CPU bottlenecks. LUDA [32] employs GPUs
|
||
WA. WiscKey [14] separates keys and values to minimize WA during to process SSTables using a co-ordering mechanism that minimizes
|
||
compaction but increases garbage collection overhead. To address this data movement, thereby reducing CPU pressure. gLSM [33] separates
|
||
issue, HashKV [15] employs hash partitioning and a hot/cold partition- keys and values to minimize data transfer between the CPU and
|
||
ing strategy, while DiffKV [16] separates keys based on the size of key- GPU, thereby accelerating compaction. dCompaction [34] leverages
|
||
value pairs to balance performance. FenceKV [17] enhances HashKV DPUs to accelerate the compaction and decompaction of SSTables,
|
||
by incorporating a fence-value-based partitioning strategy and key- offloading compaction tasks to the DPU according to a hierarchical
|
||
range-based garbage collection, optimizing range queries. FGKV [18] structure, relieving CPU overload. Despite these advances, heteroge-
|
||
and Spooky [19] reduce WA by adjusting the data granularity in neous computing still requires data transfer from host-side memory to
|
||
compaction. FGKV introduces a fine-grained compaction mechanism the computing units, which can impact overall system performance.
|
||
based on the LSpM-tree structure, minimizing redundant writes of Near-data processing (NDP), which offloads computational tasks
|
||
irrelevant data. Spooky partitions the data at the largest level into from the CPU to the data location, is an emerging computing paradigm.
|
||
equal-sized files and partitions the smaller levels according to file Previous studies [35] investigated storage computing and propose
|
||
boundaries for fine-grained compaction. frameworks for storage- and memory-level processing. Biscuit [36]
|
||
For compaction strategies, TRIAD [20] improves LSM-tree perfor- introduces a generalized framework for NDP. RFNS [37] examines
|
||
mance by optimizing logs, memory, and storage. Work [21,22] opti- the advantages of reconfigurable NDP-driven servers based on ARM
|
||
mize the traditional top-level-driven compaction of LSM-trees by shift- and FPGA architectures for data- and compute-intensive applications.
|
||
ing to a low-level-driven approach, decomposing large compaction 𝜆-IO [38] designs a unified computational storage stack to manage stor-
|
||
tasks into smaller ones to reduce granularity. WipDB [23] utilizes a age and computing resources through interfaces, runtime systems, and
|
||
bucket-sort-like algorithm that minimizes merge operations by writing scheduling. HuFu [39] is an I/O scheduling architecture for computable
|
||
KV pairs in an approximately sorted list. Although these studies en- SSDs that allows the system to manage background I/O tasks, offload
|
||
hance compaction efficiency, they primarily focus on a single device computational tasks to SSDs, and exploit the parallelism and idle time
|
||
and fail to address the competition for CPU and I/O resources between of flash memory for improved task scheduling. Li et al. [40] addresses
|
||
foreground requests and background tasks. In contrast, NDP devices the resource contention problem between user I/O and NDP requests,
|
||
expand computational resources to process tasks internally, reducing using the critical path to maximize the parallelism of multiple requests,
|
||
data transfer and resource contention. thereby improving the performance of hybrid NDP-user I/O workflows.
|
||
|
||
15
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
ABNDP [41] leverages a novel hardware-software collaborative opti- References
|
||
mization approach to solve the challenges of remote data access and
|
||
computational load balancing without requiring trade-offs. [1] Z. Zhang, Y. Sheng, T. Zhou, et al., H2o: Heavy-hitter oracle for efficient gen-
|
||
In addition, hosts and NDP devices employ distinct task scheduling erative inference of large language models, in: Advances in Neural Information
|
||
Processing Systems, vol. 36, 2024.
|
||
policies to collaborate on compaction tasks [9,10,42]. The nKV [43]
|
||
[2] H. Lin, Z. Wang, S. Qi, et al., Building a high-performance graph storage on top
|
||
defines data formats and layouts for computable storage devices and of tree-structured key-value stores, Big Data Min. Anal. 7 (1) (2023) 156–170.
|
||
designs both hardware and software architectures to optimize data [3] S. Pei, J. Yang, Q. Yang, REGISTOR: A platform for unstructured data processing
|
||
placement and computation. inside SSD storage, ACM Trans. Storage (TOS) 15 (1) (2019) 1–24.
|
||
KV-CSD [44] builds NDP architectures using NVMe SSDs and system- [4] IDC, IDC innovators: Privacy-preserving computation, 2023, [EB/OL]. (2023-09-
|
||
on-chip designs to reduce data movement during queries by offloading 20). https://www.idc.com/getdoc.jsp?containerId=prCHC51469323.
|
||
[5] P. O’Neil, E. Cheng, D. Gawlick, E. O’Neil, The log-structured merge-tree
|
||
tasks. Research such as OI-RAID [45] introduces an additional fault
|
||
(LSM-tree), Acta Inform. 33 (4) (1996) 351–385.
|
||
tolerance mechanism by adding an extra level on top of the RAID levels, [6] Google, Leveldb, 2025, https://leveldb.org/.
|
||
enabling fast recovery and enhanced reliability. KVRAID [46] utilizes [7] Facebook, Rocksdb: a persistent key-value store for fast storage environments,
|
||
logical-to-physical key conversion to pack similar-sized KV pairs into a 2016, http://rocksdb.org/.
|
||
single physical object, thereby reducing WA, and applies off-site update [8] A. Acharya, M. Uysal, J. Saltz, Active disks: Programming model, algorithms and
|
||
techniques to mitigate I/O amplification. Distributed storage systems, evaluation, Oper. Syst. Rev. 32 (5) (1998) 81–91.
|
||
[9] H. Sun, W. Liu, J. Huang, et al., Collaborative compaction optimization system
|
||
such as EdgeKV, have also been explored [47]. A sharding strategy is
|
||
using near-data processing for LSM-tree-based key-value stores, J. Parallel Distrib.
|
||
used to distribute data across multiple edge nodes, while consistent Comput. 131 (2019) 29–43.
|
||
hashing ensures balanced data distribution and high availability. ER- [10] H. Sun, W. Liu, Z. Qiao, et al., Dstore: A holistic key-value store exploring near-
|
||
KV [48] integrates a hybrid fault-tolerant design combining erasure data processing and on-demand scheduling for compaction optimization, IEEE
|
||
coding and PBR, providing fault tolerance to ensure system reliability Access 6 (2018) 61233–61253.
|
||
and high availability. Additionally, Song et al. [49] coupled each SSD [11] H. Sun, et al., Asynchronous compaction acceleration scheme for near-data
|
||
processing-enabled LSM-tree-based KV stores, ACM Trans. Embed. Comput. Syst.
|
||
with a dedicated NDP engine in an NDP server to fully leverage the 23 (6) (2024) 1–33.
|
||
data transfer bandwidth of SSD arrays. MStore [50] extends an NDP [12] Isaac Kofi Nti, et al., A mini-review of machine learning in big data analytics:
|
||
device to multiple devices, utilizing them to perform compaction tasks. Applications, challenges, and prospects, Big Data Min. Anal. 5 (2) (2022) 81–97.
|
||
Although NDP devices can handle host-side computational tasks, [13] P. Raju, R. Kadekodi, V. Chidambaram, et al., Pebblesdb: Building key-value
|
||
their resources remain limited. Consequently, it is critical to optimize stores using fragmented log-structured merge trees, in: Proceedings of the 26th
|
||
Symposium on Operating Systems Principles, 2017, pp. 497–514.
|
||
the use of these resources on the NDP device. The multi-threaded
|
||
[14] L. Lu, T.S. Pillai, H. Gopalakrishnan, et al., Wisckey: Separating keys from values
|
||
asynchronous method in ProckStore addresses this challenge by fully in SSD-conscious storage, ACM Trans. Storage (TOS) 13 (1) (2017) 1–28.
|
||
utilizing computation on both the host and device sides, avoiding [15] H. H. W. Chan, C. J. M. Liang, Y. Li, et al., HashKV: Enabling efficient updates in
|
||
resource wastage while ensuring sufficient computational capacity on KV storage via hashing, in: 2018 USENIX Annual Technical Conference, USENIX
|
||
the NDP device. ATC 18, 2018, pp. 1007–1019.
|
||
[16] Y. Li, Z. Liu, P. P. C. Lee, et al., Differentiated key-value storage management
|
||
for balanced I/O performance, in: 2021 USENIX Annual Technical Conference,
|
||
8. Conclusions
|
||
USENIX ATC 21, 2021, pp. 673–687.
|
||
[17] C. Tang, J. Wan, C. Xie, Fencekv: Enabling efficient range query for key-value
|
||
In this paper, we present ProckStore, an NDP-empowered KV store, separation, IEEE Trans. Parallel Distrib. Syst. 33 (12) (2022) 3375–3386.
|
||
to improve compaction performance for large-scale unstructured data [18] H. Sun, G. Chen, Y. Yue, et al., Improving LSM-tree based key-value stores with
|
||
storage. In ProckStore, the multi-threaded and asynchronous mecha- fine-grained compaction mechanism, IEEE Trans. Cloud Comput. (2023).
|
||
nism leverages computational resources within storage devices, reduc- [19] N. Dayan, T. Weiss, S. Dashevsky, et al., Spooky: granulating LSM-tree com-
|
||
pactions correctly, in: Proceedings of the VLDB Endowment, Vol. 15, (11) 2022,
|
||
ing data movement and enhancing compaction efficiency. ProckStore
|
||
pp. 3071–3084.
|
||
optimally schedules compaction tasks across the host and NDP de- [20] O. Balmau, D. Didona, R. Guerraoui, et al., TRIAD: Creating synergies between
|
||
vice by implementing a four-level priority scheduling mechanism. This memory, disk and log in log structured key-value stores, in: 2017 USENIX Annual
|
||
separation of compaction stages provides parallel processing with- Technical Conference, USENIX ATC 17, 2017, pp. 363–375.
|
||
out interference, achieving efficient resource utilization. In addition, [21] Y. Chai, Y. Chai, X. Wang, et al., LDC: a lower-level driven compaction method
|
||
ProckStore uses key-value separation to reduce data transfer between to optimize SSD-oriented key-value stores, in: 2019 IEEE 35th International
|
||
Conference on Data Engineering, ICDE, 2019, pp. 722–733.
|
||
the host and NDP device, minimizing transmission time. Experimental
|
||
[22] Y. Chai, Y. Chai, X. Wang, et al., Adaptive lower-level driven compaction to
|
||
results unveil that ProckStore outperforms existing synchronous and optimize LSM-tree key-value stores, IEEE Trans. Knowl. Data Eng. 34 (6) (2020)
|
||
single-threaded asynchronous NDP-empowered KV stores, achieving up 2595–2609.
|
||
to 4.2× higher throughput than the baseline KV store. ProckStore also [23] X. Zhao, S. Jiang, X. Wu, WipDB: A write-in-place key-value store that mimics
|
||
reduces WA, compaction time, and CPU utilization. bucket sort, in: 2021 IEEE 37th International Conference on Data Engineering,
|
||
ICDE, 2021, pp. 1404–1415.
|
||
[24] W. Kim, C. Park, D. Kim, et al., Listdb: Union of write-ahead logs and persistent
|
||
CRediT authorship contribution statement
|
||
SkipLists for incremental checkpointing on persistent memory, in: 16th USENIX
|
||
Symposium on Operating Systems Design and Implementation (OSDI 22), 2022,
|
||
Hui Sun: Writing – review & editing, Writing – original draft, pp. 161–177.
|
||
Visualization, Validation, Supervision, Software, Resources, Project ad- [25] F. Xia, D. Jiang, J. Xiong, et al., HiKV: a hybrid index key-value store for DRAM-
|
||
ministration, Methodology, Investigation, Funding acquisition, Formal NVM memory systems, in: 2017 USENIX Annual Technical Conference, USENIX
|
||
ATC 17, 2017, pp. 349–362.
|
||
analysis, Data curation, Conceptualization. Chao Zhao: Writing – re-
|
||
[26] L. Chen, R. Chen, C. Yang, et al., Workload-aware log-structured merge key-value
|
||
view & editing, Writing – original draft, Visualization, Validation, store for NVM-SSD hybrid storage, in: 2023 IEEE 39th International Conference
|
||
Software, Resources, Project administration, Methodology, Investiga- on Data Engineering, ICDE, 2023, pp. 2207–2219.
|
||
tion, Funding acquisition, Formal analysis, Data curation, Conceptu- [27] H. Sun, et al., TrieKV: A high-performance key-value store design with memory
|
||
alization. Yinliang Yue: Validation, Supervision, Software. Xiao Qin: as its first-class citizen, IEEE Trans. Parallel Distrib. Syst. (2024).
|
||
Supervision, Resources, Methodology, Formal analysis, Data curation. [28] P. Xu, N. Zhao, J. Wan, et al., Building a fast and efficient LSM-tree store by
|
||
integrating local storage with cloud storage, ACM Trans. Archit. Code Optim. (
|
||
TACO) 19 (3) (2022) 1–26.
|
||
Declaration of competing interest [29] Hao Zhou, Yuanhui Chen, Lixiao Cui, Gang Wang, Xiaoguang Liu, A GPU-
|
||
accelerated compaction strategy for LSM-based key-value store system, in: The
|
||
The authors declare that there is no conflict of interests regarding 38th International Conference on Massive Storage Systems and Technology,
|
||
the publication of this article. 2024, pp. 1–11.
|
||
|
||
|
||
16
|
||
H. Sun et al. Journal of Systems Architecture 160 (2025) 103342
|
||
|
||
|
||
[30] X. Sun, J. Yu, Z. Zhou, et al., Fpga-based compaction engine for accelerating [41] B. Tian, Q. Chen, M. Gao, ABNDP: Co-optimizing data access and load balance in
|
||
lsm-tree key-value stores, in: 2020 IEEE 36th International Conference on Data near-data processing, in: Proceedings of the 28th ACM International Conference
|
||
Engineering, ICDE, 2020, pp. 1261–1272. on Architectural Support for Programming Languages and Operating Systems,
|
||
[31] T. Zhang, J. Wang, X. Cheng, et al., FPGA-accelerated compactions for LSM-based Vol. 3, 2023, pp. 3–17.
|
||
key-value store, in: 18th USENIX Conference on File and Storage Technologies, [42] H. Sun, W. Liu, J. Huang, et al., Near-data processing-enabled and time-aware
|
||
FAST 20, 2020, pp. 225–237. compaction optimization for LSM-tree-based key-value stores, in: Proceedings of
|
||
[32] P. Xu, J. Wan, P. Huang, et al., LUDA: Boost LSM key value store compactions the 48th International Conference on Parallel Processing, 2019, pp. 1–11.
|
||
with GPUs, 2020, arXiv preprint arXiv:2004.03054. [43] T. Vincon, A. Bernhardt, I. Petrov, et al., nKV: near-data processing with KV-
|
||
[33] H. Sun, J. Xu, X. Jiang, et al., gLSM: Using GPGPU to accelerate compactions stores on native computational storage, in: Proceedings of the 16th International
|
||
in LSM-tree-based key-value stores, ACM Trans. Storage (2023). Workshop on Data Management on New Hardware, 2020, pp. 1–11.
|
||
[34] C. Ding, J. Zhou, J. Wan, et al., Dcomp: Efficient offload of LSM-tree compaction [44] I. Park, Q. Zheng, D. Manno, et al., KV-CSD: A hardware-accelerated key-value
|
||
with data processing units, in: Proceedings of the 52nd International Conference store for data-intensive applications, in: 2023 IEEE International Conference on
|
||
on Parallel Processing, 2023, pp. 233–243. Cluster Computing, CLUSTER, 2023, pp. 132–144.
|
||
[35] E. Riedel, G. Gibson, C. Faloutsos, Active storage for large-scale data mining [45] N. Wang, Y. Xu, Y. Li, et al., OI-RAID: a two-layer RAID architecture towards
|
||
and multimedia applications, in: Proceedings of 24th Conference on Very Large fast recovery and high reliability, in: 2016 46th Annual IEEE/IFIP International
|
||
Databases, 1998, pp. 62–73. Conference on Dependable Systems and Networks, DSN, 2016, pp. 61–72.
|
||
[36] B. Gu, A.S. Yoon, D.H. Bae, et al., Biscuit: A framework for near-data processing [46] M. Qin, A.L.N. Reddy, P. V. Gratz, et al., KVRAID: high performance, write
|
||
of big data workloads, ACM SIGARCH Comput. Archit. News 44 (3) (2016) efficient, update friendly erasure coding scheme for KV-SSDs, in: Proceedings of
|
||
153–165. the 14th ACM International Conference on Systems and Storage, 2021, pp. 1–12.
|
||
[37] X. Song, T. Xie, S. Fischer, Two reconfigurable NDP servers: Understanding the [47] K. Sonbol, Ö. Özkasap, I. Al-Oqily, et al., EdgeKV: Decentralized, scalable, and
|
||
impact of near-data processing on data center applications, ACM Trans. Storage consistent storage for the edge, J. Parallel Distrib. Comput. 144 (2020) 28–40.
|
||
(TOS) 17 (4) (2021) 1–27. [48] Y. Geng, J. Luo, G. Wang, et al., Er-kv: High performance hybrid fault-
|
||
[38] Z. Yang, Y. Lu, X. Liao, et al., 𝜆-IO: A unified IO stack for computational storage, tolerant key-value store, in: 2021 IEEE 23rd International Conference on High
|
||
in: 21st USENIX Conference on File and Storage Technologies, FAST 23, 2023, Performance Computing & Communications; 7th International Conference on
|
||
pp. 347–362. Data Science & Systems; 19th International Conference on Smart City; 7th
|
||
[39] Y. Wang, Y. Zhou, F. Wu, et al., Holistic and opportunistic scheduling of International Conference on Dependability in Sensor, Cloud & Big Data Systems
|
||
background I/Os in flash-based SSDs, IEEE Trans. Comput. (2023). & Application, HPCC/DSS/SmartCity/DependSys, 2021, pp. 179–188.
|
||
[40] J. Li, X. Chen, D. Liu, et al., Horae: A hybrid I/O request scheduling technique for [49] X. Song, T. Xie, S. Fischer, A near-data processing server architecture and
|
||
near-data processing-based SSD, IEEE Trans. Comput.-Aided Des. Integr. Circuits its impact on data center applications, in: High Performance Computing: 34th
|
||
Syst. 41 (11) (2022) 3803–3813. International Conference, ISC High Performance 2019, Frankfurt/Main, Germany,
|
||
June 16–20, 2019, Proceedings 34, Springer International Publishing, 2019, pp.
|
||
81–98.
|
||
[50] H. Sun, Q. Wang, Y.L. Yue, et al., A storage computing architecture with multiple
|
||
NDP devices for accelerating compaction performance in LSM-tree based KV
|
||
stores, J. Syst. Archit. 130 (2022) 102681.
|
||
|
||
|
||
|
||
|
||
17
|
||
|