Journal of Systems Architecture 160 (2025) 103342 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc ProckStore: An NDP-empowered key-value store with asynchronous and multi-threaded compaction scheme for optimized performance✩ Hui Sun a ,∗, Chao Zhao a , Yinliang Yue b , Xiao Qin c a Anhui University, Jiu long road 111, Hefei, 230601, Anhui, China b Zhongguancun Laboratory, Cuihu North Road 2, Beijing, 100094, China c Auburn University, The Quad Center Auburn, Auburn, 36849, AL, USA ARTICLE INFO ABSTRACT Keywords: With the exponential growth of large-scale unstructured data, LSM-tree-based key-value (KV) stores have Near-data processing (NDP) become increasingly prevalent in storage systems. However, KV stores face challenges during compaction, LSM-tree particularly when merging and reorganizing SSTables, which leads to high I/O bandwidth consumption and Asynchronous multi-threaded compaction performance degradation due to frequent data migration. Near-data processing (NDP) techniques, which Write amplification integrate computational units within storage devices, alleviate the data movement bottleneck to the CPU. Key-value separation The NDP framework is a promising solution to address the compaction challenges in KV stores. In this paper, we propose ProckStore, an NDP-enhanced KV store that employs an asynchronous and multi-threaded compaction scheme. ProckStore incorporates a multi-threaded model with a four-level priority scheduling mechanism–covering the compaction stages of triggering, selection, execution, and distribution, thereby minimizing task interference and optimizing scheduling efficiency. To reduce write amplification, ProckStore utilizes a triple-level filtering compaction strategy that minimizes unnecessary writes. Additionally, ProckStore adopts a key-value separation approach to reduce data transmission overhead during host-side compaction. Implemented as an extension of RocksDB on an NDP platform, ProckStore demonstrates significant performance improvements in practical applications. Experimental results indicate a 1.6× throughput increase over the single-threaded and asynchronous model and a 4.2× improvement compared with synchronous schemes. 1. Introduction with each level having a capacity threshold that increases at a fixed rate as the level number grows. When the amount of data in a level The rapid development of large language models [1], graph exceeds its threshold, some data migrates to lower levels, potentially databases [2], and social network [3] has led to the generation of real- causing overlapping key ranges between SSTables in different levels. time large amounts of data, contributing to a global surge in large-scale To maintain data organization and prevent duplication, SSTables with data. This data is growing exponentially and is increasingly manifested overlapping key ranges must be loaded into memory and merged. The in semi-structured and unstructured formats, in addition to traditional sorted and de-duplicated key-value pairs are then rewritten as new structured data. For example, semi-structured and unstructured data SSTables at a lower level. This process, known as compaction, involves have been grown in recent years according to IDC [4], and they now frequent read and write operations that consume a lot of I/O bandwidth account for over 85% of total data volume. To cope with the large between the host and storage devices, thereby delaying foreground amount of unstructured data, LSM-tree-based key-value stores (KV requests and degrading system performance. stores) [5] have become widely adopted in large-scale storage systems. GPUs, DPUs, and FPGAs General-purpose graphics processing unit LSM-tree structures are popularly used in modern database engines (GPGPU), data processing unit (DPU), and field-programmable gate (e.g., LevelDB) [6] RocksDB [7]). In the LSM-tree structure, key-value array (FPGA) offer additional computational resources to address com- pairs are first written to an immutable MemTable in memory and paction performance challenges. Near-Data Processing (NDP), intro- then persist to disk as Sorted String Tables (SSTables) once preset duced in the late 1990s as the ‘‘smart disk’’ [8], has regained attention threshold is reached. On disk, the LSM-tree is organized hierarchically, ✩ This work is supported in part by National Natural Science Foundation of China under Grants 62472002 and 62072001. Xiao Qin’s work is supported by the U.S. National Science Foundation (Grants IIS-1618669 and OAC-1642133), the National Aeronautics and Space Administration, United States (Grant 80NSSC20M0044), the National Highway Traffic Safety Administration, United States (Grant 451861-19158), and Wright Media, LLC (Grants 240250 and 240311). ∗ Corresponding author. E-mail addresses: sunhui@ahu.edu.cn (H. Sun), chaozh@stu.ahu.edu.cn (C. Zhao), yylhust@qq.com (Y. Yue), xqin@auburn.edu (X. Qin). https://doi.org/10.1016/j.sysarc.2025.103342 Received 31 October 2024; Received in revised form 30 December 2024; Accepted 11 January 2025 Available online 24 January 2025 1383-7621/© 2025 Published by Elsevier B.V. H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 as an emerging computational paradigm. The enhanced computational power within storage devices has fueled interest in NDP. NDP mit- igates the overhead of data movement by reducing data movement to the CPU. The NDP paradigm advocates for ‘‘computation close to data’’ as an alternative to the computation-centered approach in large-scale systems. This model enables storage devices to use their internal bus for data processing rather than transfer data to the host, where the results would otherwise be computed. Most existing NDP- empowered KV stores, such as Co-KV [9] and TStore [10], to tackle compaction tasks using a synchronization-based approach. This work splits the compaction task, leveraging either averaging or dynamic time-awareness. In the synchronization model, the host and the device cannot complete tasks simultaneously, leading to long waiting time and inefficient resources usage. PStore [11] addresses waiting time by using an asynchronous model but fails to fully exploit the benefits of this approach due to its single-threaded processing. To address these issues, we designed an asynchronous NDP scheme, ProckStore, which utilizes a multi-threaded strategy to perform com- paction tasks concurrently. All compaction tasks are managed in a thread pool and scheduled using multiple threads, exploiting the bene- fits of asynchronous processing, where tasks do not interfere with one another. The tasks are executed independently by individual threads. A four-level priority scheduling mechanism is implemented to ensure efficient scheduling of compaction tasks within the thread pool, fol- lowing the four stages of the compaction process. To address the write amplification issue, a triple-level filtering compaction method is em- ployed, reducing unnecessary writes and alleviating write amplification during compaction on the host side. Furthermore, the transmission process in the NDP architecture and its compaction module is optimized Fig. 1. The structure of LSM-tree and RocksDB. The LSM-tree is composed of compo- by utilizing a key-value separation technique, minimizing transmission nents 𝐶0 , 𝐶1 , and 𝐶𝑛 . time by sending only the keys to the host. The contributions of this work are summarized as follows ▴ We designed ProckStore with an asynchronous and multi-threaded 2. Background and motivation scheme. Then, compaction tasks are executed independently with- out interfering with each other in the thread pool, entirely using the asynchronous mode, which significantly improves write perfor- 2.1. Background mance compared with the synchronous mode and the single-threaded asynchronous scheme. RocksDB is an LSM-tree-based key-value store developed by Face- We designed ProckStore using an asynchronous and multi-threaded book, and it is widely used in Facebook’s storage systems to achieve architecture. Compaction tasks are executed independently within the high throughput. In RocksDB, the MemTable and Immutable MemTable thread pool, fully leveraging the asynchronous model. This approach are stored in memory, while the Sorted String Table (SSTable) is stored significantly enhances write performance compared to the synchronous on disk. As shown in Fig. 1, key-value pairs from the application model and single-threaded asynchronous scheme. are first written to the commit log and then cached in a sorted data ▴ ProckStore employs a four-level priority scheduling to manage the structure called the MemTable, which has a limited size (e.g., 4MB) compaction process, which consists of four stages: compaction trigger, in memory. Once the MemTable reaches its predefined capacity, it is compaction picking, compaction execution, and compaction distribu- converted into an Immutable MemTable. A background thread then tion. This scheduling prioritizes tasks at different stages, ensuring opti- writes the MemTable to disk as a sorted string table (SSTable). On disk, mal efficiency during asynchronous and multi-threaded compaction. SSTables are organized in levels, with each level growing by a fixed ▴ To optimize performance in the NDP transmission architecture, multiple. we implemented key-value separation in the host-side compaction, In Fig. 1(a), in LSM-tree, the hierarchy represents different compo- reducing data transmission overhead. The device-side compaction mod- nents, such as components 𝐶0 , 𝐶1 ......, and 𝐶𝑛 . Component 𝐶0 resides in ule employs a cross-level compaction technique to alleviate compu- memory. The new write data is first written into the sequential log file tational load, thereby improving transmission efficiency and overall and then inserted into an entry placed in 𝐶0 . However, the high-cost system throughput. memory capacity that accommodates 𝐶0 imposes a limit on the size of ▴ ProckStore, an extension of RocksDB on the NDP platform, was 𝐶0 . To migrate entries to a component on the disk, LSM-tree performs evaluated using DB_Bench and YCSB-C. Experimental results demon- a merge operation when the size of 𝐶0 reaches the threshold, including strate that ProckStore increases throughput by a factor of 1.6× com- taking some contiguous segment of entries from 𝐶0 and merging it into pared to the single-threaded asynchronous PStore, and achieves a 4.2× a component on the disk. Component 𝐶𝑛 (n> 1) resides on the disk throughput improvement over the synchronous TStore. in the LSM-tree. Although 𝐶1 is disk-resident, the frequently accessed The paper is organized as follows. Section 2 presents the background page nodes in 𝐶1 remain in the memory buffer. 𝐶1 has a directory and motivation encountered by ProckStore. Section 3 presents a system structure like B-tree but is optimized for sequential access on the disk. overview of ProckStore and information on each module. Section 4 The in-memory 𝐶0 servers high-speed writes, and 𝐶𝑛 (n> 1) on the lists the hardware and software configurations used in the experiments. disk is responsible for persistence and batch-sequential writes. Through Section 5 demonstrates the performance of ProckStore through exten- the hierarchical and merging strategies, LSM-tree achieves a balance sive experiments. Section 6 elaborates on the extended experiments. between write optimization and high-efficiency query. Section 7 summarizes related work. Finally, we conclude our work in In Fig. 1(b), the most recently generated SSTable is placed in the Section 8. lowest level, 𝐿0 . SSTables in level 𝐿0 can have overlapping key ranges, 2 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 2. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various data volume. Fig. 3. The results of PStore with different numbers of threads (1, 4, 8, and 12) under the Fillrandom DB_Bench with various value sizes. while higher levels are organized by key ranges. Each level has a size other hand, demonstrates its effectiveness in an asynchronous and threshold for its total SSTables. When this threshold is exceeded, the KV single-threaded setting. Notably, the asynchronous approach allows store migrates SSTables from level 𝐿𝑘 to level 𝐿𝑘+1 during compaction. compaction tasks to be performed independently; however, it is difficult The compaction process selects SSTables from level 𝐿𝑘 and searches to fully leverage the benefits of asynchronous processing in a single- for overlapping key ranges in level 𝐿𝑘+1 . A merge operation is then threaded environment. Therefore, we investigate the performance of performed on the SSTables with overlapping key ranges to produce new PStore using different thread configurations. Fig. 2(a) presents the SSTables, which are stored in level 𝐿𝑘+1 . Obsolete SSTables in level throughput of PStore under workloads with 4-KB value and various 𝐿𝑘+1 are deleted from the disk. This compaction process incurs compu- data volumes. We can draw two key observations. tational and storage overhead, which negatively impacts response time ▵ As the number of threads increases, the throughput of PStore does and throughput – a significant drawback of the LSM-tree. not grow exponentially as expected. The throughput improvement is Graphical computing [2], machine learning [12], and large lan- minimal during multi-threaded writes, particularly when the number guage models [1] demand substantial data for model training and of threads is 12. inference. The data transfer overhead from storage devices to the CPU ▵ With a large number of threads, the throughput of PStore in- for computation becomes higher as data volumes grow, consuming sys- creases slowly. Under 20-GB workloads, when the thread count in- tem resources and incurring bottlenecks between storage and memory creases from 8 to 12, the throughput only increases by 0.12 MB/s. in high-performance systems. As data volumes increase, the overhead These findings indicate that the asynchronous compaction advan- associated with transferring data from storage devices to the CPU for tages of PStore in single-threaded mode are insufficient to handle the computation rises, leading to resource consumption and performance large volume of multi-threaded writes. As a result, increasing the num- bottlenecks between storage and memory in high-performance sys- ber of threads does not enhance throughput, particularly as the thread tems. Traditional storage architectures struggle to meet the demands count becomes large. While the asynchronous approach in PStore takes of data-intensive applications under these conditions. NDP mitigates into account the computational imbalance between the host and the this challenge by fully utilizing the device’s internal bandwidth. By NDP device, it fails to implement an appropriate asynchronous com- incorporating embedded computing units, storage devices can perform paction method. The limitations of the single-threaded mode hinder computational tasks, offloading these operations from the host and the full potential of the asynchronous compaction mechanism in the eliminating the overhead of moving large data volumes. The results KV store. can then be retrieved from the storage device, reducing the need for As shown in Fig. 2(b), the average latency decreases under work- additional data movement. Furthermore, the KV store can leverage loads with various data volumes, but this reduction is most pronounced NDP to perform compaction tasks internally, improving compaction when using a small number of threads. Specifically, the most significant efficiency. decrease occurs between 1 and 4 threads, where the average latency re- duces by 27.8%. Additionally, the CPU utilization on the host supports 2.2. Motivation these observations (see Fig. 2(c)), with a 34% increment in 12 threads over 1 thread under 10-GB workloads. The CPU utilization exhibits a Most existing studies focus on compaction processing in a single- 19% increment as the number of threads grows from 1 to 4. The result threaded context. For instance, Co-KV and TStore process compaction reveals that PStore is suitable for single- or fewer-threaded workloads, tasks synchronously and in a single-threaded mode. PStore, on the and it is challenging to adapt to multi-threaded applications. 3 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 4. Overview system of ProckStore. 𝑄1 is the first compaction-task queue. Sub i(0 0 ⎩ where 𝑁𝑠𝑠𝑡 and 𝑁𝑚𝑎𝑥 denote the number of SSTables and the thresh- Fig. 5. Multilevel Task Queue in ProckStore. old of the number of SSTables in level 𝐿0 , respectively. 𝑁𝑛𝑜_𝑐 𝑜𝑚𝑝 and 𝑁𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the number of compaction tasks in level 𝐿0 that have been picked into the task queue to be executed and the number of compaction tasks in level 𝐿0 that are executed and contain SSTables. data, storing the SSTable based on key-range granularity, performing 𝑆𝑠𝑠𝑡 and 𝑆𝑚𝑎𝑥 denote the size of the total data volume of SSTable in garbage collection, maintaining information, and executing compaction tasks on the files. level i and the threshold of the data volume of SSTable in level i, respectively. In contrast, 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 and 𝑆𝑏𝑒𝑖𝑛𝑔_𝑐 𝑜𝑚𝑝 denote the data volume of SSTable included in the compaction task to be executed in the 3.2. Asynchronous mechanisms queue of tasks picked in level i and the current compaction task being executed. The data volume containing SSTable is being executed in the To implement the asynchronous strategy, we decouple the two phases – compaction triggering and execution – to establish conditions compaction task. Different from the RocksDB score, we can see that for asynchronous compaction. In contrast, the synchronous approach the SSTables that have been picked into the compaction queue and treats the task from compaction trigger to completion as a single the SSTables that are involved in compaction tasks are subtracted to process. In the asynchronous mechanism, compaction tasks are con- reduce the number of SSTables that are not in the level, which makes tinuously generated when the conditions for triggering compaction are the calculation of the first_score more accurate. met. These tasks, generated during the trigger phase, must be executed. When a level triggers a compaction, the generated compaction task To manage them efficiently, we propose a multi-level task queue that will be put into the corresponding task queue of the level, and the stores compaction tasks uniformly and waits for the asynchronous compaction module will wait for its processing. See Fig. 6. In an manager to schedule them. To align with the structure of the LSM-tree, asynchronous trigger mode, the compaction task will not be executed a compaction task queue is assigned to each level, with tasks generated immediately, and the asynchronous compaction manager will have to during the trigger phase placed into the task queue at a level, awaiting wait for it to be scheduled. The device side triggers the compaction scheduling. In Fig. 5, ProckStore employs a multi-task queue for each task according to the first_score of each level and places it into the task column family. The multi-task queue selects compaction tasks at each queue. The device triggers the compaction task according to first_score level based on a score value. (select the maximum value) in each level and puts it into the task We implement multi-level task queues in a thread pool. Tasks in queue. It provides the basis for prioritizing the compaction triggering each queue are sorted in ascending order by the number of SSTables. and the execution between levels, which is the first-level prioritization. A heap sorting algorithm is used in each task queue to ensure sorting A second level of prioritization is the prioritization of SSTable occurs in time complexity 𝑂(𝑛 log 𝑛). The task queue is a double-ended picking. In the compaction task generation phase, we select some queue, allowing compaction tasks to be allocated from both ends to the SSTables in the level and all the overlapping SSTables from the fol- host and device sides. Since multiple tasks are scheduled in the queue, lowing level. These SSTables are conducted and merged in the com- a thread pool is used to manage the pending tasks in the compaction paction operation. ProckStore puts the compacted SSTables into the queue. compaction_queue and then reads the first file information that needs to be compacted from the queue. The compacted SSTables in the queue 3.3. Four-level priority scheduling perform compaction operations sequentially without considering the hot and cold data and the size of the compaction task. Thus, the An asynchronous mechanism-based compaction procedure separates information about the number of overlapping SSTables with the lower the two phases: compaction triggering and execution. To achieve this, level is added to the FileMetaData of each SSTable. The second_score is we propose a four-level priority scheduling strategy that assigns priority set to the number of overlapping SSTables, and the meta-information levels to the four steps involved: triggering compaction tasks, generat- is sorted in ascending order by each level following the size of the ing tasks, allocating tasks, and executing tasks. This strategy ensures second_score, see Eq. (2), as follows efficient execution of the asynchronous compaction process. second_score = 𝑂𝑣𝑒𝑟𝑙𝑎𝑝sst , for SSTable (2) 5 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 6. Four-level priority scheduling in ProckStore. The light gold circles represent SSTables selected for compaction at the current level, while the light blue circles denote SSTables selected for compaction at the next level. The dark blue circles indicate SSTables that are not selected for compaction. The yellow rectangles represent newly generated compaction tasks, the light red rectangles signify compaction tasks assigned to the device side, and the dark orange rectangles represent compaction tasks assigned to the host side. where 𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠𝑠𝑡 denote the number of overlapping SSTables. The host side; (3) the score of the host side is equal. SSTable with the smallest number of overlapping SSTables at the lower In case (1), the default rules of acquiring tasks in the queue remain level is prioritized to select the SSTable to be compacted. Compaction unchanged, and case (2) is changed into a situation in which the device and the metadata information of SSTable are maintained using a linked side fetches the tasks from the left side of the queue, and the host list to facilitate insertion and deletion. However, we query the overlap obtains the tasks from the right side of the queue. In case (3), the of SSTables with the lower level and cost O(n) time complexity to default rules for taking tasks are still maintained, and the host side maintain the order of the linked list. We can ensure that the min- and the device side re-record and calculate the compaction time at one imum number of SSTables is selected in each compaction, reducing end, then make judgments according to the comparison results. In a compaction time. running process, the configuration of the host and device sides cannot A third level of priority is the priority of allocating compaction change, so the queue to get the task rules decided after the numerical tasks. In the stage of compaction task allocation, we consider the comparison is no longer carried out. The decision of the task to get the different computational capabilities of the host and the device sides rules cannot be changed in this process. for compaction tasks. Meanwhile, the compaction processing efficiency A fourth level of prioritization is the priority of executing varies with the configurations of the host and the device and the data sub-task. After selecting the SSTables, these SSTables are integrated paths of read, write, and transmission. Therefore, it is necessary to into a complete compaction task that reaches the stage of compaction select appropriate compaction tasks for both the host and the device. execution on the NDP device and the host. The compaction task is When all the SSTables involved in compaction are selected, the com- decomposed into multiple sub-tasks, which can be executed in parallel paction task information is generated and inserted into the compaction on the device. Notably, sub-task refers to as sub-compaction that are task queue. The compaction task queue of each level is a double-ended performed in the multi-threaded compaction mechanism in RocksDB. In a compaction process, the primary thread first executes a sub- priority task queue, which is sorted in ascending order according to the task. Notably, the first sub-task is designated as the main thread for number of SSTables in a compaction task. The queue is heap sorted with execution by default. The rest of the sub-task creates many sub-threads time complexity 𝑂(𝑛𝑙𝑜𝑔 𝑛). Initially, the host obtains the tasks from the to be executed concurrently. Then, the primary thread merges the left side of the queue with fewer SSTables, and the device side gets the results and writes them back in a unified manner. tasks from the right side with more SSTables. The host and the device The amounts of data and execution time are different in the sub- sides record the compaction time. task. The computational resources are underutilized by default. To During the compaction process, the host and the device record the address this issue, we prioritize the concurrent execution process of time cost of five consecutive compaction tasks and data volume of sub-task. Let us have 𝑓 𝑜𝑢𝑟𝑡ℎ_𝑠𝑐 𝑜𝑟𝑒 = 𝑆𝑆 𝑆 𝑇 , where 𝑆𝑆 𝑆 𝑇 denotes the compacted SSTables. The third_score is the ratio of the time taken for total data volume of SSTable in each sub-task. When dividing the five consecutive compaction tasks to the data volume of the compacted sub-task tasks, we compare the data size of each sub-task. The sub- SSTables, which is given as task containing the smallest data is set to be the highest priority. It ⎧ 𝑆host_sst means that the smaller the fourth_score is, the higher the priority is, ⎪ 𝑇host_comp , for host third_score = ⎨ 𝑆device_sst (3) and the highest-priority sub-task is placed into the primary thread for ⎪ 𝑇device_comp , for device compaction. The compaction execution time can be illustrated as ⎩ where 𝑆ℎ𝑜𝑠𝑡_ 𝑠𝑠𝑡 and 𝑇ℎ𝑜𝑠𝑡_ 𝑐 𝑜𝑚𝑝 denote the total data volume of compacted 𝑇𝑒𝑥𝑒 = 𝑇𝑝𝑡ℎ𝑟𝑒𝑎𝑑 + 𝑇𝑠𝑢𝑏𝑡ℎ𝑟𝑒𝑎𝑑 , (4) SSTables on the host and the time cost, respectively. 𝑆𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑠𝑠𝑡 and where 𝑇𝑒𝑥𝑒 , 𝑇𝑝𝑡ℎ𝑟𝑒𝑎𝑑 , 𝑎𝑛𝑑 𝑇𝑠𝑢𝑏𝑡ℎ𝑟𝑒𝑎𝑑 represent the overall execution time, 𝑇𝑑 𝑒𝑣𝑖𝑐 𝑒_ 𝑐 𝑜𝑚𝑝 represent the total data volume of compacted SSTables on primary thread execution time, and sub-thread execution time. The sub- the device and the time cost, respectively. We use the third_score to task with the least execution time is placed into the primary thread evaluate the compaction processing capability of both the host and for execution to reduce the execution time. Notably, the sub-thread device sides. The side with a higher compaction processing capability execution time is determined by the sub-task with the longest execution handles tasks containing a large number of SSTables from the right end time. This procedure cannot affect the execution time of sub-tasks, of the queue, while the other side handles tasks from the left end. thereby reducing the overall execution time and improving the system The larger the value of third_score1 is, the less efficient the com- performance. paction is. Compared 𝑡ℎ𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒ℎ𝑜𝑠𝑡 with 𝑡ℎ𝑖𝑟𝑑_𝑠𝑐 𝑜𝑟𝑒𝑑 𝑒𝑣𝑖𝑐 𝑒 , there are three cases: (1) the score of the host side is greater than that of the 3.4. Triple-level filter compaction device side; (2) the score of the device side is greater than that of the The asynchronous compaction method of ProckStore improves the compaction performance; however, this method brings the write am- 1 It indicates that the compaction operation spends more time processing plification problem. Therefore, we propose the mechanism of triple- an SSTable. level filtering compaction (see Fig. 7) . In a compaction procedure, 6 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 7. The triple-level filter compaction in ProckStore. Fig. 8. The transmission module between the host and the device in ProckStore. triple-level filtering compaction involves SSTables from three levels introduces write amplification. The SSTables newly written to level to remove duplicate data. The triggering of the triple-level filtering 𝐿𝑖+1 may immediately needs to be combined with the SSTables with mechanism, however, requires certain conditions to be met. When overlapping key ranges in level 𝐿𝑖+2 to form a new compaction task. performing the compaction involving SSTables in levels 𝐿𝑖 and 𝐿𝑖+1 , These SSTables are merged and the new data are written to level 𝐿𝑖+2 , ProckStore first_score value of level 𝐿𝑖+1 . If the value is greater than resulting in additional write amplification for data previously written 1, the triple-level filtering compaction is triggered, involving SSTables to level 𝐿𝑖+1 . Consequently, this procedure incurs two instances of write from level 𝐿𝑖+2 that overlap with those from level 𝐿𝑖+1 . This mechanism amplification. helps reduce duplicate writes and alleviates write amplification. The triple-level filtering compaction combines all the overlapping- As triple-level filtering compaction contains overlapping-key-range key-range SSTables in the three levels to perform compaction. The data SSTables of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , it causes the problem of excessive in level 𝐿𝑖 is written to level 𝐿𝑖+2 , which eliminates a compaction compaction data. When performing the three-level compaction, some process and the write amplification from level 𝐿𝑖+1 to 𝐿𝑖+2 . key ranges can exist levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 . This key ranges can be deleted and filtered at the intermediate level (i.e., level 𝐿𝑖+1 ), which 3.5. Transmission in ProckStore cannot affect the update of new keys in level 𝐿𝑖 to 𝐿𝑖+2 or the merging of old keys in level 𝐿𝑖+2 . At the stage of generating compaction tasks, In ProckStore, data is transferred between the host and device sides, we mark the duplicate key range in the three levels when picking the as shown in Fig. 8. During compaction, a large amount of data is overlapping SSTables from the three levels and filter the duplicate key read from the NDP device to the host for compaction, which involves ranges in the three levels out of level 𝐿𝑖+1 in advance. Then, the newest transferring many KV pairs. This results in significant data migration keys in level 𝐿𝑖 and the oldest keys in level 𝐿𝑖+2 are retained. This overhead. To address this issue, we employ key–value separation for approach reduces the data volume involved in compaction by reducing compaction to reduce the data transfer overhead, which minimizes data redundancy across the three levels, thereby alleviating the issue of migration between the host and the NDP device, reduces write am- excessive data in compaction operations. plification, and improves compaction performance. In the compaction As shown in Fig. 7, when level 𝐿1 performs compaction, the score process, only the keys of the KV pairs are read, sorted, and written to of level 𝐿2 is greater than 1 to satisfy the condition of triple-level filter the NDP device. The key size is less than 1 KB, while the value size compaction. Compared the key ranges of levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 , exceeds 1 KB. During compaction on the host side, the NDP device ProckStore filters out and deletes the same keys that exist across the transmits only the keys to the host, while the device processes the three levels. In Fig. 7, the keys 7, 8, 9, 10, 11, and 13 are filtered values locally. Afterward, the host processes the values, sends them out from level 𝐿2 before performing the compaction operation. These back to the device, and integrates them into an SSTable. This approach keys are placed in the compaction queue, awaiting the asynchronous significantly reduces data migration between the CPU and memory on manager to initiate the compaction. According to Eq. (1), these keys the host side and minimizes the overhead of data transfers between the are marked as 𝑆𝑛𝑜_𝑐 𝑜𝑚𝑝 , causing the first_score value in levels 𝐿1 and 𝐿2 device and host. to drop below 1 due to the subtraction of these keys. This results in a The key–value separation mechanism is implemented during host- reduction of excess data in the level. The default compaction method in side compaction, with the entire KV pair stored on the NDP device. The ProckStore merges the SSTables with overlapping key ranges in levels key is processed during host-side compaction, reducing both device- 𝐿𝑖 and 𝐿𝑖+1 and then writes new SSTables into level 𝐿𝑖+1 . This process to-host data transfers and host-to-device compaction operations. Based 7 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 on the compaction information, the device separates the key from the 4. Experimental settings value in the SSTables. The key array stores the address of each value, which is used for subsequent reorganization. The keys are then sent to Platform. We implemented ProckStore based on RocksDB and con- the host for a sort-merge operation. Afterward, the compacted keys are ducted experiments to assess its performance. To evaluate ProckStore, sent back to the device, where they are reorganized into new SSTables we constructed a test platform simulating the NDP architecture, where based on the value addresses. There are three threads for each step: (1) data transfer between the host and NDP device occurs over Ethernet. the separation thread on the device, (2) the merge operation thread on the host, and (3) the key–value reorganization thread on the host. The Although this platform was used for validation, ProckStore is scalable host-side compaction task is divided into the following three steps: to real NDP platforms. SocketRocksDB, a version of RocksDB deployed ▵ Step 1. The key–value separation thread in the NDP retrieves on the NDP collaborative framework, was used as the baseline. TStore, the KV pairs based on the SSTable data format. The key or value is PStore, and ProckStore all share the NDP-empowered storage frame- stored in the corresponding array in the NDP device’s memory. In the work. The experimental platform comprises two subsystems: a host-side key array, each key records the subscript of its corresponding value, and a device-side NDP subsystem. The host system is equipped with an and the time complexity for searching the array is O(1). The value Intel(R) Core(TM) i3-10100 CPU (8 cores) and 16 GB of DRAM, while array is transferred to the NDP device via memory sharing and waits the NDP device runs on an ARM-based development board with four for the sorted key array to be fetched from the host. The key array is Cortex-A76 cores, four Cortex-A55 cores, 16-GB DRAM, and a 256 GB transferred to the host via the host-NDP interface. Western Digital SSD. A network cable connects the host to the NDP ▵ Step 2: The host fetches the key array, sorts the individual keys, device, with a bandwidth of 1000 Mbps. and sends the sorted key array to the NDP device for restructuring. All The host system runs Ubuntu 16.04, and RocksDB version 6.10.2 is these steps are organized within a single thread. employed. The NDP platform uses a lightweight embedded operating ▵ Step 3: Upon receiving the new key array, the NDP device finds each key’s corresponding value based on its subscript. Simultaneously, system. Data transfer between the host and the NDP device is facilitated the device reconstructs the new SSTables according to the order of the by the SOCKET interface, replacing the standard POSIX interface to keys. To minimize data transfer time between the host and device, the ensure efficient data transmission. In RocksDB, the buffer and SSTable data volume is reduced, and a separate transfer thread ensures that sizes are set to 4 MB, the block size is 4 KB, and the level settings remain the communication between the host and device remains unaffected, at default values. The number of sub-tasks on the host is limited to 4, minimizing transmission latency. and all other configuration parameters in RocksDB are set to default As shown in Fig. 8, only the keys (which are reconstructed on the values. host side) are passed between the host and the device. When the host Workloads. In this section, we evaluate the performance of Prock- performs compaction, a compaction request is sent to the device, which Store under realistic workloads. The details of the DB_Bench and YCSB- then provides the necessary data from the NDP device. SSTables 1 and C workloads used in the experiments are presented in Table 1. The 2 from level 𝐿0 and SSTables 3 and 4 from level 𝐿1 are separated. DB_Bench workload is used to assess random-write performance. The duplicate keys and offset addresses are passed to the host, which Table 1 presents the different workloads in the ‘‘Type’’ column. executes the compaction procedure. After deduplication, the keys are In addition, db_bench_1 is configured in random-write mode with a re-transmitted to the NDP device, where they are reorganized into a new SSTable (SSTable 7) in level 𝐿1 . fixed value size of 1 KB and varying data sizes (10 GB, 20 GB, 30 GB, By reducing the transmission overhead on the host side, the device 40 GB), db_bench_2 is configured in random-write mode with multiple reduces the compaction task’s time cost, aligning with the NDP archi- value sizes (1 KB, 4 KB, 16 KB, 64 KB) and two data volumes (10 GB tecture’s requirements. At the fourth priority level, the host handles and 40 GB). We also employ YCSB-C to measure the ProckStore’s most of the compaction tasks, which contain more SSTables, thereby performance under mixed read–write workloads. relieving the device’s computational load. However, the NDP device not only processes the values for the host but also handles the KV pairs 5. Performance evaluation in the compaction task, which increases the device’s processing pres- sure. To alleviate this, cross-level compaction is employed to reduce computational strain on the NDP device. We conduct experiments to evaluate the performance of ProckStore When a compaction process is triggered in level 𝐿𝑖 and the first_score in terms of throughput, latency, and write amplification (WA). of level 𝐿𝑖+1 exceeds one, cross-level compaction is initiated. This process searches for SSTables with overlapping key ranges in the subse- 5.1. Performance under DB_Bench with various data volumes quent level 𝐿𝑖+2 . Unlike traditional compaction, cross-level compaction in ProckStore continuously searches for overlapping key-range SSTables in level 𝐿𝑖+2 . Subsequently, SSTables from levels 𝐿𝑖 , 𝐿𝑖+1 , and 𝐿𝑖+2 In this section, we evaluate the performance of ProckStore using undergo compaction, and the newly generated SSTables are written to DB_Bench with various data volumes and a 4-KB value. Fig. 9 illustrates level 𝐿𝑖+2 . the impact of data volume on performance, focusing on throughput, The trigger selection in level 𝐿𝑖 follows the priority rules, while WA, CPU utilization, and bandwidth. ProckStore delivers peak perfor- the selection of SSTables in levels 𝐿𝑖+1 and 𝐿𝑖+2 is based on their sec- mance with 10-GB workloads, achieving up to 48% higher throughput ond_score(see Eq. (2)). SSTables written to level 𝐿𝑖+1 in traditional com- compared to PStore, and an average improvement of 40%. Under paction may be written to level 𝐿𝑖+2 through cross-level compaction. 40-GB workloads, the WA of TStore and SocketRocksDB reaches its This cross-level approach helps balance the SSTable distribution across maximum, while ProckStore’s WA remains constant at 1.4 across all levels, reducing the number of compaction operations. However, it cases. Under 30-GB workloads, ProckStore’s throughput decreases by introduces a drawback: compaction involving many SSTables increases an average of 67% and 61% compared to TStore and SocketRocksDB, compaction time. For the NDP device, data transmission time can be respectively. This performance decrement is attributed to the frequency ignored, thereby reducing overall compaction time. As illustrated in Fig. 8, SSTables 1 and 2 in level 𝐿0 , SSTables 3 and 4 in level 𝐿1 , of compaction operations, which consume bandwidth and degrade and SSTables 8 and 9 in level 𝐿2 are involved in compaction on the overall performance. PStore exhibits lower CPU utilization than Prock- NDP device, and new data is written into SSTable 13 in level 𝐿2 . Store across all workloads. The multi-threaded approach in ProckStore With an asynchronous mechanism, priority scheduling, and optimized optimizes the utilization of computing resources. In contrast, Sock- data transmission under the NDP-empowered KV store, ProckStore etRocksDB prioritizes data storage over compaction, leading to lower efficiently optimizes the compaction process. CPU utilization than PStore and ProckStore. 8 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Table 1 Workload Characteristics used in the Experiment. Workloads in DB_Bench Type Feature Fillrandom Value Size (1 KB) Data Size (10 GB) db_bench_1 100% writes ✓ 4× 1×, 2×, 3×, 4× db_bench_2 100% writes ✓ 1×, 4×, 16×, 64× 1×, 4× Workloads in YCSB-C Type Feature Data Size Record Size Distribution Load (10 GB) Run (10 GB) (1 KB) A 50% Reads, 50% Updates Zipfian B 95% Reads, 5% Updates Zipfian C 100% Reads 1×, 2× 1×, 2× 1× Zipfian D 95% Reads, 5% Inserts Latest 95% Range Queries, E Uniform 5% Inserts 50% Reads, F Zipfian 50% Read-Modify-Writes Fig. 9. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with various data volumes. 5.1.1. Write amplification (WA) 5.1.2. Throughput A large WA indicates significant duplication of write operations, In Fig. 9(c), the operation time of ProckStore ranges from 828.35 which degrades system performance. In SocketRocksDB, WA is primar- micros/op to 1819.18 micros/op. The operation time of ProckStore ily caused by write-ahead log and compaction on the host. WA increases is lower than that of SocketRocksDB because it takes less to execute write and read operations.ProckStore reduces operation time by 72.8% with the amount of data, as the number of compaction operations compared to TStore, under a 20-GB workload. At the same time, under is proportional to the data size. As shown in Fig. 9(b), under a 10- a 40-GB workload, ProckStore reduces the operation time by 24.0% GB workload, the WA of TStore and PStore is reduced by 39% and and 61.5%, compared to PStore and SocketRocksDB. In Fig. 9(a), with 62%, respectively, compared to SocketRocksDB, which performs all a 40-GB dataset, the throughput of ProckStore is 4.15× and 1.47× compaction tasks on the host. By offloading a portion of the compaction higher than that of TStore and PStore. Meanwhile ProckStore achieves a tasks to the NDP device, TStore and PStore reduces WA. Notably, throughput of 2.75× higher than SocketRocksDB. Under a 10-GB work- ProckStore exhibits a 55% reduction in WA. A similar trend is observed load, ProckStore achieves 2.81× write throughput of SocketRocksDB under 20- and 30-GB workloads. For a 40-GB workload, WA is reduced through the multi-threaded asynchronous approach. In addition, with by 36.4%, and 72.0% for ProckStore, respectively, compared to TStore, a 10-GB dataset, the throughput of ProckStore is 45.2% higher than and SocketRocksDB. PStore. Other KV stores (excluding SocketRocksDB) leverage collaborative strategies between the host and NDP device to accelerate compaction, 9 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 thereby enhancing throughput. ProckStore optimizes resource alloca- of bandwidth, with an average improvement of 1.67× compared to tion with a multi-threaded asynchronous approach, which improves PStore and 2.32× compared to SocketRocksDB across all different value performance. Its throughput exceeds 4.57 MB/s, achieving a 48% im- sizes. Meanwhile, ProckStore achieves the highest host- and device-side provement over PStore. CPU utilization. In Fig. 11(e), the device-side CPU utilization of PStore and ProckStore are similar due to task stacking on the device under 5.1.3. CPU utilization large data volumes. CPU utilization refers to the proportion of CPU resources consumed by the KV stores under different workloads. TStore utilizes a single- 5.2.1. Write amplification (WA) threaded approach on both the host and device, leading to low CPU With increasing value sizes, the amount of data on the host grows, utilization (see Figs. 9(d) and 9(e)). As a result, TStore’s CPU utilization exacerbating WA in TStore and SocketRocksDB. In Fig. 11(b), the WA is lower than that of SocketRocksDB. Despite leveraging multi-threaded of TStore and SocketRocksDB is the lowest (2.18 and 5.2) with a 16- concurrency, SocketRocksDB faces a transmission bottleneck between KB value. Under 1-KB value workloads, WA increases to 2.39 and 6.1, the host and the device. During task processing, the host quickly respectively. ProckStore’s WA is unaffected by host-side compaction. performs merge operations; however, there is significant latency during Under 1-KB and 64-KB workloads, ProckStore reduces WA by 76.2% read and write operations. By offloading a portion of tasks to the NDP and 75.1%, respectively, compared to SocketRocksDB. However, it device, ProckStore reduces CPU idle time and improves CPU utiliza- increases to 76.4% and 77.6% under 10-GB workloads. This improve- tion on the host. Compared to SocketRocksDB, ProckStore achieves ment is due to ProckStore’s triple-filter compaction on the host, which improvements of 97% and 89% in CPU utilization under 10-GB and reduces compaction operations and the volume of compacted data. 40-GB workloads, respectively. ProckStore demonstrates the highest host-side CPU utilization, peaking at 7.03% under a 10-GB workload. ProckStore’s multi-threaded method on the host further enhances 5.2.2. Throughput CPU utilization. As shown in Fig. 9(e), PStore employs a single-threaded, In Figs. 10(a) and 11(a), ProckSotre’s average throughput ranges asynchronous method, offering greater flexibility than traditional from 3.8 MB/s to 5.1 MB/s and 2.7 MB/s to 4.0 MB/s under 10-GB scheduling models. Furthermore, reduced compaction time increases and 40-GB workloads, respectively. It is worth noting that ProckStore’s the device-side CPU utilization of PStore by over 20.49%, a 73% im- throughput increases compared with PStore, indicating lower response provement compared to TStore under a 10-GB workload. In ProckStore, times to foreground requests. Compared with SocketRocksDB, Prock- device-side CPU utilization is further enhanced through cross-level Store improves by 2.04× and 2.1× under 40-GB workloads with 1-KB compaction. This metric increases by 27%, 33%, 40%, and 35% com- and 16-KB values, respectively. Compared with PStore, ProckStore pared to PStore under 10-, 20-, 30-, and 40-GB workloads, respectively. improves throughput by 1.51× and 1.58× (see Fig. 11(a)). In particular, compared with TStore, ProckStore archives 4.1× and 2.68× improve- 5.1.4. Compaction bandwidth ment under workloads with 4-KB and 64-KB values, respectively. The compaction bandwidth unveils the compaction performance of a KV store. In this paper, the term ‘‘compaction bandwidth’’ refers 5.2.3. CPU utilization to the host-side compaction bandwidth, as our proposed ProckStore Large-sized values increase compaction overhead and host-side CPU primarily focuses on optimizing host-side performance. For instance, utilization, peaking under workloads with a 64-KB value. ProckStore’s the four-level priority scheduling in Section 3.3 prioritizes four steps— host-side and device-side CPU utilization reach 10.83% and 29.11%, triggering, task generation, task allocation, and task execution on the respectively (see Figs. 10(e) and 11(d)), while SocketRocksDB’s values host—to perform asynchronous compaction efficiently. The triple-level are 8.34% and 18.28%. Additionally, ProckStore’s CPU utilization on filter compaction in Section 3.4 combines two compaction procedures both sides is 8.27% and 25.39% under 40-GB workloads with 1-KB val- into one, thereby improving host-side compaction performance. There- ues. On average, ProckStore’s CPU utilization is 3.35× and 4.1× higher fore, we define compaction bandwidth as the ratio of compaction time than TStore and SocketRocksDB, respectively, and outperforms PStore to the amount of compacted data on the host side. SocketRocksDB in both host- and device-side CPU utilization under all workloads. performs compaction tasks on the host, while other KV stores provide compaction bandwidth on the host and NDP device. In Fig. 9(f), the 5.2.4. Compaction bandwidth single-threaded TStore fails to fully leverage the multi-core computa- In Figs. 10(f) and 11(f), the compaction bandwidth of the KV tional capabilities of the host, resulting in an average bandwidth of 2.35 stores varies. TStore’s device-side bandwidth peaks at 3.14 MB/s, MB/s. while ProckStore shows an average improvement of 4.29× and 1.61× In contrast, SocketRocksDB uses the multi-threaded method to en- over TStore and PStore, respectively. SocketRocksDB leverages multi- hance the bandwidth to 2.86 MB/s, which outperforms other KV stores. threaded parallelism to enhance computation and reduce processing This is because the host handles all the tasks, resulting in much total time, leading to superior bandwidth performance under 40-GB work- data. The collaborative solution improves processing efficiency on the loads across all value sizes. However, PStore achieves higher bandwidth host. Under 40-GB workloads, ProckStore’s bandwidth improves by than SocketRocksDB under 10-GB workloads. ProckStore outperforms 3.56× and 1.51× over SocketRocksDB and PStore, respectively. all other stores in terms of bandwidth across all workloads, achieving a 3.54× improvement over SocketRocksDB under workloads with a 64-KB 5.2. Performance under DB_Bench with various value sizes value. We configured the workloads with various value sizes and two data volumes (10 GB and 40 GB). The large-sized value increases the 5.3. Performance under YCSB-c compaction overhead while improving the throughput under work- loads with a fixed-size data volume. ProckStore maintains optimal YCSB-C provides real-world workloads, which we use to evaluate performance under workloads with different value sizes and two data the compaction performance of TStore, PStore, SocketRocksDB, and volumes (see Figs. 10 and 11). ProckStore’s throughput increases on ProckStore. We configure this workload with two types of data vol- average by 63.1% and 77.7% compared to PStore and SocketRocksDB umes: 10 GB and 20 GB in the Load and Run phases. We define the in the case of 1-KB value (see Fig. 10(a)). The performance increases at configuration with 10 GB Load and 10 GB Run as small data volumes, 64 KB because large-value workloads trigger more frequent compaction and 20 GB Load and 20 GB Run as large data volumes. We use six types and shorter running time. ProckStore has the best performance in terms of workloads in the experiment. 10 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 10. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 10-GB data volume and various value sizes. Fig. 11. The results of TStore, PStore, SocketRocksDB, and ProckStore under Fillrandom DB_Bench with 40-GB data volume and various value sizes. 11 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 12. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 10 GB and run 10 GB data volume. 5.3.1. Case 1: Load 10 GB and run 10 GB workload C, ProckStore’s average latency is 13.4% and 37.5% lower Load. In YCSB-C, the workload Load is write-intensive, resulting than that of TStore and SocketRocksDB, respectively. ProckStore ex- in frequent compaction. ProckStore optimizes compaction under var- hibits similar trends under workloads B and D. The average latency ious workloads (Fig. 12). Its throughput outperforms SocketRocksDB of ProckStore under write-intensive workloads is 5.72% and 8.45% by a factor of 2.3×. TStore benefits from time-aware dynamic task lower than that of TStore and PStore under workload E, respectively. scheduling, which reduces the performance gap compared to Prock- Moreover, the throughput of ProckStore is not lower than other KV Store. PStore’s asynchronous compaction improves performance, Prock- stores. Store’s multi-threaded execution further enhances the performance Write Amplification (WA). ProckStore achieves lower WA than of the asynchronous compaction strategy. Consequently, ProckStore’s both TStore and SocketRocksDB (see Fig. 12(b)). WA is reduced by throughput is 4.24× and 1.80× higher than that of TStore and PStore, approximately 1.2 compared to SocketRocksDB. ProckStore’s host-side respectively. During the Run phase, ProckStore’s collaborative mode multi-threaded method further decreases WA by an average of 62.3% improves performance under write-intensive workloads. Workloads A compared to TStore. The minimum WA of ProckStore is 1.20 under and F exhibit the highest write ratios at 29.0%. Under workload A, workload C. WA in ProckStore is influenced by the write-ahead log and ProckStore’s throughput is 28.2% and 29.0% higher than that of TStore host-side compaction, its compaction frequency is higher, resulting in and PStore, respectively. Under workload F, ProckStore’s throughput greater WA than PStore. Under workloads C and D, WA is 1.20 and 1.36, respectively. However, ProckStore’s triple-level filter compaction surpasses TStore and PStore by 36.4% and 71.3%, respectively. How- mechanism mitigates WA compared to SocketRocksDB. ever, when the write percentage is low, ProckStore’s throughput shows CPU Utilization and Compaction Bandwidth. Figs. 12(d) and minimal variation compared to other KV stores. Additionally, under 12(e) show the host-side CPU utilization of the KV store. Notably, read-intensive workloads, ProckStore achieves the maximum through- TStore runs on a single thread. The bandwidth limits the data transfer put improvement of 60.7%, 59.8%, 122.4%, and 9.2% under workloads between the host and the device. Overall, CPU utilization patterns for B, C, D, and E, respectively. In contrast, TStore’s read performance SocketRocksDB and ProckStore are similar on both sides, which can be suffers due to the excessive number of SSTables, which increases the attributed to the reduction in total processing time, accompanied by a query operation overhead. reduction in compaction time. Fig. 12(f) shows compaction bandwidth Throughput and Latency. Throughput and latency are critical met- in the Load and Run phases. ProckStore achieves the highest bandwidth rics for KV stores. As KV stores are widely deployed in real-world ap- on the host. Its bandwidth is 8.64× and 3.32× improvement than TStore plications, these metrics significantly impact response time. ProckStore and SocketRocksDB, respectively, under workload A. This improvement maintains its performance advantage in the Load phase when the data increases to 7.97× and 2.99× under workload F. Nevertheless, Prock- size increases from 10 GB to 20 GB under a workload with the same Store improves the bandwidth by exploiting multi-threaded parallelism. amount of data. Its throughput is 4.24× that of TStore, see Fig. 12(a). The average bandwidth of ProckStore is 56.8% higher than PStore Compared with SocketRocksDB and PStore, ProckStore’s throughput due to its efficient task scheduling, which leverages the computational improves by 2.33× and 1.8×, respectively, under the same workload. capabilities of both the host and the device. The advantage of ProckStore becomes even more pronounced under workloads A and F which involves a higher percentage of writes. La- 5.3.2. Case 2: Load 20 GB and run 20 GB tency results further demonstrate the flexibility of ProckStore’s schedul- In the Load phase, the throughput of ProckStore surpasses Sock- ing method. Under workloads D and F, ProckStore has 55.1% and etRocksDB and TStore by 3.44× and 3.73×, respectively, see Fig. 13(a). 42.5% lower latency than PStore (see Fig. 12(c)). Compared with Although the asynchronous approach of PStore enhances performance, TStore and PStore, ProckStore has 20.8% and 22.1% lower latency the multi-threaded method of ProckStore integrates with the asyn- under workload A, respectively, see Fig. 12(c). Under read-intensive chronous compaction mechanism. Consequently, the throughput of 12 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 13. The results of TStore, PStore, SocketRocksDB, and ProckStore under YCSB-C with load 20 GB and run 20-GB data volume. ProckStore reaches 1.59× of PStore, respectively. In the Run phase, ProckStore increases host- and device-side CPU utilization by 18.1% the multi-threaded asynchronous mode improves the performance of and 32.6%, respectively, compared with PStore under workload C. Un- ProckStore under write-intensive workloads A and F, where half of der mixed read–write workloads, such as A and F, ProckStore increases the operations are writes. Specifically, under workload A, ProckStore’s host-side CPU utilization by 6.7% and 20.0% and device-side CPU throughput exceeds that of TStore and PStore by 21.0% and 23.1%, utilization by 12.2% and 13.2%, respectively. Fig. 13(f) shows the com- respectively. Similarly, under workload F, ProckStore achieves 19.1% paction bandwidth of the Run phase. ProckStore achieves the highest and 21.3% higher throughput than TStore and PStore, respectively. bandwidth under all workloads. In workload C, ProckStore’s bandwidth Workloads A and F involve large-sized data volumes. Under these is 45.4% and 35.1% higher than that of PStore and SocketRocksDB, workloads, the data volume increased from 10-GB to 20-GB, and the respectively. This improvement is attributed to ProckStore’s utilization throughput of ProckStore decreased by 32.2% and 42.7%, respec- of multi-threaded parallelism. However, with large-size data volumes, tively. In addition, the throughput of ProckStore is optimized under ProckStore’s bandwidth decreases by 17.7% compared to a 10-GB data read-intensive workloads. In ProckStore, we focus on optimizing com- volume. Under workload D, ProckStore’s average bandwidth is 6.47× paction, and the read performance improvement is small. For read- and 1.38× higher than TStore and SocketRocksDB, respectively. In intensive workloads, ProckStore achieves 20.4%, 16.3%, and 28.5% comparison to a 10-GB data volume, the CPU utilization decreases by improvement under B, C, and D, compared with PStore, respectively. 38.1%, 32.3%, 17.7%, 33.7%, 30.9%, and 33.1% under workloads A, For workloads with small-sized data volumes, ProckStore decreases by B, C, D, E, and F, respectively. 27.8%, 8.4%, and 40.6% under workloads B, C, and D, respectively. Throughput and Latency. With a data size of 20 GB, Prock- 5.3.3. Tail latency Store maintains its performance advantage in the Load phase. Com- We analyzed the tail latency of ProckStore, including P90, P99, pared with SocketRocksDB and PStore, ProckStore’s throughput is im- and P999 latencies, and compared it with TStore, SocketRocksDB, and proved by 3.44× and 1.58×, respectively, in the Load phase. Both PStore under workloads of different data volumes (10 GB, 20 GB) and average latency and throughput of ProckStore have the highest perfor- a 1-KB value size. The experimental results are shown in Figs. 14 and mance in Figs. 13(a) and 13(c). Under read-intensive workloads such 15. as B and C, ProckStore outperforms SocketRocksDB by about 9.1% The results demonstrate that ProckStore outperforms other key- and 21.4%, respectively. This improvement is attributed to the triple- value stores, exhibiting lower tail latency. SocketRocksDB’s P90 and filtering compaction, which reduces execution time in the run phase, P99 tail latencies are notably lower than those of other KV stores, due to thereby increasing throughput. its multi-version management mechanism in RocksDB. ProckStore’s P90 As shown in Fig. 13(c), under write-intensive workload A, the and P99 tail latencies are lower than those of other KV stores thanks average latency of ProckStore is 17.6% and 18.9% lower than TStore to its asynchronous allocation method, which reduces tail latency. and PStore, respectively. ProckStore has a similar trend under workload Under a 10 GB workload, the most significant reduction in latency F. In addition, ProckStore’s latency also reduces by 16.9%, 14.1%, occurs when ProckStore lowers P90 latency by 94.07% and 93.89% and 22.2% under read-intensive workloads B, C, and D, compared compared to TStore and PStore under workload E. This improvement with SocketRocksDB, respectively. However, compared with 10 GB is attributed to ProckStore’s superior range query performance, while data volume, the latency increases due to the additional compaction TStore and PStore are not optimized for range queries. Similarly, Prock- operations and the associated lookup costs. Store achieves the lowest P99 latency. Fig. 14(b) shows that ProckStore CPU Utilization and Compaction Bandwidth. When the data achieves the most significant P99 latency reduction under workload E, volume increases from 10 GB to 20 GB, Figs. 13(d) and 13(e) illustrate lowering P99 by 79.4% and 79.2% compared to TStore and PStore, the changes in CPU utilization for ProckStore under various workloads. respectively. It also shows substantial improvements under workload B, 13 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 16. Write performance of ProckStore under DB_Bench with different numbers of subtasks. Fig. 14. The tail latency of ProckStore under YCSB-C with load 10 GB and run 10 GB data volume. 6.1. Impact of number of subtasks To validate the fourth-level prioritization, we conducted experi- ments to evaluate the impact of various subtasks on the write perfor- mance of ProckStore. The extended experiments replicate the configu- ration from Section 4. We configured DB_Bench with a 10 GB dataset and a 1-KB value. Specifically, we examine the impact of the number of subtasks on the fourth-level prioritization in ProckStore by configuring four types of subtasks on the host. The experimental results are shown in Fig. 16. As shown in Fig. 16(a), the throughput of ProckStore increases significantly with the number of subtasks. The throughput is 2.33 Fig. 15. The tail latency of ProckStore under YCSB-C with load 20 GB and run 20 GB MB/s, 2.81 MB/s, and 3.13 MB/s for one, two, and three subtasks, data volume. respectively, and subsequently stabilizes. With four subtasks, Prock- Store achieves a peak throughput of 3.8 MB/s. The average latency shows a similar trend, where ProckStore achieves the lowest latency with ProckStore reducing P99 by 75.6% and 76.2% compared to TStore (0.22 ms) with four subtasks, showing a 17.1% improvement from three and PStore. to four subtasks. The host-side CPU utilization also reflects ProckStore’s In Fig. 15, the differences in tail latency become more pronounced performance with different numbers of subtasks, as multi-core CPUs under a 20 GB workload. ProckStore reduces P90 latency by 9.32% enable parallel execution of multiple threads. and 31.06% under workloads A and F, respectively, compared to As shown in Fig. 16(c), CPU utilization increases with the number of PStore. Under workload E, ProckStore reduces P90 latency by 93.46% subtasks, allowing the CPU to utilize its computational resources fully. CPU utilization is 4.78% (the lowest) with one subtask, improving by and 93.68% compared to PStore and TStore, respectively. ProckStore’s 10.3% with two subtasks. The highest CPU utilization (6.84%) occurs four-level priority scheduling mechanism prevents low-priority requests with four subtasks. However, as the number of subtasks increases, from blocking high-priority writes, reducing extreme write latency the performance improvements in CPU utilization, throughput, and often blocked by flush or compaction in TStore and SocketRocksDB. average latency become less pronounced. This is because while par- Similarly, ProckStore reduces P99 tail latency under workloads A and allel execution of multiple threads reduces compaction execution time, F by 28.9% and 17.9%, respectively, compared to SocketRocksDB and the overhead from thread creation and synchronization increases. As TStore. Under workload C, ProckStore reduces P99 tail latency by the number of threads grows, this additional CPU overhead impacts 54.9% and 9.45% compared to PStore and SocketRocksDB, respec- ProckStore’s CPU utilization. tively. Under workload D, ProckStore reduces P99 tail latency by 23.5% and 6.0% compared to the same alternative KV stores. ProckStore 6.2. Impact of number of threads performs best under workload E, reducing P99 tail latency by 79.22%, 78.69%, and 6.23% compared to TStore, PStore, and SocketRocksDB, In Section 2.2, we studied the performance of PStore with different respectively. numbers of threads. For the multi-threaded comparison experiment of The FIFO scheduling used by traditional KV stores like ProckStore, we extended the analysis by comparing its throughput with SocketRocksDB can cause high-priority requests to be blocked, leading that of PStore under varying thread counts. The experimental results to increased tail latency. In contrast, ProckStore’s multi-level queue are shown in Fig. 17. scheduling mechanism enables compaction tasks to be executed in pri- Fig. 17(a) shows the throughput of ProckStore and PStore under ority order, with high-priority compaction tasks executed first, thereby workloads with 4 KB values and a 10 GB data volume. As the number of threads increases, the throughput of PStore does not increase ex- reducing tail latency. ponentially, and its performance is poor during multi-threaded writes. Specifically, the throughput increases by only 1.58% when the number 6. Extended experiment of threads rises from 8 to 12. In contrast, the throughput of ProckStore increases significantly with the number of threads. At 12 threads, the throughput reaches 7.86 MB/s, which is 10.9% higher than that at 8 In this section, we study the impact of multi-threaded and the num- threads. ProckStore’s throughput is 144.1% higher than PStore’s, as its ber of subtasks on ProckStore performance. The results demonstrate multi-threaded execution efficiently processes the large data volume the effectiveness of ProckStore under multi-threaded and verify its written by multiple threads, avoiding the computational limitations of performance under multiple subtask numbers. The environment of the single-thread execution in PStore. extended experiment is the same as the experimental configuration in As shown in Fig. 17(b), the average latency of PStore decreases Section 4. with the increase in threads under a 10 GB data volume workload. 14 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 Fig. 17. Write performance of ProckStore under DB_Bench with different number of threads. However, the decrease in latency is more significant when the number Storage architecture. ListDB [24] employs a skip-list as the core of threads is low, such as from 1 to 4 threads, where the latency data structure at all levels within non-volatile memory (NVM) or per- drops by 27.8%. For ProckStore, as the number of threads increases, sistent memory (PM) to mitigate the WA problem by leveraging byte- performance improves steadily. At 4 threads, the average latency of addressable in-place merge ordering. This approach reduces the gap ProckStore is 0.154 ms, and at 12 threads, it reduces to 0.104 ms with between DRAM and NVM write latency and addresses the write stall a 32.3% reduction. Fig. 17(c) shows that when the number of threads issue. HiKV [25] utilizes the benefits of hash and B+Tree indexes to reaches 12, the host-side CPU utilization for PStore and ProckStore is design the KV store on hybrid DRAM-NVM storage systems, where hash highest, at 8.74% and 11.62%, respectively, with ProckStore showing a indexes in NVM are used to enhance indexing performance. In a hybrid 41.92% increase over PStore. Additionally, the CPU utilization for both NVM-SSD system, WaLSM [26] tackles the WA problem through virtual systems increases by 7.28% and 10.7%, respectively, when the number partitioning, dividing the key space during compaction. Additionally, a of threads increases from 8 to 12. As the number of threads decreases, reinforcement-learning method is applied to balance the merging strat- CPU utilization also drops. For 1 thread, the CPU utilization is at its egy of different partitions under various workloads, optimizing read lowest – 6.48% for PStore and 9.37% for ProckStore. and write performance. TrieKV [27] integrates DRAM, PM, and disk into a unified storage system, utilizing a tri-structured index for all KV pairs in memory, enabling dynamic determination of KV pair locations 7. Related work across storage hierarchies and persistence requirements. Moreover, ROCKSMASH [28] utilizes local storage for frequently accessed data LSM-tree has become a popular data structure in key-value storage and metadata, while cloud storage is employed for less frequently systems, offering an alternative to traditional structures by efficiently accessed data. handling write-intensive workloads and large-scale datasets. Although Computing architecture. Heterogeneous computing [29] (e.g., KV stores manage data through compaction operations, these processes GPUs, DPUs, and FPGAs) alleviates the computational burden on the come at the cost of performance. Consequently, several studies have CPU. Sun et al. [30] propose an accelerated solution for key-value sought to mitigate the performance impact of compaction in KV stores. stores by offloading the compaction task to an FPGA. Similarly, the LSM-tree structure. PebblesDB [13] introduces the FLSM data FPGA-accelerated KV store [31] offloads the compaction task to the structure, which alleviates the limitations of non-overlapping key ranges FPGA, minimizing competition for CPU resources and accelerating within a level, thereby delaying the compaction process and reducing compaction while reducing CPU bottlenecks. LUDA [32] employs GPUs WA. WiscKey [14] separates keys and values to minimize WA during to process SSTables using a co-ordering mechanism that minimizes compaction but increases garbage collection overhead. To address this data movement, thereby reducing CPU pressure. gLSM [33] separates issue, HashKV [15] employs hash partitioning and a hot/cold partition- keys and values to minimize data transfer between the CPU and ing strategy, while DiffKV [16] separates keys based on the size of key- GPU, thereby accelerating compaction. dCompaction [34] leverages value pairs to balance performance. FenceKV [17] enhances HashKV DPUs to accelerate the compaction and decompaction of SSTables, by incorporating a fence-value-based partitioning strategy and key- offloading compaction tasks to the DPU according to a hierarchical range-based garbage collection, optimizing range queries. FGKV [18] structure, relieving CPU overload. Despite these advances, heteroge- and Spooky [19] reduce WA by adjusting the data granularity in neous computing still requires data transfer from host-side memory to compaction. FGKV introduces a fine-grained compaction mechanism the computing units, which can impact overall system performance. based on the LSpM-tree structure, minimizing redundant writes of Near-data processing (NDP), which offloads computational tasks irrelevant data. Spooky partitions the data at the largest level into from the CPU to the data location, is an emerging computing paradigm. equal-sized files and partitions the smaller levels according to file Previous studies [35] investigated storage computing and propose boundaries for fine-grained compaction. frameworks for storage- and memory-level processing. Biscuit [36] For compaction strategies, TRIAD [20] improves LSM-tree perfor- introduces a generalized framework for NDP. RFNS [37] examines mance by optimizing logs, memory, and storage. Work [21,22] opti- the advantages of reconfigurable NDP-driven servers based on ARM mize the traditional top-level-driven compaction of LSM-trees by shift- and FPGA architectures for data- and compute-intensive applications. ing to a low-level-driven approach, decomposing large compaction 𝜆-IO [38] designs a unified computational storage stack to manage stor- tasks into smaller ones to reduce granularity. WipDB [23] utilizes a age and computing resources through interfaces, runtime systems, and bucket-sort-like algorithm that minimizes merge operations by writing scheduling. HuFu [39] is an I/O scheduling architecture for computable KV pairs in an approximately sorted list. Although these studies en- SSDs that allows the system to manage background I/O tasks, offload hance compaction efficiency, they primarily focus on a single device computational tasks to SSDs, and exploit the parallelism and idle time and fail to address the competition for CPU and I/O resources between of flash memory for improved task scheduling. Li et al. [40] addresses foreground requests and background tasks. In contrast, NDP devices the resource contention problem between user I/O and NDP requests, expand computational resources to process tasks internally, reducing using the critical path to maximize the parallelism of multiple requests, data transfer and resource contention. thereby improving the performance of hybrid NDP-user I/O workflows. 15 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 ABNDP [41] leverages a novel hardware-software collaborative opti- References mization approach to solve the challenges of remote data access and computational load balancing without requiring trade-offs. [1] Z. Zhang, Y. Sheng, T. Zhou, et al., H2o: Heavy-hitter oracle for efficient gen- In addition, hosts and NDP devices employ distinct task scheduling erative inference of large language models, in: Advances in Neural Information Processing Systems, vol. 36, 2024. policies to collaborate on compaction tasks [9,10,42]. The nKV [43] [2] H. Lin, Z. Wang, S. Qi, et al., Building a high-performance graph storage on top defines data formats and layouts for computable storage devices and of tree-structured key-value stores, Big Data Min. Anal. 7 (1) (2023) 156–170. designs both hardware and software architectures to optimize data [3] S. Pei, J. Yang, Q. Yang, REGISTOR: A platform for unstructured data processing placement and computation. inside SSD storage, ACM Trans. Storage (TOS) 15 (1) (2019) 1–24. KV-CSD [44] builds NDP architectures using NVMe SSDs and system- [4] IDC, IDC innovators: Privacy-preserving computation, 2023, [EB/OL]. (2023-09- on-chip designs to reduce data movement during queries by offloading 20). https://www.idc.com/getdoc.jsp?containerId=prCHC51469323. [5] P. O’Neil, E. Cheng, D. Gawlick, E. O’Neil, The log-structured merge-tree tasks. Research such as OI-RAID [45] introduces an additional fault (LSM-tree), Acta Inform. 33 (4) (1996) 351–385. tolerance mechanism by adding an extra level on top of the RAID levels, [6] Google, Leveldb, 2025, https://leveldb.org/. enabling fast recovery and enhanced reliability. KVRAID [46] utilizes [7] Facebook, Rocksdb: a persistent key-value store for fast storage environments, logical-to-physical key conversion to pack similar-sized KV pairs into a 2016, http://rocksdb.org/. single physical object, thereby reducing WA, and applies off-site update [8] A. Acharya, M. Uysal, J. Saltz, Active disks: Programming model, algorithms and techniques to mitigate I/O amplification. Distributed storage systems, evaluation, Oper. Syst. Rev. 32 (5) (1998) 81–91. [9] H. Sun, W. Liu, J. Huang, et al., Collaborative compaction optimization system such as EdgeKV, have also been explored [47]. A sharding strategy is using near-data processing for LSM-tree-based key-value stores, J. Parallel Distrib. used to distribute data across multiple edge nodes, while consistent Comput. 131 (2019) 29–43. hashing ensures balanced data distribution and high availability. ER- [10] H. Sun, W. Liu, Z. Qiao, et al., Dstore: A holistic key-value store exploring near- KV [48] integrates a hybrid fault-tolerant design combining erasure data processing and on-demand scheduling for compaction optimization, IEEE coding and PBR, providing fault tolerance to ensure system reliability Access 6 (2018) 61233–61253. and high availability. Additionally, Song et al. [49] coupled each SSD [11] H. Sun, et al., Asynchronous compaction acceleration scheme for near-data processing-enabled LSM-tree-based KV stores, ACM Trans. Embed. Comput. Syst. with a dedicated NDP engine in an NDP server to fully leverage the 23 (6) (2024) 1–33. data transfer bandwidth of SSD arrays. MStore [50] extends an NDP [12] Isaac Kofi Nti, et al., A mini-review of machine learning in big data analytics: device to multiple devices, utilizing them to perform compaction tasks. Applications, challenges, and prospects, Big Data Min. Anal. 5 (2) (2022) 81–97. Although NDP devices can handle host-side computational tasks, [13] P. Raju, R. Kadekodi, V. Chidambaram, et al., Pebblesdb: Building key-value their resources remain limited. Consequently, it is critical to optimize stores using fragmented log-structured merge trees, in: Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 497–514. the use of these resources on the NDP device. The multi-threaded [14] L. Lu, T.S. Pillai, H. Gopalakrishnan, et al., Wisckey: Separating keys from values asynchronous method in ProckStore addresses this challenge by fully in SSD-conscious storage, ACM Trans. Storage (TOS) 13 (1) (2017) 1–28. utilizing computation on both the host and device sides, avoiding [15] H. H. W. Chan, C. J. M. Liang, Y. Li, et al., HashKV: Enabling efficient updates in resource wastage while ensuring sufficient computational capacity on KV storage via hashing, in: 2018 USENIX Annual Technical Conference, USENIX the NDP device. ATC 18, 2018, pp. 1007–1019. [16] Y. Li, Z. Liu, P. P. C. Lee, et al., Differentiated key-value storage management for balanced I/O performance, in: 2021 USENIX Annual Technical Conference, 8. Conclusions USENIX ATC 21, 2021, pp. 673–687. [17] C. Tang, J. Wan, C. Xie, Fencekv: Enabling efficient range query for key-value In this paper, we present ProckStore, an NDP-empowered KV store, separation, IEEE Trans. Parallel Distrib. Syst. 33 (12) (2022) 3375–3386. to improve compaction performance for large-scale unstructured data [18] H. Sun, G. Chen, Y. Yue, et al., Improving LSM-tree based key-value stores with storage. In ProckStore, the multi-threaded and asynchronous mecha- fine-grained compaction mechanism, IEEE Trans. Cloud Comput. (2023). nism leverages computational resources within storage devices, reduc- [19] N. Dayan, T. Weiss, S. Dashevsky, et al., Spooky: granulating LSM-tree com- pactions correctly, in: Proceedings of the VLDB Endowment, Vol. 15, (11) 2022, ing data movement and enhancing compaction efficiency. ProckStore pp. 3071–3084. optimally schedules compaction tasks across the host and NDP de- [20] O. Balmau, D. Didona, R. Guerraoui, et al., TRIAD: Creating synergies between vice by implementing a four-level priority scheduling mechanism. This memory, disk and log in log structured key-value stores, in: 2017 USENIX Annual separation of compaction stages provides parallel processing with- Technical Conference, USENIX ATC 17, 2017, pp. 363–375. out interference, achieving efficient resource utilization. In addition, [21] Y. Chai, Y. Chai, X. Wang, et al., LDC: a lower-level driven compaction method ProckStore uses key-value separation to reduce data transfer between to optimize SSD-oriented key-value stores, in: 2019 IEEE 35th International Conference on Data Engineering, ICDE, 2019, pp. 722–733. the host and NDP device, minimizing transmission time. Experimental [22] Y. Chai, Y. Chai, X. Wang, et al., Adaptive lower-level driven compaction to results unveil that ProckStore outperforms existing synchronous and optimize LSM-tree key-value stores, IEEE Trans. Knowl. Data Eng. 34 (6) (2020) single-threaded asynchronous NDP-empowered KV stores, achieving up 2595–2609. to 4.2× higher throughput than the baseline KV store. ProckStore also [23] X. Zhao, S. Jiang, X. Wu, WipDB: A write-in-place key-value store that mimics reduces WA, compaction time, and CPU utilization. bucket sort, in: 2021 IEEE 37th International Conference on Data Engineering, ICDE, 2021, pp. 1404–1415. [24] W. Kim, C. Park, D. Kim, et al., Listdb: Union of write-ahead logs and persistent CRediT authorship contribution statement SkipLists for incremental checkpointing on persistent memory, in: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, Hui Sun: Writing – review & editing, Writing – original draft, pp. 161–177. Visualization, Validation, Supervision, Software, Resources, Project ad- [25] F. Xia, D. Jiang, J. Xiong, et al., HiKV: a hybrid index key-value store for DRAM- ministration, Methodology, Investigation, Funding acquisition, Formal NVM memory systems, in: 2017 USENIX Annual Technical Conference, USENIX ATC 17, 2017, pp. 349–362. analysis, Data curation, Conceptualization. Chao Zhao: Writing – re- [26] L. Chen, R. Chen, C. Yang, et al., Workload-aware log-structured merge key-value view & editing, Writing – original draft, Visualization, Validation, store for NVM-SSD hybrid storage, in: 2023 IEEE 39th International Conference Software, Resources, Project administration, Methodology, Investiga- on Data Engineering, ICDE, 2023, pp. 2207–2219. tion, Funding acquisition, Formal analysis, Data curation, Conceptu- [27] H. Sun, et al., TrieKV: A high-performance key-value store design with memory alization. Yinliang Yue: Validation, Supervision, Software. Xiao Qin: as its first-class citizen, IEEE Trans. Parallel Distrib. Syst. (2024). Supervision, Resources, Methodology, Formal analysis, Data curation. [28] P. Xu, N. Zhao, J. Wan, et al., Building a fast and efficient LSM-tree store by integrating local storage with cloud storage, ACM Trans. Archit. Code Optim. ( TACO) 19 (3) (2022) 1–26. Declaration of competing interest [29] Hao Zhou, Yuanhui Chen, Lixiao Cui, Gang Wang, Xiaoguang Liu, A GPU- accelerated compaction strategy for LSM-based key-value store system, in: The The authors declare that there is no conflict of interests regarding 38th International Conference on Massive Storage Systems and Technology, the publication of this article. 2024, pp. 1–11. 16 H. Sun et al. Journal of Systems Architecture 160 (2025) 103342 [30] X. Sun, J. Yu, Z. Zhou, et al., Fpga-based compaction engine for accelerating [41] B. Tian, Q. Chen, M. Gao, ABNDP: Co-optimizing data access and load balance in lsm-tree key-value stores, in: 2020 IEEE 36th International Conference on Data near-data processing, in: Proceedings of the 28th ACM International Conference Engineering, ICDE, 2020, pp. 1261–1272. on Architectural Support for Programming Languages and Operating Systems, [31] T. Zhang, J. Wang, X. Cheng, et al., FPGA-accelerated compactions for LSM-based Vol. 3, 2023, pp. 3–17. key-value store, in: 18th USENIX Conference on File and Storage Technologies, [42] H. Sun, W. Liu, J. Huang, et al., Near-data processing-enabled and time-aware FAST 20, 2020, pp. 225–237. compaction optimization for LSM-tree-based key-value stores, in: Proceedings of [32] P. Xu, J. Wan, P. Huang, et al., LUDA: Boost LSM key value store compactions the 48th International Conference on Parallel Processing, 2019, pp. 1–11. with GPUs, 2020, arXiv preprint arXiv:2004.03054. [43] T. Vincon, A. Bernhardt, I. Petrov, et al., nKV: near-data processing with KV- [33] H. Sun, J. Xu, X. Jiang, et al., gLSM: Using GPGPU to accelerate compactions stores on native computational storage, in: Proceedings of the 16th International in LSM-tree-based key-value stores, ACM Trans. Storage (2023). Workshop on Data Management on New Hardware, 2020, pp. 1–11. [34] C. Ding, J. Zhou, J. Wan, et al., Dcomp: Efficient offload of LSM-tree compaction [44] I. Park, Q. Zheng, D. Manno, et al., KV-CSD: A hardware-accelerated key-value with data processing units, in: Proceedings of the 52nd International Conference store for data-intensive applications, in: 2023 IEEE International Conference on on Parallel Processing, 2023, pp. 233–243. Cluster Computing, CLUSTER, 2023, pp. 132–144. [35] E. Riedel, G. Gibson, C. Faloutsos, Active storage for large-scale data mining [45] N. Wang, Y. Xu, Y. Li, et al., OI-RAID: a two-layer RAID architecture towards and multimedia applications, in: Proceedings of 24th Conference on Very Large fast recovery and high reliability, in: 2016 46th Annual IEEE/IFIP International Databases, 1998, pp. 62–73. Conference on Dependable Systems and Networks, DSN, 2016, pp. 61–72. [36] B. Gu, A.S. Yoon, D.H. Bae, et al., Biscuit: A framework for near-data processing [46] M. Qin, A.L.N. Reddy, P. V. Gratz, et al., KVRAID: high performance, write of big data workloads, ACM SIGARCH Comput. Archit. News 44 (3) (2016) efficient, update friendly erasure coding scheme for KV-SSDs, in: Proceedings of 153–165. the 14th ACM International Conference on Systems and Storage, 2021, pp. 1–12. [37] X. Song, T. Xie, S. Fischer, Two reconfigurable NDP servers: Understanding the [47] K. Sonbol, Ö. Özkasap, I. Al-Oqily, et al., EdgeKV: Decentralized, scalable, and impact of near-data processing on data center applications, ACM Trans. Storage consistent storage for the edge, J. Parallel Distrib. Comput. 144 (2020) 28–40. (TOS) 17 (4) (2021) 1–27. [48] Y. Geng, J. Luo, G. Wang, et al., Er-kv: High performance hybrid fault- [38] Z. Yang, Y. Lu, X. Liao, et al., 𝜆-IO: A unified IO stack for computational storage, tolerant key-value store, in: 2021 IEEE 23rd International Conference on High in: 21st USENIX Conference on File and Storage Technologies, FAST 23, 2023, Performance Computing & Communications; 7th International Conference on pp. 347–362. Data Science & Systems; 19th International Conference on Smart City; 7th [39] Y. Wang, Y. Zhou, F. Wu, et al., Holistic and opportunistic scheduling of International Conference on Dependability in Sensor, Cloud & Big Data Systems background I/Os in flash-based SSDs, IEEE Trans. Comput. (2023). & Application, HPCC/DSS/SmartCity/DependSys, 2021, pp. 179–188. [40] J. Li, X. Chen, D. Liu, et al., Horae: A hybrid I/O request scheduling technique for [49] X. Song, T. Xie, S. Fischer, A near-data processing server architecture and near-data processing-based SSD, IEEE Trans. Comput.-Aided Des. Integr. Circuits its impact on data center applications, in: High Performance Computing: 34th Syst. 41 (11) (2022) 3803–3813. International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings 34, Springer International Publishing, 2019, pp. 81–98. [50] H. Sun, Q. Wang, Y.L. Yue, et al., A storage computing architecture with multiple NDP devices for accelerating compaction performance in LSM-tree based KV stores, J. Syst. Archit. 130 (2022) 102681. 17