Files
opaque-lattice/papers_txt/StorStack--A-full-stack-design-for-in-storage-_2025_Journal-of-Systems-Archi.txt
2026-01-06 12:49:26 -07:00

704 lines
82 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103348
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
StorStack: A full-stack design for in-storage file systems
Juncheng Hu, Shuo Chen, Haoyang Wei, Guoyu Wang, Chenju Pei, Xilong Che
College of Computer Science and Technology, Jilin University, Chang Chun, 130022, China
ARTICLE INFO ABSTRACT
Keywords: Due to the increasingly significant cost of data movement, In-storage Computing has attracted considerable
File system attention in academia. While most In-storage Computing works allow direct data processing, these methods do
In-storage Computing not completely eliminate the participation of the CPU during file access, and data still needs to be moved from
Storage-class Memory
the file system into memory for processing. Even though there are attempts to put file systems into storage
devices to solve this problem, the performance of the system is not ideal when facing high latency storage
devices due to bypassing the kernel and lacking page cache.
To address the above issues, we propose StorStack, a full-stack, highly configurable in-storage file system
framework, and simulator that facilitates architecture and system-level researches. By offloading the file system
into the storage device, the file system can be closer to the data, reducing the overhead of data movements.
Meanwhile, it also avoids kernel traps and reduces communication overhead. More importantly, this design
enables In-storage Computing applications to completely eliminate CPU participation. StorStack also designs
the user-level cache to maintain performance when storage device access latency is high. To study performance,
we implement a StorStack prototype and evaluate it under various benchmarks on QEMU and Linux. The results
show that StorStack achieves up to 7x performance improvement with direct access and 5.2x with cache.
1. Introduction the design and operation of file systems determine their reliance on
the CPU when accessing the file system. For In-storage Computing,
In traditional computing architectures, data must be transferred although researchers are gradually reducing CPU involvement, current
from storage devices to memory for processing, which not only con- file systems still rely on the CPU to handle complex file management
sumes the computing resources of the host, but also results in high tasks and ensure system security and integrity.
energy consumption and I/O latency. As data scales continue to expand, On the one hand, to reduce the software overhead of file systems,
In-storage Computing has been proposed to alleviate the pressure of
many works aim at the kernel trap. For example, there are some efforts
data movement [1,2]. The core idea is to perform computations directly
to move the file system into user space [813]. But running in user
where the data is stored, without the need to move the data. The
space may compromise the reliability of the file system, hence bugs
emergence of high-speed storage devices like SSDs [3] and SCMs [4,5]
has significantly advanced research in In-storage Computing and trans- or malicious software may cause crashes and data loss. Some of these
formed computer storage systems. To fully leverage the potential of works try to move the critical parts of the file system back to the kernel.
storage systems and exploit the characteristics of this new computing But in most cases, data-plane operations are interleaved with control-
paradigm, a redesign of storage stack software is required. plane operations, which may diminish the performance improvement
As the most essential part of the storage stack software, file systems brought by kernel bypassing. In recent years, firmware file systems
have been residing in the operating system kernel for a very long have been proposed, which move file systems onto the storage device
time because they need to perform integrity assurance and access controller [1416] to completely get rid of the kernel trap. However,
control to ensure data security. The kernel is considered a trusted field those file systems are designed to be strongly coupled with the storage
compared to the user space. However, this seemingly good design has device, making the device lack the replaceability of file system and
been challenged by new technologies. With the emergence of faster the compatibility with conventional operating systems. In addition,
storage devices such as SSDs and SCMs, access latency decreases signif- these firmware file systems do not provide comprehensive security
icantly compared to HDDs [6], leading to the software overhead of file
guarantees.
systems [7,8] becoming a major performance bottleneck. Meanwhile,
Corresponding author.
E-mail addresses: jchu@jlu.edu.cn (J. Hu), chenshuo22@mails.jlu.edu.cn (S. Chen), hywei23@mails.jlu.edu.cn (H. Wei), wgy21@mails.jlu.edu.cn
(G. Wang), peicj2121@mails.jlu.edu.cn (C. Pei), chexilong@jlu.edu.cn (X. Che).
https://doi.org/10.1016/j.sysarc.2025.103348
Received 29 August 2024; Received in revised form 24 November 2024; Accepted 18 January 2025
Available online 27 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
On the other hand, to fully leverage the advantages of In-storage 2.1. Hardware trends
Computing, it is necessary to eliminate the participation of host-side
OS from the storage access path. In-storage Computing advocates for a Compared to the large, slow HDD, solid-state drive (SSD) is a kind of
data-centric approach, where computation units are embedded within flash-based non-volatile storage with small form factor, high speed, and
the storage devices to enable direct data processing. However, in the low energy costs [17,18]. SSDs on the market today can provide up to
process of accessing files, traditional file systems still require CPU 30 TB of capacity and 7 GB/s throughput on sequential read/write. To
involvement. To know which data should be transferred next, file access fully exploit the high performance, modern SSDs have switched from
should be first handled by the host-side file system in the operating SATA to PCIe and NVMe. PCIe 5.0 [19] supports up to 16 lanes and 32
system kernel. This CPU intervention limits the computational capacity GT/s data rate, which leads to more than 60GB/s bandwidth. NVMe [3]
improvements that In-storage Computing can offer. is a communication protocol for non-volatile memories attached via
Another point worth noting is that numerous studies propose im- PCIe, supporting up to 65,535 I/O queues each with 65,535 depth. It
proving system performance by allowing user applications to bypass also supports SSD-friendly operations like ZNS and KV, which can help
the kernel and communicate directly with storage devices. This method SSDs further enhance SSDs throughput capabilities.
demonstrates significant performance improvements when dealing with Storage class memory (SCM), also referred to as persistent mem-
high-speed storage devices. However, due to the diversity of storage de- ory (PMEM) or non-volatile memory (NVM), is a different type of
vices and their varying latencies, system performance may suffer when storage device that is fast and byte-addressable like DRAM, but can
bypassing the high-speed cache, especially when using high-latency, also retain data without power like SSDs. Various technologies such as
low-speed storage devices. Therefore, the impact of cache configuration PRAM [20,21], MRAM [22], and ReRAM [23,24]have been explored to
on performance is also a subject of our further research. In summary, implement SCM, each exhibiting different performance characteristics.
despite various attempts to optimize file systems performance and SCM provides higher bandwidth than SSD; it offers latency close to
reduce CPU involvement, current solutions still have several issues. DRAM, and its capacity falls between SSD and DRAM [25]. As a new
To further optimize the performance and security of file systems blood in the storage hierarchy, SCM can provide more possibilities to
and fully unleash the potential of in-storage computing, we propose multiple workloads [2629].
StorStack, which is a full-stack, highly configurable, in memory file Consequently, while the increased bandwidth and reduced latency
system framework and simulator on high-speed storage devices such as of storage devices have substantially boosted the performance of com-
SSDs and SCMs. Since file systems always have a fixed primary func- puter systems and enabled novel application scenarios, these advance-
tionality of managing the data mapping, which is similar in function to ments also introduce several challenges. These challenges include
the flash translation layer (FTL) on the storage controller, we consider it heightened complexity in data management, the need to balance cost
natural and reasonable to run the file system on the storage controller. and efficiency, and issues related to technical compatibility and migra-
StorStack has three main components: a device firmware runtime tion.
for file systems enabling file systems to run directly on the storage
device, a user library to expose POSIX interfaces to user applications, 2.2. In-storage computing
and a kernel driver to guarantee access control. By moving the file sys-
tem into the storage, StorStack aims to gain performance improvement While these new storage devices have significantly altered the
from the concept of In-storage Computing that brings the file system memory hierarchy of computer systems, the memory wall between
closer to the data. Moreover, the file system code is removed from the CPU and off-chip memory is still the bottleneck of the whole system,
kernel, which can avoid the latency and context switches caused by especially with the rise of data-intensive workloads and the slowdown
kernel traps during file access. More importantly, StorStack can remove of Moores law and Dennard scaling. To reduce the overhead of data
the CPU from the storage access path of In-storage Computing appli- movement, In-storage Computing(ISC) [3032]is proposed, gaining
cations, maximizing the potential of In-storage Computing. To ensure increasing attention with advancements in integration technologies.
the security and reliability of the file system, StorStack has designed an However, most current research predominantly focuses on offloading
efficient security mechanism, introducing a device-side controller as the user-defined tasks to storage devices, and this approach still faces
runtime and retaining control plane operations within the host kernel. limitations in practice.
By reducing the ratio of control plane to data plane operations, kernel First, existing ISC methods exhibit significant shortcomings in terms
traps are minimized, enhancing performance. StorStack also includes a of compatibility and portability. On the host side, developers must de-
user-level cache to explore the impact of cache on the performance of sign custom APIs for ISC, which are incompatible with existing system
interfaces such as POSIX, demanding substantial modifications to the
in-storage file systems.
host code [32]. On the drive side, the drive program either collaborates
We implemented StorStack as a prototype and evaluated it on
with the host file system to access the correct file data [33] or manages
QEMU and Linux 5.15. Experimental results demonstrate that StorStack
the drive as a bare block device without a file system. However, most
performs up to 5.2x faster times than Ext-4 with cache and 7x times
systems still rely on file system-based external storage access, with the
with direct access. Regarding the cache, we find that as access latency
file system typically running on the CPU. Consequently, ISC tasks often
increases, file systems with cache always maintain high speeds, whereas
require CPU involvement when accessing external storage data.
the speed of file systems without cache decreases significantly.
Secondly, current approaches lack adequate protection and isolation
for ISC applications. To fully leverage the high speed of modern storage
2. Background and related work devices, multiple ISC applications may need to execute concurrently.
Without proper data protection mechanisms, malicious or erroneous
The storage or memory system has changed a lot in the past decades. ISC tasks could access unauthorized data. Without isolation, the exe-
With the development of speed, capacity, and size, and the emergence cution of one ISC task could compromise the performance and security
of new types of storage, a rethink of both hardware and software is of others. However, most existing research [1,34,35] assumes that ISC
required to exploit the potential of the system in the next era. In this tasks operate in an exclusive execution environment, failing to address
section, we first discuss the trends of two novel high-speed non-volatile these concerns effectively. Additionally, when specific code is offloaded
storage, and then explored the significance of applying In-storage Com- to storage devices, attackers can exploit vulnerabilities in in-storage
puting on these storage devices. Finally, we briefly introduce three file software and hardware firmware, such as buffer overflows [36,37] or
systems in different locations. bus snooping attacks, to escalate privileges and harm the system.
2
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
2.3. File system 3. Design
The evolution of storage hardware poses higher demands for soft-
In this section, we first discuss the design principles of StorStack,
ware systems. As a crucial part of the software stack of the storage
followed by an overview of its architecture, connection between host
system, file systems should be redesigned to minimize software over-
heads, especially the involvement of the OS kernel on the data path. and device, scheduling mechanisms and reliability designs.
Many efforts have explored the possibility of different file system
locations.
3.1. Principles
Kernel file systems. Numerous typical file systems are implemented
inside kernel as kernel file systems, including Ext4, XFS, etc. Due to
the isolation of kernel space, kernel file systems can easily manage 1. Provide a full-stack framework to enable in-storage file sys-
data and metadata with reliability guarantees [38]. Recent works on tems without compromising performance. To support in-storage
kernel file systems have sought to exploit the capabilities of modern FS, StorStacks design includes a user library, a kernel driver, and a
storage devices. For example, F2FS [39] is built on append-only logging firmware FS runtime. By bringing FS code out of the kernel and closer
to adapt to the characteristics of flash memory. PMFS [38] introduce to the data, StorStack avoids the kernel trap and reduces the commu-
a new hardware primitive to avoid the consistency issues caused by nication overhead. StorStack also incorporates a user-level cache to
CPU cache while accessing SCM. DAX [40] bypasses the buffer cache maintain the performance when the access latency of the device is high.
of the system to support direct access to the storage hardware so that
2. Make full use of the heterogeneity of the host CPU and
the redundant data movement between DRAM and SCM is removed.
storage device controller. The in-storage FS yields the host CPU time
NOVA [41] explores the hybrid of DRAM and SCM as a specially
designed log-structured file system. However, kernel file systems have to user application codes and cuts the energy cost, while conflicts due
several limitations. Firstly, the development and debugging process to concurrent access are resolved on the host CPU to maintain the per-
within kernel space is inherently complex and difficult. Furthermore, formance. If necessary, the cache is also retained on the host side, and is
every file system access necessitates a kernel trap, which inevitably in- managed by the user space. Such a heterogeneous system can maximize
troduces latency. Additionally, the frequent context switching between the overall performance and minimize the power consumption of the
user processes and the kernel increases CPU overhead. system.
User-space file systems. User-space file systems are implemented 3. Guarantee the reliability of the file system with minimal
mostly in user space to bypass the kernel and reduce the overhead as- overhead. To provide essential guarantees such as permission check-
sociated with kernel traps. However, since most user-space file systems ing, StorStack keeps its control plane within the trusted area. Addi-
are implemented in untrusted environments, ensuring data security and tionally, to enhance performance, a token mechanism is introduced to
reliability becomes challenging. User-space file systems need sophisti-
prevent StorStack from accessing the kernel during data-plane opera-
cated design, usually the collaboration between kernel space and user
tions.
space, to keep them reliable. For example, Strata [11] separate the
file system into a per-process user space update log for concurrent 4. Keep compatible with conventional operating systems. The
writing and a read-only kernel space shared area for data persistence. design of StorStack does not require changes to current operating
Moneta-D [9] provides a hardware virtual channel support with kernel systems. Instead, the user lib and kernel driver of StorStack are add-
space file system protection policy and a user space driver to access the ons. Even without them, the StorStack storage device can be accessed
hardware. There are also efforts to implement the control-plane of the with typical block- or byte-based interfaces, just like traditional SSDs
file system as a trusted user space process [8,12]. or SCMs. StorStack also supports per-partition replaceable file sys-
Firmware file systems. Works that offload part or the whole of the file tems, which is a regular function in current operating systems but not
system into the storage device firmware are categorized as firmware supported by firmware file systems.
file systems. There are three representative works on firmware file sys- 5. Support heterogeneous computing. By providing a device-level
tems: DevFS [14], CrossFS [15] and FusionFS [16]. DevFS and CrossFS
file interface, StorStack may enable multiple advanced heterogeneous
explore the possibility of moving the file system to the storage side to
access patterns, including In-storage Computing (ISC) [31,32,42,43]
benefit from kernel bypass. FusionFS goes further on the previous two
and direct I/O access from GPUs [44,45] or NICs [42,46]. In this work,
works and attempts to gain performance by combining multiple storage
access operations. However, we have identified several problems of we provide basic support for these patterns and plan to further explore
these file systems. First, these firmware file systems are tightly coupled them in future research.
with specific storage devices, which makes it hard for users to select 6. Run with reasonable hardware setup on the storage device.
alternative file systems or upgrade the software version of the current Previous research on firmware file systems has assumed that device
file system. Second, none of these file systems are designed to operate controller hardware capabilities are severely limited. However, todays
effectively in scenarios with significant communication latency. Third, high-end storage devices feature up to 4 cores and DRAM capacity that
the lack of security mechanisms limits their applicability in real-world can reach 1% of their storage capacity [47]. As in-storage processing
environments. evolves, hardware configurations will continue to improve [30,43,
4850]. In StorStack, we assume that the device possesses sufficient
2.4. Motivation capabilities to run file systems alongside a runtime environment. Fu-
ture research can investigate the benefits of integrating in-storage file
Although kernel file systems are well-designed and time-tested, their
systems with additional device-side capabilities, such as power loss
design principles, which assume high device access latency, are no
protection capacitors or the flash translation layer.
longer suitable for modern high-speed devices. User-space file systems
and firmware file systems have explored new approaches to file system
implementation in the era of high-speed storage; however, they may 3.2. Architecture
lead to inferior performance with traditional devices, compromised
security controls, or inflexible, non-replaceable file systems. To ad-
dress these issues, we introduce StorStack, a fast, flexible, and secure To support in-storage file systems with compatibility, flexibility, and
in-storage file system framework. The detailed comparison between reliability, StorStack has three major parts distributed over user space,
StorStack and previous file systems is shown in Table 1. kernel space, and device side.
3
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
Table 1
The detailed comparison between StorStack and previous file systems.
Software access Expected hardware FS position Host-side cache Replaceable FS Isolated access
latency latency control
Kernel FS High High Host ✓ ✓ ✓
User-space FS Low Low Host ◦ ✓ ◦
Prev.Firm FS Low Low Device × × ×
StorStack Low Either Device ✓ ✓ ✓
Fig. 1. StorStack Architecture. StorStack consists of three major modules: the U-lib, the K-lib, and the Firm-RT; and there are two workflows: a data-plane workflow, and
a control-plane workflow. The interconnection between them is shown in the figure.
3.2.1. High-level design subsequently transmits it to the device-side Firm-RT. The Firm-RT
As shown in Fig. 1, StorStack consists of three major parts: a user receives the NVMe command, checks its validity, and then forwards the
lib (U-lib), a kernel driver (K-lib), and an FS runtime in device command to the FS. The FS handles the file operation and then works
firmware (Firm-RT). with the FTL or other hardware instruments to arrange the data blocks
U-lib. The U-lib is the interface for user applications to access the on the storage media. The primary distinction between this routine and
in-storage FS, offered as a dynamic link library. The main job of the a typical kernel-based file system lies in the fact that the file system
U-lib is to expose POSIX file operations to users, provide user-level logic is inside the storage device, thus StorStack thereby eliminating
cache, and manage the connection with the device. It also cooperates the need for kernel traps during data access.
with the K-lib and the Firm-RT to ensure the reliability of the The control plane (blue dashed lines in Fig. 1) provides necessary
system. supports for the data plane to work properly. Control-plane operations
K-lib. The K-lib is a kernel module to provide control-plane op- on the host side, including memory resource allocation and identity
erations with reliability. Its work includes resource allocation and token assignment, are delegated to the kernel to ensure security and
permission checking. Although it resides in the kernel, the functions reliability. The host-side control-plane operations are designed to be
of K-lib are designed to be rarely called to avoid the performance rarely called to reduce kernel trap overhead. On the device, the control
penalty associated with kernel traps. plane assists in check the authentication of requests, manage the FS,
Firm-RT. The Firm-RT is a runtime on the storage firmware that and deal with other management operations. More detailed security
offers essential hardware and software support for in-storage FS to run and reliability policies will be described in Section 3.5.
on the device controller. To serve the FS, Firm-RT communicates
with both the U-lib for data-plane operations, and the K-lib for 3.2.3. Organization on the storage
control-plane operations. In StorStack, file systems are stored in the storage media with
pointers originating from partitions, so that the framework can choose
3.2.2. StorStack workflow the right FS to access a partition. We dedicate a partition to store all
For clarity, the workflow of StorStack is divided into a data plane the FS binaries that are used by user-created partitions, and each FS
and a control plan. The data-plane workflow handles data accesses from in this partition can be indexed by a number. Here we assume that a
user space, and the control plane is responsible for maintaining the GUID partition table (GPT) to organize the partitions. Each user-created
systems functionality, safety, and reliability. partition is associated with an FS when it is formatted, and the FS will
For the data plane (red lines in Fig. 1), when a user application be added to the FS partition we just mentioned if it was not there yet.
calls a file operation in StorStack, the host-side U-lib will check the To indicate the relation between the user-created partition and its FS,
cache if the cache is used. If the cache is bypassed or penetrated, the index number of the FS is added to the attribute flags bits of the
the U-lib packs it into an extended NVMe protocol command, and partitions GPT entry. The organization is illustrated in Fig. 2. This
4
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
Fig. 2. Partition organization. Figure shows how the FS is stored on the storage and associated with the partition.
design allows StorStack to provide different file systems to different disk access when the system does not support StorStack. It is note-
partitions. Meanwhile, the GPT and the partitions are still available for worthy that the protocol can be further extended under StorStack to
the typical kernel file system routine. support more paradigms like transactional access [51], log-structured
access [52,53], operations fusing [16], or In-storage Computing. We
3.3. File access pattern will leave these further explorations to our future work.
With StorStack, heterogeneous hardware like GPUs can implement
The U-lib provides POSIX IO and AIO interfaces to user appli- this extended protocol to access files directly without involving the
cations, and the complicated reliability and performance designs are CPU. For different types of hardware, there are two ways to transmit
transparent to users. For regular IO interfaces, the write operations data. For those who have their own memory (memory-mapped) like
(write, pwrite) act differently with and without cache. When the cache GPUs, StorStack can directly place the data to their memory via PCIe
is used, writes will return as soon as an operation passes some simple bus. For hardware without memory (I/O mapped), StorStack should put
check and is put into the queue. The interface will not promise that the data into the main memory. The manipulation of data destination
the data is written to the disk before it returns, just like a traditional is directed by the target device driver.
kernel file system, unless the fsync is called. Without cache, the writes
will block the process until the data is written to the storage. The 3.4.2. Multi-queue arrangement
read interfaces (read, pread) will not return until the data is available, NVMe uses multiple queues to improve performance, supporting up
regardless of whether there is a cache. The AIO interfaces return to 65,536 I/O queues, with 65,536 commands per queue. Normally,
immediately when an operation is put into the queue, and the real NVMe offers at least a pair of queues (one submission queue and one
return value can be fetched by non-blocking check, blocking suspend, completion queue) for each core to fully utilize the bandwidth without
or signal. introducing locks. In StorStack, file operations are processed on the
device side, particularly when the storage device features a multi-core
To make sure that StorStack performs well on high-latency storage
controller. To fully utilize the parallelism of the controller cores while
devices, an optional user-level per-process cache is provided. Because
minimizing the potential conflicts of concurrent file access, StorStack
the reliability of StorStack can only be ensured by the device-side file
introduces a special queue organization.
system but not the U-lib, we choose per-process cache to prevent
As Fig. 3 shows, every user process in StorStack is assigned a bunch
malicious processes from polluting data by writing to a global cache
of queue pairs, the number of which is equal to the storage device
without check. The user-level cache has two ways to deal with write
controller core count. Each queue pair of the queue pair bunch is bound
operations: the write-back method returns immediately after the data
to a controller core of the storage device, so that a process can distribute
is put into the cache; the write-around method drops the dirty data in
any file operation to a specific controller core. Meanwhile, each user
cache and returns after the operation is put into the queue. The write-
thread has its exclusive queue pair bunch to avoid queue contention
back cache has a higher performance than the write-around cache,
on the host side.
while the write-around cache can provide higher data consistency. In
The purpose of this arrangement is to enable the host-side ap-
fact, our evaluation shows that the write-back cache in StorStack can
plications to control which operation should be dispatched to which
outperform the page cache inside the kernel.
controller core. For example, read intensive applications can issue read
operations to all cores with a round robin strategy. For write intensive
3.4. Connectivity applications, different threads can send the write operations on the
same file to the same controller core to reduce lock contention between
Here we discuss how the host-side U-lib and K-lib communicate controller cores. We will leave the exploration of the scheduling policy
with the device-side Firm-RT. StorStacks communication is based for different workloads to future works.
on NVMe to take full advantage of high-speed storage devices. We
also propose a multi-queue design to improve the performance of 3.5. Security and reliability
device-side FS.
From a hardware perspective, the privileged mode (ring 0) that the
3.4.1. Communication protocol kernel runs on and the user mode that user applications run on are
The communication protocol between the host CPU and StorStack isolated, which means the access to resources is restricted by hardware.
device is a queued protocol extended from NVMe [3]. NVMe is a The privileged mode can thus be treated as a trusted area, whereas the
protocol for accessing non-volatile memories connected via PCIe that user mode as an untrusted area. StorStack introduces the device-side
supports multiple queues to maximize the throughput, which is suitable controller as a run-time, which is also isolated from user code and thus
for novel high-speed storage devices such as SSDs and SCMs. viewed as a trusted area.
To enable the transfer of file operations, we extend the NVMe For safety, everything critical to the correctness of the system should
command list to incorporate the POSIX I/O interface. Meanwhile, the be placed in the trusted area. Typical kernel file systems are placed
regular data access pattern of NVMe is retained to enable normal inside the kernel as they need to manage the data on block devices.
5
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
Fig. 3. Queue arrangement and scheduling policies. This figure shows how the
queue pairs are mapped between host CPU threads and device controller cores.
StorStack shifts FS to the device side, which is also a trusted area.
Meanwhile, as described in Section 3.2.2, StorStack separates the host-
side workflow into a control plane and a data plane. The control plane
is designed to reside in the host-side trusted area, i.e. the kernel, to
cooperate with the device-side FS to ensure security and reliability.
An important design principle of the control plane is to reduce the Fig. 4. Permission checking. Figure shows how the user space, the kernel space, and
overhead of the kernel trap. In StorStack, this is done by reducing the device work together to check the validity of a request without frequent kernel
the proportion of control-plane operations and data-plane operations. traps.
There are two types of control-plane workflow on the host side: re-
source allocation and access control. Both of them are designed to be
called rarely. generates a secret key if one has not been set yet, then save and copy
it to the device by kernel NVMe driver. Once the key is set, K-lib
3.5.1. Resource allocation uses it to encrypt the processs credential information (i.e. uid) into
The U-lib of StorStack is a user-space driver that communicates MAC (Message Authentication Code). The resulting token, which is the
with the NVMe storage device. It needs to set up VFIO and manage output of the encryption, is then returned to the process. Since the
DMA memory mapping to enable direct access from user space. It also secret key is stored in the kernel, the process cannot forge a token
needs to allocate areas for caches. These operations involve the kernel but can only use the one assigned by the kernel, which can prove the
but only need to be run once when the device is initialized, so there authenticity of the uid claimed by the process. Before being sent to the
will not be any performance loss in regular file access. device, every request from the process is tagged with the processs uid
and the token, so that the device can use the secret key and the token
3.5.2. Permission checking to verify the uid and check the identity of the request. This mechanism
To provide access control, file systems must check the users permis- requires only one communication between the kernel and the device to
sion to make sure that a file operation is legal. In kernel file systems, share the secret key, and one kernel trap to initialize the token for each
the file system can use the process structure in the kernel to validate the process. Also, the K-lib is implemented as a kernel driver, without
processs identity, and then compare it with the permission information any modification to the core functions of the kernel, which makes it
stored in the files inode. In StorStack, however, the file system resides compatible with conventional operating system.
on the device rather than in the kernel, so the kernel needs to share the
processs information with the device to support permission checking. 3.5.3. Device lock
To avoid entering the kernel frequently, DevFS [14] maintains a StorStack is designed to support direct I/O not only from CPUs,
table that maps CPU IDs to process credentials in the device. All but also from different types of heterogeneous computing devices.
requests are tagged with the CPUs ID that the process runs on before To prevent concurrent access to the same file from multiple devices,
they are sent to the device. The kernel is modified to update the table a concurrency control method is required. A common practice is to
whenever a process is scheduled on a host CPU. There are two problems implement a distributed lock across all devices, but this can be too
with this mechanism. Firstly, it assumes that the CPU ID is unforgeable, costly for low-level hardware. In StorStack, we provide in-storage file-
but usually a malicious process can potentially exploit the ID of another level locking mechanisms to protect the files from unexpected access
CPU to escalate its privilege. Secondly, this requires a modification to by multiple devices.
the process scheduler, which is a core module of the kernel, so making StorStack supports two types of lock: (1) spinning lock, an error
it incompatible with standard OS kernels, and may slow down the code will be returned to the caller if the file it accesses is already locked
system. by another device, allowing the caller to continue attempting to acquire
In StorStack, we propose a new method to share the credential of the lock until the file is unlocked; (2) sleeping lock, where if the file
the process, with less communication, safer guarantee, and no change is locked, any requests from other devices to that file will wait in the
to the Linux kernel. The process is shown in Fig. 4. When the U- submission queue until the file is unlocked. From the perspective of
lib is initialized on a process, it calls the K-lib (a kernel driver) concurrency, StorStack supports both shared lock and exclusive lock,
via ioctl() (system call) to get a credential token. The K-lib which act exactly the same as those on other systems.
6
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
Fig. 5. Random and sequential r/w. Figure shows the basic performance of StorStack compared with Ext-4, under different cache, block size, and in-storage file system settings.
running on its host machine. There are two reasons for the simula-
tion: first, although there are several works regarding programmable
storage controllers [49,5557], these solutions are either expensive or
lack high-level programmability as most of them are based on FPGA;
second, by simulating with various latency settings, we can evaluate the
performance of StorStack on different types of storage devices, which
can be costly if done with real hardware. In our prototype, QEMU has
been modified to handle extended NVMe POSIX I/O operations and
check the token of each operation.
4. Evaluation
In this section, we evaluate the performance of StorStack and com-
pare it with popular file systems to answer the following questions:
Fig. 6. Time cost for a single operation.
• Is StorStack efficient enough compared to widely used kernel file
systems?
3.6. Implementation • How much performance is gained from the kernel trap avoidance?
• How does StorStack perform on different types of devices?
We have implemented a prototype of StorStack, which consists of • How is the concurrency performance of StorStack?
three parts: the U-lib, the K-lib and the Firm-RT. The source code
of this prototype is available at https://anonymous.4open.science/r/
StorStack-524F/. 4.1. Experimental setup
The U-lib is implemented under Linux 5.15, utilizing SPDK [54]
to access storage devices from user space. The SPDK library is modified Our experiment platform is a 20-core 2.4 GHz Intel Xeon server
in StorStack to transfer POSIX I/O operations over NVMe. The U-lib equipped with 64 GB DDR4 memory and 512 GB SSD. Among them, 8
comprises two major components: a dynamic link library that provides
cores with 16 GB memory are assigned to the QEMU VM to simulate the
interfaces and a user-level cache for accessing the device, and a daemon
StorStack host; other cores with 16 GB memory are reserved to emulate
program responsible for managing the connection to the device.
the StorStack device. Both the StorStack host and the StorStack device
The K-lib is implemented as a simple kernel module in Linux 5.15
runs on Linux 5.15.
kernel. It only takes charge of two things: creating the secret key when
the StorStack is initialized so that the K-lib and the Firm-RT can StorStacks expected settings on the device require only a minimal
use it to encrypt and decrypt the MAC token for processes credentials; embedded system with abstractions of hardware functions and neces-
generating the MAC token from the uid of the current process with sary libraries, but due to our simulation requirements, we choose Linux
HMAC algorithm when the process initializes, and then return it to as the device-side environment to support the execution of QEMU.
the U-lib. The interface of the K-lib is exposed to the user space In this section, we evaluate the performance of StorStack using
through ioctl. Filebench [58], a widely used benchmarking suite for testing file system
The Firm-RT is the only component that located on the device performance. We access StorStack under various configurations, includ-
side. In this work, the Firm-RT is not implemented on actual stor- ing different cache options, device access latency, thread numbers and
age hardware but is instead simulated using QEMU and the system read/write ratios, to address the four questions previously raised.
7
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
Fig. 7. Performance with simulated latency. This figure shows the change in throughput as a function of simulated device access latency.
Fig. 8. Multi-thread Performance.
4.2. Random and sequential r/w read, and uncached write.
When the cache hits, the data resides in fast DRAM, resulting in
First, we evaluate StorStacks performance with single-thread ran- low data-fetch latency. In this scenario, traditional Ext-4 exhibits higher
dom and sequential read/write tests. The random tests run on a 1 GB access latency, as the kernel trap accounts for most of the latency.
file with 1K, 4K, and 16K bytes I/O size. The sequential tests run on In contrast, StorStack shows lower latency because its cache is imple-
a 8 GB file with 8K, 32K, and 128K bytes I/O size. Both of the files mented inside user space eliminating the need for kernel traps. When a
are stored on the DRAM memory, which is simulated as a PMEM by cache miss occurs, the primary overhead shifts to the multiple rounds
memmap. The tests are performed on StorStack (referred to as SS) with of storage device access, which further increases the performance gap
two different in-storage FS settings: SS+Ext-4 and SS+Ext-4_DAX. between traditional Ext-4 and StorStack.
Then we compare them with Ext-4. We also evaluate the performance
of SS without cache (SS NC) and Ext-4 with direct IO (Ext-4_DIO) 4.4. Impact of access latency
to study performance improvement when accessed directly.
Fig. 5 shows the results of the random and sequential tests. In Storage devices with different access latencies may influence the
both tests, SS outperforms traditional kernel-level Ext-4, due to our performance of file systems. In this experiment, we use multiple latency
kernel-bypass and near-data file system design. SS+Ext-4_DAX with settings to simulate devices with different access speed. The latency is
user-level write-back cache achieves averagely 1.98x, 4.25x, 3.59x, and simulated on the device side by QEMU.
4.08x performance gain on random read, random write, sequential We compare the performance of SS with Ext-4 under cached and
read, and sequential write respectively compared with Ext-4 with uncached settings using several latency settings. The latency ranges
page cache. For direct access, the speed increase is 6.41x, 6.21x, from 0 μs to 25 μs to simulate connection methods from DDR to PCIe
4.72x, and 1.90x respectively. Another interesting phenomenon is that to RDMA. Tests run with 4KB block size.
in cached StorStack, the performances of SS+Ext-4 and SS+Ext- Fig. 7 shows the result of this test. With a cache, both SS and
4_DAX are similar, indicating that the choice of the in-storage file Ext-4 are not susceptible to the rise of latency. However, without
system does not matter because most operations are handled by cache, the performance of SS has a 78.20% degrade from 526MB/s
the user-level cache. However, in uncached tests, SS+Ext-4_DAX at 0 simulated latency to 115 MB/s at 25 μs latency. The performance
show better results, which means that the in-storage file system may of Ext-4 also cuts 20.98% from 54MB/s to 43MB/s. Note that the
influence the overall performance in direct access. experiment introduces extra latency due to QEMU, so the simulated 0
latency is larger than 0 actually, meaning that the curve can even go
4.3. Profit of kernel bypassing higher on the left side of the graph. The result illustrates that direct
access of SS should only be enabled on ultra-low latency devices. For
We measure the time cost of a single operation to study the profit other hardware, it is better to enable the cache.
of kernel bypassing. The cached test demonstrates the impact of kernel
trap on the access of in-memory page cache. The uncached test shows 4.5. Multi-thread performance
the impact of both kernel trap and write amplification on direct access
to the storage device. Both tests utilize 4KB block size, and the files To study the performance of StorStack under multiple threads, we
are stored on the simulated PMEM. The results in Fig. 6 indicate that evaluate SS and Ext-4 under a multi-thread micro-benchmark. The
compared to Ext-4, SS+Ext-4_DAX reduces latency by 91.91%, benchmark is to perform parallel 4KB file operations on one file with 4
50.46%, 69.83%, and 81.83% on cached read, cached write, uncached threads, each thread is a reader or a writer, and the ratio of readers and
8
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
writers is set to 4:0, 3:1, 1:3, and 0:4. Fig. 8 shows the result. StorStack [10] M. Dong, H. Bu, J. Yi, B. Dong, H. Chen, Performance and protection in the
is faster than Ext-4 in all concurrent read and write scenarios of our ZoFS user-space NVM file system, in: Proceedings of the 27th ACM Symposium
on Operating Systems Principles, ACM, Huntsville Ontario Canada, 2019, pp.
test. For cached scenario, SS is on average 2.88x faster than Ext-4 in
478493, http://dx.doi.org/10.1145/3341301.3359637.
all read-write ratios. For uncached scenario, the speed up is 17.34x. [11] Y. Kwon, H. Fingler, T. Hunt, S. Peter, E. Witchel, T. Anderson, Strata: A cross
media file system, in: Proceedings of the 26th Symposium on Operating Systems
5. Conclusion Principles, in: SOSP 17, Association for Computing Machinery, New York, NY,
USA, 2017, pp. 460477, http://dx.doi.org/10.1145/3132747.3132770.
[12] J. Liu, A.C. Arpaci-Dusseau, R.H. Arpaci-Dusseau, S. Kannan, File systems as
In this paper, we present StorStack, a full-stack design for in-storage processes, in: 11th USENIX Workshop on Hot Topics in Storage and File Systems,
file systems framework and simulator. The StorStack components across HotStorage 19, USENIX Association, Renton, WA, 2019.
user space, kernel space, and device space collaborate to enable file [13] S. Zhong, C. Ye, G. Hu, S. Qu, A. Arpaci-Dusseau, R. Arpaci-Dusseau, M. Swift,
systems to run inside the storage device efficiently and reliably. We MadFS: per-file virtualization for userspace persistent memory filesystems, in:
21st USENIX Conference on File and Storage Technologies, FAST 23, 2023, pp.
implement a prototype of StorStack and evaluate it with various set-
265280.
tings. Experimental results show that StorStack outperforms current [14] S. Kannan, A.C. Arpaci-Dusseau, R.H. Arpaci-Dusseau, Y. Wang, J. Xu, G. Palani,
kernel file systems in both cached and uncached scenes. Some further Designing a true direct-access file system with devfs, in: 16th USENIX Conference
performance optimizations, such as the combination of file systems and on File and Storage Technologies, FAST 18, USENIX Association, Oakland, CA,
2018, pp. 241256.
storage hardware capabilities, the exploration of multi-queue schedul-
[15] Y. Ren, C. Min, S. Kannan, Crossfs: A cross-layered direct-access file system,
ing strategies for different workloads, and the performance of direct in: 14th USENIX Symposium on Operating Systems Design and Implementation,
access from heterogeneous devices, are left to future work. OSDI 20, USENIX Association, 2020, pp. 137154.
[16] J. Zhang, Y. Ren, S. Kannan, FusionFS: fusing I/O operations using ciscops
in firmware file systems, in: 20th USENIX Conference on File and Storage
CRediT authorship contribution statement
Technologies, FAST 22, USENIX Association, Santa Clara, CA, 2022, pp. 297312.
[17] N. Agrawal, V. Prabhakaran, T. Wobber, J.D. Davis, M. Manasse, R. Panigrahy,
Juncheng Hu: Writing review & editing, Writing original draft. Design tradeoffs for SSD performance, in: USENIX 2008 Annual Technical
Shuo Chen: Formal analysis, Data curation. Haoyang Wei: Formal Conference, in: ATC08, USENIX Association, USA, 2008, pp. 5770.
analysis, Data curation. Guoyu Wang: Writing review & editing, [18] F. Chen, D.A. Koufaty, X. Zhang, Understanding intrinsic characteristics and
system implications of flash memory based solid state drives, in: Proceedings
Writing original draft. Chenju Pei: Formal analysis, Data curation. of the Eleventh International Joint Conference on Measurement and Modeling of
Xilong Che: Methodology, Conceptualization. Computer Systems, in: SIGMETRICS 09, Association for Computing Machinery,
New York, NY, USA, 2009, pp. 181192, http://dx.doi.org/10.1145/1555349.
Declaration of competing interest 1555371.
[19] Welcome to PCI-SIG | PCI-SIG, https://pcisig.com/.
[20] Y. Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y. Oh, D.
The authors declare that they have no known competing finan- Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M.G. Kang, J. Lee, Y. Kwon, S. Kim,
cial interests or personal relationships that could have appeared to J. Kim, Y.-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K.
influence the work reported in this paper. Lee, Y.-T. Lee, J. Yoo, G. Jeong, A 20nm 1.8V 8gb PRAM with 40mb/s program
bandwidth, in: 2012 IEEE International Solid-State Circuits Conference, 2012,
pp. 4648, http://dx.doi.org/10.1109/ISSCC.2012.6176872.
Acknowledgments [21] H. Volos, A.J. Tack, M.M. Swift, Mnemosyne: Lightweight persistent memory,
ACM SIGARCH Comput. Archit. News 39 (1) (2011) 91104, http://dx.doi.org/
This work was funded by the National Key Research and De- 10.1145/1961295.1950379.
velopment Programme No. 2024YFB3310200, and by Key scientific [22] S.-W. Chung, T. Kishi, J.W. Park, M. Yoshikawa, K.S. Park, T. Nagase, K.
Sunouchi, H. Kanaya, G.C. Kim, K. Noma, M.S. Lee, A. Yamamoto, K.M. Rho,
and technological R&D Plan of Jilin Province of China under Grant K. Tsuchida, S.J. Chung, J.Y. Yi, H.S. Kim, Y. Chun, H. Oyamatsu, S.J. Hong,
No. 20230201066GX, and by the Central University Basic Scientific 4Gbit density STT-MRAM using perpendicular MTJ realized with compact cell
Research Fund Grant No.2023-JCXK-04. structure, in: 2016 IEEE International Electron Devices Meeting, IEDM, 2016, pp.
27.1.127.1.4, http://dx.doi.org/10.1109/IEDM.2016.7838490.
[23] H. Akinaga, H. Shima, Resistive random access memory (ReRAM) based on metal
References
oxides, Proc. IEEE 98 (12) (2010) 22372251, http://dx.doi.org/10.1109/JPROC.
2010.2070830.
[1] G. Koo, K.K. Matam, T. I, H.K.G. Narra, J. Li, H.-W. Tseng, S. Swanson, M. [24] K. Kawai, A. Kawahara, R. Yasuhara, S. Muraoka, Z. Wei, R. Azuma, K. Tanabe,
Annavaram, Summarizer: trading communication with computing near storage, K. Shimakawa, Highly-reliable TaOx reram technology using automatic forming
in: Proceedings of the 50th Annual IEEE/ACM International Symposium on circuit, in: 2014 IEEE International Conference on IC Design & Technology, 2014,
Microarchitecture, 2017, pp. 219231. pp. 14, http://dx.doi.org/10.1109/ICICDT.2014.6838600.
[2] S.S.M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. Jin, Y. Liu, S. Swanson, [25] K. Suzuki, S. Swanson, The Non-Volatile Memory Technology Database
Willow: A User-Programmable ssd, OSDI, 2014. (NVMDB), Tech. Rep. CS2015-1011, Department of Computer Science &
[3] NVMe specifications, https://nvmexpress.org/specifications/. Engineering, University of California, San Diego, 2015.
[4] Intel, Intel® Optane™ Persistent Memory, https://www.intel.com/content/www/ [26] S. Matsuura, Designing a persistent-memory-native storage engine for SQL
us/en/products/docs/memory-storage/optane-persistent-memory/overview. database systems, in: 2021 IEEE 10th Non-Volatile Memory Systems and Ap-
html. plications Symposium, NVMSA, IEEE, Beijing, China, 2021, pp. 16, http://dx.
[5] S. Mittal, J.S. Vetter, A survey of software techniques for using non-volatile doi.org/10.1109/NVMSA53655.2021.9628842.
memories for storage and main memory systems, IEEE Trans. Parallel Distrib. [27] R. Tadakamadla, M. Patocka, T. Kani, S.J. Norton, Accelerating database work-
Syst. 27 (5) (2016) 15371550, http://dx.doi.org/10.1109/TPDS.2015.2442980. loads with DM-WriteCache and persistent memory, in: Proceedings of the 2019
[6] M. Wei, M. Bjørling, P. Bonnet, S. Swanson, I/O speculation for the microsecond ACM/SPEC International Conference on Performance Engineering, in: ICPE 19,
era, in: 2014 USENIX Annual Technical Conference, USENIX ATC 14, 2014, pp. Association for Computing Machinery, New York, NY, USA, 2019, pp. 255263,
475481. http://dx.doi.org/10.1145/3297663.3309669.
[7] S. Peter, J. Li, I. Zhang, D.R.K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, [28] W. Wang, C. Yang, R. Zhang, S. Nie, X. Chen, D. Liu, Themis: malicious wear
T. Roscoe, Arrakis: the operating system is the control plane, in: 11th USENIX detection and defense for persistent memory file systems, in: 2020 IEEE 26th
Symposium on Operating Systems Design and Implementation, OSDI 14, 2014, International Conference on Parallel and Distributed Systems, ICPADS, 2020, pp.
pp. 116. 140147, http://dx.doi.org/10.1109/ICPADS51040.2020.00028.
[8] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Saxena, M.M. Swift, [29] B. Zhu, Y. Chen, Q. Wang, Y. Lu, J. Shu, Octopus+ : An RDMA-enabled distributed
Aerie: Flexible file-system interfaces to storage-class memory, in: Proceedings persistent memory file system, ACM Trans. Storage 17 (3) (2021) 125, http:
of the Ninth European Conference on Computer Systems, in: EuroSys 14, //dx.doi.org/10.1145/3448418.
Association for Computing Machinery, New York, NY, USA, 2014, pp. 114, [30] J. Do, V.C. Ferreira, H. Bobarshad, M. Torabzadehkashi, S. Rezaei, A. Hey-
http://dx.doi.org/10.1145/2592798.2592810. darigorji, D. Souza, B.F. Goldstein, L. Santiago, M.S. Kim, P.M.V. Lima, F.M.G.
[9] A.M. Caulfield, T.I. Mollov, L.A. Eisner, A. De, J. Coburn, S. Swanson, Providing França, V. Alves, Cost-effective, energy-efficient, and scalable storage computing
safe, user space access to fast, solid state disks, ACM SIGPLAN Not. 47 (4) (2012) for large-scale AI applications, ACM Trans. Storage 16 (4) (2020) 21:121:37,
387400, http://dx.doi.org/10.1145/2248487.2151017. http://dx.doi.org/10.1145/3415580.
9
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
[31] L. Kang, Y. Xue, W. Jia, X. Wang, J. Kim, C. Youn, M.J. Kang, H.J. Lim, [57] J. Kwak, S. Lee, K. Park, J. Jeong, Y.H. Song, Cosmos+ OpenSSD: rapid prototype
B. Jacob, J. Huang, IceClave: A trusted execution environment for in-storage for flash storage systems, ACM Trans. Storage 16 (3) (2020) 15:115:35, http:
computing, in: MICRO-54: 54th Annual IEEE/ACM International Symposium //dx.doi.org/10.1145/3385073.
on Microarchitecture, in: MICRO 21, Association for Computing Machinery, [58] Filebench, https://github.com/filebench/filebench.
New York, NY, USA, 2021, pp. 199211, http://dx.doi.org/10.1145/3466752.
3480109.
[32] Z. Ruan, T. He, J. Cong, INSIDER: designing in-storage computing system Juncheng Hu received the bachelors degree and doctor
for emerging high-performance drive, in: 2019 USENIX Annual Technical of Engineering degree from Jilin University in 2017 and
Conference, USENIX ATC 19, USENIX Association, Renton, WA, 2019, pp. 2022, where he is current a lecturer in Jilin University. His
379394. research interests include data mining, machine learning,
[33] A.M. Caulfield, T.I. Mollov, L.A. Eisner, A. De, J. Coburn, S. Swanson, Providing computer network and parallel computing.
safe, user space access to fast, solid state disks, ACM SIGPLAN Not. 47 (4) (2012) jchu@jlu.edu.cn
387400.
[34] S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, G.R. Ganger, Active disk meets flash: A case
for intelligent ssds, in: Proceedings of the 27th International ACM Conference
on International Conference on Supercomputing, 2013, pp. 91102.
[35] J. Do, Y.-S. Kee, J.M. Patel, C. Park, K. Park, D.J. DeWitt, Query processing
on smart ssds: Opportunities and challenges, in: Proceedings of the 2013
ACM SIGMOD International Conference on Management of Data, 2013, pp. Shuo Chen is currently working toward the masters de-
12211230. gree with College of Computer Science and Technology,
[36] C. Cowan, S. Beattie, J. Johansen, P. Wagle, {Point Guar d.}: Protecting pointers Jilin University since 2022. His research field is computer
from buffer overflow vulnerabilities, in: 12th USENIX Security Symposium, architecture, mainly focusing on optimization for caching
USENIX Security 03, 2003. systems.
[37] L. Szekeres, M. Payer, T. Wei, D. Song, Sok: Eternal war in memory, in: 2013 chenshuo22@mails.jlu.edu.cn
IEEE Symposium on Security and Privacy, IEEE, 2013, pp. 4862.
[38] S.R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, J.
Jackson, System software for persistent memory, in: Proceedings of the Ninth Eu-
ropean Conference on Computer Systems - EuroSys 14, ACM Press, Amsterdam,
The Netherlands, 2014, pp. 115, http://dx.doi.org/10.1145/2592798.2592814.
[39] C. Lee, D. Sim, J. Hwang, S. Cho, F2FS: A new file system for flash storage, in:
13th USENIX Conference on File and Storage Technologies, FAST 15, USENIX Wei Haoyang a 23rd-year Masters student in Computer
Association, Santa Clara, CA, 2015, pp. 273286. Science and Technology at Jilin University, focuses on
[40] DAX, https://www.kernel.org/doc/Documentation/filesystems/dax.txt. computer architecture research, with a primary interest in
[41] J. Xu, S. Swanson, NOVA: A log-structured file system for hybrid volatile/non- the application of new storage devices.
volatile main memories, in: Proceedings of the 14th Usenix Conference on File hywei23@mails.jlu.edu.cn
and Storage Technologies, in: FAST16, USENIX Association, USA, 2016, pp.
323338.
[42] M. Torabzadehkashi, S. Rezaei, A. HeydariGorji, H. Bobarshad, V. Alves, N.
Bagherzadeh, Computational storage: An efficient and scalable platform for big
data and HPC applications, J. Big Data 6 (1) (2019) 100, http://dx.doi.org/10.
1186/s40537-019-0265-5.
[43] W. Cao, Y. Liu, Z. Cheng, N. Zheng, W. Li, W. Wu, L. Ouyang, P. Wang, Y.
Wang, R. Kuan, Z. Liu, F. Zhu, T. Zhang, POLARDB meets computational storage: Guoyu Wang is currently working toward the doctors
efficiently support analytical workloads in cloud-native relational database, in: degree with College of Computer Science and Technology,
Proceedings of the 18th USENIX Conference on File and Storage Technologies, Jilin University.
in: FAST20, USENIX Association, USA, 2020, pp. 2942. wgy21@mails.jlu.edu.cn
[44] Nvidia, NVIDIA RTX IO: GPU accelerated storage technology, https://www.
nvidia.com/en-us/geforce/news/rtx-io-gpu-accelerated-storage-technology/.
[45] AMD, Radeon™ Pro SSG graphics, https://www.amd.com/en/products/
professional-graphics/radeon-pro-ssg.
[46] Z. An, Z. Zhang, Q. Li, J. Xing, H. Du, Z. Wang, Z. Huo, J. Ma, Optimizing the
datapath for key-value middleware with NVMe SSDs over RDMA interconnects,
in: 2017 IEEE International Conference on Cluster Computing, CLUSTER, 2017,
pp. 582586, http://dx.doi.org/10.1109/CLUSTER.2017.69. Pei Chenju is an undergraduate student at the School of
[47] Samsung, Samsung 990 PRO with heatsink, https://semiconductor.samsung. Computer Science and Technology at Jilin University. His
com/content/semiconductor/global/consumer-storage/internal-ssd/990-pro- field of research is computer system architecture, and he is
with-heatsink.html. currently investigating new L7 load balancing solutions.
[48] A. Ltd, ARM computational storage solution, https://www.arm.com/solutions/ peicj2121@mails.jlu.edu.cn
storage/computational-storage.
[49] Samsung, Samsung SmartSSD, https://www.xilinx.com/applications/data-center/
computational-storage/smartssd.html.
[50] ScaleFlux, ScaleFlux, https://scaleflux.com/.
[51] E. Gal, S. Toledo, A transactional flash file system for microcontrollers, in: 2005
USENIX Annual Technical Conference, USENIX ATC 05, 2005.
[52] J. Koo, J. Im, J. Song, J. Park, E. Lee, B.S. Kim, S. Lee, Modernizing file system
through in-storage indexing, in: Proceedings of the 15th Usenix Symposium on Che Xilong Received the M.S. and Ph.D. degrees in Com-
Operating Systems Design and Implementation, Osdi 21, Usenix Assoc, Berkeley, puter Science from Jilin University, in 2006 and 2009
2021, pp. 7592, http://dx.doi.org/10.5281/zenodo.4659803. respectively.
[53] LevelDB, https://github.com/google/leveldb. Currently, He is a full professor and doctoral supervisor
[54] Storage performance development kit, https://spdk.io/. at the College of Computer Science and Technology, Jilin
[55] DFC open source, https://github.com/DFC-OpenSource. University, China.
[56] M. Jung, OpenExpress: fully hardware automated open research framework His current research areas are Parallel & Distributed
for future fast NVMe devices, in: 2020 USENIX Annual Technical Conference, Computing, High Performance Computing Architectures,
USENIX ATC 20, 2020, pp. 649656. and related optimizations.
He is a member of the China Computer Federation.
Corresponding author of this paper.
chexilong@jlu.edu.cn
10