704 lines
82 KiB
Plaintext
704 lines
82 KiB
Plaintext
Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Journal of Systems Architecture
|
||
journal homepage: www.elsevier.com/locate/sysarc
|
||
|
||
|
||
|
||
|
||
StorStack: A full-stack design for in-storage file systems
|
||
Juncheng Hu, Shuo Chen, Haoyang Wei, Guoyu Wang, Chenju Pei, Xilong Che ∗
|
||
College of Computer Science and Technology, Jilin University, Chang Chun, 130022, China
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: Due to the increasingly significant cost of data movement, In-storage Computing has attracted considerable
|
||
File system attention in academia. While most In-storage Computing works allow direct data processing, these methods do
|
||
In-storage Computing not completely eliminate the participation of the CPU during file access, and data still needs to be moved from
|
||
Storage-class Memory
|
||
the file system into memory for processing. Even though there are attempts to put file systems into storage
|
||
devices to solve this problem, the performance of the system is not ideal when facing high latency storage
|
||
devices due to bypassing the kernel and lacking page cache.
|
||
To address the above issues, we propose StorStack, a full-stack, highly configurable in-storage file system
|
||
framework, and simulator that facilitates architecture and system-level researches. By offloading the file system
|
||
into the storage device, the file system can be closer to the data, reducing the overhead of data movements.
|
||
Meanwhile, it also avoids kernel traps and reduces communication overhead. More importantly, this design
|
||
enables In-storage Computing applications to completely eliminate CPU participation. StorStack also designs
|
||
the user-level cache to maintain performance when storage device access latency is high. To study performance,
|
||
we implement a StorStack prototype and evaluate it under various benchmarks on QEMU and Linux. The results
|
||
show that StorStack achieves up to 7x performance improvement with direct access and 5.2x with cache.
|
||
|
||
|
||
|
||
1. Introduction the design and operation of file systems determine their reliance on
|
||
the CPU when accessing the file system. For In-storage Computing,
|
||
In traditional computing architectures, data must be transferred although researchers are gradually reducing CPU involvement, current
|
||
from storage devices to memory for processing, which not only con- file systems still rely on the CPU to handle complex file management
|
||
sumes the computing resources of the host, but also results in high tasks and ensure system security and integrity.
|
||
energy consumption and I/O latency. As data scales continue to expand, On the one hand, to reduce the software overhead of file systems,
|
||
In-storage Computing has been proposed to alleviate the pressure of
|
||
many works aim at the kernel trap. For example, there are some efforts
|
||
data movement [1,2]. The core idea is to perform computations directly
|
||
to move the file system into user space [8–13]. But running in user
|
||
where the data is stored, without the need to move the data. The
|
||
space may compromise the reliability of the file system, hence bugs
|
||
emergence of high-speed storage devices like SSDs [3] and SCMs [4,5]
|
||
has significantly advanced research in In-storage Computing and trans- or malicious software may cause crashes and data loss. Some of these
|
||
formed computer storage systems. To fully leverage the potential of works try to move the critical parts of the file system back to the kernel.
|
||
storage systems and exploit the characteristics of this new computing But in most cases, data-plane operations are interleaved with control-
|
||
paradigm, a redesign of storage stack software is required. plane operations, which may diminish the performance improvement
|
||
As the most essential part of the storage stack software, file systems brought by kernel bypassing. In recent years, firmware file systems
|
||
have been residing in the operating system kernel for a very long have been proposed, which move file systems onto the storage device
|
||
time because they need to perform integrity assurance and access controller [14–16] to completely get rid of the kernel trap. However,
|
||
control to ensure data security. The kernel is considered a trusted field those file systems are designed to be strongly coupled with the storage
|
||
compared to the user space. However, this seemingly good design has device, making the device lack the replaceability of file system and
|
||
been challenged by new technologies. With the emergence of faster the compatibility with conventional operating systems. In addition,
|
||
storage devices such as SSDs and SCMs, access latency decreases signif- these firmware file systems do not provide comprehensive security
|
||
icantly compared to HDDs [6], leading to the software overhead of file
|
||
guarantees.
|
||
systems [7,8] becoming a major performance bottleneck. Meanwhile,
|
||
|
||
|
||
∗ Corresponding author.
|
||
E-mail addresses: jchu@jlu.edu.cn (J. Hu), chenshuo22@mails.jlu.edu.cn (S. Chen), hywei23@mails.jlu.edu.cn (H. Wei), wgy21@mails.jlu.edu.cn
|
||
(G. Wang), peicj2121@mails.jlu.edu.cn (C. Pei), chexilong@jlu.edu.cn (X. Che).
|
||
|
||
https://doi.org/10.1016/j.sysarc.2025.103348
|
||
Received 29 August 2024; Received in revised form 24 November 2024; Accepted 18 January 2025
|
||
Available online 27 January 2025
|
||
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
On the other hand, to fully leverage the advantages of In-storage 2.1. Hardware trends
|
||
Computing, it is necessary to eliminate the participation of host-side
|
||
OS from the storage access path. In-storage Computing advocates for a Compared to the large, slow HDD, solid-state drive (SSD) is a kind of
|
||
data-centric approach, where computation units are embedded within flash-based non-volatile storage with small form factor, high speed, and
|
||
the storage devices to enable direct data processing. However, in the low energy costs [17,18]. SSDs on the market today can provide up to
|
||
process of accessing files, traditional file systems still require CPU 30 TB of capacity and 7 GB/s throughput on sequential read/write. To
|
||
involvement. To know which data should be transferred next, file access fully exploit the high performance, modern SSDs have switched from
|
||
should be first handled by the host-side file system in the operating SATA to PCIe and NVMe. PCIe 5.0 [19] supports up to 16 lanes and 32
|
||
system kernel. This CPU intervention limits the computational capacity GT/s data rate, which leads to more than 60GB/s bandwidth. NVMe [3]
|
||
improvements that In-storage Computing can offer. is a communication protocol for non-volatile memories attached via
|
||
Another point worth noting is that numerous studies propose im- PCIe, supporting up to 65,535 I/O queues each with 65,535 depth. It
|
||
proving system performance by allowing user applications to bypass also supports SSD-friendly operations like ZNS and KV, which can help
|
||
the kernel and communicate directly with storage devices. This method SSDs further enhance SSDs’ throughput capabilities.
|
||
demonstrates significant performance improvements when dealing with Storage class memory (SCM), also referred to as persistent mem-
|
||
high-speed storage devices. However, due to the diversity of storage de- ory (PMEM) or non-volatile memory (NVM), is a different type of
|
||
vices and their varying latencies, system performance may suffer when storage device that is fast and byte-addressable like DRAM, but can
|
||
bypassing the high-speed cache, especially when using high-latency, also retain data without power like SSDs. Various technologies such as
|
||
low-speed storage devices. Therefore, the impact of cache configuration PRAM [20,21], MRAM [22], and ReRAM [23,24]have been explored to
|
||
on performance is also a subject of our further research. In summary, implement SCM, each exhibiting different performance characteristics.
|
||
despite various attempts to optimize file systems performance and SCM provides higher bandwidth than SSD; it offers latency close to
|
||
reduce CPU involvement, current solutions still have several issues. DRAM, and its capacity falls between SSD and DRAM [25]. As a new
|
||
To further optimize the performance and security of file systems blood in the storage hierarchy, SCM can provide more possibilities to
|
||
and fully unleash the potential of in-storage computing, we propose multiple workloads [26–29].
|
||
StorStack, which is a full-stack, highly configurable, in memory file Consequently, while the increased bandwidth and reduced latency
|
||
system framework and simulator on high-speed storage devices such as of storage devices have substantially boosted the performance of com-
|
||
SSDs and SCMs. Since file systems always have a fixed primary func- puter systems and enabled novel application scenarios, these advance-
|
||
tionality of managing the data mapping, which is similar in function to ments also introduce several challenges. These challenges include
|
||
the flash translation layer (FTL) on the storage controller, we consider it heightened complexity in data management, the need to balance cost
|
||
natural and reasonable to run the file system on the storage controller. and efficiency, and issues related to technical compatibility and migra-
|
||
StorStack has three main components: a device firmware runtime tion.
|
||
for file systems enabling file systems to run directly on the storage
|
||
device, a user library to expose POSIX interfaces to user applications, 2.2. In-storage computing
|
||
and a kernel driver to guarantee access control. By moving the file sys-
|
||
tem into the storage, StorStack aims to gain performance improvement While these new storage devices have significantly altered the
|
||
from the concept of In-storage Computing that brings the file system memory hierarchy of computer systems, the memory wall between
|
||
closer to the data. Moreover, the file system code is removed from the CPU and off-chip memory is still the bottleneck of the whole system,
|
||
kernel, which can avoid the latency and context switches caused by especially with the rise of data-intensive workloads and the slowdown
|
||
kernel traps during file access. More importantly, StorStack can remove of Moore’s law and Dennard scaling. To reduce the overhead of data
|
||
the CPU from the storage access path of In-storage Computing appli- movement, In-storage Computing(ISC) [30–32]is proposed, gaining
|
||
cations, maximizing the potential of In-storage Computing. To ensure increasing attention with advancements in integration technologies.
|
||
the security and reliability of the file system, StorStack has designed an However, most current research predominantly focuses on offloading
|
||
efficient security mechanism, introducing a device-side controller as the user-defined tasks to storage devices, and this approach still faces
|
||
runtime and retaining control plane operations within the host kernel. limitations in practice.
|
||
By reducing the ratio of control plane to data plane operations, kernel First, existing ISC methods exhibit significant shortcomings in terms
|
||
traps are minimized, enhancing performance. StorStack also includes a of compatibility and portability. On the host side, developers must de-
|
||
user-level cache to explore the impact of cache on the performance of sign custom APIs for ISC, which are incompatible with existing system
|
||
interfaces such as POSIX, demanding substantial modifications to the
|
||
in-storage file systems.
|
||
host code [32]. On the drive side, the drive program either collaborates
|
||
We implemented StorStack as a prototype and evaluated it on
|
||
with the host file system to access the correct file data [33] or manages
|
||
QEMU and Linux 5.15. Experimental results demonstrate that StorStack
|
||
the drive as a bare block device without a file system. However, most
|
||
performs up to 5.2x faster times than Ext-4 with cache and 7x times
|
||
systems still rely on file system-based external storage access, with the
|
||
with direct access. Regarding the cache, we find that as access latency
|
||
file system typically running on the CPU. Consequently, ISC tasks often
|
||
increases, file systems with cache always maintain high speeds, whereas
|
||
require CPU involvement when accessing external storage data.
|
||
the speed of file systems without cache decreases significantly.
|
||
Secondly, current approaches lack adequate protection and isolation
|
||
for ISC applications. To fully leverage the high speed of modern storage
|
||
2. Background and related work devices, multiple ISC applications may need to execute concurrently.
|
||
Without proper data protection mechanisms, malicious or erroneous
|
||
The storage or memory system has changed a lot in the past decades. ISC tasks could access unauthorized data. Without isolation, the exe-
|
||
With the development of speed, capacity, and size, and the emergence cution of one ISC task could compromise the performance and security
|
||
of new types of storage, a rethink of both hardware and software is of others. However, most existing research [1,34,35] assumes that ISC
|
||
required to exploit the potential of the system in the next era. In this tasks operate in an exclusive execution environment, failing to address
|
||
section, we first discuss the trends of two novel high-speed non-volatile these concerns effectively. Additionally, when specific code is offloaded
|
||
storage, and then explored the significance of applying In-storage Com- to storage devices, attackers can exploit vulnerabilities in in-storage
|
||
puting on these storage devices. Finally, we briefly introduce three file software and hardware firmware, such as buffer overflows [36,37] or
|
||
systems in different locations. bus snooping attacks, to escalate privileges and harm the system.
|
||
|
||
2
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
2.3. File system 3. Design
|
||
|
||
The evolution of storage hardware poses higher demands for soft-
|
||
In this section, we first discuss the design principles of StorStack,
|
||
ware systems. As a crucial part of the software stack of the storage
|
||
followed by an overview of its architecture, connection between host
|
||
system, file systems should be redesigned to minimize software over-
|
||
heads, especially the involvement of the OS kernel on the data path. and device, scheduling mechanisms and reliability designs.
|
||
Many efforts have explored the possibility of different file system
|
||
locations.
|
||
3.1. Principles
|
||
Kernel file systems. Numerous typical file systems are implemented
|
||
inside kernel as kernel file systems, including Ext4, XFS, etc. Due to
|
||
the isolation of kernel space, kernel file systems can easily manage 1. Provide a full-stack framework to enable in-storage file sys-
|
||
data and metadata with reliability guarantees [38]. Recent works on tems without compromising performance. To support in-storage
|
||
kernel file systems have sought to exploit the capabilities of modern FS, StorStack’s design includes a user library, a kernel driver, and a
|
||
storage devices. For example, F2FS [39] is built on append-only logging firmware FS runtime. By bringing FS code out of the kernel and closer
|
||
to adapt to the characteristics of flash memory. PMFS [38] introduce to the data, StorStack avoids the kernel trap and reduces the commu-
|
||
a new hardware primitive to avoid the consistency issues caused by nication overhead. StorStack also incorporates a user-level cache to
|
||
CPU cache while accessing SCM. DAX [40] bypasses the buffer cache maintain the performance when the access latency of the device is high.
|
||
of the system to support direct access to the storage hardware so that
|
||
2. Make full use of the heterogeneity of the host CPU and
|
||
the redundant data movement between DRAM and SCM is removed.
|
||
storage device controller. The in-storage FS yields the host CPU time
|
||
NOVA [41] explores the hybrid of DRAM and SCM as a specially
|
||
designed log-structured file system. However, kernel file systems have to user application codes and cuts the energy cost, while conflicts due
|
||
several limitations. Firstly, the development and debugging process to concurrent access are resolved on the host CPU to maintain the per-
|
||
within kernel space is inherently complex and difficult. Furthermore, formance. If necessary, the cache is also retained on the host side, and is
|
||
every file system access necessitates a kernel trap, which inevitably in- managed by the user space. Such a heterogeneous system can maximize
|
||
troduces latency. Additionally, the frequent context switching between the overall performance and minimize the power consumption of the
|
||
user processes and the kernel increases CPU overhead. system.
|
||
User-space file systems. User-space file systems are implemented 3. Guarantee the reliability of the file system with minimal
|
||
mostly in user space to bypass the kernel and reduce the overhead as- overhead. To provide essential guarantees such as permission check-
|
||
sociated with kernel traps. However, since most user-space file systems ing, StorStack keeps its control plane within the trusted area. Addi-
|
||
are implemented in untrusted environments, ensuring data security and tionally, to enhance performance, a token mechanism is introduced to
|
||
reliability becomes challenging. User-space file systems need sophisti-
|
||
prevent StorStack from accessing the kernel during data-plane opera-
|
||
cated design, usually the collaboration between kernel space and user
|
||
tions.
|
||
space, to keep them reliable. For example, Strata [11] separate the
|
||
file system into a per-process user space update log for concurrent 4. Keep compatible with conventional operating systems. The
|
||
writing and a read-only kernel space shared area for data persistence. design of StorStack does not require changes to current operating
|
||
Moneta-D [9] provides a hardware virtual channel support with kernel systems. Instead, the user lib and kernel driver of StorStack are add-
|
||
space file system protection policy and a user space driver to access the ons. Even without them, the StorStack storage device can be accessed
|
||
hardware. There are also efforts to implement the control-plane of the with typical block- or byte-based interfaces, just like traditional SSDs
|
||
file system as a trusted user space process [8,12]. or SCMs. StorStack also supports per-partition replaceable file sys-
|
||
Firmware file systems. Works that offload part or the whole of the file tems, which is a regular function in current operating systems but not
|
||
system into the storage device firmware are categorized as firmware supported by firmware file systems.
|
||
file systems. There are three representative works on firmware file sys- 5. Support heterogeneous computing. By providing a device-level
|
||
tems: DevFS [14], CrossFS [15] and FusionFS [16]. DevFS and CrossFS
|
||
file interface, StorStack may enable multiple advanced heterogeneous
|
||
explore the possibility of moving the file system to the storage side to
|
||
access patterns, including In-storage Computing (ISC) [31,32,42,43]
|
||
benefit from kernel bypass. FusionFS goes further on the previous two
|
||
and direct I/O access from GPUs [44,45] or NICs [42,46]. In this work,
|
||
works and attempts to gain performance by combining multiple storage
|
||
access operations. However, we have identified several problems of we provide basic support for these patterns and plan to further explore
|
||
these file systems. First, these firmware file systems are tightly coupled them in future research.
|
||
with specific storage devices, which makes it hard for users to select 6. Run with reasonable hardware setup on the storage device.
|
||
alternative file systems or upgrade the software version of the current Previous research on firmware file systems has assumed that device
|
||
file system. Second, none of these file systems are designed to operate controller hardware capabilities are severely limited. However, today’s
|
||
effectively in scenarios with significant communication latency. Third, high-end storage devices feature up to 4 cores and DRAM capacity that
|
||
the lack of security mechanisms limits their applicability in real-world can reach 1% of their storage capacity [47]. As in-storage processing
|
||
environments. evolves, hardware configurations will continue to improve [30,43,
|
||
48–50]. In StorStack, we assume that the device possesses sufficient
|
||
2.4. Motivation capabilities to run file systems alongside a runtime environment. Fu-
|
||
ture research can investigate the benefits of integrating in-storage file
|
||
Although kernel file systems are well-designed and time-tested, their
|
||
systems with additional device-side capabilities, such as power loss
|
||
design principles, which assume high device access latency, are no
|
||
protection capacitors or the flash translation layer.
|
||
longer suitable for modern high-speed devices. User-space file systems
|
||
and firmware file systems have explored new approaches to file system
|
||
implementation in the era of high-speed storage; however, they may 3.2. Architecture
|
||
lead to inferior performance with traditional devices, compromised
|
||
security controls, or inflexible, non-replaceable file systems. To ad-
|
||
dress these issues, we introduce StorStack, a fast, flexible, and secure To support in-storage file systems with compatibility, flexibility, and
|
||
in-storage file system framework. The detailed comparison between reliability, StorStack has three major parts distributed over user space,
|
||
StorStack and previous file systems is shown in Table 1. kernel space, and device side.
|
||
|
||
3
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
Table 1
|
||
The detailed comparison between StorStack and previous file systems.
|
||
Software access Expected hardware FS position Host-side cache Replaceable FS Isolated access
|
||
latency latency control
|
||
Kernel FS High High Host ✓ ✓ ✓
|
||
User-space FS Low Low Host ◦ ✓ ◦
|
||
Prev.Firm FS Low Low Device × × ×
|
||
StorStack Low Either Device ✓ ✓ ✓
|
||
|
||
|
||
|
||
|
||
Fig. 1. StorStack Architecture. StorStack consists of three major modules: the U-lib, the K-lib, and the Firm-RT; and there are two workflows: a data-plane workflow, and
|
||
a control-plane workflow. The interconnection between them is shown in the figure.
|
||
|
||
|
||
3.2.1. High-level design subsequently transmits it to the device-side Firm-RT. The Firm-RT
|
||
As shown in Fig. 1, StorStack consists of three major parts: a user receives the NVMe command, checks its validity, and then forwards the
|
||
lib (U-lib), a kernel driver (K-lib), and an FS runtime in device command to the FS. The FS handles the file operation and then works
|
||
firmware (Firm-RT). with the FTL or other hardware instruments to arrange the data blocks
|
||
U-lib. The U-lib is the interface for user applications to access the on the storage media. The primary distinction between this routine and
|
||
in-storage FS, offered as a dynamic link library. The main job of the a typical kernel-based file system lies in the fact that the file system
|
||
U-lib is to expose POSIX file operations to users, provide user-level logic is inside the storage device, thus StorStack thereby eliminating
|
||
cache, and manage the connection with the device. It also cooperates the need for kernel traps during data access.
|
||
with the K-lib and the Firm-RT to ensure the reliability of the The control plane (blue dashed lines in Fig. 1) provides necessary
|
||
system. supports for the data plane to work properly. Control-plane operations
|
||
K-lib. The K-lib is a kernel module to provide control-plane op- on the host side, including memory resource allocation and identity
|
||
erations with reliability. Its work includes resource allocation and token assignment, are delegated to the kernel to ensure security and
|
||
permission checking. Although it resides in the kernel, the functions reliability. The host-side control-plane operations are designed to be
|
||
of K-lib are designed to be rarely called to avoid the performance rarely called to reduce kernel trap overhead. On the device, the control
|
||
penalty associated with kernel traps. plane assists in check the authentication of requests, manage the FS,
|
||
Firm-RT. The Firm-RT is a runtime on the storage firmware that and deal with other management operations. More detailed security
|
||
offers essential hardware and software support for in-storage FS to run and reliability policies will be described in Section 3.5.
|
||
on the device controller. To serve the FS, Firm-RT communicates
|
||
with both the U-lib for data-plane operations, and the K-lib for 3.2.3. Organization on the storage
|
||
control-plane operations. In StorStack, file systems are stored in the storage media with
|
||
pointers originating from partitions, so that the framework can choose
|
||
3.2.2. StorStack workflow the right FS to access a partition. We dedicate a partition to store all
|
||
For clarity, the workflow of StorStack is divided into a data plane the FS binaries that are used by user-created partitions, and each FS
|
||
and a control plan. The data-plane workflow handles data accesses from in this partition can be indexed by a number. Here we assume that a
|
||
user space, and the control plane is responsible for maintaining the GUID partition table (GPT) to organize the partitions. Each user-created
|
||
system’s functionality, safety, and reliability. partition is associated with an FS when it is formatted, and the FS will
|
||
For the data plane (red lines in Fig. 1), when a user application be added to the FS partition we just mentioned if it was not there yet.
|
||
calls a file operation in StorStack, the host-side U-lib will check the To indicate the relation between the user-created partition and its FS,
|
||
cache if the cache is used. If the cache is bypassed or penetrated, the index number of the FS is added to the attribute flags bits of the
|
||
the U-lib packs it into an extended NVMe protocol command, and partition’s GPT entry. The organization is illustrated in Fig. 2. This
|
||
|
||
4
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
|
||
|
||
Fig. 2. Partition organization. Figure shows how the FS is stored on the storage and associated with the partition.
|
||
|
||
|
||
design allows StorStack to provide different file systems to different disk access when the system does not support StorStack. It is note-
|
||
partitions. Meanwhile, the GPT and the partitions are still available for worthy that the protocol can be further extended under StorStack to
|
||
the typical kernel file system routine. support more paradigms like transactional access [51], log-structured
|
||
access [52,53], operations fusing [16], or In-storage Computing. We
|
||
3.3. File access pattern will leave these further explorations to our future work.
|
||
With StorStack, heterogeneous hardware like GPUs can implement
|
||
The U-lib provides POSIX IO and AIO interfaces to user appli- this extended protocol to access files directly without involving the
|
||
cations, and the complicated reliability and performance designs are CPU. For different types of hardware, there are two ways to transmit
|
||
transparent to users. For regular IO interfaces, the write operations data. For those who have their own memory (memory-mapped) like
|
||
(write, pwrite) act differently with and without cache. When the cache GPUs, StorStack can directly place the data to their memory via PCIe
|
||
is used, writes will return as soon as an operation passes some simple bus. For hardware without memory (I/O mapped), StorStack should put
|
||
check and is put into the queue. The interface will not promise that the data into the main memory. The manipulation of data destination
|
||
the data is written to the disk before it returns, just like a traditional is directed by the target device driver.
|
||
kernel file system, unless the fsync is called. Without cache, the writes
|
||
will block the process until the data is written to the storage. The 3.4.2. Multi-queue arrangement
|
||
read interfaces (read, pread) will not return until the data is available, NVMe uses multiple queues to improve performance, supporting up
|
||
regardless of whether there is a cache. The AIO interfaces return to 65,536 I/O queues, with 65,536 commands per queue. Normally,
|
||
immediately when an operation is put into the queue, and the real NVMe offers at least a pair of queues (one submission queue and one
|
||
return value can be fetched by non-blocking check, blocking suspend, completion queue) for each core to fully utilize the bandwidth without
|
||
or signal. introducing locks. In StorStack, file operations are processed on the
|
||
device side, particularly when the storage device features a multi-core
|
||
To make sure that StorStack performs well on high-latency storage
|
||
controller. To fully utilize the parallelism of the controller cores while
|
||
devices, an optional user-level per-process cache is provided. Because
|
||
minimizing the potential conflicts of concurrent file access, StorStack
|
||
the reliability of StorStack can only be ensured by the device-side file
|
||
introduces a special queue organization.
|
||
system but not the U-lib, we choose per-process cache to prevent
|
||
As Fig. 3 shows, every user process in StorStack is assigned a bunch
|
||
malicious processes from polluting data by writing to a global cache
|
||
of queue pairs, the number of which is equal to the storage device
|
||
without check. The user-level cache has two ways to deal with write
|
||
controller core count. Each queue pair of the queue pair bunch is bound
|
||
operations: the write-back method returns immediately after the data
|
||
to a controller core of the storage device, so that a process can distribute
|
||
is put into the cache; the write-around method drops the dirty data in
|
||
any file operation to a specific controller core. Meanwhile, each user
|
||
cache and returns after the operation is put into the queue. The write-
|
||
thread has its exclusive queue pair bunch to avoid queue contention
|
||
back cache has a higher performance than the write-around cache,
|
||
on the host side.
|
||
while the write-around cache can provide higher data consistency. In
|
||
The purpose of this arrangement is to enable the host-side ap-
|
||
fact, our evaluation shows that the write-back cache in StorStack can
|
||
plications to control which operation should be dispatched to which
|
||
outperform the page cache inside the kernel.
|
||
controller core. For example, read intensive applications can issue read
|
||
operations to all cores with a round robin strategy. For write intensive
|
||
3.4. Connectivity applications, different threads can send the write operations on the
|
||
same file to the same controller core to reduce lock contention between
|
||
Here we discuss how the host-side U-lib and K-lib communicate controller cores. We will leave the exploration of the scheduling policy
|
||
with the device-side Firm-RT. StorStack’s communication is based for different workloads to future works.
|
||
on NVMe to take full advantage of high-speed storage devices. We
|
||
also propose a multi-queue design to improve the performance of 3.5. Security and reliability
|
||
device-side FS.
|
||
From a hardware perspective, the privileged mode (ring 0) that the
|
||
3.4.1. Communication protocol kernel runs on and the user mode that user applications run on are
|
||
The communication protocol between the host CPU and StorStack isolated, which means the access to resources is restricted by hardware.
|
||
device is a queued protocol extended from NVMe [3]. NVMe is a The privileged mode can thus be treated as a trusted area, whereas the
|
||
protocol for accessing non-volatile memories connected via PCIe that user mode as an untrusted area. StorStack introduces the device-side
|
||
supports multiple queues to maximize the throughput, which is suitable controller as a run-time, which is also isolated from user code and thus
|
||
for novel high-speed storage devices such as SSDs and SCMs. viewed as a trusted area.
|
||
To enable the transfer of file operations, we extend the NVMe For safety, everything critical to the correctness of the system should
|
||
command list to incorporate the POSIX I/O interface. Meanwhile, the be placed in the trusted area. Typical kernel file systems are placed
|
||
regular data access pattern of NVMe is retained to enable normal inside the kernel as they need to manage the data on block devices.
|
||
|
||
5
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
|
||
|
||
Fig. 3. Queue arrangement and scheduling policies. This figure shows how the
|
||
queue pairs are mapped between host CPU threads and device controller cores.
|
||
|
||
|
||
|
||
|
||
StorStack shifts FS to the device side, which is also a trusted area.
|
||
Meanwhile, as described in Section 3.2.2, StorStack separates the host-
|
||
side workflow into a control plane and a data plane. The control plane
|
||
is designed to reside in the host-side trusted area, i.e. the kernel, to
|
||
cooperate with the device-side FS to ensure security and reliability.
|
||
An important design principle of the control plane is to reduce the Fig. 4. Permission checking. Figure shows how the user space, the kernel space, and
|
||
overhead of the kernel trap. In StorStack, this is done by reducing the device work together to check the validity of a request without frequent kernel
|
||
the proportion of control-plane operations and data-plane operations. traps.
|
||
|
||
There are two types of control-plane workflow on the host side: re-
|
||
source allocation and access control. Both of them are designed to be
|
||
called rarely. generates a secret key if one has not been set yet, then save and copy
|
||
it to the device by kernel NVMe driver. Once the key is set, K-lib
|
||
3.5.1. Resource allocation uses it to encrypt the process’s credential information (i.e. uid) into
|
||
The U-lib of StorStack is a user-space driver that communicates MAC (Message Authentication Code). The resulting token, which is the
|
||
with the NVMe storage device. It needs to set up VFIO and manage output of the encryption, is then returned to the process. Since the
|
||
DMA memory mapping to enable direct access from user space. It also secret key is stored in the kernel, the process cannot forge a token
|
||
needs to allocate areas for caches. These operations involve the kernel but can only use the one assigned by the kernel, which can prove the
|
||
but only need to be run once when the device is initialized, so there authenticity of the uid claimed by the process. Before being sent to the
|
||
will not be any performance loss in regular file access. device, every request from the process is tagged with the process’s uid
|
||
and the token, so that the device can use the secret key and the token
|
||
3.5.2. Permission checking to verify the uid and check the identity of the request. This mechanism
|
||
To provide access control, file systems must check the user’s permis- requires only one communication between the kernel and the device to
|
||
sion to make sure that a file operation is legal. In kernel file systems, share the secret key, and one kernel trap to initialize the token for each
|
||
the file system can use the process structure in the kernel to validate the process. Also, the K-lib is implemented as a kernel driver, without
|
||
process’s identity, and then compare it with the permission information any modification to the core functions of the kernel, which makes it
|
||
stored in the file’s inode. In StorStack, however, the file system resides compatible with conventional operating system.
|
||
on the device rather than in the kernel, so the kernel needs to share the
|
||
process’s information with the device to support permission checking. 3.5.3. Device lock
|
||
To avoid entering the kernel frequently, DevFS [14] maintains a StorStack is designed to support direct I/O not only from CPUs,
|
||
table that maps CPU IDs to process credentials in the device. All but also from different types of heterogeneous computing devices.
|
||
requests are tagged with the CPU’s ID that the process runs on before To prevent concurrent access to the same file from multiple devices,
|
||
they are sent to the device. The kernel is modified to update the table a concurrency control method is required. A common practice is to
|
||
whenever a process is scheduled on a host CPU. There are two problems implement a distributed lock across all devices, but this can be too
|
||
with this mechanism. Firstly, it assumes that the CPU ID is unforgeable, costly for low-level hardware. In StorStack, we provide in-storage file-
|
||
but usually a malicious process can potentially exploit the ID of another level locking mechanisms to protect the files from unexpected access
|
||
CPU to escalate its privilege. Secondly, this requires a modification to by multiple devices.
|
||
the process scheduler, which is a core module of the kernel, so making StorStack supports two types of lock: (1) spinning lock, an error
|
||
it incompatible with standard OS kernels, and may slow down the code will be returned to the caller if the file it accesses is already locked
|
||
system. by another device, allowing the caller to continue attempting to acquire
|
||
In StorStack, we propose a new method to share the credential of the lock until the file is unlocked; (2) sleeping lock, where if the file
|
||
the process, with less communication, safer guarantee, and no change is locked, any requests from other devices to that file will wait in the
|
||
to the Linux kernel. The process is shown in Fig. 4. When the U- submission queue until the file is unlocked. From the perspective of
|
||
lib is initialized on a process, it calls the K-lib (a kernel driver) concurrency, StorStack supports both shared lock and exclusive lock,
|
||
via ioctl() (system call) to get a credential token. The K-lib which act exactly the same as those on other systems.
|
||
|
||
6
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
|
||
|
||
Fig. 5. Random and sequential r/w. Figure shows the basic performance of StorStack compared with Ext-4, under different cache, block size, and in-storage file system settings.
|
||
|
||
|
||
running on its host machine. There are two reasons for the simula-
|
||
tion: first, although there are several works regarding programmable
|
||
storage controllers [49,55–57], these solutions are either expensive or
|
||
lack high-level programmability as most of them are based on FPGA;
|
||
second, by simulating with various latency settings, we can evaluate the
|
||
performance of StorStack on different types of storage devices, which
|
||
can be costly if done with real hardware. In our prototype, QEMU has
|
||
been modified to handle extended NVMe POSIX I/O operations and
|
||
check the token of each operation.
|
||
|
||
4. Evaluation
|
||
|
||
In this section, we evaluate the performance of StorStack and com-
|
||
pare it with popular file systems to answer the following questions:
|
||
Fig. 6. Time cost for a single operation.
|
||
|
||
• Is StorStack efficient enough compared to widely used kernel file
|
||
systems?
|
||
3.6. Implementation • How much performance is gained from the kernel trap avoidance?
|
||
• How does StorStack perform on different types of devices?
|
||
We have implemented a prototype of StorStack, which consists of • How is the concurrency performance of StorStack?
|
||
three parts: the U-lib, the K-lib and the Firm-RT. The source code
|
||
of this prototype is available at https://anonymous.4open.science/r/
|
||
StorStack-524F/. 4.1. Experimental setup
|
||
The U-lib is implemented under Linux 5.15, utilizing SPDK [54]
|
||
to access storage devices from user space. The SPDK library is modified Our experiment platform is a 20-core 2.4 GHz Intel Xeon server
|
||
in StorStack to transfer POSIX I/O operations over NVMe. The U-lib equipped with 64 GB DDR4 memory and 512 GB SSD. Among them, 8
|
||
comprises two major components: a dynamic link library that provides
|
||
cores with 16 GB memory are assigned to the QEMU VM to simulate the
|
||
interfaces and a user-level cache for accessing the device, and a daemon
|
||
StorStack host; other cores with 16 GB memory are reserved to emulate
|
||
program responsible for managing the connection to the device.
|
||
the StorStack device. Both the StorStack host and the StorStack device
|
||
The K-lib is implemented as a simple kernel module in Linux 5.15
|
||
runs on Linux 5.15.
|
||
kernel. It only takes charge of two things: creating the secret key when
|
||
the StorStack is initialized so that the K-lib and the Firm-RT can StorStack’s expected settings on the device require only a minimal
|
||
use it to encrypt and decrypt the MAC token for processes’ credentials; embedded system with abstractions of hardware functions and neces-
|
||
generating the MAC token from the uid of the current process with sary libraries, but due to our simulation requirements, we choose Linux
|
||
HMAC algorithm when the process initializes, and then return it to as the device-side environment to support the execution of QEMU.
|
||
the U-lib. The interface of the K-lib is exposed to the user space In this section, we evaluate the performance of StorStack using
|
||
through ioctl. Filebench [58], a widely used benchmarking suite for testing file system
|
||
The Firm-RT is the only component that located on the device performance. We access StorStack under various configurations, includ-
|
||
side. In this work, the Firm-RT is not implemented on actual stor- ing different cache options, device access latency, thread numbers and
|
||
age hardware but is instead simulated using QEMU and the system read/write ratios, to address the four questions previously raised.
|
||
|
||
7
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
|
||
|
||
Fig. 7. Performance with simulated latency. This figure shows the change in throughput as a function of simulated device access latency.
|
||
|
||
|
||
|
||
|
||
Fig. 8. Multi-thread Performance.
|
||
|
||
|
||
4.2. Random and sequential r/w read, and uncached write.
|
||
When the cache hits, the data resides in fast DRAM, resulting in
|
||
First, we evaluate StorStack’s performance with single-thread ran- low data-fetch latency. In this scenario, traditional Ext-4 exhibits higher
|
||
dom and sequential read/write tests. The random tests run on a 1 GB access latency, as the kernel trap accounts for most of the latency.
|
||
file with 1K, 4K, and 16K bytes I/O size. The sequential tests run on In contrast, StorStack shows lower latency because its cache is imple-
|
||
a 8 GB file with 8K, 32K, and 128K bytes I/O size. Both of the files mented inside user space eliminating the need for kernel traps. When a
|
||
are stored on the DRAM memory, which is simulated as a PMEM by cache miss occurs, the primary overhead shifts to the multiple rounds
|
||
memmap. The tests are performed on StorStack (referred to as SS) with of storage device access, which further increases the performance gap
|
||
two different in-storage FS settings: SS+Ext-4 and SS+Ext-4_DAX. between traditional Ext-4 and StorStack.
|
||
Then we compare them with Ext-4. We also evaluate the performance
|
||
of SS without cache (SS NC) and Ext-4 with direct IO (Ext-4_DIO) 4.4. Impact of access latency
|
||
to study performance improvement when accessed directly.
|
||
Fig. 5 shows the results of the random and sequential tests. In Storage devices with different access latencies may influence the
|
||
both tests, SS outperforms traditional kernel-level Ext-4, due to our performance of file systems. In this experiment, we use multiple latency
|
||
kernel-bypass and near-data file system design. SS+Ext-4_DAX with settings to simulate devices with different access speed. The latency is
|
||
user-level write-back cache achieves averagely 1.98x, 4.25x, 3.59x, and simulated on the device side by QEMU.
|
||
4.08x performance gain on random read, random write, sequential We compare the performance of SS with Ext-4 under cached and
|
||
read, and sequential write respectively compared with Ext-4 with uncached settings using several latency settings. The latency ranges
|
||
page cache. For direct access, the speed increase is 6.41x, 6.21x, from 0 μs to 25 μs to simulate connection methods from DDR to PCIe
|
||
4.72x, and 1.90x respectively. Another interesting phenomenon is that to RDMA. Tests run with 4KB block size.
|
||
in cached StorStack, the performances of SS+Ext-4 and SS+Ext- Fig. 7 shows the result of this test. With a cache, both SS and
|
||
4_DAX are similar, indicating that the choice of the in-storage file Ext-4 are not susceptible to the rise of latency. However, without
|
||
system does not matter because most operations are handled by cache, the performance of SS has a 78.20% degrade from 526MB/s
|
||
the user-level cache. However, in uncached tests, SS+Ext-4_DAX at 0 simulated latency to 115 MB/s at 25 μs latency. The performance
|
||
show better results, which means that the in-storage file system may of Ext-4 also cuts 20.98% from 54MB/s to 43MB/s. Note that the
|
||
influence the overall performance in direct access. experiment introduces extra latency due to QEMU, so the simulated 0
|
||
latency is larger than 0 actually, meaning that the curve can even go
|
||
4.3. Profit of kernel bypassing higher on the left side of the graph. The result illustrates that direct
|
||
access of SS should only be enabled on ultra-low latency devices. For
|
||
We measure the time cost of a single operation to study the profit other hardware, it is better to enable the cache.
|
||
of kernel bypassing. The cached test demonstrates the impact of kernel
|
||
trap on the access of in-memory page cache. The uncached test shows 4.5. Multi-thread performance
|
||
the impact of both kernel trap and write amplification on direct access
|
||
to the storage device. Both tests utilize 4KB block size, and the files To study the performance of StorStack under multiple threads, we
|
||
are stored on the simulated PMEM. The results in Fig. 6 indicate that evaluate SS and Ext-4 under a multi-thread micro-benchmark. The
|
||
compared to Ext-4, SS+Ext-4_DAX reduces latency by 91.91%, benchmark is to perform parallel 4KB file operations on one file with 4
|
||
50.46%, 69.83%, and 81.83% on cached read, cached write, uncached threads, each thread is a reader or a writer, and the ratio of readers and
|
||
|
||
8
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
writers is set to 4:0, 3:1, 1:3, and 0:4. Fig. 8 shows the result. StorStack [10] M. Dong, H. Bu, J. Yi, B. Dong, H. Chen, Performance and protection in the
|
||
is faster than Ext-4 in all concurrent read and write scenarios of our ZoFS user-space NVM file system, in: Proceedings of the 27th ACM Symposium
|
||
on Operating Systems Principles, ACM, Huntsville Ontario Canada, 2019, pp.
|
||
test. For cached scenario, SS is on average 2.88x faster than Ext-4 in
|
||
478–493, http://dx.doi.org/10.1145/3341301.3359637.
|
||
all read-write ratios. For uncached scenario, the speed up is 17.34x. [11] Y. Kwon, H. Fingler, T. Hunt, S. Peter, E. Witchel, T. Anderson, Strata: A cross
|
||
media file system, in: Proceedings of the 26th Symposium on Operating Systems
|
||
5. Conclusion Principles, in: SOSP ’17, Association for Computing Machinery, New York, NY,
|
||
USA, 2017, pp. 460–477, http://dx.doi.org/10.1145/3132747.3132770.
|
||
[12] J. Liu, A.C. Arpaci-Dusseau, R.H. Arpaci-Dusseau, S. Kannan, File systems as
|
||
In this paper, we present StorStack, a full-stack design for in-storage processes, in: 11th USENIX Workshop on Hot Topics in Storage and File Systems,
|
||
file systems framework and simulator. The StorStack components across HotStorage 19, USENIX Association, Renton, WA, 2019.
|
||
user space, kernel space, and device space collaborate to enable file [13] S. Zhong, C. Ye, G. Hu, S. Qu, A. Arpaci-Dusseau, R. Arpaci-Dusseau, M. Swift,
|
||
systems to run inside the storage device efficiently and reliably. We MadFS: per-file virtualization for userspace persistent memory filesystems, in:
|
||
21st USENIX Conference on File and Storage Technologies, FAST 23, 2023, pp.
|
||
implement a prototype of StorStack and evaluate it with various set-
|
||
265–280.
|
||
tings. Experimental results show that StorStack outperforms current [14] S. Kannan, A.C. Arpaci-Dusseau, R.H. Arpaci-Dusseau, Y. Wang, J. Xu, G. Palani,
|
||
kernel file systems in both cached and uncached scenes. Some further Designing a true direct-access file system with devfs, in: 16th USENIX Conference
|
||
performance optimizations, such as the combination of file systems and on File and Storage Technologies, FAST 18, USENIX Association, Oakland, CA,
|
||
2018, pp. 241–256.
|
||
storage hardware capabilities, the exploration of multi-queue schedul-
|
||
[15] Y. Ren, C. Min, S. Kannan, Crossfs: A cross-layered direct-access file system,
|
||
ing strategies for different workloads, and the performance of direct in: 14th USENIX Symposium on Operating Systems Design and Implementation,
|
||
access from heterogeneous devices, are left to future work. OSDI 20, USENIX Association, 2020, pp. 137–154.
|
||
[16] J. Zhang, Y. Ren, S. Kannan, FusionFS: fusing I/O operations using ciscops
|
||
in firmware file systems, in: 20th USENIX Conference on File and Storage
|
||
CRediT authorship contribution statement
|
||
Technologies, FAST 22, USENIX Association, Santa Clara, CA, 2022, pp. 297–312.
|
||
[17] N. Agrawal, V. Prabhakaran, T. Wobber, J.D. Davis, M. Manasse, R. Panigrahy,
|
||
Juncheng Hu: Writing – review & editing, Writing – original draft. Design tradeoffs for SSD performance, in: USENIX 2008 Annual Technical
|
||
Shuo Chen: Formal analysis, Data curation. Haoyang Wei: Formal Conference, in: ATC’08, USENIX Association, USA, 2008, pp. 57–70.
|
||
analysis, Data curation. Guoyu Wang: Writing – review & editing, [18] F. Chen, D.A. Koufaty, X. Zhang, Understanding intrinsic characteristics and
|
||
system implications of flash memory based solid state drives, in: Proceedings
|
||
Writing – original draft. Chenju Pei: Formal analysis, Data curation. of the Eleventh International Joint Conference on Measurement and Modeling of
|
||
Xilong Che: Methodology, Conceptualization. Computer Systems, in: SIGMETRICS ’09, Association for Computing Machinery,
|
||
New York, NY, USA, 2009, pp. 181–192, http://dx.doi.org/10.1145/1555349.
|
||
Declaration of competing interest 1555371.
|
||
[19] Welcome to PCI-SIG | PCI-SIG, https://pcisig.com/.
|
||
[20] Y. Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y. Oh, D.
|
||
The authors declare that they have no known competing finan- Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M.G. Kang, J. Lee, Y. Kwon, S. Kim,
|
||
cial interests or personal relationships that could have appeared to J. Kim, Y.-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K.
|
||
influence the work reported in this paper. Lee, Y.-T. Lee, J. Yoo, G. Jeong, A 20nm 1.8V 8gb PRAM with 40mb/s program
|
||
bandwidth, in: 2012 IEEE International Solid-State Circuits Conference, 2012,
|
||
pp. 46–48, http://dx.doi.org/10.1109/ISSCC.2012.6176872.
|
||
Acknowledgments [21] H. Volos, A.J. Tack, M.M. Swift, Mnemosyne: Lightweight persistent memory,
|
||
ACM SIGARCH Comput. Archit. News 39 (1) (2011) 91–104, http://dx.doi.org/
|
||
This work was funded by the National Key Research and De- 10.1145/1961295.1950379.
|
||
velopment Programme No. 2024YFB3310200, and by Key scientific [22] S.-W. Chung, T. Kishi, J.W. Park, M. Yoshikawa, K.S. Park, T. Nagase, K.
|
||
Sunouchi, H. Kanaya, G.C. Kim, K. Noma, M.S. Lee, A. Yamamoto, K.M. Rho,
|
||
and technological R&D Plan of Jilin Province of China under Grant K. Tsuchida, S.J. Chung, J.Y. Yi, H.S. Kim, Y. Chun, H. Oyamatsu, S.J. Hong,
|
||
No. 20230201066GX, and by the Central University Basic Scientific 4Gbit density STT-MRAM using perpendicular MTJ realized with compact cell
|
||
Research Fund Grant No.2023-JCXK-04. structure, in: 2016 IEEE International Electron Devices Meeting, IEDM, 2016, pp.
|
||
27.1.1–27.1.4, http://dx.doi.org/10.1109/IEDM.2016.7838490.
|
||
[23] H. Akinaga, H. Shima, Resistive random access memory (ReRAM) based on metal
|
||
References
|
||
oxides, Proc. IEEE 98 (12) (2010) 2237–2251, http://dx.doi.org/10.1109/JPROC.
|
||
2010.2070830.
|
||
[1] G. Koo, K.K. Matam, T. I, H.K.G. Narra, J. Li, H.-W. Tseng, S. Swanson, M. [24] K. Kawai, A. Kawahara, R. Yasuhara, S. Muraoka, Z. Wei, R. Azuma, K. Tanabe,
|
||
Annavaram, Summarizer: trading communication with computing near storage, K. Shimakawa, Highly-reliable TaOx reram technology using automatic forming
|
||
in: Proceedings of the 50th Annual IEEE/ACM International Symposium on circuit, in: 2014 IEEE International Conference on IC Design & Technology, 2014,
|
||
Microarchitecture, 2017, pp. 219–231. pp. 1–4, http://dx.doi.org/10.1109/ICICDT.2014.6838600.
|
||
[2] S.S.M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. Jin, Y. Liu, S. Swanson, [25] K. Suzuki, S. Swanson, The Non-Volatile Memory Technology Database
|
||
Willow: A User-Programmable ssd, OSDI, 2014. (NVMDB), Tech. Rep. CS2015-1011, Department of Computer Science &
|
||
[3] NVMe specifications, https://nvmexpress.org/specifications/. Engineering, University of California, San Diego, 2015.
|
||
[4] Intel, Intel® Optane™ Persistent Memory, https://www.intel.com/content/www/ [26] S. Matsuura, Designing a persistent-memory-native storage engine for SQL
|
||
us/en/products/docs/memory-storage/optane-persistent-memory/overview. database systems, in: 2021 IEEE 10th Non-Volatile Memory Systems and Ap-
|
||
html. plications Symposium, NVMSA, IEEE, Beijing, China, 2021, pp. 1–6, http://dx.
|
||
[5] S. Mittal, J.S. Vetter, A survey of software techniques for using non-volatile doi.org/10.1109/NVMSA53655.2021.9628842.
|
||
memories for storage and main memory systems, IEEE Trans. Parallel Distrib. [27] R. Tadakamadla, M. Patocka, T. Kani, S.J. Norton, Accelerating database work-
|
||
Syst. 27 (5) (2016) 1537–1550, http://dx.doi.org/10.1109/TPDS.2015.2442980. loads with DM-WriteCache and persistent memory, in: Proceedings of the 2019
|
||
[6] M. Wei, M. Bjørling, P. Bonnet, S. Swanson, I/O speculation for the microsecond ACM/SPEC International Conference on Performance Engineering, in: ICPE ’19,
|
||
era, in: 2014 USENIX Annual Technical Conference, USENIX ATC 14, 2014, pp. Association for Computing Machinery, New York, NY, USA, 2019, pp. 255–263,
|
||
475–481. http://dx.doi.org/10.1145/3297663.3309669.
|
||
[7] S. Peter, J. Li, I. Zhang, D.R.K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, [28] W. Wang, C. Yang, R. Zhang, S. Nie, X. Chen, D. Liu, Themis: malicious wear
|
||
T. Roscoe, Arrakis: the operating system is the control plane, in: 11th USENIX detection and defense for persistent memory file systems, in: 2020 IEEE 26th
|
||
Symposium on Operating Systems Design and Implementation, OSDI 14, 2014, International Conference on Parallel and Distributed Systems, ICPADS, 2020, pp.
|
||
pp. 1–16. 140–147, http://dx.doi.org/10.1109/ICPADS51040.2020.00028.
|
||
[8] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Saxena, M.M. Swift, [29] B. Zhu, Y. Chen, Q. Wang, Y. Lu, J. Shu, Octopus+ : An RDMA-enabled distributed
|
||
Aerie: Flexible file-system interfaces to storage-class memory, in: Proceedings persistent memory file system, ACM Trans. Storage 17 (3) (2021) 1–25, http:
|
||
of the Ninth European Conference on Computer Systems, in: EuroSys ’14, //dx.doi.org/10.1145/3448418.
|
||
Association for Computing Machinery, New York, NY, USA, 2014, pp. 1–14, [30] J. Do, V.C. Ferreira, H. Bobarshad, M. Torabzadehkashi, S. Rezaei, A. Hey-
|
||
http://dx.doi.org/10.1145/2592798.2592810. darigorji, D. Souza, B.F. Goldstein, L. Santiago, M.S. Kim, P.M.V. Lima, F.M.G.
|
||
[9] A.M. Caulfield, T.I. Mollov, L.A. Eisner, A. De, J. Coburn, S. Swanson, Providing França, V. Alves, Cost-effective, energy-efficient, and scalable storage computing
|
||
safe, user space access to fast, solid state disks, ACM SIGPLAN Not. 47 (4) (2012) for large-scale AI applications, ACM Trans. Storage 16 (4) (2020) 21:1–21:37,
|
||
387–400, http://dx.doi.org/10.1145/2248487.2151017. http://dx.doi.org/10.1145/3415580.
|
||
|
||
|
||
9
|
||
J. Hu et al. Journal of Systems Architecture 160 (2025) 103348
|
||
|
||
|
||
[31] L. Kang, Y. Xue, W. Jia, X. Wang, J. Kim, C. Youn, M.J. Kang, H.J. Lim, [57] J. Kwak, S. Lee, K. Park, J. Jeong, Y.H. Song, Cosmos+ OpenSSD: rapid prototype
|
||
B. Jacob, J. Huang, IceClave: A trusted execution environment for in-storage for flash storage systems, ACM Trans. Storage 16 (3) (2020) 15:1–15:35, http:
|
||
computing, in: MICRO-54: 54th Annual IEEE/ACM International Symposium //dx.doi.org/10.1145/3385073.
|
||
on Microarchitecture, in: MICRO ’21, Association for Computing Machinery, [58] Filebench, https://github.com/filebench/filebench.
|
||
New York, NY, USA, 2021, pp. 199–211, http://dx.doi.org/10.1145/3466752.
|
||
3480109.
|
||
[32] Z. Ruan, T. He, J. Cong, INSIDER: designing in-storage computing system Juncheng Hu received the bachelor’s degree and doctor
|
||
for emerging high-performance drive, in: 2019 USENIX Annual Technical of Engineering degree from Jilin University in 2017 and
|
||
Conference, USENIX ATC 19, USENIX Association, Renton, WA, 2019, pp. 2022, where he is current a lecturer in Jilin University. His
|
||
379–394. research interests include data mining, machine learning,
|
||
[33] A.M. Caulfield, T.I. Mollov, L.A. Eisner, A. De, J. Coburn, S. Swanson, Providing computer network and parallel computing.
|
||
safe, user space access to fast, solid state disks, ACM SIGPLAN Not. 47 (4) (2012) jchu@jlu.edu.cn
|
||
387–400.
|
||
[34] S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, G.R. Ganger, Active disk meets flash: A case
|
||
for intelligent ssds, in: Proceedings of the 27th International ACM Conference
|
||
on International Conference on Supercomputing, 2013, pp. 91–102.
|
||
[35] J. Do, Y.-S. Kee, J.M. Patel, C. Park, K. Park, D.J. DeWitt, Query processing
|
||
on smart ssds: Opportunities and challenges, in: Proceedings of the 2013
|
||
ACM SIGMOD International Conference on Management of Data, 2013, pp. Shuo Chen is currently working toward the master’s de-
|
||
1221–1230. gree with College of Computer Science and Technology,
|
||
[36] C. Cowan, S. Beattie, J. Johansen, P. Wagle, {Point Guar d.}: Protecting pointers Jilin University since 2022. His research field is computer
|
||
from buffer overflow vulnerabilities, in: 12th USENIX Security Symposium, architecture, mainly focusing on optimization for caching
|
||
USENIX Security 03, 2003. systems.
|
||
[37] L. Szekeres, M. Payer, T. Wei, D. Song, Sok: Eternal war in memory, in: 2013 chenshuo22@mails.jlu.edu.cn
|
||
IEEE Symposium on Security and Privacy, IEEE, 2013, pp. 48–62.
|
||
[38] S.R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, J.
|
||
Jackson, System software for persistent memory, in: Proceedings of the Ninth Eu-
|
||
ropean Conference on Computer Systems - EuroSys ’14, ACM Press, Amsterdam,
|
||
The Netherlands, 2014, pp. 1–15, http://dx.doi.org/10.1145/2592798.2592814.
|
||
[39] C. Lee, D. Sim, J. Hwang, S. Cho, F2FS: A new file system for flash storage, in:
|
||
13th USENIX Conference on File and Storage Technologies, FAST 15, USENIX Wei Haoyang a 23rd-year Master’s student in Computer
|
||
Association, Santa Clara, CA, 2015, pp. 273–286. Science and Technology at Jilin University, focuses on
|
||
[40] DAX, https://www.kernel.org/doc/Documentation/filesystems/dax.txt. computer architecture research, with a primary interest in
|
||
[41] J. Xu, S. Swanson, NOVA: A log-structured file system for hybrid volatile/non- the application of new storage devices.
|
||
volatile main memories, in: Proceedings of the 14th Usenix Conference on File hywei23@mails.jlu.edu.cn
|
||
and Storage Technologies, in: FAST’16, USENIX Association, USA, 2016, pp.
|
||
323–338.
|
||
[42] M. Torabzadehkashi, S. Rezaei, A. HeydariGorji, H. Bobarshad, V. Alves, N.
|
||
Bagherzadeh, Computational storage: An efficient and scalable platform for big
|
||
data and HPC applications, J. Big Data 6 (1) (2019) 100, http://dx.doi.org/10.
|
||
1186/s40537-019-0265-5.
|
||
[43] W. Cao, Y. Liu, Z. Cheng, N. Zheng, W. Li, W. Wu, L. Ouyang, P. Wang, Y.
|
||
Wang, R. Kuan, Z. Liu, F. Zhu, T. Zhang, POLARDB meets computational storage: Guoyu Wang is currently working toward the doctor’s
|
||
efficiently support analytical workloads in cloud-native relational database, in: degree with College of Computer Science and Technology,
|
||
Proceedings of the 18th USENIX Conference on File and Storage Technologies, Jilin University.
|
||
in: FAST’20, USENIX Association, USA, 2020, pp. 29–42. wgy21@mails.jlu.edu.cn
|
||
[44] Nvidia, NVIDIA RTX IO: GPU accelerated storage technology, https://www.
|
||
nvidia.com/en-us/geforce/news/rtx-io-gpu-accelerated-storage-technology/.
|
||
[45] AMD, Radeon™ Pro SSG graphics, https://www.amd.com/en/products/
|
||
professional-graphics/radeon-pro-ssg.
|
||
[46] Z. An, Z. Zhang, Q. Li, J. Xing, H. Du, Z. Wang, Z. Huo, J. Ma, Optimizing the
|
||
datapath for key-value middleware with NVMe SSDs over RDMA interconnects,
|
||
in: 2017 IEEE International Conference on Cluster Computing, CLUSTER, 2017,
|
||
pp. 582–586, http://dx.doi.org/10.1109/CLUSTER.2017.69. Pei Chenju is an undergraduate student at the School of
|
||
[47] Samsung, Samsung 990 PRO with heatsink, https://semiconductor.samsung. Computer Science and Technology at Jilin University. His
|
||
com/content/semiconductor/global/consumer-storage/internal-ssd/990-pro- field of research is computer system architecture, and he is
|
||
with-heatsink.html. currently investigating new L7 load balancing solutions.
|
||
[48] A. Ltd, ARM computational storage solution, https://www.arm.com/solutions/ peicj2121@mails.jlu.edu.cn
|
||
storage/computational-storage.
|
||
[49] Samsung, Samsung SmartSSD, https://www.xilinx.com/applications/data-center/
|
||
computational-storage/smartssd.html.
|
||
[50] ScaleFlux, ScaleFlux, https://scaleflux.com/.
|
||
[51] E. Gal, S. Toledo, A transactional flash file system for microcontrollers, in: 2005
|
||
USENIX Annual Technical Conference, USENIX ATC 05, 2005.
|
||
[52] J. Koo, J. Im, J. Song, J. Park, E. Lee, B.S. Kim, S. Lee, Modernizing file system
|
||
through in-storage indexing, in: Proceedings of the 15th Usenix Symposium on Che Xilong Received the M.S. and Ph.D. degrees in Com-
|
||
Operating Systems Design and Implementation, Osdi ’21, Usenix Assoc, Berkeley, puter Science from Jilin University, in 2006 and 2009
|
||
2021, pp. 75–92, http://dx.doi.org/10.5281/zenodo.4659803. respectively.
|
||
[53] LevelDB, https://github.com/google/leveldb. Currently, He is a full professor and doctoral supervisor
|
||
[54] Storage performance development kit, https://spdk.io/. at the College of Computer Science and Technology, Jilin
|
||
[55] DFC open source, https://github.com/DFC-OpenSource. University, China.
|
||
[56] M. Jung, OpenExpress: fully hardware automated open research framework His current research areas are Parallel & Distributed
|
||
for future fast NVMe devices, in: 2020 USENIX Annual Technical Conference, Computing, High Performance Computing Architectures,
|
||
USENIX ATC 20, 2020, pp. 649–656. and related optimizations.
|
||
He is a member of the China Computer Federation.
|
||
Corresponding author of this paper.
|
||
chexilong@jlu.edu.cn
|
||
|
||
|
||
|
||
|
||
10
|
||
|