998 lines
111 KiB
Plaintext
998 lines
111 KiB
Plaintext
Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Journal of Systems Architecture
|
||
journal homepage: www.elsevier.com/locate/sysarc
|
||
|
||
|
||
|
||
|
||
REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
|
||
Gun Ko, Jiwon Lee, Hongju Kal, Hyunwuk Lee, Won Woo Ro ∗
|
||
Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: With the increasing demands of modern workloads, multi-GPU systems have emerged as a scalable solution, ex-
|
||
Multi-GPU tending performance beyond the capabilities of single GPUs. However, these systems face significant challenges
|
||
Data sharing in managing memory across multiple GPUs, particularly due to the Non-Uniform Memory Access (NUMA)
|
||
Cache coherence
|
||
effect, which introduces latency penalties when accessing remote memory. To mitigate NUMA overheads, GPUs
|
||
Cache architecture
|
||
typically cache remote memory accesses across multiple levels of the cache hierarchy, which are kept coherent
|
||
using cache coherence protocols. The traditional GPU bulk-synchronous programming (BSP) model relies on
|
||
coarse-grained invalidations and cache flushes at kernel boundaries, which are insufficient for the fine-grained
|
||
communication patterns required by emerging applications. In multi-GPU systems, where NUMA is a major
|
||
bottleneck, substantial data movement resulting from the bulk cache invalidations exacerbates performance
|
||
overheads. Recent cache coherence protocol for multi-GPUs enables flexible data sharing through coherence
|
||
directories that track shared data at a fine-grained level across GPUs. However, these directories limited in
|
||
capacity, leading to frequent evictions and unnecessary invalidations, which increase cache misses and degrade
|
||
performance. To address these challenges, we propose REC, a low-cost architectural solution that enhances
|
||
the effective tracking capacity of coherence directories by leveraging memory access locality. REC coalesces
|
||
multiple tag addresses from remote read requests within common address ranges, reducing directory storage
|
||
overhead while maintaining fine-grained coherence for writes. Our evaluation on a 4-GPU system shows that
|
||
REC reduces L2 cache misses by 53.5% and improves overall system performance by 32.7% across a variety
|
||
of GPU workloads.
|
||
|
||
|
||
|
||
1. Introduction each kernel. However, as recent GPU applications increasingly require
|
||
more frequent and fine-grained communication both within and across
|
||
Multi-GPU systems have emerged to meet the growing demands kernels [11,13–15], these frequent synchronizations can lead to sub-
|
||
of modern workloads, offering scalable performance beyond what a stantial cache operation and data movement overheads. Additionally,
|
||
single GPU can deliver. However, as multi-GPU architectures scale in precisely managing the synchronizations places additional burdens on
|
||
size and complexity [1,2], managing memory across multiple GPUs programmers, complicating the optimization of multi-GPU systems.
|
||
becomes increasingly challenging [3–7]. One of the primary challenges Ren et al. [11] proposed HMG, a hierarchical cache coherence
|
||
arises from the bandwidth discrepancy between local and remote mem- protocol designed for L2 caches in large-scale multi-GPU systems. HMG
|
||
ory, commonly known as the Non-Uniform Memory Access (NUMA) employs coherence directories to record cache line addresses and their
|
||
effect [3,4]. To mitigate the NUMA penalty, GPUs generally rely on associated sharers upon receiving remote read requests. Any writes to
|
||
caching remote memory accesses, allowing them to be served with
|
||
these addresses trigger invalidations. Once capacity is reached, existing
|
||
local bandwidth [5,8–10]. This caching strategy is often extended
|
||
entries are evicted from the directory, triggering invalidation requests
|
||
across multiple levels of the cache hierarchy, including both private
|
||
to the sharer GPUs. These invalidations are unnecessary, as the cor-
|
||
on-chip caches and shared caches [3,4,11,12], to better accommodate
|
||
responding cache lines do not immediately require coherence to be
|
||
the diverse access patterns of emerging workloads.
|
||
maintained. When GPUs access data across a wide range of addresses,
|
||
While remote data caching offers significant performance benefits
|
||
in multi-GPU systems, it also requires extending coherence throughout significant directory insertions lead to a number of unnecessary invali-
|
||
the cache hierarchy. Conventional GPUs rely on a simple software- dations for cache lines that have not yet been fully utilized. Subsequent
|
||
inserted bulk-synchronous programming (BSP) model [11], which per- accesses to these cache lines result in cache misses, requiring data to
|
||
forms cache invalidation and flush operations at the start and end of be fetched again over bandwidth-limited inter-GPU links.
|
||
|
||
|
||
∗ Corresponding author.
|
||
E-mail address: wro@yonsei.ac.kr (W.W. Ro).
|
||
|
||
https://doi.org/10.1016/j.sysarc.2025.103339
|
||
Received 10 September 2024; Received in revised form 27 December 2024; Accepted 5 January 2025
|
||
Available online 9 January 2025
|
||
1383-7621/© 2025 Published by Elsevier B.V.
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 1. Performance of each caching scheme normalized to a system that enables
|
||
remote data caching in both L1 and L2 caches using software and hardware coherence
|
||
protocols, respectively. ‘‘No caching’’ refers to a system that disables remote data Fig. 2. Baseline multi-GPU system. Each GPU has a coherence directory that records
|
||
caching, simplifying coherence. and tracks the status of shared data at given addresses along with the corresponding
|
||
sharer IDs.
|
||
|
||
|
||
To evaluate the implications of the coherence protocol, we mea-
|
||
sure the performance impact of unnecessary invalidations on a 4-GPU 2. Background
|
||
system that caches remote data in both L1 and L2 caches. L1 caches
|
||
are assumed to be software-managed, while L2 caches are managed 2.1. Multi-GPU architecture
|
||
under fine-grained invalidation through coherence directories. As Fig. 1
|
||
shows, there exists a significant performance opportunity in eliminat- The slowdown of transistor scaling has made it increasingly difficult
|
||
ing unnecessary invalidations caused by frequent directory evictions. for single GPUs to meet the growing demands of modern workloads. Al-
|
||
Increasing the size of the coherence directory can delay evictions and ternatively, multi-GPU systems have emerged as a viable path forward,
|
||
the corresponding invalidation requests, but at the cost of increased offering enhanced performance and memory capacity by leveraging
|
||
hardware. Our observations indicate that to eliminate unnecessary multiple GPUs connected using high-bandwidth interconnects such as
|
||
invalidations, the size of the coherence directory would need be sub- PCIe and NVLink [18]. However, these inter-GPU links are likely to
|
||
stantially increased, accounting for 30.4% of the L2 cache size. As have bandwidth that falls far behind the local memory bandwidth [3,
|
||
the size of GPU L2 caches continues to grow [16,17], the aggregate 4,8]. The NUMA effect that arises from this large bandwidth gap
|
||
storage overhead of coherence directories becomes substantial, caus- can significantly impact multi-GPU performance, making it crucial to
|
||
ing inefficiency in scaling for multi-GPU environment (discussed in optimize remote access bottlenecks to maximize efficiency.
|
||
Section 3.3). Fig. 2 illustrates the architectural details of our target multi-GPU
|
||
In this paper, we propose Range-based Directory Entry Coalescing system. Each GPU is divided into several SAs, with each comprising a
|
||
(REC), an architectural solution that mitigates unnecessary invalidation
|
||
number of CUs. Every CU has its own private L1 vector cache (L1V$),
|
||
overhead by increasing the effective tracking capacity of the coher-
|
||
while the L1 scalar cache (L1S$) and L1 instruction cache (L1I$) are
|
||
ence directory without incurring significant hardware costs. Our key
|
||
shared across all CUs within an SA. Additionally, each GPU contains
|
||
insight is that since directory updates are performed upon receiving
|
||
a larger L2 cache that is shared across all SAs. When a data access
|
||
remote read requests, leveraging memory access locality provides an
|
||
misses in the local cache hierarchy, it is forwarded to either local or
|
||
opportunity to coalesce multiple tag addresses of shared data based on
|
||
remote GPU memory, depending on the data location. For local mem-
|
||
their common address range. To achieve this, we employ a coherence
|
||
ory accesses, the cache lines are stored in both the shared L2 cache and
|
||
directory design, which aggregates data from incoming remote reads
|
||
that share a common base address within the same address range, the L1 cache private to the requesting CU. In the case of remote-GPU
|
||
storing only the offset and the sharer IDs. We reduce the storage memory accesses, the data can be cached either only in the L1 cache
|
||
requirements of directory entries by designing them in a base-and-offset of the requesting CU [4,5,8] or in both the L2 and L1 caches [3,11,12].
|
||
format, recording the common high-order bits of addresses and using a Caching data in remote memory nodes helps mitigate the performance
|
||
bit-vector to indicate the index of each coalesced entry within the target degradation caused by accessing remote memory nodes.
|
||
range. For incoming writes, if they are found in the coherence direc-
|
||
tory, invalidations are propagated only to the corresponding address, 2.2. Remote data caching in multi-GPU
|
||
maintaining fine-grained coherence in multi-GPU systems.
|
||
To summarize, this paper makes the following contributions: While caching remote data only in the L1 cache can save L2 cache
|
||
capacity, it limits the sharing of remote data among CUs. As a result,
|
||
• We identify a performance bottleneck of fine-grained shared data such an approach provides lower performance gain when unnecessary
|
||
tracking mechanisms in multi-GPU systems. Our analysis demon- invalidation overhead is eliminated in its counterpart, as shown in
|
||
strates that such methods generate unnecessary invalidations at Fig. 1. For this reason, in this study, we assume the baseline multi-GPU
|
||
coherence directory evictions, which incurs a significant perfor- architecture allows caching of remote data in both L1 and L2 caches.
|
||
mance bottleneck due to increased cache miss rates.
|
||
A step-by-step process of remote data caching is shown in Fig. 2.
|
||
• We show that simply employing larger coherence directories
|
||
Upon generating a memory request, an L1 cache lookup is performed
|
||
incurs significant storage overhead. Our analysis shows that the
|
||
by the requesting CU ( 1 ). When data is not present in the L1, an
|
||
baseline multi-GPU system requires a 12× increase in the direc-
|
||
L2 cache lookup is generated to check if the remote data is cached
|
||
tories to eliminate redundant invalidations.
|
||
in the L2 ( 2 ). If the data is found in the L2 cache, it is returned to
|
||
• We propose REC which increases effective coverage of the co-
|
||
the requesting CU and cached in its local L1 cache. If the data is not
|
||
herence directory by enabling each entry to coalesce and track
|
||
multiple memory addresses along with the associated sharers. By found in the L2 cache, the request is forwarded to the remote GPU
|
||
reducing the L2 cache misses by 53.5%, REC improves overall memory at the given physical address. Subsequently, the requested
|
||
performance by 32.7% on average across our evaluated GPU data is returned at a cache line granularity and cached in both the L1
|
||
workloads. and L2 caches ( 3 ). At the same time, the coherence directory, which
|
||
maintains information about data locations across multiple GPUs, is
|
||
|
||
2
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 3. Coherence protocol flows in detail. The baseline hardware protocol has two Fig. 4. L2 cache miss rates in baseline and idealized system where no invalidations
|
||
stable states: valid and invalid, with no transient states or acknowledgments required are propagated by coherence directory evictions. Cold misses are excluded from the
|
||
for write permissions. results.
|
||
|
||
|
||
|
||
|
||
updated with the corresponding entry and the sharer GPU ( 4 ). Writes Local reads: Local read requests arriving at the L2 cache are directed
|
||
to remote data in the home GPU are also performed in the local L2 to either locally- or remotely-mapped data. On cache hits, the data is
|
||
cache, following the write-through policy, as the corresponding GPU returned and guaranteed to be consistent because it is either the most
|
||
may access the written data in the future. Remote writes arriving at up-to-date data (if mapped to local DRAM) or correctly managed by
|
||
the home GPU trigger invalidation messages to be sent out to the sharer the protocol (if mapped to remote GPU). On cache misses, the requests
|
||
GPU(s), and the requesting GPU is recorded as a sharer ( 4 ). are forwarded to either local DRAM or a remote GPU. In all cases, the
|
||
directory of the requesting GPU remains unchanged.
|
||
2.3. Cache coherence in multi-GPU Remote reads: For remote reads that arrive at the home GPU, the
|
||
coherence directory records the ID of the requesting GPU at the given
|
||
cache line address. If the line is already being tracked (i.e., the entry is
|
||
Existing hardware protocols, such as GPU-VI [19], employ coher-
|
||
found and valid), the directory simply adds the requester to the sharer
|
||
ence directories to track sharers (i.e., L1s) and propagate write-initiated
|
||
field and keeps the entry in the valid state. If the line is not being
|
||
cache invalidations within a single GPU. Bringing the notion into multi-
|
||
tracked, the directory finds an empty spot to allocate a new entry and
|
||
GPU environments, Ren et al. proposed HMG [11], a hierarchical design
|
||
marks it as valid. When the directory is full and every entry is valid, it
|
||
that efficiently manages both intra- and inter-GPU coherence. HMG
|
||
evicts an existing entry and replaces it with the new entry (discussed
|
||
includes two layers for selecting home nodes to track sharers: (1) the
|
||
below).
|
||
inter-GPU module (GPM) level that selects a home GPM within a GPU
|
||
Local writes: Local writes to data mapped to the home GPU memory
|
||
and (2) the inter-GPU level that selects a home GPU across the entire
|
||
look up the directory to find whether a matching entry at the line
|
||
system. A GPM is a chiplet in multi-chip module GPUs. With this,
|
||
address exists. If found, invalidations are propagated to the recorded
|
||
HMG reduces the complexity of tracking and maintaining coherence
|
||
sharers in the background, and the directory entry becomes invalid.
|
||
across a large number of sharers. HMG also optimizes performance by
|
||
Remote writes: By default, L2 caches use a write-back policy for local
|
||
eliminating all transient states and most invalidation acknowledgments,
|
||
writes. As described in Section 2.2, remote writes update both the L2
|
||
leveraging weak memory models in modern GPUs [11].
|
||
cache of the requester and local memory, similar to a write-through
|
||
Each GPU has a coherence directory attached to its L2 cache, policy. Consequently, the directory maintains the entry as valid by
|
||
managed by the cache controllers. The directory is organized in a set- adding the requester to the sharer list and sends out invalidations to
|
||
associative structure, and each entry contains the following fields: tag, other sharers recorded in the original entry.
|
||
sharer IDs, and coherence state. The tag field stores the cache line Directory entry eviction/replacement: Coherence directories are im-
|
||
address for the data copied and fetched by the sharer. The sharer plemented in a set-associative structure. Thus, capacity and conflict
|
||
ID field is a bit-vector representing the list of sharers, excluding the misses occur as directory lookups are initiated by the read requests con-
|
||
home GPU. Each entry is in one of two stable states: valid or invalid. tinuously received from remote GPUs. To notify that the information
|
||
Unlike HMG [11], the baseline coherence directory tracks one cache in the evicted entry is no longer traceable, invalidations are sent out as
|
||
line per each entry. In contrast, a directory entry in HMG is designed with writes.
|
||
to track four cache lines using a single tag address and sharer ID Acquire and release: At the start of a kernel, invalidations are per-
|
||
field, which limits its ability to manage each cache line at a fine formed in L1 caches as coherence is maintained using software bulk
|
||
granularity. Consequently, a write to any address tracked by a directory synchronizations. However, the invalidations are not propagated be-
|
||
entry may unnecessarily invalidate other cache lines within the same yond L1 caches, as L2 caches are kept coherent with the fine-grained
|
||
range, potentially causing inefficiencies in remote data caching. We directory protocol. Release operations flush dirty data in both L1 and
|
||
discuss the importance of reducing unnecessary cache line invalidations L2 caches.
|
||
in detail in Section 3.1. Like typical memory allocation in multi-GPU
|
||
systems, the physical address space is partitioned among the GPUs in 3. Motivation
|
||
the system. Therefore, data at any given physical address is designated
|
||
to one GPU (i.e., the home GPU), and every access by a remote GPU In multi-GPU systems, coherence is managed explicitly through
|
||
references the coherence directory of the home GPU. For example, in cache invalidations to ensure data consistency across multiple GPUs.
|
||
Fig. 2, GPU0 requests data at address 0xA from GPU1, which is the When invalidation requests are received, sharer GPUs must look up and
|
||
home GPU; the corresponding entry is then inserted into the directory invalidate the corresponding cache lines. Subsequent accesses to these
|
||
of GPU1 with the relevant information. invalidated cache lines result in cache misses, which are then forwarded
|
||
Fig. 3 shows the detailed state transitions and actions initiated by to the home GPU. This, in turn, can negate the performance benefits of
|
||
the coherence directory. Note that local and remote refer to the sources local caching as it undermines the effectiveness of caching mechanisms
|
||
of memory requests received: local refers to accesses from the local CUs, intended to reduce remote access bottlenecks. In this section, we ana-
|
||
and remote refers to accesses from the remote GPUs. lyze the behavior of cache invalidation and its impact on the overall
|
||
|
||
3
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 6. Performance impact of increasing coherence directory sizes. To eliminate
|
||
unnecessary invalidations, GPUs require a directory size up to 12× larger than the
|
||
Fig. 5. Fraction of evict-initiated and write-initiated invalidations in the baseline multi- baseline.
|
||
GPU system. The results are based on invalidation requests that hit in the sharer-side
|
||
L2 caches.
|
||
|
||
3.2. Source of premature invalidation
|
||
performance of multi-GPU systems. We identify the sources of invalida-
|
||
As described in Section 2.3, when a coherence directory becomes
|
||
tion and explore a straightforward solution to mitigate the associated
|
||
full, the GPU needs to evict an old entry and replace it with a new
|
||
bottlenecks. Our experiments are conducted using MGPUSim [20], a
|
||
one upon receiving a remote read request; an invalidation request must
|
||
multi-GPU simulation framework that we have extended to support
|
||
be sent out to the sharer(s) in the evicted entry. Fig. 5 shows the
|
||
the hardware cache coherence protocol. The detailed configuration is
|
||
distribution of invalidations triggered by directory eviction and write
|
||
provided in Table 2.
|
||
requests, referred to as evict-initiated and write-initiated invalidations,
|
||
respectively. The measurements are taken based on the invalidations
|
||
3.1. Impact of cache invalidation that are hit in the sharer-side L2 caches after receiving the requests. We
|
||
observe that significant amount of invalidations (average 79.5%) are
|
||
To ensure data consistency across multiple GPUs, invalidation re- performed by the requests from directory evictions in the home GPUs.
|
||
quests are propagated by the home GPU in two cases: (1) when write These invalidations, considered unnecessary as they do not require
|
||
requests are received and (2) when an entry is evicted from the coher- immediate action, should be delayed until remote GPUs have full use
|
||
ence directory due to capacity and conflict misses. Invalidation requests of the data.
|
||
triggered by writes are crucial for maintaining data consistency, as they We also show the percentage of write-initiated invalidations in
|
||
ensure that no stale data is accessed in the sharer GPU caches. On the Fig. 5. One can observe that applications such as FIR, LU, and MM2
|
||
other hand, invalidations generated by directory eviction aim to notify experience a significant number of invalidations due to write re-
|
||
the sharers that the coherence information is no longer traceable, even quests. These workloads exhibit fine-grained communication within
|
||
if the data is still valid. A detailed background on the protocol flows and across dependent kernels, necessitating the invalidation of corre-
|
||
with invalidations is given in Section 2.3. sponding cache lines in the remote L2 cache upon any modification
|
||
Broadcasting invalidations does not significantly impact cache ef- to the shared data. Although the applications exhibit a high percent-
|
||
ficiency if the cache lines are already evicted or no longer in use. age of write-initiated invalidations, their impact on cache miss rates
|
||
However, when applications exhibit frequent remote memory accesses, may be negligible if the GPUs do not subsequently require access
|
||
the generation of new directory entries increases invalidation requests to the invalidated cache lines. Nonetheless, the results from Fig. 4
|
||
from eviction, invalidating the associated cache lines prematurely. clearly demonstrate the importance of minimizing unnecessary cache
|
||
These premature invalidations lead to higher cache miss rates, as invalidations.
|
||
So far, we have discussed how prematurely invalidating remote data
|
||
subsequent accesses to the invalidated cache lines result in misses.
|
||
leads to increased cache miss rates, which negatively impacts multi-
|
||
As remote data misses exacerbates NUMA overheads, they need to be
|
||
GPU performance. We also show that a large fraction of invalidation
|
||
reduced to improve multi-GPU performance.
|
||
requests stems from directory evictions, which frequently occur due to
|
||
Fig. 4 shows the impact of cache miss rate when eliminating unnec-
|
||
the high volume of remote accesses. These accesses trigger numerous
|
||
essary invalidations across the benchmarks listed in Table 3 running on
|
||
directory updates, overwhelming the baseline coherence directory’s
|
||
a 4-GPU system. The figure demonstrates that the baseline system expe-
|
||
capacity to effectively manage coherence. A straightforward solution to
|
||
riences a cache miss rate more than double (average 2.4×) that of the
|
||
mitigate premature invalidations is to increase the size of the coherence
|
||
idealized system without the unnecessary invalidation. This increase
|
||
directory, providing more coverage to track sharers and reducing evic-
|
||
is mainly due to frequent invalidation requests, which prematurely
|
||
tion rates. In the following section, we analyze the performance impact
|
||
invalidate cache lines before they can be fully utilized, leading to an of larger coherence directory sizes. It is important to note that this
|
||
increase in the number of remote memory accesses. The result strongly paper primarily focuses on delaying invalidations caused by directory
|
||
motivates us to further study the source of these frequent invalidations evictions, as write-initiated invalidations are necessary and must be
|
||
to improve efficiency of remote data caching in multi-GPU systems. performed immediately for correctness.
|
||
To demonstrate the performance opportunity, Fig. 1 presents a study
|
||
showing the performance of idealized caching without the invalidation 3.3. Increasing directory sizes
|
||
overhead. With no invalidations to unmodified cache lines, remote data
|
||
can be fully utilized as needed until they are naturally replaced by the A simple approach to delay directory evictions, thereby minimizing
|
||
typical cache replacement policy. The performance of the baseline and premature invalidations, is to increase the size of coherence directories.
|
||
ideal system is represented in the first and fourth bars, respectively, Limited directory sizes lead to significant evict-initiated invalidations,
|
||
in Fig. 1. The result shows that an ideal system with no unnecessary which can undermine the performance benefits of local caching. To
|
||
cache invalidation overheads outperforms the baseline by up to 2.79× quantify the benefits of larger directories, we conduct a quantitative
|
||
(average 36.9%). As demonstrated by Figs. 1 and 4, reducing premature analysis of performance improvements with increasing directory sizes.
|
||
cache invalidations is crucial in improving efficiency of remote data In our simulated 4-GPU system, each GPU has an L2 cache size of
|
||
caching in multi-GPU systems. 2 MB, with each cache line being 64B. Each coherence directory tracks
|
||
|
||
4
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 7. Average performance improvement per increased directory storage in the
|
||
baseline coherence directory design. The results are normalized to the system with
|
||
8K-entry coherence directory.
|
||
|
||
|
||
|
||
|
||
the identity of all sharers excluding the home GPU (i.e., three GPUs).
|
||
To cover the entire L2 cache space for three GPUs, an ideal coherence
|
||
directory would require approximately 96K entries, or about 12× the
|
||
baseline 8K entries.
|
||
Fig. 6 illustrates the normalized performance for increasing the
|
||
Fig. 8. A high-level overview of (a) baseline and (b) proposed REC architecture with
|
||
directory sizes by 2×-12× the baseline. With an ideal directory size,
|
||
simplified 2-entry coherence directories. The figure illustrates a scenario where GPU1
|
||
unnecessary invalidations from directory evictions can be eliminated, accesses memory of GPU0 in order of 0 × 1000, 0 × 1040, 0 × 1080, and 0 × 1000
|
||
leaving only write-initiated invalidations. The results show that ap- by each CU. In the baseline directory, entry that tracks status of data at 0 × 1000
|
||
plications exhibit significant performance gains as the directory size is evicted for recording the address 0 × 1080. The proposed directory coalesces three
|
||
increases, with some benchmarks (e.g., ATAX, PR, and ST) requiring addresses with same base address into one entry.
|
||
8×-12× the baseline size to achieve the highest speed-up. Specifically,
|
||
benchmarks such as PR and ST show irregular memory access patterns
|
||
that span a wide address range, leading to higher chances of conflict 4.1. Hardware overview
|
||
misses when updating coherence directories. Most other tested bench-
|
||
marks require up to six times the baseline directory size to achieve As shown in Section 3.2, a significant fraction of cache invalidations
|
||
maximum attainable performance; the average speedup with six times are generated by the frequent directory evictions. These invalidations
|
||
the size is 1.35×. lead to increased cache misses, as data is prematurely invalidated from
|
||
Each entry in the coherence directory comprises a tag, sharer list, the cache, requiring subsequent accesses to fetch the data from remote
|
||
and coherence state. We assume 48 bits for tag addresses, a 3-bit memory. While simply increasing the directory size can address this
|
||
vector for tracking sharers, and one bit for the directory entry state; bottleneck, the associated cost of hardware can become substantial. To
|
||
thus, each entry requires a total of 52 bits of storage. Our baseline address this, we propose REC, an architectural solution that compresses
|
||
directory implementation has 8K entries and occupies approximately remote GPU access information, retaining as much data as possible
|
||
2.5% of the L2 cache [11]. Therefore, the storage cost of the baseline before eviction occurs. It aggregates data from incoming remote read
|
||
directory in each GPU is 52 × 8192/8/1024 = 52 kB, assuming 8 requests so that (1) multiple reads to the same address range share
|
||
bits per byte and 1024 bytes per kilobyte. From our observation in a common base address, storing only the offset and source GPU in-
|
||
Fig. 6, applications require directory sizes from 6× up to 12× the formation, and (2) the coalescing process does not result in any loss
|
||
baseline to achieve maximum performance. This corresponds to a total
|
||
of information, maintaining the accuracy of the coherence protocol.
|
||
storage cost of 312-624 kB, which is an additional 15.2–30.4% of
|
||
We now discuss the design overview of REC and the details of the
|
||
the L2 cache size. While increasing directory size can significantly
|
||
associated hardware components.
|
||
improve performance, the associated hardware costs are substantial.
|
||
Fig. 8(a) shows how the baseline GPU handles a sequence of in-
|
||
To show the inefficiency of simply scaling directory sizes, we calculate
|
||
coming read requests. The cache controller records the tag addresses
|
||
the performance per storage using the results in Fig. 6 and the number
|
||
and the corresponding sharer IDs in the order that the requests arrive.
|
||
of directory entries. Fig. 7 illustrates the results relative to the baseline
|
||
When the coherence directory reaches its capacity, the cache controller
|
||
with 8K entries, showing that performance improvements per increased
|
||
follows a typical FIFO policy to replace the oldest entry with a new
|
||
storage do not scale proportionally with larger coherence directories.
|
||
one within the set. Once an entry is evicted, the information it held
|
||
Additionally, since GPU applications require different directory sizes
|
||
can no longer be tracked, triggering an invalidation request to be sent
|
||
to achieve maximum performance, simply increasing the directory size
|
||
is not an efficient solution. Moreover, as GPU L2 caches continue to to the GPU listed in the entry. Upon receiving this request, the sharer
|
||
grow [16,17], the cost of maintaining proportionally larger coherence GPU checks its L2 cache and invalidates the corresponding cache line,
|
||
directories will only amplify these overheads. Therefore, improving leading to a cache miss on any subsequent access to the cache line.
|
||
coherence directory coverage without significant storage overhead mo- To delay invalidations caused by directory evictions without signif-
|
||
tivates the need for more efficient fine-grained hardware protocols in icant hardware overhead, we introduce the REC architecture, which
|
||
multi-GPU systems. enhances the baseline coherence directory by leveraging spatial locality
|
||
for merging multiple addresses into a single entry. As illustrated in
|
||
4. REC architecture Fig. 8(b), REC stores tag addresses with common high-order bits as a
|
||
single entry using a base-plus-offset format. When a new read request
|
||
This work aims to enhance coherence directory coverage while matches the base address in an existing entry, the offset and sharer in-
|
||
avoiding significant hardware overhead, overall reducing unnecessary formation are appended to that entry, reducing the need for additional
|
||
cache invalidations in multi-GPU systems. We introduce REC, an archi- entries and delaying evictions. The base address represents the shared
|
||
tecture that coalesces directory entries by leveraging the spatial locality high-order bits, covering a range of addresses and reducing the storage
|
||
in memory accesses observed in GPU workloads. In this section, we required compared to storing full tag addresses individually. Addition-
|
||
provide an overview of REC design and discuss its integration with ally, REC uses position bits to efficiently track multiple addresses within
|
||
existing multi-GPU coherence protocols. the specified range, further minimizing storage overhead.
|
||
|
||
5
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
Table 1
|
||
Trade-offs between addressable range and storage for each entry. Note that one valid
|
||
bit, not shown in the table, is included in the overall calculation.
|
||
Addressable range
|
||
64B 128B 256B 1 kB 4 kB
|
||
Base address bits 48 41 40 38 36
|
||
Position/Sharer bits −/3 2/6 4/12 16/48 64/192
|
||
Total bits per entry 52 50 57 103 293
|
||
|
||
|
||
Table 2
|
||
Baseline GPU configuration.
|
||
Parameter Configuration
|
||
Number of SAs 16
|
||
Number of CUs 4 per SA
|
||
L1 vector cache 1 per CU, 16 kB 4-way
|
||
L1 inst cache 1 per SA, 32 kB 4-way Fig. 10. Overview of the REC protocol flows. In the example coherence directory,
|
||
L1 scalar cache 1 per SA, 16 kB 4-way entry insertion and offset addition operations are highlighted in blue, while eviction
|
||
L2 cache 2 MB 16-way, 16 banks, write-back and offset deletion operations are shown in red.
|
||
Cache line size 64B
|
||
Coherence directory 8K entries, 8-way
|
||
DRAM capacity 4 GB HBM, 16 banks
|
||
DRAM bandwidth 1 TB/s [11]
|
||
for comparing the storage costs. REC designs with larger addressable
|
||
Inter-GPU bandwidth 300 GB/s, bi-directional ranges can benefit from increased directory coverage but at the cost of
|
||
storage. In the evaluation of this paper, we tested various addressable
|
||
ranges for REC. Each design is configured to coalesce the maximum
|
||
Table 3
|
||
number of offsets within its specified range. Later in the results, we
|
||
Tested workloads.
|
||
confirm that a 1 kB coalesceable range offers the best trade-off, bal-
|
||
Benchmark Abbr. Memory footprint
|
||
ancing reasonable size overhead per entry with the ability to coalesce
|
||
Matrix transpose and vector multiplication [21] ATAX 128 MB
|
||
2-D convolution [21] C2D 512 MB
|
||
a significant number of entries before evictions occur (discussed in
|
||
Finite impulse response [22] FIR 128 MB Section 5.2).
|
||
Matrix-multiply [21] GEMM 128 MB Based on these findings, the format of a directory entry is as
|
||
Vector multiplication and matrix addition [21] GEMV 256 MB illustrated in Fig. 9. Each entry comprises a base address, coalesced
|
||
2-D jacobi solver [21] J2D 128 MB
|
||
entries, and a valid bit. When the first remote read request arrives at the
|
||
LU decomposition [21] LU 128 MB
|
||
2 matrix multiplications [21] MM2 128 MB home GPU, the cache controller sets the base address by right-shifting
|
||
3 matrix multiplications [21] MM3 64 MB the tag address by the number of bits needed to represent the offset
|
||
PageRank [22] PR 256 MB within the specified range. For a 48-bit tag, the address is right-shifted
|
||
Simple convolution [23] SC 512 MB
|
||
by 10 bits (considering a 64B-aligned 1 kB range), and the resulting
|
||
Stencil 2D [24] ST 128 MB
|
||
bits from positions 64 to 101 are used to store the base address. The
|
||
coalesced entry is identified using the offset within the 1 kB range,
|
||
represented by a position bit, followed by three bits for recording the
|
||
sharers. The position bit is calculated as:
|
||
( )
|
||
Tag mod 𝑚
|
||
𝑝= × (𝑛 + 1)
|
||
64
|
||
where 𝑚 denotes the coalescing range, and 𝑛 is the number of shar-
|
||
ers, which are set to 1 kB and 3, respectively. Once the position is
|
||
determined, the corresponding position and the sharer bit are set to
|
||
1 using bitwise OR operation. Given that the 1 kB range allows each
|
||
entry to record up to 16 individual tag addresses, we use the lower 64
|
||
Fig. 9. Coherence directory entry structure for 64B cache lines. In our design, each bits to store the coalesced entries. Furthermore, the position bit can
|
||
entry stores up to 16 coalesced entries based on 1 kB range. also function as the valid bit for each coalesced entry, meaning only
|
||
one valid bit is necessary to indicate whether the entire entry is valid
|
||
or not.
|
||
Determining the address range within which REC coalesces entries is
|
||
one of the key design considerations, as it directly impacts the number 4.2. REC protocol flows
|
||
of bits required for each entry. Table 1 shows a list of design choices for
|
||
implementing REC with varying addressable ranges and their potential The baseline coherence protocol operates with two stable states-
|
||
trade-offs. The number of required base address bits is calculated using valid and invalid-allowing it to remain lightweight and efficient. In
|
||
2n = addressable_range, where n is the number of bits right-shifted our proposed coherence directory design, each entry represents the
|
||
from the original tag address. Also, the number of required position validity of an entire address range instead of tracking individual tag
|
||
bits is determined by the maximum number of coalesceable cache line addresses and associated sharers. This enables the state transitions
|
||
addresses within the target range, assuming 64B line size. Then, the to be managed at a coarser granularity during directory evictions.
|
||
number of sharer bits required is (n-1)×num_position_bits, where n is Additionally, REC supports fine-grained control over write requests by
|
||
the number of GPUs. For example, if REC is designed to coalesce with tracking specific offsets within these address ranges, avoiding the need
|
||
addressable range of 256B, each entry would require 40, 4, and 12 bits to evict entire entries. Fig. 10 highlights the architecture of REC and
|
||
for base address, position, and sharer fields, respectively. Lastly, one how it differently handles the received requests with the baseline. REC
|
||
valid bit is added to each entry. In Table 1, we show the total bits does not require additional coherence states but instead modifies the
|
||
required per entry under the addressable ranges from 128B to 4 kB transitions triggered under specific conditions.
|
||
|
||
6
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
Remote reads: When the GPU receives the read request from the 4.3. Discussion
|
||
remote GPU, the cache controller extracts the base and offset from the
|
||
tag address ( A ). The controller then looks up the coherence directory Overheads: In our design, the coherence directory consists of 8K
|
||
for an entry with the matching base address ( B ). If a valid entry is entries, with each entry covering a 1 kB range of addresses. Each entry
|
||
found, the position bit corresponding to the offset calculated using the comprises a 38-bit base address field, a 64-bit vector for offsets and
|
||
formula in Section 4.1 and the associated sharer bit are set ( C ). For sharers, and a valid bit (detailed in Table 1). Thus, the total directory
|
||
example, the position bit is 34016 /64 × 4 = 52 representing the 14th size is 8192 × 103/8/1024 = 103 kB. We also estimate the area
|
||
cache line within the specified 1 kB range. The sharer bit is determined and power overhead of the coherence directory in REC, using CACTI
|
||
by the source GPU index (e.g., GPU1). Therefore, bit 52 and 53 are set 7.0 [25]. The results show that the directory is 3.94% area and has
|
||
to 1. It can happen that the position bit is already set; nevertheless, the 3.28% power consumption compared to GPU L2 cache. REC requires no
|
||
controller still performs a bitwise OR on the bits at the corresponding additional hardware extensions for managing the coherence directory.
|
||
positions. Since the entry already exists in the directory, it remains The existing cache controller handles operations such as base address
|
||
valid. Otherwise, if no valid entry is found, a new entry is created calculation and bitwise manipulation efficiently.
|
||
with the base address, and the position and sharer bits are set. With Comparison to prior work: As discussed in Section 2.3, HMG [11]
|
||
designs each coherence directory entry to track four cache lines at
|
||
the insertion of a new entry, the state transitions from invalid to valid.
|
||
a coarse granularity. We empirically show, in Section 3.3, that GPUs
|
||
Local writes: When the write request is performed locally ( D ), the
|
||
require a directory size up to 12× the baseline to eliminate unnecessary
|
||
cache controller must determine whether it needs to send out inval-
|
||
cache line invalidations. Since REC coalesces up to 16 consecutive
|
||
idation requests to the sharers that hold the copy of data. For this,
|
||
cache line addresses per entry, REC can track a significantly larger num-
|
||
the controller again looks up the directory with the calculated base
|
||
ber of cache lines compared to the prior work. Moreover, REC precisely
|
||
address and offset ( E ). If an entry is found and the offset is valid
|
||
tracks each address by storing the offset and sharer information. Thus,
|
||
(i.e., the position bit is set), the invalidation request is generated
|
||
REC fully support fine-grained management of cache lines under write
|
||
and propagated to the recorded sharers immediately ( F ). The state
|
||
operations.
|
||
transition is handled differently based on two conditions. First, when Scalability: REC requires modifications to its design in large-scale
|
||
another offset is tracked under the common address range, the directory systems, specifically to the sharer bit field. For an 8-GPU system, REC
|
||
entry should remain valid. Thus, the controller clears only the position requires (8-1) × 16 = 112 bits to record sharers in each entry. Then,
|
||
and sharer bits for the specific offset of the target address. For example, the size of each entry becomes 112 + 38 + 16 + 1 = 167 bits, which
|
||
in Fig. 10, the directory entry has another offset (atp = 56) recorded is approximately three times the baseline size, where each entry costs
|
||
under the same base address. Once the invalidation request is sent out 56 bits, including a 4-bit increase for sharers. Similarly, for a 16-GPU
|
||
to GPU1, the controller only clears bits 0 and 1. If the cleared bits are system, REC requires 295 bits per entry, roughly five times the baseline
|
||
the last ones, the entire directory entry transitions to an invalid state size. However, as observed in Section 3.3, an ideal GPU requires up to
|
||
to make room for new entries. 12 times the baseline directory size even in a 4-GPU system, implying
|
||
Remote writes: For the remote write request, the cache controller that simply increasing the baseline directory size is insufficient to meet
|
||
begins the same directory lookup process by calculating the base and scalability demands.
|
||
offset from the tag ( G ). In our target multi-GPU system, the source
|
||
GPU also performs writes to the copy of data in its local L2 cache 5. Evaluation
|
||
(discussed in Section 2.2). Therefore, the controller handles remote
|
||
write requests differently from local writes. When an entry already 5.1. Methodology
|
||
exists in the directory (i.e., hits), there may be two circumstances: (1)
|
||
the target offset is invalid but the entry has other valid offsets and (2) We use MGPUSim [20], a cycle-accurate multi-GPU simulator, to
|
||
the target offset is already valid and one or more sharers are being model baseline and REC architecture with four AMD GPUs connected
|
||
tracked. If the target offset is invalid, the controller simply adds the using inter-GPU links of 300 GB/s bandwidth [26]. The configuration of
|
||
offset and the sharer to the entry in the same way it handles remote the modeled GPU architecture is detailed in Table 2. Each GPU includes
|
||
reads. If the offset is valid, the controller adds the source GPU to the L1 scalar and instruction caches shared within each SA, while the L1
|
||
sharer list by setting its corresponding bit and clearing other sharer vector cache is private to each CU, and the L2 cache is shared across the
|
||
GPU. We extend remote data caching to the L2 caches, allowing data
|
||
bits ( H ), then sends invalidation requests to all other sharers ( I ). In
|
||
from any GPU in the system to be cached in the L2 cache of any other
|
||
Fig. 10, the entry and the target offset (atp = 56) both are already
|
||
GPU. Since MGPUSim does not include a support of hardware cache
|
||
recorded. The controller, thus, additionally sets bit 58 to add GPU2 as
|
||
coherence, we extend the simulator by implementing a coherence di-
|
||
a sharer while clearing the bit 59 and sends the invalidation request
|
||
rectory managed by the L2 cache controller. The coherence directory is
|
||
to GPU3. In either cases, the directory entry remains valid. When the
|
||
implemented with a set-associative structure to reduce lookup latency.
|
||
directory misses, the cache controller allocates a new entry to record
|
||
Since the baseline coherence directory is decoupled from the caches,
|
||
the base, offset, and sharer from the write request. Then, the entry state
|
||
its way associativity as well as the size can be scaled independently.
|
||
transitions to valid. In our evaluation, the coherence directory is designed with an 8-way
|
||
Directory entry eviction/replacement: When the coherence directory set-associative structure to reduce conflict misses, containing 8K entries
|
||
becomes full, it needs to replace an entry with the newly inserted in both the baseline and REC architectures. Upon receiving remote read
|
||
one. The baseline coherence directory uses a FIFO replacement policy. requests, the cache controller updates the coherence directory with
|
||
However, for workloads that exhibit irregular memory access pat- recording the addresses and the associated sharers. Once capacity of
|
||
terns, capturing locality becomes a challenge. To address this, REC the directory is reached, the cache controller evicts an entry and sends
|
||
adopts the replacement policy, similar to LRU, to better retain entries out invalidation requests to the recorded sharers. For receiving write
|
||
that are more likely to be accessed again. As the cache controller requests, the controller looks up the directory to find whether data
|
||
receives the remote read request and does find an entry with the with matching addresses are shared by remote GPUs. If the matching
|
||
matching base address ( J ), it determines an entry for replacement entries are found, invalidation requests are propagated to the sharers
|
||
( K ). The evicting entry is then replaced with the new entry from the except the source GPU. Additionally, since L2 caches are managed
|
||
incoming request ( L ). Meanwhile, the controller retrieves the base by coherence directories, acquire operations do not perform invalida-
|
||
address, every merged offset from the evicting entry and reconstructs tions on L2 caches, but release operations flush the L2 caches. We
|
||
the original tag addresses. Invalidation requests are propagated to every use workloads from a diverse set of benchmark suites, including AM-
|
||
recorded sharer associated with each tag address ( M ). Lastly, the entry DAPPSDK [23], Heteromark [22], Polybench [21], SHOC [24]. Table 3
|
||
transitions to an invalid state. lists the workloads with their memory footprints.
|
||
|
||
7
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 11. Performance comparison of the baseline with double-sized coherence direc- Fig. 12. Number of coalesced cache line addresses at directory entry eviction under
|
||
tory, HMG [11], REC, and an idealized system with zero unnecessary invalidations. REC with varying addressable ranges. REC in this work coalesces with 1 kB addressable
|
||
Performance is normalized to the baseline with 8K-entry coherence directory. range.
|
||
|
||
|
||
|
||
|
||
5.2. Performance analysis
|
||
|
||
Fig. 11 shows the performance of the baseline with coherence di-
|
||
rectory double in size, HMG [11], REC, and an ideal multi-GPU system
|
||
with zero unnecessary invalidations relative to the baseline. First, we
|
||
include the performance of baseline with double in coherence directory
|
||
size to compare REC with the same storage cost. The result shows that
|
||
the baseline with double the size of directory achieves average speedup
|
||
of 7.3%. The baseline coherence directory tracks each remote access
|
||
Fig. 13. Total number of L2 cache misses in the baseline with double-sized coherence
|
||
individually, on a per-entry basis. As discussed in Section 3.3, doubling directory, HMG [11], and REC relative to the baseline.
|
||
the size of coherence directory does not mitigate the unnecessary cache
|
||
line invalidations for applications with significant directory evictions.
|
||
Also, results show that HMG and REC achieve average speedup of
|
||
As a result, this delays the replacement of useful cache lines, thereby
|
||
16.7% and 32.7% across the evaluated workloads. We observe that
|
||
improving cache efficiency.
|
||
REC outperforms the prior scheme for two reasons. First, REC delays
|
||
L2 cache misses: The performance improvement of REC is largely
|
||
directory evictions by allowing each entry to record more cache line
|
||
attributed to the reduction in cache misses caused by unnecessary
|
||
addresses for a wider range. Since HMG uses each directory entry to
|
||
invalidations from frequent evictions in the coherence directory of
|
||
track four cache lines, an entire coherence directory can track cache
|
||
home GPUs. Fig. 13 shows the total number of L2 cache misses in the
|
||
lines up to 4× the baseline. On the other hand, the directory in REC
|
||
baseline with double-sized directory, HMG, and REC relative to the
|
||
can record up to 16× the number of entries. Second, REC manages write
|
||
baseline. Cold misses are excluded from the results. We observe that
|
||
operations to shared cache lines at a fine granularity by searching the
|
||
REC reduces L2 cache misses by 53.5%. In contrast, the baseline with
|
||
directory with exact addresses and sharers, propagating invalidations
|
||
double-sized directory and HMG experience 1.79× and 1.40× higher
|
||
only when necessary. Since each directory entry of HMG stores only
|
||
a single address and sharer ID field that cover for four cache lines, number of cache misses than REC since neither approach is insufficient
|
||
writes to any of these cache lines trigger invalidation requests to every to delay evict-initiated cache line invalidations. The result is closely
|
||
cache line and recorded sharer which leads them to be false positives. related to the reduction in remote access latency, as the corresponding
|
||
In contrast, REC does not allow any false positives and performs inval- misses are forwarded to the remote GPUs. Addressing the remote GPU
|
||
idations only to the modified cache lines and the associated sharers. As access bottleneck is performance-critical in multi-GPU systems.
|
||
a result, REC reduces unnecessary invalidations on cache lines that are Unnecessary invalidations: In the baseline, invalidation requests
|
||
actively being accessed by the requesting GPUs, minimizing redundant propagated from frequent directory evictions in the home GPU lead to a
|
||
remote memory accesses. To investigate the effectiveness of REC under higher chances of finding the corresponding cache lines still valid in the
|
||
different addressable ranges listed in Table 1, we also measure the sharer-side L2 caches. This results in premature invalidations of cache
|
||
number of coalesced cache line addresses when an entry is evicted lines that are actively in use, exacerbating the cache miss rate. In REC,
|
||
and plot in Fig. 12. We observe that the directory entries capture an the invalidation requests generated by directory eviction reduce the
|
||
average of 1.8, 3.4, 12.9, and 54.7 addresses until eviction under REC chances of invalidating valid cache lines. Fig. 14 shows that the number
|
||
with 128B, 256B, 1 kB, and 4 kB coalesceable ranges. Specifically, of unnecessary invalidations performed in remote L2 caches (i.e., where
|
||
REC captures more than 14 addresses before directory eviction for they are hits) is reduced by 84.4%. Since REC significantly delays evict-
|
||
applications with strong spatial locality. initiated invalidation requests, many cache lines have already been
|
||
Fig. 12 also illustrates the characteristics of limited locality for evicted from the caches by the time these requests are issued.
|
||
certain workloads where REC benefits less. In ATAX, PR, and ST, REC Inter-GPU transactions: The reduction in unnecessary invalidations
|
||
coalesces 3.9, 6.1, and 5.8 addresses, respectively. This is because the enhances the utilization of data within the sharer GPUs and min-
|
||
applications exhibit locality challenging to be captured due to their imizes redundant accesses over inter-GPU links. Fig. 14 shows the
|
||
irregular memory access patterns that span across a wide range of total number of inter-GPU transactions compared to the baseline. As
|
||
addresses. To delay the eviction of entries in irregular workloads, we illustrated, REC reduces inter-GPU transactions by an average of 34.9%.
|
||
design our proposed coherence directory with an LRU-like replace- The reduced inter-GPU transactions directly contributes to the overall
|
||
ment policy (discussed in Section 4.2). Another interesting observation performance improvement in multi-GPU systems.
|
||
is that the performance improvement of GEMV with REC is higher Bandwidth impact: Fig. 15 shows the total inter-GPU bandwidth
|
||
than the improvement seen when eliminating unnecessary invalida- costs of invalidation requests. As presented in Section 3.2, a large
|
||
tions. Our approach delays invalidations, but still performs them when fraction of invalidation requests are propagated due to frequent direc-
|
||
the directories become full. During cache line replacement, the con- tory evictions. Since REC delays invalidation requests from directory
|
||
troller prioritizes invalid cache lines before applying the LRU policy. evictions by allowing each entry to coalesce multiple tag addresses, the
|
||
|
||
8
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 14. Total number of unnecessary invalidations (bars) and inter-GPU transactions
|
||
(plots) relative to the baseline. Fig. 17. Performance of REC under varying (a) coalescing address ranges and (b)
|
||
number of directory entries. Results are shown relative to the baseline with an 8K-
|
||
entry coherence directory.
|
||
|
||
|
||
|
||
|
||
Fig. 15. Total bandwidth consumption of invalidation requests.
|
||
|
||
|
||
Fig. 18. Performance comparison of REC using FIFO and LRU replacement policies.
|
||
Performance is normalized to the baseline coherence directory with FIFO policy.
|
||
|
||
|
||
|
||
|
||
Fig. 16. L2 cache lookup latency.
|
||
|
||
|
||
|
||
|
||
bandwidth in most of the workloads becomes only a few gigabytes per Fig. 19. Performance impact of different L2 cache sizes in the baseline and REC.
|
||
Performance is normalized to the baseline with 2 MB L2 cache.
|
||
second.
|
||
Cache lookup latency: Fig. 16 illustrates the average L2 cache lookup
|
||
latency of REC normalized to the baseline. The results show that the
|
||
lookup latency reduces by 14.8% compared to the baseline. REC affects average, REC outperforms the baseline, even with reduced entry sizes
|
||
the average lookup latency as evict-initiated invalidation requests are compared to the baseline system with 8K-entry coherence directory.
|
||
propagated in burst. However, since REC significantly delays direc- This is because the coverage of each coherence directory in REC
|
||
tory eviction by coalescing multiple tag addresses, the overall latency can increase by up to 16× when locality is fully utilized. Although
|
||
decreases for most of the evaluated workloads. applications with limited locality show performance improvements as
|
||
the directory size increases, these gains are relatively modest when
|
||
5.3. Sensitivity analysis considered against the additional hardware costs.
|
||
FIFO replacement: Fig. 18 represents the performance of REC with
|
||
Coalescing range: One important design decision in optimizing REC a FIFO replacement policy. Our evaluation shows that the choice of
|
||
is determining the range over which to coalesce when remote read replacement policy has a relatively small impact on the overall perfor-
|
||
requests are received. As discussed in Section 4.1, the trade-off exists mance. For the workloads with regular and more predictable memory
|
||
between the range an entry coalesces and the number of bits required: access patterns, using the FIFO replacement policy is already effective
|
||
the larger the range, the more bits are needed to store the remote in coalescing sufficient number of addresses under the target ranges
|
||
GPU access information. Fig. 17(a) shows that the performance of REC (shown in Fig. 12). However, for some applications, such as ATAX,
|
||
improves as the coalescing range increases, with performance gains PR, and ST, performance is lower with FIFO compared to REC due
|
||
beginning to saturate at 1 kB. For our applications, a 1 kB range is to their limited locality patterns. These applications, therefore, benefit
|
||
sufficient to capture the majority of memory access locality within the from using an LRU-like replacement policy.
|
||
workloads. Since coalescing beyond 4 kB incurs excessive overhead in L2 cache size: The performance impact of different L2 cache sizes is
|
||
terms of bits required per entry (with 4 kB already requiring nearly 6× shown in Fig. 19. The results are normalized to the baseline with a
|
||
the baseline size), the potential performance improvement may not be 2 MB L2 cache. The benefits from increasing L2 cache capacity are
|
||
substantial to offset the additional cost. Therefore, we choose a 1 kB limited by the baseline coherence directory. In contrast, the perfor-
|
||
range for our implementation. mance of REC improves as L2 cache size increases, demonstrating its
|
||
Entry size: In our evaluation, we use a directory size of 8K entries ability to leverage larger caches effectively. Another observation is that
|
||
to match the baseline coherence directory. Fig. 17(b) shows the per- performance improvement with smaller L2 capacity is less significant
|
||
formance REC with varying entry sizes, ranging from 2K to 32K. On compared to larger L2 caches. This is because the coverage of the
|
||
|
||
9
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
|
||
|
||
Fig. 20. Performance impact of different inter-GPU bandwidth in the baseline and REC. Fig. 23. Performance of REC in different GPU architecture.
|
||
Performance is normalized to the baseline with 300 GB/s inter-GPU bandwidth.
|
||
|
||
|
||
|
||
|
||
Fig. 21. Performance of REC with different number of SAs normalized to the baseline
|
||
Fig. 24. Performance of REC with DNN applications.
|
||
with 16 SAs.
|
||
|
||
|
||
|
||
16-GPU systems, respectively. We observe that the performance im-
|
||
provement decreases as the number of GPUs increases. This is because,
|
||
with more GPUs, the application dataset is more distributed, and the
|
||
amount of data allocated to each GPU’s memory decreases, resulting
|
||
in reduced pressure on each coherence directory for tracking shared
|
||
copies. Additionally, we compare REC with the baseline configured
|
||
with different directory sizes to match equal storage costs (discussed in
|
||
Section 4.3). We observe that REC achieves performance improvements
|
||
of 2.04× and 1.83× over the baseline with directory sizes increased by
|
||
Fig. 22. Performance comparison of REC and the baseline with equal storage cost 3× and 5×, respectively. The results confirm that simply increasing di-
|
||
under different number of GPUs. Performance is normalized to the baseline with 8K
|
||
rectory sizes is not an efficient approach, even in large-scale multi-GPU
|
||
entries.
|
||
systems.
|
||
|
||
5.4. REC with Different GPU Architecture
|
||
baseline coherence directory relatively increases as the L2 cache size
|
||
decreases. To further explore the performance sensitivity to different
|
||
We extend the evaluation of REC to include a different GPU ar-
|
||
L2 cache sizes, we evaluate REC in systems with L2 cache sizes of
|
||
chitecture by adapting the simulation environment to a more recent
|
||
0.5 MB and 8 MB. We find that REC achieves an average performance
|
||
NVIDIA-styled GPU [27]. This involves increasing the number of com-
|
||
improvement of 6.3% and 26.7% compared to the baseline with 0.5 MB
|
||
putation and memory resources compared to the AMD GPU setup.
|
||
and 8 MB L2 caches, respectively. Additionally, the performance trend
|
||
Specifically, we change the GPU configuration to include 128 CUs, each
|
||
of REC decreases as the L2 cache size increases since the effectiveness
|
||
with a 128 kB L1V cache. The L2 cache size is increased to 72 MB
|
||
of REC also reduces larger caches. Nevertheless, the results emphasize
|
||
with the cache line size adjusted to 128B. With the increased cache
|
||
the importance of coherence protocol in improving cache efficiency.
|
||
line size, we configure the addressable range of REC to 2 kB, allowing
|
||
Inter-GPU bandwidth: The bandwidth of inter-GPU links is a critical for coalescing up to the same number of tag addresses. We also scale
|
||
factor in scaling multi-GPU performance. Fig. 20 shows the perfor- the input sizes of the workloads until the simulations remain feasible.
|
||
mance of the baseline and REC under different inter-GPU bandwidths, The performance results, in Fig. 23, show that REC achieves a 12.9%
|
||
relative to the 300 GB/s baseline. The results demonstrate that REC out- performance improvement over the baseline. This indicates that our
|
||
performs the baseline, even in applications where performance begins proposed REC also benefits the NVIDIA-like GPU architecture.
|
||
to saturate with increased bandwidth.
|
||
Number of SAs: We also evaluate REC with increasing the number 5.5. Effectiveness of REC on DNN applications
|
||
of SAs as shown in Fig. 21. The performance improvement of REC
|
||
decreases compared to the system with 16 SAs since the increased We evaluate the performance improvement of REC in training
|
||
number of SAs improves thread-level parallelism of GPUs. However, the two DNN models, VGG16 and ResNet18, using Tiny-Imagenet-200
|
||
system with a larger number of SAs also elevates the intensity of data dataset [28]. As shown in Fig. 24, REC outperforms the baseline for
|
||
sharing thus, increases the frequency of coherence directory evictions. training VGG16 and ResNet18 by 5.6% and 8.9%, respectively. The
|
||
As a result, REC outperforms the baseline with 16 SAs by 17.1%. results imply that REC also has benefits in multi-GPU training on
|
||
Number of GPUs: We evaluate REC in 8-GPU and 16-GPU systems, DNN workloads. Additionally, GPUs have recently gained significant
|
||
as shown in Fig. 22. To ensure a fair comparison, we do not change attention for training large language models (LLM). The computation
|
||
the workload sizes. The results show that REC provides performance of LLM training comprises multiple decoder blocks with each primarily
|
||
improvements of 24.7% and 14.7% over the baseline in 8-GPU and having series of matrix and vector operations [29]. In our evaluation,
|
||
|
||
10
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
we observe that REC improves multi-GPU performance by 20.2% and translation overheads, and [47]. Villa et al. [49] studied design-
|
||
20.4% on GEMM and GEMV workloads, respectively. Considering real- ing trustworthy system-level simulation methodologies for single- and
|
||
world LLM training, the memory requirements can become significant multi-GPU systems. Lastly, NGS [50] enables multiple nodes in a data
|
||
with large parameters which can pressure memory systems and lead center network to share the compute resources of GPUs on top of a
|
||
to under-utilization of computation resources [29]. Since REC im- virtualization technique.
|
||
proves the cache efficiency in multi-GPU systems, we expect a higher
|
||
performance potential from REC in real-world LLM training. 7. Conclusion
|
||
|
||
6. Related work In this paper, we propose REC to improve the efficiency of cache
|
||
coherence in multi-GPU systems. Our analysis shows that the limited
|
||
Several prior works have proposed GPU memory consistency and
|
||
capacity of coherence directories in fine-grained hardware protocols
|
||
cache coherence mechanisms optimized for general-purpose domains
|
||
frequently leads to evictions and unnecessary invalidations of shared
|
||
[13–15,19,30–32]. GPU-VI [19] reduces stalls at the cache controller
|
||
data. As a result, the increase in cache misses exacerbates NUMA
|
||
by employing write-through, write-no-allocate L1 caches and treating
|
||
overhead, leading to significant performance degradation in multi-GPU
|
||
loads to the pending writes as misses. To maintain write atomicity,
|
||
systems. To address this challenge, REC leverages memory access local-
|
||
GPU-VI adds transient states and state transitions and requires invali-
|
||
ity to coalesce multiple tag addresses within common address ranges,
|
||
dation acknowledgments before write completion. REC is implemented
|
||
effectively increasing the coverage of coherence directories without
|
||
based on the relaxed memory models commonly adopted in recent
|
||
incurring significant hardware overhead. Additionally, REC maintains
|
||
GPU architectures, which do not require acknowledgments to be sent
|
||
write-initiated invalidations at a fine granularity to ensure precise and
|
||
or received over long-latency inter-GPU links. HMG [11] proposes a
|
||
flexible coherence across GPUs. Experiments show that REC reduces
|
||
lightweight directory protocol by addressing up-to-date memory consis-
|
||
L2 cache misses by 53.5% and improves overall system performance
|
||
tency and coherence requirements. HMG integrates separate layers for
|
||
by 32.7%.
|
||
managing inter-GPM and inter-GPU level coherence, reducing network
|
||
traffic and complexity in deeply hierarchical multi-GPU systems. REC
|
||
primarily addresses the increased cache misses to remotely fetched data CRediT authorship contribution statement
|
||
caused by frequent invalidations. Additionally, REC can be extended
|
||
to support hierarchical multi-GPU systems posed by HMG without Gun Ko: Writing – original draft, Visualization, Validation, Soft-
|
||
significant hardware modifications. ware, Resources, Methodology, Investigation, Formal analysis, Data
|
||
Other efforts aim to design efficient cache coherence protocols for curation, Conceptualization. Jiwon Lee: Formal analysis, Conceptu-
|
||
other processor domains. Wang et al. [33] suggested a method to alization. Hongju Kal: Validation, Conceptualization. Hyunwuk Lee:
|
||
efficiently support dynamic task parallelism on heterogeneous cache Visualization, Validation. Won Woo Ro: Supervision, Project adminis-
|
||
coherent systems. Zuckerman et al. [34] proposed Cohmeleon that tration, Conceptualization.
|
||
orchestrates the coherence in accelerators in heterogeneous system-on-
|
||
chip designs. HieraGen [35] and HeteroGen [36] are automated tools Declaration of competing interest
|
||
for generating hierarchical and heterogeneous cache coherence proto-
|
||
cols, respectively, for generic processor designs. Li et al. [37] proposed The authors declare that they have no known competing finan-
|
||
methodologies to determine the minimum number of virtual networks cial interests or personal relationships that could have appeared to
|
||
for cache coherence protocols that can avoid deadlocks. However, these influence the work reported in this paper.
|
||
studies do not address the challenges of redundant invalidations in the
|
||
cache coherence mechanisms of multi-GPU systems.
|
||
Acknowledgments
|
||
Significant research has addressed the NUMA effect in multi-GPU
|
||
systems by proposing efficient page placement and migration strate-
|
||
This work was supported by Institute of Information & communica-
|
||
gies [5,6,38], data transfer and replication methods [4,7,8,10,39,40],
|
||
tions Technology Planning & Evaluation (IITP) grant funded by the Ko-
|
||
and address translation schemes [41–43]. In particular, several works
|
||
rea government (MSIT) (No. 2024-00402898, Simulation-based High-
|
||
have focused on improving the management of shared data within the
|
||
speed/High-Accuracy Data Center Workload/System Analysis Platform)
|
||
local memory hierarchy. NUMA-aware cache partitioning [3] dynami-
|
||
cally allocates cache space to accommodate data from both local and
|
||
remote memory by monitoring inter-GPU and local DRAM bandwidths. Data availability
|
||
The authors also extend software coherence with bulk invalidations
|
||
to L2 caches and evaluate the overhead associated with unnecessary The authors are unable or have chosen not to specify which data
|
||
invalidations. SAC [12] proposes reconfigurable last-level caches (LLC) has been used.
|
||
that can be utilized as either memory-side or SM-side, depending on
|
||
predicted application behavior in terms of effective LLC bandwidth.
|
||
References
|
||
SAC evaluates the performance of both software and hardware ex-
|
||
tensions for LLC coherence. In contrast, REC specifically targets the
|
||
[1] NVIDIA, NVIDIA DGX-2, 2018, https://www.nvidia.com/content/dam/en-
|
||
issue of unnecessary invalidations under hardware coherence, which zz/Solutions/Data-Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web-
|
||
can undermine the efficiency of remote data caching. It introduces uk.pdf.
|
||
a new directory structure, carefully examining the trade-off between [2] NVIDIA, NVIDIA DGX A100 system architecture, 2020, https://download.
|
||
performance and storage overhead. boston.co.uk/downloads/3/8/6/386750a7-52cd-4872-95e4-7196ab92b51c/
|
||
DGX%20A100%20System%20Architecture%20Whitepaper.pdf.
|
||
Recent studies on multi-GPU and multi-node GPU systems also ad-
|
||
[3] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez,
|
||
dress challenges in various domains. Researchers proposed methods to
|
||
D. Nellans, Beyond the socket: NUMA-aware GPUs, in: Proceedings of IEEE/ACM
|
||
accelerate deep learning applications [44], graph neural networks [45], International Symposium on Microarchitecture, 2017, pp. 123–135.
|
||
and graphics rendering applications [46] in multi-GPU systems. Na [4] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, O. Villa, Combining
|
||
et al. [47] addressed security challenges in inter-GPU communications HW/SW mechanisms to improve NUMA performance of multi-GPU systems, in:
|
||
under unified virtual memory framework. Barre Chord [48] leverages Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2018,
|
||
page allocation schemes in multi-chip-module GPUs to reduce address pp. 339–351.
|
||
|
||
|
||
11
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
[5] T. Baruah, Y. Sun, A.T. Dinçer, S.A. Mojumder, J.L. Abellán, Y. Ukidave, A. [30] K. Koukos, A. Ros, E. Hagersten, S. Kaxiras, Building heterogeneous Unified
|
||
Joshi, N. Rubin, J. Kim, D. Kaeli, Griffin: Hardware-software support for efficient Virtual Memories (UVMs) without the overhead, ACM Trans. Archit. Code Optim.
|
||
page migration in multi-GPU systems, in: Proceedings of IEEE International 13 (1) (2016).
|
||
Symposium on High Performance Computer Architecture, 2020, pp. 596–609. [31] X. Ren, M. Lis, Efficient sequential consistency in GPUs via relativistic cache co-
|
||
[6] M. Khairy, V. Nikiforov, D. Nellans, T.G. Rogers, Locality-centric data and thread- herence, in: Proceedings of IEEE International Symposium on High Performance
|
||
block management for massive GPUs, in: Proceedings of IEEE/ACM International Computer Architecture, 2017, pp. 625–636.
|
||
Symposium on Microarchitecture, 2020, pp. 1022–1036.
|
||
[32] S. Puthoor, M.H. Lipasti, Turn-based spatiotemporal coherence for GPUs, ACM
|
||
[7] H. Muthukrishnan, D. Lustig, D. Nellans, T. Wenisch, GPS: A global publish-
|
||
Trans. Archit. Code Optim. 20 (3) (2023).
|
||
subscribe model for multi-GPU memory management, in: Proceedings of
|
||
IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 46–58. [33] M. Wang, T. Ta, L. Cheng, C. Batten, Efficiently supporting dynamic task paral-
|
||
[8] L. Belayneh, H. Ye, K.-Y. Chen, D. Blaauw, T. Mudge, R. Dreslinski, N. Talati, lelism on heterogeneous cache-coherent systems, in: Proceedings of ACM/IEEE
|
||
Locality-aware optimizations for improving remote memory latency in multi-GPU International Symposium on Computer Architecture, 2020, pp. 173–186.
|
||
systems, in: Proceedings of the International Conference on Parallel Architectures [34] J. Zuckerman, D. Giri, J. Kwon, P. Mantovani, L.P. Carloni, Cohmeleon:
|
||
and Compilation Techniques, 2022, pp. 304–316. Learning-based orchestration of accelerator coherence in heterogeneous SoCs, in:
|
||
[9] S.B. Dutta, H. Naghibijouybari, A. Gupta, N. Abu-Ghazaleh, A. Marquez, K. Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2021,
|
||
Barker, Spy in the GPU-box: Covert and side channel attacks on multi-GPU pp. 350–365.
|
||
systems, in: Proceedings of ACM/IEEE International Symposium on Computer
|
||
[35] N. Oswald, V. Nagarajan, D.J. Sorin, HieraGen: Automated generation of con-
|
||
Architecture, 2023, pp. 633–645.
|
||
current, hierarchical cache coherence protocols, in: Proceedings of ACM/IEEE
|
||
[10] H. Muthukrishnan, D. Lustig, O. Villa, T. Wenisch, D. Nellans, FinePack:
|
||
International Symposium on Computer Architecture, 2020, pp. 888–899.
|
||
Transparently improving the efficiency of fine-grained transfers in multi-GPU
|
||
systems, in: Proceedings of IEEE International Symposium on High Performance [36] N. Oswald, V. Nagarajan, D.J. Sorin, V. Gavrielatos, T. Olausson, R. Carr,
|
||
Computer Architecture, 2023, pp. 516–529. HeteroGen: Automatic synthesis of heterogeneous cache coherence protocols, in:
|
||
[11] X. Ren, D. Lustig, E. Bolotin, A. Jaleel, O. Villa, D. Nellans, HMG: Extending Proceedings of IEEE International Symposium on High Performance Computer
|
||
cache coherence protocols across modern hierarchical multi-GPU systems, in: Architecture, 2022, pp. 756–771.
|
||
Proceedings of IEEE International Symposium on High Performance Computer [37] W. Li, A.G.U. of Amsterdam, N. Oswald, V. Nagarajan, D.J. Sorin, Determining
|
||
Architecture, 2020, pp. 582–595. the minimum number of virtual networks for different coherence protocols, in:
|
||
[12] S. Zhang, M. Naderan-Tahan, M. Jahre, L. Eeckhout, SAC: Sharing-aware caching Proceedings of ACM/IEEE International Symposium on Computer Architecture,
|
||
in multi-chip GPUs, in: Proceedings of ACM/IEEE International Symposium on 2024, pp. 182–197.
|
||
Computer Architecture, 2023, pp. 605–617. [38] Y. Wang, B. Li, A. Jaleel, J. Yang, X. Tang, GRIT: Enhancing multi-GPU
|
||
[13] B.A. Hechtman, S. Che, D.R. Hower, Y. Tian, B.M. Beckmann, M.D. Hill, S.K. performance with fine-grained dynamic page placement, in: Proceedings of IEEE
|
||
Reinhardt, D.A. Wood, QuickRelease: A throughput-oriented approach to release International Symposium on High Performance Computer Architecture, 2024, pp.
|
||
consistency on GPUs, in: Proceedings of IEEE International Symposium on High 1080–1094.
|
||
Performance Computer Architecture, 2014, pp. 189–200.
|
||
[39] M.K. Tavana, Y. Sun, N.B. Agostini, D. Kaeli, Exploiting adaptive data com-
|
||
[14] M.D. Sinclair, J. Alsop, S.V. Adve, Efficient GPU synchronization without
|
||
pression to improve performance and energy-efficiency of compute workloads in
|
||
scopes: Saying no to complex consistency models, in: Proceedings of IEEE/ACM
|
||
multi-GPU systems, in: Proceedings of IEEE International Parallel and Distributed
|
||
International Symposium on Microarchitecture, 2015, pp. 647–659.
|
||
[15] J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consis- Processing Symposium, 2019, pp. 664–674.
|
||
tency for GPUs, in: Proceedings of IEEE/ACM International Symposium on [40] H. Muthukrishnan, D. Nellans, D. Lustig, J.A. Fessler, T.F. Wenisch, Efficient
|
||
Microarchitecture, 2016, pp. 1–13. multi-GPU shared memory via automatic optimization of fine-grained trans-
|
||
[16] NVIDIA, NVIDIA TESLA V100 GPU architecture, 2017, https://images.nvidia. fers, in: Proceedings of the ACM/IEEE International Symposium on Computer
|
||
com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Architecture, 2021, pp. 139–152.
|
||
[17] NVIDIA, NVIDIA A100 tensor core GPU architecture, 2020, https: [41] B. Li, J. Yin, Y. Zhang, X. Tang, Improving address translation in multi-
|
||
//images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere- GPUs via sharing and spilling aware TLB design, in: Proceedings of IEEE/ACM
|
||
architecture-whitepaper.pdf. International Symposium on Microarchitecture, 2021, pp. 1154–1168.
|
||
[18] NVIDIA, NVIDIA NVLink high-speed GPU interconnect, 2024, https://www.
|
||
[42] B. Li, J. Yin, A. Holey, Y. Zhang, J. Yang, X. Tang, Trans-FW: Short circuiting
|
||
nvidia.com/en-us/design-visualization/nvlink-bridges/.
|
||
page table walk in multi-GPU systems via remote forwarding, in: Proceedings
|
||
[19] I. Singh, A. Shriraman, W.W.L. Fung, M. O’Connor, T.M. Aamodt, Cache coher-
|
||
of IEEE International Symposium on High Performance Computer Architecture,
|
||
ence for GPU architectures, in: Proceedings of IEEE International Symposium on
|
||
2023, pp. 456–470.
|
||
High Performance Computer Architecture, 2013, pp. 578–590.
|
||
[20] Y. Sun, T. Baruah, S.A. Mojumder, S. Dong, X. Gong, S. Treadway, Y. Bao, [43] B. Li, Y. Guo, Y. Wang, A. Jaleel, J. Yang, X. Tang, IDYLL: Enhancing page
|
||
S. Hance, C. McCardwell, V. Zhao, H. Barclay, A.K. Ziabari, Z. Chen, R. translation in multi-GPUs via light weight PTE invalidations, in: Proceedings of
|
||
Ubal, J.L. Abellán, J. Kim, A. Joshi, D. Kaeli, MGPUSim: Enabling multi- IEEE/ACM International Symposium on Microarchitecture, 2015, pp. 1163–1177.
|
||
GPU performance modeling and optimization, in: Proceedings of ACM/IEEE [44] E. Choukse, M.B. Sullivan, M. O’Connor, M. Erez, J. Pool, D. Nellans, Buddy
|
||
International Symposium on Computer Architecture, 2019, pp. 197–209. compression: Enabling larger memory for deep learning and HPC workloads
|
||
[21] T. Yuki, L.-N. Pouchet, Polybench 4.0, 2015. on GPUs, in: Proceedings of ACM/IEEE International Symposium on Computer
|
||
[22] Y. Sun, X. Gong, A.K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A. Architecture, 2020, pp. 926–939.
|
||
Villegas, D. Kaeli, Hetero-mark, a benchmark suite for CPU-GPU collaborative
|
||
[45] Y. Tan, Z. Bai, D. Liu, Z. Zeng, Y. Gan, A. Ren, X. Chen, K. Zhong, BGS: Accelerate
|
||
computing, in: Proceedings of IEEE International Symposium on Workload
|
||
GNN training on multiple GPUs, J. Syst. Archit. 153 (2024) 103162.
|
||
Characterization, 2016, pp. 1–10.
|
||
[23] AMD, AMD app SDK OpenCL optimization guide, 2015. [46] X. Ren, M. Lis, CHOPIN: Scalable graphics rendering in multi-GPU systems via
|
||
[24] A. Danalis, G. Marin, C. McCurdy, J.S. Meredith, P.C. Roth, K. Spafford, V. Tip- parallel image composition, in: Proceedings of IEEE International Symposium on
|
||
paraju, J.S. Vetter, The Scalable Heterogeneous Computing (SHOC) benchmark High Performance Computer Architecture, 2021, pp. 709–722.
|
||
suite, in: Proceedings of the 3rd Workshop on General-Purpose Computation on [47] S. Na, J. Kim, S. Lee, J. Huh, Supporting secure multi-GPU computing with dy-
|
||
Graphics Processing Units, 2010, pp. 63–74. namic and batched metadata management, in: Proceedings of IEEE International
|
||
[25] R. Balasubramonian, A.B. Kahng, N. Muralimanohar, A. Shafiee, V. Srinivas, Symposium on High Performance Computer Architecture, 2024, pp. 204–217.
|
||
CACTI 7: New tools for interconnect exploration in innovative off-chip memories, [48] Y. Feng, S. Na, H. Kim, H. Jeon, Barre chord: Efficient virtual memory trans-
|
||
ACM Trans. Archit. Code Optim. 14 (2) (2017) 14:1–25. lation for multi-chip-module GPUs, in: Proceedings of ACM/IEEE International
|
||
[26] NVIDIA, NVIDIA DGX-1 with tesla V100 system architecture, 2017, pp. 1–43.
|
||
Symposium on Computer Architecture, 2024, pp. 834–847.
|
||
[27] NVIDIA, NVIDIA ADA GPU architecture, 2023, https://images.nvidia.com/aem-
|
||
dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper- [49] O. Villa, D. Lustig, Z. Yan, E. Bolotin, Y. Fu, N. Chatterjee, Need for speed:
|
||
v2.1.pdf. Experiences building a trustworthy system-level GPU simulator, in: Proceedings
|
||
[28] Y. Le, X. Yang, Tiny ImageNet visual recognition challenge, 2015, https://http: of IEEE International Symposium on High Performance Computer Architecture,
|
||
//vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf. 2021, pp. 868–880.
|
||
[29] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park, [50] J. Prades, C. Reaño, F. Silla, NGS: A network GPGPU system for orchestrating
|
||
NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inferencing, remote and virtual accelerators, J. Syst. Archit. 151 (2024) 103138.
|
||
in: Proceedings of ACM International Conference on Architectural Support for
|
||
Programming Languages and Operating Systems, 2024, pp. 722–737.
|
||
|
||
|
||
12
|
||
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
|
||
|
||
|
||
Gun Ko received the B.S. degree in electrical engineering Hyunwuk Lee received his B.S. and Ph.D. degrees in
|
||
from Pennsylvania State University in 2017. He is currently electrical and electronic engineering from Yonsei University,
|
||
pursuing the Ph.D. degree with the Embedded Systems Seoul, Korea, in 2018 and 2024, respectively. He currently
|
||
and Computer Architecture Laboratory, School of Electrical works in the memory division at Samsung Electronics. His
|
||
and Electronic Engineering, Yonsei University, Seoul, South research interests include neural network accelerators and
|
||
Korea. His current research interests include GPU memory GPU systems.
|
||
systems, multi-GPU systems, and virtual memory.
|
||
|
||
|
||
|
||
|
||
Jiwon Lee received the B.S. and Ph.D. degrees in electrical
|
||
and electronic engineering from Yonsei University, Seoul, Won Woo Ro received the B.S. degree in electrical engineer-
|
||
South Korea, in 2018 and 2024, respectively. He currently ing from Yonsei University, Seoul, South Korea, in 1996, and
|
||
works in the memory division at Samsung Electronics. His the M.S. and Ph.D. degrees in electrical engineering from the
|
||
research interests include virtual memory, GPU memory University of Southern California, in 1999 and 2004, respec-
|
||
systems, and storage systems. tively. He worked as a Research Scientist with the Electrical
|
||
Engineering and Computer Science Department, University
|
||
of California, Irvine. He currently works as a Professor
|
||
with the School of Electrical and Electronic Engineering,
|
||
Yonsei University. Prior to joining Yonsei University, he
|
||
worked as an Assistant Professor with the Department
|
||
Hongju Kal received the B.S. degree from Seoul National of Electrical and Computer Engineering, California State
|
||
University of Science and Technology and Ph.D. degree from University, Northridge. His industry experience includes a
|
||
Yonsei University in school of electric and electronic engi- college internship with Apple Computer, Inc., and a contract
|
||
neering, Seoul, South Korea in 2018 and 2024, respectively. software engineer with ARM, Inc. His current research
|
||
He currently works in the memory division at Samsung interests include high-performance microprocessor design,
|
||
Electronics. His current research interests include memory GPU microarchitectures, neural network accelerators, and
|
||
architectures, memory hierarchies, near memory processing, memory hierarchy design.
|
||
and neural network accelerators.
|
||
|
||
|
||
|
||
|
||
13
|
||
|