Files
opaque-lattice/papers_txt/REC--Enhancing-fine-grained-cache-coherence-protoc_2025_Journal-of-Systems-A.txt
2026-01-06 12:49:26 -07:00

998 lines
111 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103339
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
Gun Ko, Jiwon Lee, Hongju Kal, Hyunwuk Lee, Won Woo Ro
Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea
ARTICLE INFO ABSTRACT
Keywords: With the increasing demands of modern workloads, multi-GPU systems have emerged as a scalable solution, ex-
Multi-GPU tending performance beyond the capabilities of single GPUs. However, these systems face significant challenges
Data sharing in managing memory across multiple GPUs, particularly due to the Non-Uniform Memory Access (NUMA)
Cache coherence
effect, which introduces latency penalties when accessing remote memory. To mitigate NUMA overheads, GPUs
Cache architecture
typically cache remote memory accesses across multiple levels of the cache hierarchy, which are kept coherent
using cache coherence protocols. The traditional GPU bulk-synchronous programming (BSP) model relies on
coarse-grained invalidations and cache flushes at kernel boundaries, which are insufficient for the fine-grained
communication patterns required by emerging applications. In multi-GPU systems, where NUMA is a major
bottleneck, substantial data movement resulting from the bulk cache invalidations exacerbates performance
overheads. Recent cache coherence protocol for multi-GPUs enables flexible data sharing through coherence
directories that track shared data at a fine-grained level across GPUs. However, these directories limited in
capacity, leading to frequent evictions and unnecessary invalidations, which increase cache misses and degrade
performance. To address these challenges, we propose REC, a low-cost architectural solution that enhances
the effective tracking capacity of coherence directories by leveraging memory access locality. REC coalesces
multiple tag addresses from remote read requests within common address ranges, reducing directory storage
overhead while maintaining fine-grained coherence for writes. Our evaluation on a 4-GPU system shows that
REC reduces L2 cache misses by 53.5% and improves overall system performance by 32.7% across a variety
of GPU workloads.
1. Introduction each kernel. However, as recent GPU applications increasingly require
more frequent and fine-grained communication both within and across
Multi-GPU systems have emerged to meet the growing demands kernels [11,1315], these frequent synchronizations can lead to sub-
of modern workloads, offering scalable performance beyond what a stantial cache operation and data movement overheads. Additionally,
single GPU can deliver. However, as multi-GPU architectures scale in precisely managing the synchronizations places additional burdens on
size and complexity [1,2], managing memory across multiple GPUs programmers, complicating the optimization of multi-GPU systems.
becomes increasingly challenging [37]. One of the primary challenges Ren et al. [11] proposed HMG, a hierarchical cache coherence
arises from the bandwidth discrepancy between local and remote mem- protocol designed for L2 caches in large-scale multi-GPU systems. HMG
ory, commonly known as the Non-Uniform Memory Access (NUMA) employs coherence directories to record cache line addresses and their
effect [3,4]. To mitigate the NUMA penalty, GPUs generally rely on associated sharers upon receiving remote read requests. Any writes to
caching remote memory accesses, allowing them to be served with
these addresses trigger invalidations. Once capacity is reached, existing
local bandwidth [5,810]. This caching strategy is often extended
entries are evicted from the directory, triggering invalidation requests
across multiple levels of the cache hierarchy, including both private
to the sharer GPUs. These invalidations are unnecessary, as the cor-
on-chip caches and shared caches [3,4,11,12], to better accommodate
responding cache lines do not immediately require coherence to be
the diverse access patterns of emerging workloads.
maintained. When GPUs access data across a wide range of addresses,
While remote data caching offers significant performance benefits
in multi-GPU systems, it also requires extending coherence throughout significant directory insertions lead to a number of unnecessary invali-
the cache hierarchy. Conventional GPUs rely on a simple software- dations for cache lines that have not yet been fully utilized. Subsequent
inserted bulk-synchronous programming (BSP) model [11], which per- accesses to these cache lines result in cache misses, requiring data to
forms cache invalidation and flush operations at the start and end of be fetched again over bandwidth-limited inter-GPU links.
Corresponding author.
E-mail address: wro@yonsei.ac.kr (W.W. Ro).
https://doi.org/10.1016/j.sysarc.2025.103339
Received 10 September 2024; Received in revised form 27 December 2024; Accepted 5 January 2025
Available online 9 January 2025
1383-7621/© 2025 Published by Elsevier B.V.
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 1. Performance of each caching scheme normalized to a system that enables
remote data caching in both L1 and L2 caches using software and hardware coherence
protocols, respectively. No caching refers to a system that disables remote data Fig. 2. Baseline multi-GPU system. Each GPU has a coherence directory that records
caching, simplifying coherence. and tracks the status of shared data at given addresses along with the corresponding
sharer IDs.
To evaluate the implications of the coherence protocol, we mea-
sure the performance impact of unnecessary invalidations on a 4-GPU 2. Background
system that caches remote data in both L1 and L2 caches. L1 caches
are assumed to be software-managed, while L2 caches are managed 2.1. Multi-GPU architecture
under fine-grained invalidation through coherence directories. As Fig. 1
shows, there exists a significant performance opportunity in eliminat- The slowdown of transistor scaling has made it increasingly difficult
ing unnecessary invalidations caused by frequent directory evictions. for single GPUs to meet the growing demands of modern workloads. Al-
Increasing the size of the coherence directory can delay evictions and ternatively, multi-GPU systems have emerged as a viable path forward,
the corresponding invalidation requests, but at the cost of increased offering enhanced performance and memory capacity by leveraging
hardware. Our observations indicate that to eliminate unnecessary multiple GPUs connected using high-bandwidth interconnects such as
invalidations, the size of the coherence directory would need be sub- PCIe and NVLink [18]. However, these inter-GPU links are likely to
stantially increased, accounting for 30.4% of the L2 cache size. As have bandwidth that falls far behind the local memory bandwidth [3,
the size of GPU L2 caches continues to grow [16,17], the aggregate 4,8]. The NUMA effect that arises from this large bandwidth gap
storage overhead of coherence directories becomes substantial, caus- can significantly impact multi-GPU performance, making it crucial to
ing inefficiency in scaling for multi-GPU environment (discussed in optimize remote access bottlenecks to maximize efficiency.
Section 3.3). Fig. 2 illustrates the architectural details of our target multi-GPU
In this paper, we propose Range-based Directory Entry Coalescing system. Each GPU is divided into several SAs, with each comprising a
(REC), an architectural solution that mitigates unnecessary invalidation
number of CUs. Every CU has its own private L1 vector cache (L1V$),
overhead by increasing the effective tracking capacity of the coher-
while the L1 scalar cache (L1S$) and L1 instruction cache (L1I$) are
ence directory without incurring significant hardware costs. Our key
shared across all CUs within an SA. Additionally, each GPU contains
insight is that since directory updates are performed upon receiving
a larger L2 cache that is shared across all SAs. When a data access
remote read requests, leveraging memory access locality provides an
misses in the local cache hierarchy, it is forwarded to either local or
opportunity to coalesce multiple tag addresses of shared data based on
remote GPU memory, depending on the data location. For local mem-
their common address range. To achieve this, we employ a coherence
ory accesses, the cache lines are stored in both the shared L2 cache and
directory design, which aggregates data from incoming remote reads
that share a common base address within the same address range, the L1 cache private to the requesting CU. In the case of remote-GPU
storing only the offset and the sharer IDs. We reduce the storage memory accesses, the data can be cached either only in the L1 cache
requirements of directory entries by designing them in a base-and-offset of the requesting CU [4,5,8] or in both the L2 and L1 caches [3,11,12].
format, recording the common high-order bits of addresses and using a Caching data in remote memory nodes helps mitigate the performance
bit-vector to indicate the index of each coalesced entry within the target degradation caused by accessing remote memory nodes.
range. For incoming writes, if they are found in the coherence direc-
tory, invalidations are propagated only to the corresponding address, 2.2. Remote data caching in multi-GPU
maintaining fine-grained coherence in multi-GPU systems.
To summarize, this paper makes the following contributions: While caching remote data only in the L1 cache can save L2 cache
capacity, it limits the sharing of remote data among CUs. As a result,
• We identify a performance bottleneck of fine-grained shared data such an approach provides lower performance gain when unnecessary
tracking mechanisms in multi-GPU systems. Our analysis demon- invalidation overhead is eliminated in its counterpart, as shown in
strates that such methods generate unnecessary invalidations at Fig. 1. For this reason, in this study, we assume the baseline multi-GPU
coherence directory evictions, which incurs a significant perfor- architecture allows caching of remote data in both L1 and L2 caches.
mance bottleneck due to increased cache miss rates.
A step-by-step process of remote data caching is shown in Fig. 2.
• We show that simply employing larger coherence directories
Upon generating a memory request, an L1 cache lookup is performed
incurs significant storage overhead. Our analysis shows that the
by the requesting CU ( 1 ). When data is not present in the L1, an
baseline multi-GPU system requires a 12× increase in the direc-
L2 cache lookup is generated to check if the remote data is cached
tories to eliminate redundant invalidations.
in the L2 ( 2 ). If the data is found in the L2 cache, it is returned to
• We propose REC which increases effective coverage of the co-
the requesting CU and cached in its local L1 cache. If the data is not
herence directory by enabling each entry to coalesce and track
multiple memory addresses along with the associated sharers. By found in the L2 cache, the request is forwarded to the remote GPU
reducing the L2 cache misses by 53.5%, REC improves overall memory at the given physical address. Subsequently, the requested
performance by 32.7% on average across our evaluated GPU data is returned at a cache line granularity and cached in both the L1
workloads. and L2 caches ( 3 ). At the same time, the coherence directory, which
maintains information about data locations across multiple GPUs, is
2
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 3. Coherence protocol flows in detail. The baseline hardware protocol has two Fig. 4. L2 cache miss rates in baseline and idealized system where no invalidations
stable states: valid and invalid, with no transient states or acknowledgments required are propagated by coherence directory evictions. Cold misses are excluded from the
for write permissions. results.
updated with the corresponding entry and the sharer GPU ( 4 ). Writes Local reads: Local read requests arriving at the L2 cache are directed
to remote data in the home GPU are also performed in the local L2 to either locally- or remotely-mapped data. On cache hits, the data is
cache, following the write-through policy, as the corresponding GPU returned and guaranteed to be consistent because it is either the most
may access the written data in the future. Remote writes arriving at up-to-date data (if mapped to local DRAM) or correctly managed by
the home GPU trigger invalidation messages to be sent out to the sharer the protocol (if mapped to remote GPU). On cache misses, the requests
GPU(s), and the requesting GPU is recorded as a sharer ( 4 ). are forwarded to either local DRAM or a remote GPU. In all cases, the
directory of the requesting GPU remains unchanged.
2.3. Cache coherence in multi-GPU Remote reads: For remote reads that arrive at the home GPU, the
coherence directory records the ID of the requesting GPU at the given
cache line address. If the line is already being tracked (i.e., the entry is
Existing hardware protocols, such as GPU-VI [19], employ coher-
found and valid), the directory simply adds the requester to the sharer
ence directories to track sharers (i.e., L1s) and propagate write-initiated
field and keeps the entry in the valid state. If the line is not being
cache invalidations within a single GPU. Bringing the notion into multi-
tracked, the directory finds an empty spot to allocate a new entry and
GPU environments, Ren et al. proposed HMG [11], a hierarchical design
marks it as valid. When the directory is full and every entry is valid, it
that efficiently manages both intra- and inter-GPU coherence. HMG
evicts an existing entry and replaces it with the new entry (discussed
includes two layers for selecting home nodes to track sharers: (1) the
below).
inter-GPU module (GPM) level that selects a home GPM within a GPU
Local writes: Local writes to data mapped to the home GPU memory
and (2) the inter-GPU level that selects a home GPU across the entire
look up the directory to find whether a matching entry at the line
system. A GPM is a chiplet in multi-chip module GPUs. With this,
address exists. If found, invalidations are propagated to the recorded
HMG reduces the complexity of tracking and maintaining coherence
sharers in the background, and the directory entry becomes invalid.
across a large number of sharers. HMG also optimizes performance by
Remote writes: By default, L2 caches use a write-back policy for local
eliminating all transient states and most invalidation acknowledgments,
writes. As described in Section 2.2, remote writes update both the L2
leveraging weak memory models in modern GPUs [11].
cache of the requester and local memory, similar to a write-through
Each GPU has a coherence directory attached to its L2 cache, policy. Consequently, the directory maintains the entry as valid by
managed by the cache controllers. The directory is organized in a set- adding the requester to the sharer list and sends out invalidations to
associative structure, and each entry contains the following fields: tag, other sharers recorded in the original entry.
sharer IDs, and coherence state. The tag field stores the cache line Directory entry eviction/replacement: Coherence directories are im-
address for the data copied and fetched by the sharer. The sharer plemented in a set-associative structure. Thus, capacity and conflict
ID field is a bit-vector representing the list of sharers, excluding the misses occur as directory lookups are initiated by the read requests con-
home GPU. Each entry is in one of two stable states: valid or invalid. tinuously received from remote GPUs. To notify that the information
Unlike HMG [11], the baseline coherence directory tracks one cache in the evicted entry is no longer traceable, invalidations are sent out as
line per each entry. In contrast, a directory entry in HMG is designed with writes.
to track four cache lines using a single tag address and sharer ID Acquire and release: At the start of a kernel, invalidations are per-
field, which limits its ability to manage each cache line at a fine formed in L1 caches as coherence is maintained using software bulk
granularity. Consequently, a write to any address tracked by a directory synchronizations. However, the invalidations are not propagated be-
entry may unnecessarily invalidate other cache lines within the same yond L1 caches, as L2 caches are kept coherent with the fine-grained
range, potentially causing inefficiencies in remote data caching. We directory protocol. Release operations flush dirty data in both L1 and
discuss the importance of reducing unnecessary cache line invalidations L2 caches.
in detail in Section 3.1. Like typical memory allocation in multi-GPU
systems, the physical address space is partitioned among the GPUs in 3. Motivation
the system. Therefore, data at any given physical address is designated
to one GPU (i.e., the home GPU), and every access by a remote GPU In multi-GPU systems, coherence is managed explicitly through
references the coherence directory of the home GPU. For example, in cache invalidations to ensure data consistency across multiple GPUs.
Fig. 2, GPU0 requests data at address 0xA from GPU1, which is the When invalidation requests are received, sharer GPUs must look up and
home GPU; the corresponding entry is then inserted into the directory invalidate the corresponding cache lines. Subsequent accesses to these
of GPU1 with the relevant information. invalidated cache lines result in cache misses, which are then forwarded
Fig. 3 shows the detailed state transitions and actions initiated by to the home GPU. This, in turn, can negate the performance benefits of
the coherence directory. Note that local and remote refer to the sources local caching as it undermines the effectiveness of caching mechanisms
of memory requests received: local refers to accesses from the local CUs, intended to reduce remote access bottlenecks. In this section, we ana-
and remote refers to accesses from the remote GPUs. lyze the behavior of cache invalidation and its impact on the overall
3
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 6. Performance impact of increasing coherence directory sizes. To eliminate
unnecessary invalidations, GPUs require a directory size up to 12× larger than the
Fig. 5. Fraction of evict-initiated and write-initiated invalidations in the baseline multi- baseline.
GPU system. The results are based on invalidation requests that hit in the sharer-side
L2 caches.
3.2. Source of premature invalidation
performance of multi-GPU systems. We identify the sources of invalida-
As described in Section 2.3, when a coherence directory becomes
tion and explore a straightforward solution to mitigate the associated
full, the GPU needs to evict an old entry and replace it with a new
bottlenecks. Our experiments are conducted using MGPUSim [20], a
one upon receiving a remote read request; an invalidation request must
multi-GPU simulation framework that we have extended to support
be sent out to the sharer(s) in the evicted entry. Fig. 5 shows the
the hardware cache coherence protocol. The detailed configuration is
distribution of invalidations triggered by directory eviction and write
provided in Table 2.
requests, referred to as evict-initiated and write-initiated invalidations,
respectively. The measurements are taken based on the invalidations
3.1. Impact of cache invalidation that are hit in the sharer-side L2 caches after receiving the requests. We
observe that significant amount of invalidations (average 79.5%) are
To ensure data consistency across multiple GPUs, invalidation re- performed by the requests from directory evictions in the home GPUs.
quests are propagated by the home GPU in two cases: (1) when write These invalidations, considered unnecessary as they do not require
requests are received and (2) when an entry is evicted from the coher- immediate action, should be delayed until remote GPUs have full use
ence directory due to capacity and conflict misses. Invalidation requests of the data.
triggered by writes are crucial for maintaining data consistency, as they We also show the percentage of write-initiated invalidations in
ensure that no stale data is accessed in the sharer GPU caches. On the Fig. 5. One can observe that applications such as FIR, LU, and MM2
other hand, invalidations generated by directory eviction aim to notify experience a significant number of invalidations due to write re-
the sharers that the coherence information is no longer traceable, even quests. These workloads exhibit fine-grained communication within
if the data is still valid. A detailed background on the protocol flows and across dependent kernels, necessitating the invalidation of corre-
with invalidations is given in Section 2.3. sponding cache lines in the remote L2 cache upon any modification
Broadcasting invalidations does not significantly impact cache ef- to the shared data. Although the applications exhibit a high percent-
ficiency if the cache lines are already evicted or no longer in use. age of write-initiated invalidations, their impact on cache miss rates
However, when applications exhibit frequent remote memory accesses, may be negligible if the GPUs do not subsequently require access
the generation of new directory entries increases invalidation requests to the invalidated cache lines. Nonetheless, the results from Fig. 4
from eviction, invalidating the associated cache lines prematurely. clearly demonstrate the importance of minimizing unnecessary cache
These premature invalidations lead to higher cache miss rates, as invalidations.
So far, we have discussed how prematurely invalidating remote data
subsequent accesses to the invalidated cache lines result in misses.
leads to increased cache miss rates, which negatively impacts multi-
As remote data misses exacerbates NUMA overheads, they need to be
GPU performance. We also show that a large fraction of invalidation
reduced to improve multi-GPU performance.
requests stems from directory evictions, which frequently occur due to
Fig. 4 shows the impact of cache miss rate when eliminating unnec-
the high volume of remote accesses. These accesses trigger numerous
essary invalidations across the benchmarks listed in Table 3 running on
directory updates, overwhelming the baseline coherence directorys
a 4-GPU system. The figure demonstrates that the baseline system expe-
capacity to effectively manage coherence. A straightforward solution to
riences a cache miss rate more than double (average 2.4×) that of the
mitigate premature invalidations is to increase the size of the coherence
idealized system without the unnecessary invalidation. This increase
directory, providing more coverage to track sharers and reducing evic-
is mainly due to frequent invalidation requests, which prematurely
tion rates. In the following section, we analyze the performance impact
invalidate cache lines before they can be fully utilized, leading to an of larger coherence directory sizes. It is important to note that this
increase in the number of remote memory accesses. The result strongly paper primarily focuses on delaying invalidations caused by directory
motivates us to further study the source of these frequent invalidations evictions, as write-initiated invalidations are necessary and must be
to improve efficiency of remote data caching in multi-GPU systems. performed immediately for correctness.
To demonstrate the performance opportunity, Fig. 1 presents a study
showing the performance of idealized caching without the invalidation 3.3. Increasing directory sizes
overhead. With no invalidations to unmodified cache lines, remote data
can be fully utilized as needed until they are naturally replaced by the A simple approach to delay directory evictions, thereby minimizing
typical cache replacement policy. The performance of the baseline and premature invalidations, is to increase the size of coherence directories.
ideal system is represented in the first and fourth bars, respectively, Limited directory sizes lead to significant evict-initiated invalidations,
in Fig. 1. The result shows that an ideal system with no unnecessary which can undermine the performance benefits of local caching. To
cache invalidation overheads outperforms the baseline by up to 2.79× quantify the benefits of larger directories, we conduct a quantitative
(average 36.9%). As demonstrated by Figs. 1 and 4, reducing premature analysis of performance improvements with increasing directory sizes.
cache invalidations is crucial in improving efficiency of remote data In our simulated 4-GPU system, each GPU has an L2 cache size of
caching in multi-GPU systems. 2 MB, with each cache line being 64B. Each coherence directory tracks
4
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 7. Average performance improvement per increased directory storage in the
baseline coherence directory design. The results are normalized to the system with
8K-entry coherence directory.
the identity of all sharers excluding the home GPU (i.e., three GPUs).
To cover the entire L2 cache space for three GPUs, an ideal coherence
directory would require approximately 96K entries, or about 12× the
baseline 8K entries.
Fig. 6 illustrates the normalized performance for increasing the
Fig. 8. A high-level overview of (a) baseline and (b) proposed REC architecture with
directory sizes by 2×-12× the baseline. With an ideal directory size,
simplified 2-entry coherence directories. The figure illustrates a scenario where GPU1
unnecessary invalidations from directory evictions can be eliminated, accesses memory of GPU0 in order of 0 × 1000, 0 × 1040, 0 × 1080, and 0 × 1000
leaving only write-initiated invalidations. The results show that ap- by each CU. In the baseline directory, entry that tracks status of data at 0 × 1000
plications exhibit significant performance gains as the directory size is evicted for recording the address 0 × 1080. The proposed directory coalesces three
increases, with some benchmarks (e.g., ATAX, PR, and ST) requiring addresses with same base address into one entry.
8×-12× the baseline size to achieve the highest speed-up. Specifically,
benchmarks such as PR and ST show irregular memory access patterns
that span a wide address range, leading to higher chances of conflict 4.1. Hardware overview
misses when updating coherence directories. Most other tested bench-
marks require up to six times the baseline directory size to achieve As shown in Section 3.2, a significant fraction of cache invalidations
maximum attainable performance; the average speedup with six times are generated by the frequent directory evictions. These invalidations
the size is 1.35×. lead to increased cache misses, as data is prematurely invalidated from
Each entry in the coherence directory comprises a tag, sharer list, the cache, requiring subsequent accesses to fetch the data from remote
and coherence state. We assume 48 bits for tag addresses, a 3-bit memory. While simply increasing the directory size can address this
vector for tracking sharers, and one bit for the directory entry state; bottleneck, the associated cost of hardware can become substantial. To
thus, each entry requires a total of 52 bits of storage. Our baseline address this, we propose REC, an architectural solution that compresses
directory implementation has 8K entries and occupies approximately remote GPU access information, retaining as much data as possible
2.5% of the L2 cache [11]. Therefore, the storage cost of the baseline before eviction occurs. It aggregates data from incoming remote read
directory in each GPU is 52 × 8192/8/1024 = 52 kB, assuming 8 requests so that (1) multiple reads to the same address range share
bits per byte and 1024 bytes per kilobyte. From our observation in a common base address, storing only the offset and source GPU in-
Fig. 6, applications require directory sizes from 6× up to 12× the formation, and (2) the coalescing process does not result in any loss
baseline to achieve maximum performance. This corresponds to a total
of information, maintaining the accuracy of the coherence protocol.
storage cost of 312-624 kB, which is an additional 15.230.4% of
We now discuss the design overview of REC and the details of the
the L2 cache size. While increasing directory size can significantly
associated hardware components.
improve performance, the associated hardware costs are substantial.
Fig. 8(a) shows how the baseline GPU handles a sequence of in-
To show the inefficiency of simply scaling directory sizes, we calculate
coming read requests. The cache controller records the tag addresses
the performance per storage using the results in Fig. 6 and the number
and the corresponding sharer IDs in the order that the requests arrive.
of directory entries. Fig. 7 illustrates the results relative to the baseline
When the coherence directory reaches its capacity, the cache controller
with 8K entries, showing that performance improvements per increased
follows a typical FIFO policy to replace the oldest entry with a new
storage do not scale proportionally with larger coherence directories.
one within the set. Once an entry is evicted, the information it held
Additionally, since GPU applications require different directory sizes
can no longer be tracked, triggering an invalidation request to be sent
to achieve maximum performance, simply increasing the directory size
is not an efficient solution. Moreover, as GPU L2 caches continue to to the GPU listed in the entry. Upon receiving this request, the sharer
grow [16,17], the cost of maintaining proportionally larger coherence GPU checks its L2 cache and invalidates the corresponding cache line,
directories will only amplify these overheads. Therefore, improving leading to a cache miss on any subsequent access to the cache line.
coherence directory coverage without significant storage overhead mo- To delay invalidations caused by directory evictions without signif-
tivates the need for more efficient fine-grained hardware protocols in icant hardware overhead, we introduce the REC architecture, which
multi-GPU systems. enhances the baseline coherence directory by leveraging spatial locality
for merging multiple addresses into a single entry. As illustrated in
4. REC architecture Fig. 8(b), REC stores tag addresses with common high-order bits as a
single entry using a base-plus-offset format. When a new read request
This work aims to enhance coherence directory coverage while matches the base address in an existing entry, the offset and sharer in-
avoiding significant hardware overhead, overall reducing unnecessary formation are appended to that entry, reducing the need for additional
cache invalidations in multi-GPU systems. We introduce REC, an archi- entries and delaying evictions. The base address represents the shared
tecture that coalesces directory entries by leveraging the spatial locality high-order bits, covering a range of addresses and reducing the storage
in memory accesses observed in GPU workloads. In this section, we required compared to storing full tag addresses individually. Addition-
provide an overview of REC design and discuss its integration with ally, REC uses position bits to efficiently track multiple addresses within
existing multi-GPU coherence protocols. the specified range, further minimizing storage overhead.
5
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Table 1
Trade-offs between addressable range and storage for each entry. Note that one valid
bit, not shown in the table, is included in the overall calculation.
Addressable range
64B 128B 256B 1 kB 4 kB
Base address bits 48 41 40 38 36
Position/Sharer bits /3 2/6 4/12 16/48 64/192
Total bits per entry 52 50 57 103 293
Table 2
Baseline GPU configuration.
Parameter Configuration
Number of SAs 16
Number of CUs 4 per SA
L1 vector cache 1 per CU, 16 kB 4-way
L1 inst cache 1 per SA, 32 kB 4-way Fig. 10. Overview of the REC protocol flows. In the example coherence directory,
L1 scalar cache 1 per SA, 16 kB 4-way entry insertion and offset addition operations are highlighted in blue, while eviction
L2 cache 2 MB 16-way, 16 banks, write-back and offset deletion operations are shown in red.
Cache line size 64B
Coherence directory 8K entries, 8-way
DRAM capacity 4 GB HBM, 16 banks
DRAM bandwidth 1 TB/s [11]
for comparing the storage costs. REC designs with larger addressable
Inter-GPU bandwidth 300 GB/s, bi-directional ranges can benefit from increased directory coverage but at the cost of
storage. In the evaluation of this paper, we tested various addressable
ranges for REC. Each design is configured to coalesce the maximum
Table 3
number of offsets within its specified range. Later in the results, we
Tested workloads.
confirm that a 1 kB coalesceable range offers the best trade-off, bal-
Benchmark Abbr. Memory footprint
ancing reasonable size overhead per entry with the ability to coalesce
Matrix transpose and vector multiplication [21] ATAX 128 MB
2-D convolution [21] C2D 512 MB
a significant number of entries before evictions occur (discussed in
Finite impulse response [22] FIR 128 MB Section 5.2).
Matrix-multiply [21] GEMM 128 MB Based on these findings, the format of a directory entry is as
Vector multiplication and matrix addition [21] GEMV 256 MB illustrated in Fig. 9. Each entry comprises a base address, coalesced
2-D jacobi solver [21] J2D 128 MB
entries, and a valid bit. When the first remote read request arrives at the
LU decomposition [21] LU 128 MB
2 matrix multiplications [21] MM2 128 MB home GPU, the cache controller sets the base address by right-shifting
3 matrix multiplications [21] MM3 64 MB the tag address by the number of bits needed to represent the offset
PageRank [22] PR 256 MB within the specified range. For a 48-bit tag, the address is right-shifted
Simple convolution [23] SC 512 MB
by 10 bits (considering a 64B-aligned 1 kB range), and the resulting
Stencil 2D [24] ST 128 MB
bits from positions 64 to 101 are used to store the base address. The
coalesced entry is identified using the offset within the 1 kB range,
represented by a position bit, followed by three bits for recording the
sharers. The position bit is calculated as:
( )
Tag mod 𝑚
𝑝= × (𝑛 + 1)
64
where 𝑚 denotes the coalescing range, and 𝑛 is the number of shar-
ers, which are set to 1 kB and 3, respectively. Once the position is
determined, the corresponding position and the sharer bit are set to
1 using bitwise OR operation. Given that the 1 kB range allows each
entry to record up to 16 individual tag addresses, we use the lower 64
Fig. 9. Coherence directory entry structure for 64B cache lines. In our design, each bits to store the coalesced entries. Furthermore, the position bit can
entry stores up to 16 coalesced entries based on 1 kB range. also function as the valid bit for each coalesced entry, meaning only
one valid bit is necessary to indicate whether the entire entry is valid
or not.
Determining the address range within which REC coalesces entries is
one of the key design considerations, as it directly impacts the number 4.2. REC protocol flows
of bits required for each entry. Table 1 shows a list of design choices for
implementing REC with varying addressable ranges and their potential The baseline coherence protocol operates with two stable states-
trade-offs. The number of required base address bits is calculated using valid and invalid-allowing it to remain lightweight and efficient. In
2n = addressable_range, where n is the number of bits right-shifted our proposed coherence directory design, each entry represents the
from the original tag address. Also, the number of required position validity of an entire address range instead of tracking individual tag
bits is determined by the maximum number of coalesceable cache line addresses and associated sharers. This enables the state transitions
addresses within the target range, assuming 64B line size. Then, the to be managed at a coarser granularity during directory evictions.
number of sharer bits required is (n-1)×num_position_bits, where n is Additionally, REC supports fine-grained control over write requests by
the number of GPUs. For example, if REC is designed to coalesce with tracking specific offsets within these address ranges, avoiding the need
addressable range of 256B, each entry would require 40, 4, and 12 bits to evict entire entries. Fig. 10 highlights the architecture of REC and
for base address, position, and sharer fields, respectively. Lastly, one how it differently handles the received requests with the baseline. REC
valid bit is added to each entry. In Table 1, we show the total bits does not require additional coherence states but instead modifies the
required per entry under the addressable ranges from 128B to 4 kB transitions triggered under specific conditions.
6
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Remote reads: When the GPU receives the read request from the 4.3. Discussion
remote GPU, the cache controller extracts the base and offset from the
tag address ( A ). The controller then looks up the coherence directory Overheads: In our design, the coherence directory consists of 8K
for an entry with the matching base address ( B ). If a valid entry is entries, with each entry covering a 1 kB range of addresses. Each entry
found, the position bit corresponding to the offset calculated using the comprises a 38-bit base address field, a 64-bit vector for offsets and
formula in Section 4.1 and the associated sharer bit are set ( C ). For sharers, and a valid bit (detailed in Table 1). Thus, the total directory
example, the position bit is 34016 /64 × 4 = 52 representing the 14th size is 8192 × 103/8/1024 = 103 kB. We also estimate the area
cache line within the specified 1 kB range. The sharer bit is determined and power overhead of the coherence directory in REC, using CACTI
by the source GPU index (e.g., GPU1). Therefore, bit 52 and 53 are set 7.0 [25]. The results show that the directory is 3.94% area and has
to 1. It can happen that the position bit is already set; nevertheless, the 3.28% power consumption compared to GPU L2 cache. REC requires no
controller still performs a bitwise OR on the bits at the corresponding additional hardware extensions for managing the coherence directory.
positions. Since the entry already exists in the directory, it remains The existing cache controller handles operations such as base address
valid. Otherwise, if no valid entry is found, a new entry is created calculation and bitwise manipulation efficiently.
with the base address, and the position and sharer bits are set. With Comparison to prior work: As discussed in Section 2.3, HMG [11]
designs each coherence directory entry to track four cache lines at
the insertion of a new entry, the state transitions from invalid to valid.
a coarse granularity. We empirically show, in Section 3.3, that GPUs
Local writes: When the write request is performed locally ( D ), the
require a directory size up to 12× the baseline to eliminate unnecessary
cache controller must determine whether it needs to send out inval-
cache line invalidations. Since REC coalesces up to 16 consecutive
idation requests to the sharers that hold the copy of data. For this,
cache line addresses per entry, REC can track a significantly larger num-
the controller again looks up the directory with the calculated base
ber of cache lines compared to the prior work. Moreover, REC precisely
address and offset ( E ). If an entry is found and the offset is valid
tracks each address by storing the offset and sharer information. Thus,
(i.e., the position bit is set), the invalidation request is generated
REC fully support fine-grained management of cache lines under write
and propagated to the recorded sharers immediately ( F ). The state
operations.
transition is handled differently based on two conditions. First, when Scalability: REC requires modifications to its design in large-scale
another offset is tracked under the common address range, the directory systems, specifically to the sharer bit field. For an 8-GPU system, REC
entry should remain valid. Thus, the controller clears only the position requires (8-1) × 16 = 112 bits to record sharers in each entry. Then,
and sharer bits for the specific offset of the target address. For example, the size of each entry becomes 112 + 38 + 16 + 1 = 167 bits, which
in Fig. 10, the directory entry has another offset (atp = 56) recorded is approximately three times the baseline size, where each entry costs
under the same base address. Once the invalidation request is sent out 56 bits, including a 4-bit increase for sharers. Similarly, for a 16-GPU
to GPU1, the controller only clears bits 0 and 1. If the cleared bits are system, REC requires 295 bits per entry, roughly five times the baseline
the last ones, the entire directory entry transitions to an invalid state size. However, as observed in Section 3.3, an ideal GPU requires up to
to make room for new entries. 12 times the baseline directory size even in a 4-GPU system, implying
Remote writes: For the remote write request, the cache controller that simply increasing the baseline directory size is insufficient to meet
begins the same directory lookup process by calculating the base and scalability demands.
offset from the tag ( G ). In our target multi-GPU system, the source
GPU also performs writes to the copy of data in its local L2 cache 5. Evaluation
(discussed in Section 2.2). Therefore, the controller handles remote
write requests differently from local writes. When an entry already 5.1. Methodology
exists in the directory (i.e., hits), there may be two circumstances: (1)
the target offset is invalid but the entry has other valid offsets and (2) We use MGPUSim [20], a cycle-accurate multi-GPU simulator, to
the target offset is already valid and one or more sharers are being model baseline and REC architecture with four AMD GPUs connected
tracked. If the target offset is invalid, the controller simply adds the using inter-GPU links of 300 GB/s bandwidth [26]. The configuration of
offset and the sharer to the entry in the same way it handles remote the modeled GPU architecture is detailed in Table 2. Each GPU includes
reads. If the offset is valid, the controller adds the source GPU to the L1 scalar and instruction caches shared within each SA, while the L1
sharer list by setting its corresponding bit and clearing other sharer vector cache is private to each CU, and the L2 cache is shared across the
GPU. We extend remote data caching to the L2 caches, allowing data
bits ( H ), then sends invalidation requests to all other sharers ( I ). In
from any GPU in the system to be cached in the L2 cache of any other
Fig. 10, the entry and the target offset (atp = 56) both are already
GPU. Since MGPUSim does not include a support of hardware cache
recorded. The controller, thus, additionally sets bit 58 to add GPU2 as
coherence, we extend the simulator by implementing a coherence di-
a sharer while clearing the bit 59 and sends the invalidation request
rectory managed by the L2 cache controller. The coherence directory is
to GPU3. In either cases, the directory entry remains valid. When the
implemented with a set-associative structure to reduce lookup latency.
directory misses, the cache controller allocates a new entry to record
Since the baseline coherence directory is decoupled from the caches,
the base, offset, and sharer from the write request. Then, the entry state
its way associativity as well as the size can be scaled independently.
transitions to valid. In our evaluation, the coherence directory is designed with an 8-way
Directory entry eviction/replacement: When the coherence directory set-associative structure to reduce conflict misses, containing 8K entries
becomes full, it needs to replace an entry with the newly inserted in both the baseline and REC architectures. Upon receiving remote read
one. The baseline coherence directory uses a FIFO replacement policy. requests, the cache controller updates the coherence directory with
However, for workloads that exhibit irregular memory access pat- recording the addresses and the associated sharers. Once capacity of
terns, capturing locality becomes a challenge. To address this, REC the directory is reached, the cache controller evicts an entry and sends
adopts the replacement policy, similar to LRU, to better retain entries out invalidation requests to the recorded sharers. For receiving write
that are more likely to be accessed again. As the cache controller requests, the controller looks up the directory to find whether data
receives the remote read request and does find an entry with the with matching addresses are shared by remote GPUs. If the matching
matching base address ( J ), it determines an entry for replacement entries are found, invalidation requests are propagated to the sharers
( K ). The evicting entry is then replaced with the new entry from the except the source GPU. Additionally, since L2 caches are managed
incoming request ( L ). Meanwhile, the controller retrieves the base by coherence directories, acquire operations do not perform invalida-
address, every merged offset from the evicting entry and reconstructs tions on L2 caches, but release operations flush the L2 caches. We
the original tag addresses. Invalidation requests are propagated to every use workloads from a diverse set of benchmark suites, including AM-
recorded sharer associated with each tag address ( M ). Lastly, the entry DAPPSDK [23], Heteromark [22], Polybench [21], SHOC [24]. Table 3
transitions to an invalid state. lists the workloads with their memory footprints.
7
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 11. Performance comparison of the baseline with double-sized coherence direc- Fig. 12. Number of coalesced cache line addresses at directory entry eviction under
tory, HMG [11], REC, and an idealized system with zero unnecessary invalidations. REC with varying addressable ranges. REC in this work coalesces with 1 kB addressable
Performance is normalized to the baseline with 8K-entry coherence directory. range.
5.2. Performance analysis
Fig. 11 shows the performance of the baseline with coherence di-
rectory double in size, HMG [11], REC, and an ideal multi-GPU system
with zero unnecessary invalidations relative to the baseline. First, we
include the performance of baseline with double in coherence directory
size to compare REC with the same storage cost. The result shows that
the baseline with double the size of directory achieves average speedup
of 7.3%. The baseline coherence directory tracks each remote access
Fig. 13. Total number of L2 cache misses in the baseline with double-sized coherence
individually, on a per-entry basis. As discussed in Section 3.3, doubling directory, HMG [11], and REC relative to the baseline.
the size of coherence directory does not mitigate the unnecessary cache
line invalidations for applications with significant directory evictions.
Also, results show that HMG and REC achieve average speedup of
As a result, this delays the replacement of useful cache lines, thereby
16.7% and 32.7% across the evaluated workloads. We observe that
improving cache efficiency.
REC outperforms the prior scheme for two reasons. First, REC delays
L2 cache misses: The performance improvement of REC is largely
directory evictions by allowing each entry to record more cache line
attributed to the reduction in cache misses caused by unnecessary
addresses for a wider range. Since HMG uses each directory entry to
invalidations from frequent evictions in the coherence directory of
track four cache lines, an entire coherence directory can track cache
home GPUs. Fig. 13 shows the total number of L2 cache misses in the
lines up to 4× the baseline. On the other hand, the directory in REC
baseline with double-sized directory, HMG, and REC relative to the
can record up to 16× the number of entries. Second, REC manages write
baseline. Cold misses are excluded from the results. We observe that
operations to shared cache lines at a fine granularity by searching the
REC reduces L2 cache misses by 53.5%. In contrast, the baseline with
directory with exact addresses and sharers, propagating invalidations
double-sized directory and HMG experience 1.79× and 1.40× higher
only when necessary. Since each directory entry of HMG stores only
a single address and sharer ID field that cover for four cache lines, number of cache misses than REC since neither approach is insufficient
writes to any of these cache lines trigger invalidation requests to every to delay evict-initiated cache line invalidations. The result is closely
cache line and recorded sharer which leads them to be false positives. related to the reduction in remote access latency, as the corresponding
In contrast, REC does not allow any false positives and performs inval- misses are forwarded to the remote GPUs. Addressing the remote GPU
idations only to the modified cache lines and the associated sharers. As access bottleneck is performance-critical in multi-GPU systems.
a result, REC reduces unnecessary invalidations on cache lines that are Unnecessary invalidations: In the baseline, invalidation requests
actively being accessed by the requesting GPUs, minimizing redundant propagated from frequent directory evictions in the home GPU lead to a
remote memory accesses. To investigate the effectiveness of REC under higher chances of finding the corresponding cache lines still valid in the
different addressable ranges listed in Table 1, we also measure the sharer-side L2 caches. This results in premature invalidations of cache
number of coalesced cache line addresses when an entry is evicted lines that are actively in use, exacerbating the cache miss rate. In REC,
and plot in Fig. 12. We observe that the directory entries capture an the invalidation requests generated by directory eviction reduce the
average of 1.8, 3.4, 12.9, and 54.7 addresses until eviction under REC chances of invalidating valid cache lines. Fig. 14 shows that the number
with 128B, 256B, 1 kB, and 4 kB coalesceable ranges. Specifically, of unnecessary invalidations performed in remote L2 caches (i.e., where
REC captures more than 14 addresses before directory eviction for they are hits) is reduced by 84.4%. Since REC significantly delays evict-
applications with strong spatial locality. initiated invalidation requests, many cache lines have already been
Fig. 12 also illustrates the characteristics of limited locality for evicted from the caches by the time these requests are issued.
certain workloads where REC benefits less. In ATAX, PR, and ST, REC Inter-GPU transactions: The reduction in unnecessary invalidations
coalesces 3.9, 6.1, and 5.8 addresses, respectively. This is because the enhances the utilization of data within the sharer GPUs and min-
applications exhibit locality challenging to be captured due to their imizes redundant accesses over inter-GPU links. Fig. 14 shows the
irregular memory access patterns that span across a wide range of total number of inter-GPU transactions compared to the baseline. As
addresses. To delay the eviction of entries in irregular workloads, we illustrated, REC reduces inter-GPU transactions by an average of 34.9%.
design our proposed coherence directory with an LRU-like replace- The reduced inter-GPU transactions directly contributes to the overall
ment policy (discussed in Section 4.2). Another interesting observation performance improvement in multi-GPU systems.
is that the performance improvement of GEMV with REC is higher Bandwidth impact: Fig. 15 shows the total inter-GPU bandwidth
than the improvement seen when eliminating unnecessary invalida- costs of invalidation requests. As presented in Section 3.2, a large
tions. Our approach delays invalidations, but still performs them when fraction of invalidation requests are propagated due to frequent direc-
the directories become full. During cache line replacement, the con- tory evictions. Since REC delays invalidation requests from directory
troller prioritizes invalid cache lines before applying the LRU policy. evictions by allowing each entry to coalesce multiple tag addresses, the
8
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 14. Total number of unnecessary invalidations (bars) and inter-GPU transactions
(plots) relative to the baseline. Fig. 17. Performance of REC under varying (a) coalescing address ranges and (b)
number of directory entries. Results are shown relative to the baseline with an 8K-
entry coherence directory.
Fig. 15. Total bandwidth consumption of invalidation requests.
Fig. 18. Performance comparison of REC using FIFO and LRU replacement policies.
Performance is normalized to the baseline coherence directory with FIFO policy.
Fig. 16. L2 cache lookup latency.
bandwidth in most of the workloads becomes only a few gigabytes per Fig. 19. Performance impact of different L2 cache sizes in the baseline and REC.
Performance is normalized to the baseline with 2 MB L2 cache.
second.
Cache lookup latency: Fig. 16 illustrates the average L2 cache lookup
latency of REC normalized to the baseline. The results show that the
lookup latency reduces by 14.8% compared to the baseline. REC affects average, REC outperforms the baseline, even with reduced entry sizes
the average lookup latency as evict-initiated invalidation requests are compared to the baseline system with 8K-entry coherence directory.
propagated in burst. However, since REC significantly delays direc- This is because the coverage of each coherence directory in REC
tory eviction by coalescing multiple tag addresses, the overall latency can increase by up to 16× when locality is fully utilized. Although
decreases for most of the evaluated workloads. applications with limited locality show performance improvements as
the directory size increases, these gains are relatively modest when
5.3. Sensitivity analysis considered against the additional hardware costs.
FIFO replacement: Fig. 18 represents the performance of REC with
Coalescing range: One important design decision in optimizing REC a FIFO replacement policy. Our evaluation shows that the choice of
is determining the range over which to coalesce when remote read replacement policy has a relatively small impact on the overall perfor-
requests are received. As discussed in Section 4.1, the trade-off exists mance. For the workloads with regular and more predictable memory
between the range an entry coalesces and the number of bits required: access patterns, using the FIFO replacement policy is already effective
the larger the range, the more bits are needed to store the remote in coalescing sufficient number of addresses under the target ranges
GPU access information. Fig. 17(a) shows that the performance of REC (shown in Fig. 12). However, for some applications, such as ATAX,
improves as the coalescing range increases, with performance gains PR, and ST, performance is lower with FIFO compared to REC due
beginning to saturate at 1 kB. For our applications, a 1 kB range is to their limited locality patterns. These applications, therefore, benefit
sufficient to capture the majority of memory access locality within the from using an LRU-like replacement policy.
workloads. Since coalescing beyond 4 kB incurs excessive overhead in L2 cache size: The performance impact of different L2 cache sizes is
terms of bits required per entry (with 4 kB already requiring nearly 6× shown in Fig. 19. The results are normalized to the baseline with a
the baseline size), the potential performance improvement may not be 2 MB L2 cache. The benefits from increasing L2 cache capacity are
substantial to offset the additional cost. Therefore, we choose a 1 kB limited by the baseline coherence directory. In contrast, the perfor-
range for our implementation. mance of REC improves as L2 cache size increases, demonstrating its
Entry size: In our evaluation, we use a directory size of 8K entries ability to leverage larger caches effectively. Another observation is that
to match the baseline coherence directory. Fig. 17(b) shows the per- performance improvement with smaller L2 capacity is less significant
formance REC with varying entry sizes, ranging from 2K to 32K. On compared to larger L2 caches. This is because the coverage of the
9
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Fig. 20. Performance impact of different inter-GPU bandwidth in the baseline and REC. Fig. 23. Performance of REC in different GPU architecture.
Performance is normalized to the baseline with 300 GB/s inter-GPU bandwidth.
Fig. 21. Performance of REC with different number of SAs normalized to the baseline
Fig. 24. Performance of REC with DNN applications.
with 16 SAs.
16-GPU systems, respectively. We observe that the performance im-
provement decreases as the number of GPUs increases. This is because,
with more GPUs, the application dataset is more distributed, and the
amount of data allocated to each GPUs memory decreases, resulting
in reduced pressure on each coherence directory for tracking shared
copies. Additionally, we compare REC with the baseline configured
with different directory sizes to match equal storage costs (discussed in
Section 4.3). We observe that REC achieves performance improvements
of 2.04× and 1.83× over the baseline with directory sizes increased by
Fig. 22. Performance comparison of REC and the baseline with equal storage cost 3× and 5×, respectively. The results confirm that simply increasing di-
under different number of GPUs. Performance is normalized to the baseline with 8K
rectory sizes is not an efficient approach, even in large-scale multi-GPU
entries.
systems.
5.4. REC with Different GPU Architecture
baseline coherence directory relatively increases as the L2 cache size
decreases. To further explore the performance sensitivity to different
We extend the evaluation of REC to include a different GPU ar-
L2 cache sizes, we evaluate REC in systems with L2 cache sizes of
chitecture by adapting the simulation environment to a more recent
0.5 MB and 8 MB. We find that REC achieves an average performance
NVIDIA-styled GPU [27]. This involves increasing the number of com-
improvement of 6.3% and 26.7% compared to the baseline with 0.5 MB
putation and memory resources compared to the AMD GPU setup.
and 8 MB L2 caches, respectively. Additionally, the performance trend
Specifically, we change the GPU configuration to include 128 CUs, each
of REC decreases as the L2 cache size increases since the effectiveness
with a 128 kB L1V cache. The L2 cache size is increased to 72 MB
of REC also reduces larger caches. Nevertheless, the results emphasize
with the cache line size adjusted to 128B. With the increased cache
the importance of coherence protocol in improving cache efficiency.
line size, we configure the addressable range of REC to 2 kB, allowing
Inter-GPU bandwidth: The bandwidth of inter-GPU links is a critical for coalescing up to the same number of tag addresses. We also scale
factor in scaling multi-GPU performance. Fig. 20 shows the perfor- the input sizes of the workloads until the simulations remain feasible.
mance of the baseline and REC under different inter-GPU bandwidths, The performance results, in Fig. 23, show that REC achieves a 12.9%
relative to the 300 GB/s baseline. The results demonstrate that REC out- performance improvement over the baseline. This indicates that our
performs the baseline, even in applications where performance begins proposed REC also benefits the NVIDIA-like GPU architecture.
to saturate with increased bandwidth.
Number of SAs: We also evaluate REC with increasing the number 5.5. Effectiveness of REC on DNN applications
of SAs as shown in Fig. 21. The performance improvement of REC
decreases compared to the system with 16 SAs since the increased We evaluate the performance improvement of REC in training
number of SAs improves thread-level parallelism of GPUs. However, the two DNN models, VGG16 and ResNet18, using Tiny-Imagenet-200
system with a larger number of SAs also elevates the intensity of data dataset [28]. As shown in Fig. 24, REC outperforms the baseline for
sharing thus, increases the frequency of coherence directory evictions. training VGG16 and ResNet18 by 5.6% and 8.9%, respectively. The
As a result, REC outperforms the baseline with 16 SAs by 17.1%. results imply that REC also has benefits in multi-GPU training on
Number of GPUs: We evaluate REC in 8-GPU and 16-GPU systems, DNN workloads. Additionally, GPUs have recently gained significant
as shown in Fig. 22. To ensure a fair comparison, we do not change attention for training large language models (LLM). The computation
the workload sizes. The results show that REC provides performance of LLM training comprises multiple decoder blocks with each primarily
improvements of 24.7% and 14.7% over the baseline in 8-GPU and having series of matrix and vector operations [29]. In our evaluation,
10
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
we observe that REC improves multi-GPU performance by 20.2% and translation overheads, and [47]. Villa et al. [49] studied design-
20.4% on GEMM and GEMV workloads, respectively. Considering real- ing trustworthy system-level simulation methodologies for single- and
world LLM training, the memory requirements can become significant multi-GPU systems. Lastly, NGS [50] enables multiple nodes in a data
with large parameters which can pressure memory systems and lead center network to share the compute resources of GPUs on top of a
to under-utilization of computation resources [29]. Since REC im- virtualization technique.
proves the cache efficiency in multi-GPU systems, we expect a higher
performance potential from REC in real-world LLM training. 7. Conclusion
6. Related work In this paper, we propose REC to improve the efficiency of cache
coherence in multi-GPU systems. Our analysis shows that the limited
Several prior works have proposed GPU memory consistency and
capacity of coherence directories in fine-grained hardware protocols
cache coherence mechanisms optimized for general-purpose domains
frequently leads to evictions and unnecessary invalidations of shared
[1315,19,3032]. GPU-VI [19] reduces stalls at the cache controller
data. As a result, the increase in cache misses exacerbates NUMA
by employing write-through, write-no-allocate L1 caches and treating
overhead, leading to significant performance degradation in multi-GPU
loads to the pending writes as misses. To maintain write atomicity,
systems. To address this challenge, REC leverages memory access local-
GPU-VI adds transient states and state transitions and requires invali-
ity to coalesce multiple tag addresses within common address ranges,
dation acknowledgments before write completion. REC is implemented
effectively increasing the coverage of coherence directories without
based on the relaxed memory models commonly adopted in recent
incurring significant hardware overhead. Additionally, REC maintains
GPU architectures, which do not require acknowledgments to be sent
write-initiated invalidations at a fine granularity to ensure precise and
or received over long-latency inter-GPU links. HMG [11] proposes a
flexible coherence across GPUs. Experiments show that REC reduces
lightweight directory protocol by addressing up-to-date memory consis-
L2 cache misses by 53.5% and improves overall system performance
tency and coherence requirements. HMG integrates separate layers for
by 32.7%.
managing inter-GPM and inter-GPU level coherence, reducing network
traffic and complexity in deeply hierarchical multi-GPU systems. REC
primarily addresses the increased cache misses to remotely fetched data CRediT authorship contribution statement
caused by frequent invalidations. Additionally, REC can be extended
to support hierarchical multi-GPU systems posed by HMG without Gun Ko: Writing original draft, Visualization, Validation, Soft-
significant hardware modifications. ware, Resources, Methodology, Investigation, Formal analysis, Data
Other efforts aim to design efficient cache coherence protocols for curation, Conceptualization. Jiwon Lee: Formal analysis, Conceptu-
other processor domains. Wang et al. [33] suggested a method to alization. Hongju Kal: Validation, Conceptualization. Hyunwuk Lee:
efficiently support dynamic task parallelism on heterogeneous cache Visualization, Validation. Won Woo Ro: Supervision, Project adminis-
coherent systems. Zuckerman et al. [34] proposed Cohmeleon that tration, Conceptualization.
orchestrates the coherence in accelerators in heterogeneous system-on-
chip designs. HieraGen [35] and HeteroGen [36] are automated tools Declaration of competing interest
for generating hierarchical and heterogeneous cache coherence proto-
cols, respectively, for generic processor designs. Li et al. [37] proposed The authors declare that they have no known competing finan-
methodologies to determine the minimum number of virtual networks cial interests or personal relationships that could have appeared to
for cache coherence protocols that can avoid deadlocks. However, these influence the work reported in this paper.
studies do not address the challenges of redundant invalidations in the
cache coherence mechanisms of multi-GPU systems.
Acknowledgments
Significant research has addressed the NUMA effect in multi-GPU
systems by proposing efficient page placement and migration strate-
This work was supported by Institute of Information & communica-
gies [5,6,38], data transfer and replication methods [4,7,8,10,39,40],
tions Technology Planning & Evaluation (IITP) grant funded by the Ko-
and address translation schemes [4143]. In particular, several works
rea government (MSIT) (No. 2024-00402898, Simulation-based High-
have focused on improving the management of shared data within the
speed/High-Accuracy Data Center Workload/System Analysis Platform)
local memory hierarchy. NUMA-aware cache partitioning [3] dynami-
cally allocates cache space to accommodate data from both local and
remote memory by monitoring inter-GPU and local DRAM bandwidths. Data availability
The authors also extend software coherence with bulk invalidations
to L2 caches and evaluate the overhead associated with unnecessary The authors are unable or have chosen not to specify which data
invalidations. SAC [12] proposes reconfigurable last-level caches (LLC) has been used.
that can be utilized as either memory-side or SM-side, depending on
predicted application behavior in terms of effective LLC bandwidth.
References
SAC evaluates the performance of both software and hardware ex-
tensions for LLC coherence. In contrast, REC specifically targets the
[1] NVIDIA, NVIDIA DGX-2, 2018, https://www.nvidia.com/content/dam/en-
issue of unnecessary invalidations under hardware coherence, which zz/Solutions/Data-Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web-
can undermine the efficiency of remote data caching. It introduces uk.pdf.
a new directory structure, carefully examining the trade-off between [2] NVIDIA, NVIDIA DGX A100 system architecture, 2020, https://download.
performance and storage overhead. boston.co.uk/downloads/3/8/6/386750a7-52cd-4872-95e4-7196ab92b51c/
DGX%20A100%20System%20Architecture%20Whitepaper.pdf.
Recent studies on multi-GPU and multi-node GPU systems also ad-
[3] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez,
dress challenges in various domains. Researchers proposed methods to
D. Nellans, Beyond the socket: NUMA-aware GPUs, in: Proceedings of IEEE/ACM
accelerate deep learning applications [44], graph neural networks [45], International Symposium on Microarchitecture, 2017, pp. 123135.
and graphics rendering applications [46] in multi-GPU systems. Na [4] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, O. Villa, Combining
et al. [47] addressed security challenges in inter-GPU communications HW/SW mechanisms to improve NUMA performance of multi-GPU systems, in:
under unified virtual memory framework. Barre Chord [48] leverages Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2018,
page allocation schemes in multi-chip-module GPUs to reduce address pp. 339351.
11
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
[5] T. Baruah, Y. Sun, A.T. Dinçer, S.A. Mojumder, J.L. Abellán, Y. Ukidave, A. [30] K. Koukos, A. Ros, E. Hagersten, S. Kaxiras, Building heterogeneous Unified
Joshi, N. Rubin, J. Kim, D. Kaeli, Griffin: Hardware-software support for efficient Virtual Memories (UVMs) without the overhead, ACM Trans. Archit. Code Optim.
page migration in multi-GPU systems, in: Proceedings of IEEE International 13 (1) (2016).
Symposium on High Performance Computer Architecture, 2020, pp. 596609. [31] X. Ren, M. Lis, Efficient sequential consistency in GPUs via relativistic cache co-
[6] M. Khairy, V. Nikiforov, D. Nellans, T.G. Rogers, Locality-centric data and thread- herence, in: Proceedings of IEEE International Symposium on High Performance
block management for massive GPUs, in: Proceedings of IEEE/ACM International Computer Architecture, 2017, pp. 625636.
Symposium on Microarchitecture, 2020, pp. 10221036.
[32] S. Puthoor, M.H. Lipasti, Turn-based spatiotemporal coherence for GPUs, ACM
[7] H. Muthukrishnan, D. Lustig, D. Nellans, T. Wenisch, GPS: A global publish-
Trans. Archit. Code Optim. 20 (3) (2023).
subscribe model for multi-GPU memory management, in: Proceedings of
IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 4658. [33] M. Wang, T. Ta, L. Cheng, C. Batten, Efficiently supporting dynamic task paral-
[8] L. Belayneh, H. Ye, K.-Y. Chen, D. Blaauw, T. Mudge, R. Dreslinski, N. Talati, lelism on heterogeneous cache-coherent systems, in: Proceedings of ACM/IEEE
Locality-aware optimizations for improving remote memory latency in multi-GPU International Symposium on Computer Architecture, 2020, pp. 173186.
systems, in: Proceedings of the International Conference on Parallel Architectures [34] J. Zuckerman, D. Giri, J. Kwon, P. Mantovani, L.P. Carloni, Cohmeleon:
and Compilation Techniques, 2022, pp. 304316. Learning-based orchestration of accelerator coherence in heterogeneous SoCs, in:
[9] S.B. Dutta, H. Naghibijouybari, A. Gupta, N. Abu-Ghazaleh, A. Marquez, K. Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2021,
Barker, Spy in the GPU-box: Covert and side channel attacks on multi-GPU pp. 350365.
systems, in: Proceedings of ACM/IEEE International Symposium on Computer
[35] N. Oswald, V. Nagarajan, D.J. Sorin, HieraGen: Automated generation of con-
Architecture, 2023, pp. 633645.
current, hierarchical cache coherence protocols, in: Proceedings of ACM/IEEE
[10] H. Muthukrishnan, D. Lustig, O. Villa, T. Wenisch, D. Nellans, FinePack:
International Symposium on Computer Architecture, 2020, pp. 888899.
Transparently improving the efficiency of fine-grained transfers in multi-GPU
systems, in: Proceedings of IEEE International Symposium on High Performance [36] N. Oswald, V. Nagarajan, D.J. Sorin, V. Gavrielatos, T. Olausson, R. Carr,
Computer Architecture, 2023, pp. 516529. HeteroGen: Automatic synthesis of heterogeneous cache coherence protocols, in:
[11] X. Ren, D. Lustig, E. Bolotin, A. Jaleel, O. Villa, D. Nellans, HMG: Extending Proceedings of IEEE International Symposium on High Performance Computer
cache coherence protocols across modern hierarchical multi-GPU systems, in: Architecture, 2022, pp. 756771.
Proceedings of IEEE International Symposium on High Performance Computer [37] W. Li, A.G.U. of Amsterdam, N. Oswald, V. Nagarajan, D.J. Sorin, Determining
Architecture, 2020, pp. 582595. the minimum number of virtual networks for different coherence protocols, in:
[12] S. Zhang, M. Naderan-Tahan, M. Jahre, L. Eeckhout, SAC: Sharing-aware caching Proceedings of ACM/IEEE International Symposium on Computer Architecture,
in multi-chip GPUs, in: Proceedings of ACM/IEEE International Symposium on 2024, pp. 182197.
Computer Architecture, 2023, pp. 605617. [38] Y. Wang, B. Li, A. Jaleel, J. Yang, X. Tang, GRIT: Enhancing multi-GPU
[13] B.A. Hechtman, S. Che, D.R. Hower, Y. Tian, B.M. Beckmann, M.D. Hill, S.K. performance with fine-grained dynamic page placement, in: Proceedings of IEEE
Reinhardt, D.A. Wood, QuickRelease: A throughput-oriented approach to release International Symposium on High Performance Computer Architecture, 2024, pp.
consistency on GPUs, in: Proceedings of IEEE International Symposium on High 10801094.
Performance Computer Architecture, 2014, pp. 189200.
[39] M.K. Tavana, Y. Sun, N.B. Agostini, D. Kaeli, Exploiting adaptive data com-
[14] M.D. Sinclair, J. Alsop, S.V. Adve, Efficient GPU synchronization without
pression to improve performance and energy-efficiency of compute workloads in
scopes: Saying no to complex consistency models, in: Proceedings of IEEE/ACM
multi-GPU systems, in: Proceedings of IEEE International Parallel and Distributed
International Symposium on Microarchitecture, 2015, pp. 647659.
[15] J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consis- Processing Symposium, 2019, pp. 664674.
tency for GPUs, in: Proceedings of IEEE/ACM International Symposium on [40] H. Muthukrishnan, D. Nellans, D. Lustig, J.A. Fessler, T.F. Wenisch, Efficient
Microarchitecture, 2016, pp. 113. multi-GPU shared memory via automatic optimization of fine-grained trans-
[16] NVIDIA, NVIDIA TESLA V100 GPU architecture, 2017, https://images.nvidia. fers, in: Proceedings of the ACM/IEEE International Symposium on Computer
com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Architecture, 2021, pp. 139152.
[17] NVIDIA, NVIDIA A100 tensor core GPU architecture, 2020, https: [41] B. Li, J. Yin, Y. Zhang, X. Tang, Improving address translation in multi-
//images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere- GPUs via sharing and spilling aware TLB design, in: Proceedings of IEEE/ACM
architecture-whitepaper.pdf. International Symposium on Microarchitecture, 2021, pp. 11541168.
[18] NVIDIA, NVIDIA NVLink high-speed GPU interconnect, 2024, https://www.
[42] B. Li, J. Yin, A. Holey, Y. Zhang, J. Yang, X. Tang, Trans-FW: Short circuiting
nvidia.com/en-us/design-visualization/nvlink-bridges/.
page table walk in multi-GPU systems via remote forwarding, in: Proceedings
[19] I. Singh, A. Shriraman, W.W.L. Fung, M. OConnor, T.M. Aamodt, Cache coher-
of IEEE International Symposium on High Performance Computer Architecture,
ence for GPU architectures, in: Proceedings of IEEE International Symposium on
2023, pp. 456470.
High Performance Computer Architecture, 2013, pp. 578590.
[20] Y. Sun, T. Baruah, S.A. Mojumder, S. Dong, X. Gong, S. Treadway, Y. Bao, [43] B. Li, Y. Guo, Y. Wang, A. Jaleel, J. Yang, X. Tang, IDYLL: Enhancing page
S. Hance, C. McCardwell, V. Zhao, H. Barclay, A.K. Ziabari, Z. Chen, R. translation in multi-GPUs via light weight PTE invalidations, in: Proceedings of
Ubal, J.L. Abellán, J. Kim, A. Joshi, D. Kaeli, MGPUSim: Enabling multi- IEEE/ACM International Symposium on Microarchitecture, 2015, pp. 11631177.
GPU performance modeling and optimization, in: Proceedings of ACM/IEEE [44] E. Choukse, M.B. Sullivan, M. OConnor, M. Erez, J. Pool, D. Nellans, Buddy
International Symposium on Computer Architecture, 2019, pp. 197209. compression: Enabling larger memory for deep learning and HPC workloads
[21] T. Yuki, L.-N. Pouchet, Polybench 4.0, 2015. on GPUs, in: Proceedings of ACM/IEEE International Symposium on Computer
[22] Y. Sun, X. Gong, A.K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A. Architecture, 2020, pp. 926939.
Villegas, D. Kaeli, Hetero-mark, a benchmark suite for CPU-GPU collaborative
[45] Y. Tan, Z. Bai, D. Liu, Z. Zeng, Y. Gan, A. Ren, X. Chen, K. Zhong, BGS: Accelerate
computing, in: Proceedings of IEEE International Symposium on Workload
GNN training on multiple GPUs, J. Syst. Archit. 153 (2024) 103162.
Characterization, 2016, pp. 110.
[23] AMD, AMD app SDK OpenCL optimization guide, 2015. [46] X. Ren, M. Lis, CHOPIN: Scalable graphics rendering in multi-GPU systems via
[24] A. Danalis, G. Marin, C. McCurdy, J.S. Meredith, P.C. Roth, K. Spafford, V. Tip- parallel image composition, in: Proceedings of IEEE International Symposium on
paraju, J.S. Vetter, The Scalable Heterogeneous Computing (SHOC) benchmark High Performance Computer Architecture, 2021, pp. 709722.
suite, in: Proceedings of the 3rd Workshop on General-Purpose Computation on [47] S. Na, J. Kim, S. Lee, J. Huh, Supporting secure multi-GPU computing with dy-
Graphics Processing Units, 2010, pp. 6374. namic and batched metadata management, in: Proceedings of IEEE International
[25] R. Balasubramonian, A.B. Kahng, N. Muralimanohar, A. Shafiee, V. Srinivas, Symposium on High Performance Computer Architecture, 2024, pp. 204217.
CACTI 7: New tools for interconnect exploration in innovative off-chip memories, [48] Y. Feng, S. Na, H. Kim, H. Jeon, Barre chord: Efficient virtual memory trans-
ACM Trans. Archit. Code Optim. 14 (2) (2017) 14:125. lation for multi-chip-module GPUs, in: Proceedings of ACM/IEEE International
[26] NVIDIA, NVIDIA DGX-1 with tesla V100 system architecture, 2017, pp. 143.
Symposium on Computer Architecture, 2024, pp. 834847.
[27] NVIDIA, NVIDIA ADA GPU architecture, 2023, https://images.nvidia.com/aem-
dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper- [49] O. Villa, D. Lustig, Z. Yan, E. Bolotin, Y. Fu, N. Chatterjee, Need for speed:
v2.1.pdf. Experiences building a trustworthy system-level GPU simulator, in: Proceedings
[28] Y. Le, X. Yang, Tiny ImageNet visual recognition challenge, 2015, https://http: of IEEE International Symposium on High Performance Computer Architecture,
//vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf. 2021, pp. 868880.
[29] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park, [50] J. Prades, C. Reaño, F. Silla, NGS: A network GPGPU system for orchestrating
NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inferencing, remote and virtual accelerators, J. Syst. Archit. 151 (2024) 103138.
in: Proceedings of ACM International Conference on Architectural Support for
Programming Languages and Operating Systems, 2024, pp. 722737.
12
G. Ko et al. Journal of Systems Architecture 160 (2025) 103339
Gun Ko received the B.S. degree in electrical engineering Hyunwuk Lee received his B.S. and Ph.D. degrees in
from Pennsylvania State University in 2017. He is currently electrical and electronic engineering from Yonsei University,
pursuing the Ph.D. degree with the Embedded Systems Seoul, Korea, in 2018 and 2024, respectively. He currently
and Computer Architecture Laboratory, School of Electrical works in the memory division at Samsung Electronics. His
and Electronic Engineering, Yonsei University, Seoul, South research interests include neural network accelerators and
Korea. His current research interests include GPU memory GPU systems.
systems, multi-GPU systems, and virtual memory.
Jiwon Lee received the B.S. and Ph.D. degrees in electrical
and electronic engineering from Yonsei University, Seoul, Won Woo Ro received the B.S. degree in electrical engineer-
South Korea, in 2018 and 2024, respectively. He currently ing from Yonsei University, Seoul, South Korea, in 1996, and
works in the memory division at Samsung Electronics. His the M.S. and Ph.D. degrees in electrical engineering from the
research interests include virtual memory, GPU memory University of Southern California, in 1999 and 2004, respec-
systems, and storage systems. tively. He worked as a Research Scientist with the Electrical
Engineering and Computer Science Department, University
of California, Irvine. He currently works as a Professor
with the School of Electrical and Electronic Engineering,
Yonsei University. Prior to joining Yonsei University, he
worked as an Assistant Professor with the Department
Hongju Kal received the B.S. degree from Seoul National of Electrical and Computer Engineering, California State
University of Science and Technology and Ph.D. degree from University, Northridge. His industry experience includes a
Yonsei University in school of electric and electronic engi- college internship with Apple Computer, Inc., and a contract
neering, Seoul, South Korea in 2018 and 2024, respectively. software engineer with ARM, Inc. His current research
He currently works in the memory division at Samsung interests include high-performance microprocessor design,
Electronics. His current research interests include memory GPU microarchitectures, neural network accelerators, and
architectures, memory hierarchies, near memory processing, memory hierarchy design.
and neural network accelerators.
13