opaque-lattice/papers_txt/REC--Enhancing-fine-grained-cache-coherence-protoc_2025_Journal-of-Systems-A.txt

                                                             Journal of Systems Architecture 160 (2025) 103339


                                                                 Contents lists available at ScienceDirect


                                                        Journal of Systems Architecture
                                                        journal homepage: www.elsevier.com/locate/sysarc


REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
Gun Ko, Jiwon Lee, Hongju Kal, Hyunwuk Lee, Won Woo Ro ∗
Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea


ARTICLE               INFO                              ABSTRACT

Keywords:                                               With the increasing demands of modern workloads, multi-GPU systems have emerged as a scalable solution, ex-
Multi-GPU                                               tending performance beyond the capabilities of single GPUs. However, these systems face significant challenges
Data sharing                                            in managing memory across multiple GPUs, particularly due to the Non-Uniform Memory Access (NUMA)
Cache coherence
                                                        effect, which introduces latency penalties when accessing remote memory. To mitigate NUMA overheads, GPUs
Cache architecture
                                                        typically cache remote memory accesses across multiple levels of the cache hierarchy, which are kept coherent
                                                        using cache coherence protocols. The traditional GPU bulk-synchronous programming (BSP) model relies on
                                                        coarse-grained invalidations and cache flushes at kernel boundaries, which are insufficient for the fine-grained
                                                        communication patterns required by emerging applications. In multi-GPU systems, where NUMA is a major
                                                        bottleneck, substantial data movement resulting from the bulk cache invalidations exacerbates performance
                                                        overheads. Recent cache coherence protocol for multi-GPUs enables flexible data sharing through coherence
                                                        directories that track shared data at a fine-grained level across GPUs. However, these directories limited in
                                                        capacity, leading to frequent evictions and unnecessary invalidations, which increase cache misses and degrade
                                                        performance. To address these challenges, we propose REC, a low-cost architectural solution that enhances
                                                        the effective tracking capacity of coherence directories by leveraging memory access locality. REC coalesces
                                                        multiple tag addresses from remote read requests within common address ranges, reducing directory storage
                                                        overhead while maintaining fine-grained coherence for writes. Our evaluation on a 4-GPU system shows that
                                                        REC reduces L2 cache misses by 53.5% and improves overall system performance by 32.7% across a variety
                                                        of GPU workloads.


1. Introduction                                                                          each kernel. However, as recent GPU applications increasingly require
                                                                                         more frequent and fine-grained communication both within and across
    Multi-GPU systems have emerged to meet the growing demands                           kernels [11,13–15], these frequent synchronizations can lead to sub-
of modern workloads, offering scalable performance beyond what a                         stantial cache operation and data movement overheads. Additionally,
single GPU can deliver. However, as multi-GPU architectures scale in                     precisely managing the synchronizations places additional burdens on
size and complexity [1,2], managing memory across multiple GPUs                          programmers, complicating the optimization of multi-GPU systems.
becomes increasingly challenging [3–7]. One of the primary challenges                        Ren et al. [11] proposed HMG, a hierarchical cache coherence
arises from the bandwidth discrepancy between local and remote mem-                      protocol designed for L2 caches in large-scale multi-GPU systems. HMG
ory, commonly known as the Non-Uniform Memory Access (NUMA)                              employs coherence directories to record cache line addresses and their
effect [3,4]. To mitigate the NUMA penalty, GPUs generally rely on                       associated sharers upon receiving remote read requests. Any writes to
caching remote memory accesses, allowing them to be served with
                                                                                         these addresses trigger invalidations. Once capacity is reached, existing
local bandwidth [5,8–10]. This caching strategy is often extended
                                                                                         entries are evicted from the directory, triggering invalidation requests
across multiple levels of the cache hierarchy, including both private
                                                                                         to the sharer GPUs. These invalidations are unnecessary, as the cor-
on-chip caches and shared caches [3,4,11,12], to better accommodate
                                                                                         responding cache lines do not immediately require coherence to be
the diverse access patterns of emerging workloads.
                                                                                         maintained. When GPUs access data across a wide range of addresses,
    While remote data caching offers significant performance benefits
in multi-GPU systems, it also requires extending coherence throughout                    significant directory insertions lead to a number of unnecessary invali-
the cache hierarchy. Conventional GPUs rely on a simple software-                        dations for cache lines that have not yet been fully utilized. Subsequent
inserted bulk-synchronous programming (BSP) model [11], which per-                       accesses to these cache lines result in cache misses, requiring data to
forms cache invalidation and flush operations at the start and end of                    be fetched again over bandwidth-limited inter-GPU links.


  ∗ Corresponding author.
     E-mail address: wro@yonsei.ac.kr (W.W. Ro).

https://doi.org/10.1016/j.sysarc.2025.103339
Received 10 September 2024; Received in revised form 27 December 2024; Accepted 5 January 2025
Available online 9 January 2025
1383-7621/© 2025 Published by Elsevier B.V.
G. Ko et al.                                                                                                                   Journal of Systems Architecture 160 (2025) 103339


Fig. 1. Performance of each caching scheme normalized to a system that enables
remote data caching in both L1 and L2 caches using software and hardware coherence
protocols, respectively. ‘‘No caching’’ refers to a system that disables remote data       Fig. 2. Baseline multi-GPU system. Each GPU has a coherence directory that records
caching, simplifying coherence.                                                            and tracks the status of shared data at given addresses along with the corresponding
                                                                                           sharer IDs.


    To evaluate the implications of the coherence protocol, we mea-
sure the performance impact of unnecessary invalidations on a 4-GPU                        2. Background
system that caches remote data in both L1 and L2 caches. L1 caches
are assumed to be software-managed, while L2 caches are managed                            2.1. Multi-GPU architecture
under fine-grained invalidation through coherence directories. As Fig. 1
shows, there exists a significant performance opportunity in eliminat-                         The slowdown of transistor scaling has made it increasingly difficult
ing unnecessary invalidations caused by frequent directory evictions.                      for single GPUs to meet the growing demands of modern workloads. Al-
Increasing the size of the coherence directory can delay evictions and                     ternatively, multi-GPU systems have emerged as a viable path forward,
the corresponding invalidation requests, but at the cost of increased                      offering enhanced performance and memory capacity by leveraging
hardware. Our observations indicate that to eliminate unnecessary                          multiple GPUs connected using high-bandwidth interconnects such as
invalidations, the size of the coherence directory would need be sub-                      PCIe and NVLink [18]. However, these inter-GPU links are likely to
stantially increased, accounting for 30.4% of the L2 cache size. As                        have bandwidth that falls far behind the local memory bandwidth [3,
the size of GPU L2 caches continues to grow [16,17], the aggregate                         4,8]. The NUMA effect that arises from this large bandwidth gap
storage overhead of coherence directories becomes substantial, caus-                       can significantly impact multi-GPU performance, making it crucial to
ing inefficiency in scaling for multi-GPU environment (discussed in                        optimize remote access bottlenecks to maximize efficiency.
Section 3.3).                                                                                  Fig. 2 illustrates the architectural details of our target multi-GPU
    In this paper, we propose Range-based Directory Entry Coalescing                       system. Each GPU is divided into several SAs, with each comprising a
(REC), an architectural solution that mitigates unnecessary invalidation
                                                                                           number of CUs. Every CU has its own private L1 vector cache (L1V$),
overhead by increasing the effective tracking capacity of the coher-
                                                                                           while the L1 scalar cache (L1S$) and L1 instruction cache (L1I$) are
ence directory without incurring significant hardware costs. Our key
                                                                                           shared across all CUs within an SA. Additionally, each GPU contains
insight is that since directory updates are performed upon receiving
                                                                                           a larger L2 cache that is shared across all SAs. When a data access
remote read requests, leveraging memory access locality provides an
                                                                                           misses in the local cache hierarchy, it is forwarded to either local or
opportunity to coalesce multiple tag addresses of shared data based on
                                                                                           remote GPU memory, depending on the data location. For local mem-
their common address range. To achieve this, we employ a coherence
                                                                                           ory accesses, the cache lines are stored in both the shared L2 cache and
directory design, which aggregates data from incoming remote reads
that share a common base address within the same address range,                            the L1 cache private to the requesting CU. In the case of remote-GPU
storing only the offset and the sharer IDs. We reduce the storage                          memory accesses, the data can be cached either only in the L1 cache
requirements of directory entries by designing them in a base-and-offset                   of the requesting CU [4,5,8] or in both the L2 and L1 caches [3,11,12].
format, recording the common high-order bits of addresses and using a                      Caching data in remote memory nodes helps mitigate the performance
bit-vector to indicate the index of each coalesced entry within the target                 degradation caused by accessing remote memory nodes.
range. For incoming writes, if they are found in the coherence direc-
tory, invalidations are propagated only to the corresponding address,                      2.2. Remote data caching in multi-GPU
maintaining fine-grained coherence in multi-GPU systems.
    To summarize, this paper makes the following contributions:                                While caching remote data only in the L1 cache can save L2 cache
                                                                                           capacity, it limits the sharing of remote data among CUs. As a result,
    • We identify a performance bottleneck of fine-grained shared data                     such an approach provides lower performance gain when unnecessary
      tracking mechanisms in multi-GPU systems. Our analysis demon-                        invalidation overhead is eliminated in its counterpart, as shown in
      strates that such methods generate unnecessary invalidations at                      Fig. 1. For this reason, in this study, we assume the baseline multi-GPU
      coherence directory evictions, which incurs a significant perfor-                    architecture allows caching of remote data in both L1 and L2 caches.
      mance bottleneck due to increased cache miss rates.
                                                                                               A step-by-step process of remote data caching is shown in Fig. 2.
    • We show that simply employing larger coherence directories
                                                                                           Upon generating a memory request, an L1 cache lookup is performed
      incurs significant storage overhead. Our analysis shows that the
                                                                                           by the requesting CU ( 1 ). When data is not present in the L1, an
      baseline multi-GPU system requires a 12× increase in the direc-
                                                                                           L2 cache lookup is generated to check if the remote data is cached
      tories to eliminate redundant invalidations.
                                                                                           in the L2 ( 2 ). If the data is found in the L2 cache, it is returned to
    • We propose REC which increases effective coverage of the co-
                                                                                           the requesting CU and cached in its local L1 cache. If the data is not
      herence directory by enabling each entry to coalesce and track
      multiple memory addresses along with the associated sharers. By                      found in the L2 cache, the request is forwarded to the remote GPU
      reducing the L2 cache misses by 53.5%, REC improves overall                          memory at the given physical address. Subsequently, the requested
      performance by 32.7% on average across our evaluated GPU                             data is returned at a cache line granularity and cached in both the L1
      workloads.                                                                           and L2 caches ( 3 ). At the same time, the coherence directory, which
                                                                                           maintains information about data locations across multiple GPUs, is

                                                                                       2
G. Ko et al.                                                                                                                     Journal of Systems Architecture 160 (2025) 103339


Fig. 3. Coherence protocol flows in detail. The baseline hardware protocol has two           Fig. 4. L2 cache miss rates in baseline and idealized system where no invalidations
stable states: valid and invalid, with no transient states or acknowledgments required       are propagated by coherence directory evictions. Cold misses are excluded from the
for write permissions.                                                                       results.


updated with the corresponding entry and the sharer GPU ( 4 ). Writes                        Local reads: Local read requests arriving at the L2 cache are directed
to remote data in the home GPU are also performed in the local L2                            to either locally- or remotely-mapped data. On cache hits, the data is
cache, following the write-through policy, as the corresponding GPU                          returned and guaranteed to be consistent because it is either the most
may access the written data in the future. Remote writes arriving at                         up-to-date data (if mapped to local DRAM) or correctly managed by
the home GPU trigger invalidation messages to be sent out to the sharer                      the protocol (if mapped to remote GPU). On cache misses, the requests
GPU(s), and the requesting GPU is recorded as a sharer ( 4 ).                                are forwarded to either local DRAM or a remote GPU. In all cases, the
                                                                                             directory of the requesting GPU remains unchanged.
2.3. Cache coherence in multi-GPU                                                            Remote reads: For remote reads that arrive at the home GPU, the
                                                                                             coherence directory records the ID of the requesting GPU at the given
                                                                                             cache line address. If the line is already being tracked (i.e., the entry is
    Existing hardware protocols, such as GPU-VI [19], employ coher-
                                                                                             found and valid), the directory simply adds the requester to the sharer
ence directories to track sharers (i.e., L1s) and propagate write-initiated
                                                                                             field and keeps the entry in the valid state. If the line is not being
cache invalidations within a single GPU. Bringing the notion into multi-
                                                                                             tracked, the directory finds an empty spot to allocate a new entry and
GPU environments, Ren et al. proposed HMG [11], a hierarchical design
                                                                                             marks it as valid. When the directory is full and every entry is valid, it
that efficiently manages both intra- and inter-GPU coherence. HMG
                                                                                             evicts an existing entry and replaces it with the new entry (discussed
includes two layers for selecting home nodes to track sharers: (1) the
                                                                                             below).
inter-GPU module (GPM) level that selects a home GPM within a GPU
                                                                                             Local writes: Local writes to data mapped to the home GPU memory
and (2) the inter-GPU level that selects a home GPU across the entire
                                                                                             look up the directory to find whether a matching entry at the line
system. A GPM is a chiplet in multi-chip module GPUs. With this,
                                                                                             address exists. If found, invalidations are propagated to the recorded
HMG reduces the complexity of tracking and maintaining coherence
                                                                                             sharers in the background, and the directory entry becomes invalid.
across a large number of sharers. HMG also optimizes performance by
                                                                                             Remote writes: By default, L2 caches use a write-back policy for local
eliminating all transient states and most invalidation acknowledgments,
                                                                                             writes. As described in Section 2.2, remote writes update both the L2
leveraging weak memory models in modern GPUs [11].
                                                                                             cache of the requester and local memory, similar to a write-through
    Each GPU has a coherence directory attached to its L2 cache,                             policy. Consequently, the directory maintains the entry as valid by
managed by the cache controllers. The directory is organized in a set-                       adding the requester to the sharer list and sends out invalidations to
associative structure, and each entry contains the following fields: tag,                    other sharers recorded in the original entry.
sharer IDs, and coherence state. The tag field stores the cache line                         Directory entry eviction/replacement: Coherence directories are im-
address for the data copied and fetched by the sharer. The sharer                            plemented in a set-associative structure. Thus, capacity and conflict
ID field is a bit-vector representing the list of sharers, excluding the                     misses occur as directory lookups are initiated by the read requests con-
home GPU. Each entry is in one of two stable states: valid or invalid.                       tinuously received from remote GPUs. To notify that the information
Unlike HMG [11], the baseline coherence directory tracks one cache                           in the evicted entry is no longer traceable, invalidations are sent out as
line per each entry. In contrast, a directory entry in HMG is designed                       with writes.
to track four cache lines using a single tag address and sharer ID                           Acquire and release: At the start of a kernel, invalidations are per-
field, which limits its ability to manage each cache line at a fine                          formed in L1 caches as coherence is maintained using software bulk
granularity. Consequently, a write to any address tracked by a directory                     synchronizations. However, the invalidations are not propagated be-
entry may unnecessarily invalidate other cache lines within the same                         yond L1 caches, as L2 caches are kept coherent with the fine-grained
range, potentially causing inefficiencies in remote data caching. We                         directory protocol. Release operations flush dirty data in both L1 and
discuss the importance of reducing unnecessary cache line invalidations                      L2 caches.
in detail in Section 3.1. Like typical memory allocation in multi-GPU
systems, the physical address space is partitioned among the GPUs in                         3. Motivation
the system. Therefore, data at any given physical address is designated
to one GPU (i.e., the home GPU), and every access by a remote GPU                                In multi-GPU systems, coherence is managed explicitly through
references the coherence directory of the home GPU. For example, in                          cache invalidations to ensure data consistency across multiple GPUs.
Fig. 2, GPU0 requests data at address 0xA from GPU1, which is the                            When invalidation requests are received, sharer GPUs must look up and
home GPU; the corresponding entry is then inserted into the directory                        invalidate the corresponding cache lines. Subsequent accesses to these
of GPU1 with the relevant information.                                                       invalidated cache lines result in cache misses, which are then forwarded
    Fig. 3 shows the detailed state transitions and actions initiated by                     to the home GPU. This, in turn, can negate the performance benefits of
the coherence directory. Note that local and remote refer to the sources                     local caching as it undermines the effectiveness of caching mechanisms
of memory requests received: local refers to accesses from the local CUs,                    intended to reduce remote access bottlenecks. In this section, we ana-
and remote refers to accesses from the remote GPUs.                                          lyze the behavior of cache invalidation and its impact on the overall

                                                                                         3
G. Ko et al.                                                                                                                          Journal of Systems Architecture 160 (2025) 103339


                                                                                                   Fig. 6. Performance impact of increasing coherence directory sizes. To eliminate
                                                                                                   unnecessary invalidations, GPUs require a directory size up to 12× larger than the
Fig. 5. Fraction of evict-initiated and write-initiated invalidations in the baseline multi-       baseline.
GPU system. The results are based on invalidation requests that hit in the sharer-side
L2 caches.

                                                                                                   3.2. Source of premature invalidation
performance of multi-GPU systems. We identify the sources of invalida-
                                                                                                       As described in Section 2.3, when a coherence directory becomes
tion and explore a straightforward solution to mitigate the associated
                                                                                                   full, the GPU needs to evict an old entry and replace it with a new
bottlenecks. Our experiments are conducted using MGPUSim [20], a
                                                                                                   one upon receiving a remote read request; an invalidation request must
multi-GPU simulation framework that we have extended to support
                                                                                                   be sent out to the sharer(s) in the evicted entry. Fig. 5 shows the
the hardware cache coherence protocol. The detailed configuration is
                                                                                                   distribution of invalidations triggered by directory eviction and write
provided in Table 2.
                                                                                                   requests, referred to as evict-initiated and write-initiated invalidations,
                                                                                                   respectively. The measurements are taken based on the invalidations
3.1. Impact of cache invalidation                                                                  that are hit in the sharer-side L2 caches after receiving the requests. We
                                                                                                   observe that significant amount of invalidations (average 79.5%) are
    To ensure data consistency across multiple GPUs, invalidation re-                              performed by the requests from directory evictions in the home GPUs.
quests are propagated by the home GPU in two cases: (1) when write                                 These invalidations, considered unnecessary as they do not require
requests are received and (2) when an entry is evicted from the coher-                             immediate action, should be delayed until remote GPUs have full use
ence directory due to capacity and conflict misses. Invalidation requests                          of the data.
triggered by writes are crucial for maintaining data consistency, as they                              We also show the percentage of write-initiated invalidations in
ensure that no stale data is accessed in the sharer GPU caches. On the                             Fig. 5. One can observe that applications such as FIR, LU, and MM2
other hand, invalidations generated by directory eviction aim to notify                            experience a significant number of invalidations due to write re-
the sharers that the coherence information is no longer traceable, even                            quests. These workloads exhibit fine-grained communication within
if the data is still valid. A detailed background on the protocol flows                            and across dependent kernels, necessitating the invalidation of corre-
with invalidations is given in Section 2.3.                                                        sponding cache lines in the remote L2 cache upon any modification
    Broadcasting invalidations does not significantly impact cache ef-                             to the shared data. Although the applications exhibit a high percent-
ficiency if the cache lines are already evicted or no longer in use.                               age of write-initiated invalidations, their impact on cache miss rates
However, when applications exhibit frequent remote memory accesses,                                may be negligible if the GPUs do not subsequently require access
the generation of new directory entries increases invalidation requests                            to the invalidated cache lines. Nonetheless, the results from Fig. 4
from eviction, invalidating the associated cache lines prematurely.                                clearly demonstrate the importance of minimizing unnecessary cache
These premature invalidations lead to higher cache miss rates, as                                  invalidations.
                                                                                                       So far, we have discussed how prematurely invalidating remote data
subsequent accesses to the invalidated cache lines result in misses.
                                                                                                   leads to increased cache miss rates, which negatively impacts multi-
As remote data misses exacerbates NUMA overheads, they need to be
                                                                                                   GPU performance. We also show that a large fraction of invalidation
reduced to improve multi-GPU performance.
                                                                                                   requests stems from directory evictions, which frequently occur due to
    Fig. 4 shows the impact of cache miss rate when eliminating unnec-
                                                                                                   the high volume of remote accesses. These accesses trigger numerous
essary invalidations across the benchmarks listed in Table 3 running on
                                                                                                   directory updates, overwhelming the baseline coherence directory’s
a 4-GPU system. The figure demonstrates that the baseline system expe-
                                                                                                   capacity to effectively manage coherence. A straightforward solution to
riences a cache miss rate more than double (average 2.4×) that of the
                                                                                                   mitigate premature invalidations is to increase the size of the coherence
idealized system without the unnecessary invalidation. This increase
                                                                                                   directory, providing more coverage to track sharers and reducing evic-
is mainly due to frequent invalidation requests, which prematurely
                                                                                                   tion rates. In the following section, we analyze the performance impact
invalidate cache lines before they can be fully utilized, leading to an                            of larger coherence directory sizes. It is important to note that this
increase in the number of remote memory accesses. The result strongly                              paper primarily focuses on delaying invalidations caused by directory
motivates us to further study the source of these frequent invalidations                           evictions, as write-initiated invalidations are necessary and must be
to improve efficiency of remote data caching in multi-GPU systems.                                 performed immediately for correctness.
    To demonstrate the performance opportunity, Fig. 1 presents a study
showing the performance of idealized caching without the invalidation                              3.3. Increasing directory sizes
overhead. With no invalidations to unmodified cache lines, remote data
can be fully utilized as needed until they are naturally replaced by the                              A simple approach to delay directory evictions, thereby minimizing
typical cache replacement policy. The performance of the baseline and                              premature invalidations, is to increase the size of coherence directories.
ideal system is represented in the first and fourth bars, respectively,                            Limited directory sizes lead to significant evict-initiated invalidations,
in Fig. 1. The result shows that an ideal system with no unnecessary                               which can undermine the performance benefits of local caching. To
cache invalidation overheads outperforms the baseline by up to 2.79×                               quantify the benefits of larger directories, we conduct a quantitative
(average 36.9%). As demonstrated by Figs. 1 and 4, reducing premature                              analysis of performance improvements with increasing directory sizes.
cache invalidations is crucial in improving efficiency of remote data                                 In our simulated 4-GPU system, each GPU has an L2 cache size of
caching in multi-GPU systems.                                                                      2 MB, with each cache line being 64B. Each coherence directory tracks

                                                                                               4
G. Ko et al.                                                                                                                  Journal of Systems Architecture 160 (2025) 103339


Fig. 7. Average performance improvement per increased directory storage in the
baseline coherence directory design. The results are normalized to the system with
8K-entry coherence directory.


the identity of all sharers excluding the home GPU (i.e., three GPUs).
To cover the entire L2 cache space for three GPUs, an ideal coherence
directory would require approximately 96K entries, or about 12× the
baseline 8K entries.
    Fig. 6 illustrates the normalized performance for increasing the
                                                                                         Fig. 8. A high-level overview of (a) baseline and (b) proposed REC architecture with
directory sizes by 2×-12× the baseline. With an ideal directory size,
                                                                                         simplified 2-entry coherence directories. The figure illustrates a scenario where GPU1
unnecessary invalidations from directory evictions can be eliminated,                    accesses memory of GPU0 in order of 0 × 1000, 0 × 1040, 0 × 1080, and 0 × 1000
leaving only write-initiated invalidations. The results show that ap-                    by each CU. In the baseline directory, entry that tracks status of data at 0 × 1000
plications exhibit significant performance gains as the directory size                   is evicted for recording the address 0 × 1080. The proposed directory coalesces three
increases, with some benchmarks (e.g., ATAX, PR, and ST) requiring                       addresses with same base address into one entry.
8×-12× the baseline size to achieve the highest speed-up. Specifically,
benchmarks such as PR and ST show irregular memory access patterns
that span a wide address range, leading to higher chances of conflict                    4.1. Hardware overview
misses when updating coherence directories. Most other tested bench-
marks require up to six times the baseline directory size to achieve                         As shown in Section 3.2, a significant fraction of cache invalidations
maximum attainable performance; the average speedup with six times                       are generated by the frequent directory evictions. These invalidations
the size is 1.35×.                                                                       lead to increased cache misses, as data is prematurely invalidated from
    Each entry in the coherence directory comprises a tag, sharer list,                  the cache, requiring subsequent accesses to fetch the data from remote
and coherence state. We assume 48 bits for tag addresses, a 3-bit                        memory. While simply increasing the directory size can address this
vector for tracking sharers, and one bit for the directory entry state;                  bottleneck, the associated cost of hardware can become substantial. To
thus, each entry requires a total of 52 bits of storage. Our baseline                    address this, we propose REC, an architectural solution that compresses
directory implementation has 8K entries and occupies approximately                       remote GPU access information, retaining as much data as possible
2.5% of the L2 cache [11]. Therefore, the storage cost of the baseline                   before eviction occurs. It aggregates data from incoming remote read
directory in each GPU is 52 × 8192/8/1024 = 52 kB, assuming 8                            requests so that (1) multiple reads to the same address range share
bits per byte and 1024 bytes per kilobyte. From our observation in                       a common base address, storing only the offset and source GPU in-
Fig. 6, applications require directory sizes from 6× up to 12× the                       formation, and (2) the coalescing process does not result in any loss
baseline to achieve maximum performance. This corresponds to a total
                                                                                         of information, maintaining the accuracy of the coherence protocol.
storage cost of 312-624 kB, which is an additional 15.2–30.4% of
                                                                                         We now discuss the design overview of REC and the details of the
the L2 cache size. While increasing directory size can significantly
                                                                                         associated hardware components.
improve performance, the associated hardware costs are substantial.
                                                                                             Fig. 8(a) shows how the baseline GPU handles a sequence of in-
To show the inefficiency of simply scaling directory sizes, we calculate
                                                                                         coming read requests. The cache controller records the tag addresses
the performance per storage using the results in Fig. 6 and the number
                                                                                         and the corresponding sharer IDs in the order that the requests arrive.
of directory entries. Fig. 7 illustrates the results relative to the baseline
                                                                                         When the coherence directory reaches its capacity, the cache controller
with 8K entries, showing that performance improvements per increased
                                                                                         follows a typical FIFO policy to replace the oldest entry with a new
storage do not scale proportionally with larger coherence directories.
                                                                                         one within the set. Once an entry is evicted, the information it held
Additionally, since GPU applications require different directory sizes
                                                                                         can no longer be tracked, triggering an invalidation request to be sent
to achieve maximum performance, simply increasing the directory size
is not an efficient solution. Moreover, as GPU L2 caches continue to                     to the GPU listed in the entry. Upon receiving this request, the sharer
grow [16,17], the cost of maintaining proportionally larger coherence                    GPU checks its L2 cache and invalidates the corresponding cache line,
directories will only amplify these overheads. Therefore, improving                      leading to a cache miss on any subsequent access to the cache line.
coherence directory coverage without significant storage overhead mo-                        To delay invalidations caused by directory evictions without signif-
tivates the need for more efficient fine-grained hardware protocols in                   icant hardware overhead, we introduce the REC architecture, which
multi-GPU systems.                                                                       enhances the baseline coherence directory by leveraging spatial locality
                                                                                         for merging multiple addresses into a single entry. As illustrated in
4. REC architecture                                                                      Fig. 8(b), REC stores tag addresses with common high-order bits as a
                                                                                         single entry using a base-plus-offset format. When a new read request
    This work aims to enhance coherence directory coverage while                         matches the base address in an existing entry, the offset and sharer in-
avoiding significant hardware overhead, overall reducing unnecessary                     formation are appended to that entry, reducing the need for additional
cache invalidations in multi-GPU systems. We introduce REC, an archi-                    entries and delaying evictions. The base address represents the shared
tecture that coalesces directory entries by leveraging the spatial locality              high-order bits, covering a range of addresses and reducing the storage
in memory accesses observed in GPU workloads. In this section, we                        required compared to storing full tag addresses individually. Addition-
provide an overview of REC design and discuss its integration with                       ally, REC uses position bits to efficiently track multiple addresses within
existing multi-GPU coherence protocols.                                                  the specified range, further minimizing storage overhead.

                                                                                     5
G. Ko et al.                                                                                                                    Journal of Systems Architecture 160 (2025) 103339

Table 1
Trade-offs between addressable range and storage for each entry. Note that one valid
bit, not shown in the table, is included in the overall calculation.
                              Addressable range
                              64B       128B        256B          1 kB       4 kB
 Base address bits            48        41          40            38         36
 Position/Sharer bits         −/3       2/6         4/12          16/48      64/192
 Total bits per entry         52        50          57            103        293


Table 2
Baseline GPU configuration.
 Parameter                                        Configuration
 Number of SAs                                    16
 Number of CUs                                    4 per SA
 L1 vector cache                                  1 per CU, 16 kB 4-way
 L1 inst cache                                    1 per SA, 32 kB 4-way                    Fig. 10. Overview of the REC protocol flows. In the example coherence directory,
 L1 scalar cache                                  1 per SA, 16 kB 4-way                    entry insertion and offset addition operations are highlighted in blue, while eviction
 L2 cache                                         2 MB 16-way, 16 banks, write-back        and offset deletion operations are shown in red.
 Cache line size                                  64B
 Coherence directory                              8K entries, 8-way
 DRAM capacity                                    4 GB HBM, 16 banks
 DRAM bandwidth                                   1 TB/s [11]
                                                                                           for comparing the storage costs. REC designs with larger addressable
 Inter-GPU bandwidth                              300 GB/s, bi-directional                 ranges can benefit from increased directory coverage but at the cost of
                                                                                           storage. In the evaluation of this paper, we tested various addressable
                                                                                           ranges for REC. Each design is configured to coalesce the maximum
Table 3
                                                                                           number of offsets within its specified range. Later in the results, we
Tested workloads.
                                                                                           confirm that a 1 kB coalesceable range offers the best trade-off, bal-
 Benchmark                                             Abbr.        Memory footprint
                                                                                           ancing reasonable size overhead per entry with the ability to coalesce
 Matrix transpose and vector multiplication [21]       ATAX         128 MB
 2-D convolution [21]                                  C2D          512 MB
                                                                                           a significant number of entries before evictions occur (discussed in
 Finite impulse response [22]                          FIR          128 MB                 Section 5.2).
 Matrix-multiply [21]                                  GEMM         128 MB                     Based on these findings, the format of a directory entry is as
 Vector multiplication and matrix addition [21]        GEMV         256 MB                 illustrated in Fig. 9. Each entry comprises a base address, coalesced
 2-D jacobi solver [21]                                J2D          128 MB
                                                                                           entries, and a valid bit. When the first remote read request arrives at the
 LU decomposition [21]                                 LU           128 MB
 2 matrix multiplications [21]                         MM2          128 MB                 home GPU, the cache controller sets the base address by right-shifting
 3 matrix multiplications [21]                         MM3          64 MB                  the tag address by the number of bits needed to represent the offset
 PageRank [22]                                         PR           256 MB                 within the specified range. For a 48-bit tag, the address is right-shifted
 Simple convolution [23]                               SC           512 MB
                                                                                           by 10 bits (considering a 64B-aligned 1 kB range), and the resulting
 Stencil 2D [24]                                       ST           128 MB
                                                                                           bits from positions 64 to 101 are used to store the base address. The
                                                                                           coalesced entry is identified using the offset within the 1 kB range,
                                                                                           represented by a position bit, followed by three bits for recording the
                                                                                           sharers. The position bit is calculated as:
                                                                                                (           )
                                                                                                  Tag mod 𝑚
                                                                                           𝑝=                  × (𝑛 + 1)
                                                                                                      64
                                                                                           where 𝑚 denotes the coalescing range, and 𝑛 is the number of shar-
                                                                                           ers, which are set to 1 kB and 3, respectively. Once the position is
                                                                                           determined, the corresponding position and the sharer bit are set to
                                                                                           1 using bitwise OR operation. Given that the 1 kB range allows each
                                                                                           entry to record up to 16 individual tag addresses, we use the lower 64
Fig. 9. Coherence directory entry structure for 64B cache lines. In our design, each       bits to store the coalesced entries. Furthermore, the position bit can
entry stores up to 16 coalesced entries based on 1 kB range.                               also function as the valid bit for each coalesced entry, meaning only
                                                                                           one valid bit is necessary to indicate whether the entire entry is valid
                                                                                           or not.
    Determining the address range within which REC coalesces entries is
one of the key design considerations, as it directly impacts the number                    4.2. REC protocol flows
of bits required for each entry. Table 1 shows a list of design choices for
implementing REC with varying addressable ranges and their potential                           The baseline coherence protocol operates with two stable states-
trade-offs. The number of required base address bits is calculated using                   valid and invalid-allowing it to remain lightweight and efficient. In
2n = addressable_range, where n is the number of bits right-shifted                        our proposed coherence directory design, each entry represents the
from the original tag address. Also, the number of required position                       validity of an entire address range instead of tracking individual tag
bits is determined by the maximum number of coalesceable cache line                        addresses and associated sharers. This enables the state transitions
addresses within the target range, assuming 64B line size. Then, the                       to be managed at a coarser granularity during directory evictions.
number of sharer bits required is (n-1)×num_position_bits, where n is                      Additionally, REC supports fine-grained control over write requests by
the number of GPUs. For example, if REC is designed to coalesce with                       tracking specific offsets within these address ranges, avoiding the need
addressable range of 256B, each entry would require 40, 4, and 12 bits                     to evict entire entries. Fig. 10 highlights the architecture of REC and
for base address, position, and sharer fields, respectively. Lastly, one                   how it differently handles the received requests with the baseline. REC
valid bit is added to each entry. In Table 1, we show the total bits                       does not require additional coherence states but instead modifies the
required per entry under the addressable ranges from 128B to 4 kB                          transitions triggered under specific conditions.

                                                                                       6
G. Ko et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103339


Remote reads: When the GPU receives the read request from the                     4.3. Discussion
remote GPU, the cache controller extracts the base and offset from the
tag address ( A ). The controller then looks up the coherence directory           Overheads: In our design, the coherence directory consists of 8K
for an entry with the matching base address ( B ). If a valid entry is            entries, with each entry covering a 1 kB range of addresses. Each entry
found, the position bit corresponding to the offset calculated using the          comprises a 38-bit base address field, a 64-bit vector for offsets and
formula in Section 4.1 and the associated sharer bit are set ( C ). For           sharers, and a valid bit (detailed in Table 1). Thus, the total directory
example, the position bit is 34016 /64 × 4 = 52 representing the 14th             size is 8192 × 103/8/1024 = 103 kB. We also estimate the area
cache line within the specified 1 kB range. The sharer bit is determined          and power overhead of the coherence directory in REC, using CACTI
by the source GPU index (e.g., GPU1). Therefore, bit 52 and 53 are set            7.0 [25]. The results show that the directory is 3.94% area and has
to 1. It can happen that the position bit is already set; nevertheless, the       3.28% power consumption compared to GPU L2 cache. REC requires no
controller still performs a bitwise OR on the bits at the corresponding           additional hardware extensions for managing the coherence directory.
positions. Since the entry already exists in the directory, it remains            The existing cache controller handles operations such as base address
valid. Otherwise, if no valid entry is found, a new entry is created              calculation and bitwise manipulation efficiently.
with the base address, and the position and sharer bits are set. With             Comparison to prior work: As discussed in Section 2.3, HMG [11]
                                                                                  designs each coherence directory entry to track four cache lines at
the insertion of a new entry, the state transitions from invalid to valid.
                                                                                  a coarse granularity. We empirically show, in Section 3.3, that GPUs
Local writes: When the write request is performed locally ( D ), the
                                                                                  require a directory size up to 12× the baseline to eliminate unnecessary
cache controller must determine whether it needs to send out inval-
                                                                                  cache line invalidations. Since REC coalesces up to 16 consecutive
idation requests to the sharers that hold the copy of data. For this,
                                                                                  cache line addresses per entry, REC can track a significantly larger num-
the controller again looks up the directory with the calculated base
                                                                                  ber of cache lines compared to the prior work. Moreover, REC precisely
address and offset ( E ). If an entry is found and the offset is valid
                                                                                  tracks each address by storing the offset and sharer information. Thus,
(i.e., the position bit is set), the invalidation request is generated
                                                                                  REC fully support fine-grained management of cache lines under write
and propagated to the recorded sharers immediately ( F ). The state
                                                                                  operations.
transition is handled differently based on two conditions. First, when            Scalability: REC requires modifications to its design in large-scale
another offset is tracked under the common address range, the directory           systems, specifically to the sharer bit field. For an 8-GPU system, REC
entry should remain valid. Thus, the controller clears only the position          requires (8-1) × 16 = 112 bits to record sharers in each entry. Then,
and sharer bits for the specific offset of the target address. For example,       the size of each entry becomes 112 + 38 + 16 + 1 = 167 bits, which
in Fig. 10, the directory entry has another offset (atp = 56) recorded            is approximately three times the baseline size, where each entry costs
under the same base address. Once the invalidation request is sent out            56 bits, including a 4-bit increase for sharers. Similarly, for a 16-GPU
to GPU1, the controller only clears bits 0 and 1. If the cleared bits are         system, REC requires 295 bits per entry, roughly five times the baseline
the last ones, the entire directory entry transitions to an invalid state         size. However, as observed in Section 3.3, an ideal GPU requires up to
to make room for new entries.                                                     12 times the baseline directory size even in a 4-GPU system, implying
Remote writes: For the remote write request, the cache controller                 that simply increasing the baseline directory size is insufficient to meet
begins the same directory lookup process by calculating the base and              scalability demands.
offset from the tag ( G ). In our target multi-GPU system, the source
GPU also performs writes to the copy of data in its local L2 cache                5. Evaluation
(discussed in Section 2.2). Therefore, the controller handles remote
write requests differently from local writes. When an entry already               5.1. Methodology
exists in the directory (i.e., hits), there may be two circumstances: (1)
the target offset is invalid but the entry has other valid offsets and (2)            We use MGPUSim [20], a cycle-accurate multi-GPU simulator, to
the target offset is already valid and one or more sharers are being              model baseline and REC architecture with four AMD GPUs connected
tracked. If the target offset is invalid, the controller simply adds the          using inter-GPU links of 300 GB/s bandwidth [26]. The configuration of
offset and the sharer to the entry in the same way it handles remote              the modeled GPU architecture is detailed in Table 2. Each GPU includes
reads. If the offset is valid, the controller adds the source GPU to the          L1 scalar and instruction caches shared within each SA, while the L1
sharer list by setting its corresponding bit and clearing other sharer            vector cache is private to each CU, and the L2 cache is shared across the
                                                                                  GPU. We extend remote data caching to the L2 caches, allowing data
bits ( H ), then sends invalidation requests to all other sharers ( I ). In
                                                                                  from any GPU in the system to be cached in the L2 cache of any other
Fig. 10, the entry and the target offset (atp = 56) both are already
                                                                                  GPU. Since MGPUSim does not include a support of hardware cache
recorded. The controller, thus, additionally sets bit 58 to add GPU2 as
                                                                                  coherence, we extend the simulator by implementing a coherence di-
a sharer while clearing the bit 59 and sends the invalidation request
                                                                                  rectory managed by the L2 cache controller. The coherence directory is
to GPU3. In either cases, the directory entry remains valid. When the
                                                                                  implemented with a set-associative structure to reduce lookup latency.
directory misses, the cache controller allocates a new entry to record
                                                                                  Since the baseline coherence directory is decoupled from the caches,
the base, offset, and sharer from the write request. Then, the entry state
                                                                                  its way associativity as well as the size can be scaled independently.
transitions to valid.                                                             In our evaluation, the coherence directory is designed with an 8-way
Directory entry eviction/replacement: When the coherence directory                set-associative structure to reduce conflict misses, containing 8K entries
becomes full, it needs to replace an entry with the newly inserted                in both the baseline and REC architectures. Upon receiving remote read
one. The baseline coherence directory uses a FIFO replacement policy.             requests, the cache controller updates the coherence directory with
However, for workloads that exhibit irregular memory access pat-                  recording the addresses and the associated sharers. Once capacity of
terns, capturing locality becomes a challenge. To address this, REC               the directory is reached, the cache controller evicts an entry and sends
adopts the replacement policy, similar to LRU, to better retain entries           out invalidation requests to the recorded sharers. For receiving write
that are more likely to be accessed again. As the cache controller                requests, the controller looks up the directory to find whether data
receives the remote read request and does find an entry with the                  with matching addresses are shared by remote GPUs. If the matching
matching base address ( J ), it determines an entry for replacement               entries are found, invalidation requests are propagated to the sharers
( K ). The evicting entry is then replaced with the new entry from the            except the source GPU. Additionally, since L2 caches are managed
incoming request ( L ). Meanwhile, the controller retrieves the base              by coherence directories, acquire operations do not perform invalida-
address, every merged offset from the evicting entry and reconstructs             tions on L2 caches, but release operations flush the L2 caches. We
the original tag addresses. Invalidation requests are propagated to every         use workloads from a diverse set of benchmark suites, including AM-
recorded sharer associated with each tag address ( M ). Lastly, the entry         DAPPSDK [23], Heteromark [22], Polybench [21], SHOC [24]. Table 3
transitions to an invalid state.                                                  lists the workloads with their memory footprints.

                                                                              7
G. Ko et al.                                                                                                                  Journal of Systems Architecture 160 (2025) 103339


Fig. 11. Performance comparison of the baseline with double-sized coherence direc-       Fig. 12. Number of coalesced cache line addresses at directory entry eviction under
tory, HMG [11], REC, and an idealized system with zero unnecessary invalidations.        REC with varying addressable ranges. REC in this work coalesces with 1 kB addressable
Performance is normalized to the baseline with 8K-entry coherence directory.             range.


5.2. Performance analysis

    Fig. 11 shows the performance of the baseline with coherence di-
rectory double in size, HMG [11], REC, and an ideal multi-GPU system
with zero unnecessary invalidations relative to the baseline. First, we
include the performance of baseline with double in coherence directory
size to compare REC with the same storage cost. The result shows that
the baseline with double the size of directory achieves average speedup
of 7.3%. The baseline coherence directory tracks each remote access
                                                                                         Fig. 13. Total number of L2 cache misses in the baseline with double-sized coherence
individually, on a per-entry basis. As discussed in Section 3.3, doubling                directory, HMG [11], and REC relative to the baseline.
the size of coherence directory does not mitigate the unnecessary cache
line invalidations for applications with significant directory evictions.
Also, results show that HMG and REC achieve average speedup of
                                                                                         As a result, this delays the replacement of useful cache lines, thereby
16.7% and 32.7% across the evaluated workloads. We observe that
                                                                                         improving cache efficiency.
REC outperforms the prior scheme for two reasons. First, REC delays
                                                                                         L2 cache misses: The performance improvement of REC is largely
directory evictions by allowing each entry to record more cache line
                                                                                         attributed to the reduction in cache misses caused by unnecessary
addresses for a wider range. Since HMG uses each directory entry to
                                                                                         invalidations from frequent evictions in the coherence directory of
track four cache lines, an entire coherence directory can track cache
                                                                                         home GPUs. Fig. 13 shows the total number of L2 cache misses in the
lines up to 4× the baseline. On the other hand, the directory in REC
                                                                                         baseline with double-sized directory, HMG, and REC relative to the
can record up to 16× the number of entries. Second, REC manages write
                                                                                         baseline. Cold misses are excluded from the results. We observe that
operations to shared cache lines at a fine granularity by searching the
                                                                                         REC reduces L2 cache misses by 53.5%. In contrast, the baseline with
directory with exact addresses and sharers, propagating invalidations
                                                                                         double-sized directory and HMG experience 1.79× and 1.40× higher
only when necessary. Since each directory entry of HMG stores only
a single address and sharer ID field that cover for four cache lines,                    number of cache misses than REC since neither approach is insufficient
writes to any of these cache lines trigger invalidation requests to every                to delay evict-initiated cache line invalidations. The result is closely
cache line and recorded sharer which leads them to be false positives.                   related to the reduction in remote access latency, as the corresponding
In contrast, REC does not allow any false positives and performs inval-                  misses are forwarded to the remote GPUs. Addressing the remote GPU
idations only to the modified cache lines and the associated sharers. As                 access bottleneck is performance-critical in multi-GPU systems.
a result, REC reduces unnecessary invalidations on cache lines that are                  Unnecessary invalidations: In the baseline, invalidation requests
actively being accessed by the requesting GPUs, minimizing redundant                     propagated from frequent directory evictions in the home GPU lead to a
remote memory accesses. To investigate the effectiveness of REC under                    higher chances of finding the corresponding cache lines still valid in the
different addressable ranges listed in Table 1, we also measure the                      sharer-side L2 caches. This results in premature invalidations of cache
number of coalesced cache line addresses when an entry is evicted                        lines that are actively in use, exacerbating the cache miss rate. In REC,
and plot in Fig. 12. We observe that the directory entries capture an                    the invalidation requests generated by directory eviction reduce the
average of 1.8, 3.4, 12.9, and 54.7 addresses until eviction under REC                   chances of invalidating valid cache lines. Fig. 14 shows that the number
with 128B, 256B, 1 kB, and 4 kB coalesceable ranges. Specifically,                       of unnecessary invalidations performed in remote L2 caches (i.e., where
REC captures more than 14 addresses before directory eviction for                        they are hits) is reduced by 84.4%. Since REC significantly delays evict-
applications with strong spatial locality.                                               initiated invalidation requests, many cache lines have already been
    Fig. 12 also illustrates the characteristics of limited locality for                 evicted from the caches by the time these requests are issued.
certain workloads where REC benefits less. In ATAX, PR, and ST, REC                      Inter-GPU transactions: The reduction in unnecessary invalidations
coalesces 3.9, 6.1, and 5.8 addresses, respectively. This is because the                 enhances the utilization of data within the sharer GPUs and min-
applications exhibit locality challenging to be captured due to their                    imizes redundant accesses over inter-GPU links. Fig. 14 shows the
irregular memory access patterns that span across a wide range of                        total number of inter-GPU transactions compared to the baseline. As
addresses. To delay the eviction of entries in irregular workloads, we                   illustrated, REC reduces inter-GPU transactions by an average of 34.9%.
design our proposed coherence directory with an LRU-like replace-                        The reduced inter-GPU transactions directly contributes to the overall
ment policy (discussed in Section 4.2). Another interesting observation                  performance improvement in multi-GPU systems.
is that the performance improvement of GEMV with REC is higher                           Bandwidth impact: Fig. 15 shows the total inter-GPU bandwidth
than the improvement seen when eliminating unnecessary invalida-                         costs of invalidation requests. As presented in Section 3.2, a large
tions. Our approach delays invalidations, but still performs them when                   fraction of invalidation requests are propagated due to frequent direc-
the directories become full. During cache line replacement, the con-                     tory evictions. Since REC delays invalidation requests from directory
troller prioritizes invalid cache lines before applying the LRU policy.                  evictions by allowing each entry to coalesce multiple tag addresses, the

                                                                                     8
G. Ko et al.                                                                                                                   Journal of Systems Architecture 160 (2025) 103339


Fig. 14. Total number of unnecessary invalidations (bars) and inter-GPU transactions
(plots) relative to the baseline.                                                          Fig. 17. Performance of REC under varying (a) coalescing address ranges and (b)
                                                                                           number of directory entries. Results are shown relative to the baseline with an 8K-
                                                                                           entry coherence directory.


           Fig. 15. Total bandwidth consumption of invalidation requests.


                                                                                           Fig. 18. Performance comparison of REC using FIFO and LRU replacement policies.
                                                                                           Performance is normalized to the baseline coherence directory with FIFO policy.


                         Fig. 16. L2 cache lookup latency.


bandwidth in most of the workloads becomes only a few gigabytes per                        Fig. 19. Performance impact of different L2 cache sizes in the baseline and REC.
                                                                                           Performance is normalized to the baseline with 2 MB L2 cache.
second.
Cache lookup latency: Fig. 16 illustrates the average L2 cache lookup
latency of REC normalized to the baseline. The results show that the
lookup latency reduces by 14.8% compared to the baseline. REC affects                      average, REC outperforms the baseline, even with reduced entry sizes
the average lookup latency as evict-initiated invalidation requests are                    compared to the baseline system with 8K-entry coherence directory.
propagated in burst. However, since REC significantly delays direc-                        This is because the coverage of each coherence directory in REC
tory eviction by coalescing multiple tag addresses, the overall latency                    can increase by up to 16× when locality is fully utilized. Although
decreases for most of the evaluated workloads.                                             applications with limited locality show performance improvements as
                                                                                           the directory size increases, these gains are relatively modest when
5.3. Sensitivity analysis                                                                  considered against the additional hardware costs.
                                                                                           FIFO replacement: Fig. 18 represents the performance of REC with
Coalescing range: One important design decision in optimizing REC                          a FIFO replacement policy. Our evaluation shows that the choice of
is determining the range over which to coalesce when remote read                           replacement policy has a relatively small impact on the overall perfor-
requests are received. As discussed in Section 4.1, the trade-off exists                   mance. For the workloads with regular and more predictable memory
between the range an entry coalesces and the number of bits required:                      access patterns, using the FIFO replacement policy is already effective
the larger the range, the more bits are needed to store the remote                         in coalescing sufficient number of addresses under the target ranges
GPU access information. Fig. 17(a) shows that the performance of REC                       (shown in Fig. 12). However, for some applications, such as ATAX,
improves as the coalescing range increases, with performance gains                         PR, and ST, performance is lower with FIFO compared to REC due
beginning to saturate at 1 kB. For our applications, a 1 kB range is                       to their limited locality patterns. These applications, therefore, benefit
sufficient to capture the majority of memory access locality within the                    from using an LRU-like replacement policy.
workloads. Since coalescing beyond 4 kB incurs excessive overhead in                       L2 cache size: The performance impact of different L2 cache sizes is
terms of bits required per entry (with 4 kB already requiring nearly 6×                    shown in Fig. 19. The results are normalized to the baseline with a
the baseline size), the potential performance improvement may not be                       2 MB L2 cache. The benefits from increasing L2 cache capacity are
substantial to offset the additional cost. Therefore, we choose a 1 kB                     limited by the baseline coherence directory. In contrast, the perfor-
range for our implementation.                                                              mance of REC improves as L2 cache size increases, demonstrating its
Entry size: In our evaluation, we use a directory size of 8K entries                       ability to leverage larger caches effectively. Another observation is that
to match the baseline coherence directory. Fig. 17(b) shows the per-                       performance improvement with smaller L2 capacity is less significant
formance REC with varying entry sizes, ranging from 2K to 32K. On                          compared to larger L2 caches. This is because the coverage of the

                                                                                       9
G. Ko et al.                                                                                                                   Journal of Systems Architecture 160 (2025) 103339


Fig. 20. Performance impact of different inter-GPU bandwidth in the baseline and REC.                   Fig. 23. Performance of REC in different GPU architecture.
Performance is normalized to the baseline with 300 GB/s inter-GPU bandwidth.


Fig. 21. Performance of REC with different number of SAs normalized to the baseline
                                                                                                           Fig. 24. Performance of REC with DNN applications.
with 16 SAs.


                                                                                             16-GPU systems, respectively. We observe that the performance im-
                                                                                             provement decreases as the number of GPUs increases. This is because,
                                                                                             with more GPUs, the application dataset is more distributed, and the
                                                                                             amount of data allocated to each GPU’s memory decreases, resulting
                                                                                             in reduced pressure on each coherence directory for tracking shared
                                                                                             copies. Additionally, we compare REC with the baseline configured
                                                                                             with different directory sizes to match equal storage costs (discussed in
                                                                                             Section 4.3). We observe that REC achieves performance improvements
                                                                                             of 2.04× and 1.83× over the baseline with directory sizes increased by
Fig. 22. Performance comparison of REC and the baseline with equal storage cost              3× and 5×, respectively. The results confirm that simply increasing di-
under different number of GPUs. Performance is normalized to the baseline with 8K
                                                                                             rectory sizes is not an efficient approach, even in large-scale multi-GPU
entries.
                                                                                             systems.

                                                                                             5.4. REC with Different GPU Architecture
baseline coherence directory relatively increases as the L2 cache size
decreases. To further explore the performance sensitivity to different
                                                                                                 We extend the evaluation of REC to include a different GPU ar-
L2 cache sizes, we evaluate REC in systems with L2 cache sizes of
                                                                                             chitecture by adapting the simulation environment to a more recent
0.5 MB and 8 MB. We find that REC achieves an average performance
                                                                                             NVIDIA-styled GPU [27]. This involves increasing the number of com-
improvement of 6.3% and 26.7% compared to the baseline with 0.5 MB
                                                                                             putation and memory resources compared to the AMD GPU setup.
and 8 MB L2 caches, respectively. Additionally, the performance trend
                                                                                             Specifically, we change the GPU configuration to include 128 CUs, each
of REC decreases as the L2 cache size increases since the effectiveness
                                                                                             with a 128 kB L1V cache. The L2 cache size is increased to 72 MB
of REC also reduces larger caches. Nevertheless, the results emphasize
                                                                                             with the cache line size adjusted to 128B. With the increased cache
the importance of coherence protocol in improving cache efficiency.
                                                                                             line size, we configure the addressable range of REC to 2 kB, allowing
Inter-GPU bandwidth: The bandwidth of inter-GPU links is a critical                          for coalescing up to the same number of tag addresses. We also scale
factor in scaling multi-GPU performance. Fig. 20 shows the perfor-                           the input sizes of the workloads until the simulations remain feasible.
mance of the baseline and REC under different inter-GPU bandwidths,                          The performance results, in Fig. 23, show that REC achieves a 12.9%
relative to the 300 GB/s baseline. The results demonstrate that REC out-                     performance improvement over the baseline. This indicates that our
performs the baseline, even in applications where performance begins                         proposed REC also benefits the NVIDIA-like GPU architecture.
to saturate with increased bandwidth.
Number of SAs: We also evaluate REC with increasing the number                               5.5. Effectiveness of REC on DNN applications
of SAs as shown in Fig. 21. The performance improvement of REC
decreases compared to the system with 16 SAs since the increased                                 We evaluate the performance improvement of REC in training
number of SAs improves thread-level parallelism of GPUs. However, the                        two DNN models, VGG16 and ResNet18, using Tiny-Imagenet-200
system with a larger number of SAs also elevates the intensity of data                       dataset [28]. As shown in Fig. 24, REC outperforms the baseline for
sharing thus, increases the frequency of coherence directory evictions.                      training VGG16 and ResNet18 by 5.6% and 8.9%, respectively. The
As a result, REC outperforms the baseline with 16 SAs by 17.1%.                              results imply that REC also has benefits in multi-GPU training on
Number of GPUs: We evaluate REC in 8-GPU and 16-GPU systems,                                 DNN workloads. Additionally, GPUs have recently gained significant
as shown in Fig. 22. To ensure a fair comparison, we do not change                           attention for training large language models (LLM). The computation
the workload sizes. The results show that REC provides performance                           of LLM training comprises multiple decoder blocks with each primarily
improvements of 24.7% and 14.7% over the baseline in 8-GPU and                               having series of matrix and vector operations [29]. In our evaluation,

                                                                                        10
G. Ko et al.                                                                                                           Journal of Systems Architecture 160 (2025) 103339


we observe that REC improves multi-GPU performance by 20.2% and                   translation overheads, and [47]. Villa et al. [49] studied design-
20.4% on GEMM and GEMV workloads, respectively. Considering real-                 ing trustworthy system-level simulation methodologies for single- and
world LLM training, the memory requirements can become significant                multi-GPU systems. Lastly, NGS [50] enables multiple nodes in a data
with large parameters which can pressure memory systems and lead                  center network to share the compute resources of GPUs on top of a
to under-utilization of computation resources [29]. Since REC im-                 virtualization technique.
proves the cache efficiency in multi-GPU systems, we expect a higher
performance potential from REC in real-world LLM training.                        7. Conclusion

6. Related work                                                                       In this paper, we propose REC to improve the efficiency of cache
                                                                                  coherence in multi-GPU systems. Our analysis shows that the limited
    Several prior works have proposed GPU memory consistency and
                                                                                  capacity of coherence directories in fine-grained hardware protocols
cache coherence mechanisms optimized for general-purpose domains
                                                                                  frequently leads to evictions and unnecessary invalidations of shared
[13–15,19,30–32]. GPU-VI [19] reduces stalls at the cache controller
                                                                                  data. As a result, the increase in cache misses exacerbates NUMA
by employing write-through, write-no-allocate L1 caches and treating
                                                                                  overhead, leading to significant performance degradation in multi-GPU
loads to the pending writes as misses. To maintain write atomicity,
                                                                                  systems. To address this challenge, REC leverages memory access local-
GPU-VI adds transient states and state transitions and requires invali-
                                                                                  ity to coalesce multiple tag addresses within common address ranges,
dation acknowledgments before write completion. REC is implemented
                                                                                  effectively increasing the coverage of coherence directories without
based on the relaxed memory models commonly adopted in recent
                                                                                  incurring significant hardware overhead. Additionally, REC maintains
GPU architectures, which do not require acknowledgments to be sent
                                                                                  write-initiated invalidations at a fine granularity to ensure precise and
or received over long-latency inter-GPU links. HMG [11] proposes a
                                                                                  flexible coherence across GPUs. Experiments show that REC reduces
lightweight directory protocol by addressing up-to-date memory consis-
                                                                                  L2 cache misses by 53.5% and improves overall system performance
tency and coherence requirements. HMG integrates separate layers for
                                                                                  by 32.7%.
managing inter-GPM and inter-GPU level coherence, reducing network
traffic and complexity in deeply hierarchical multi-GPU systems. REC
primarily addresses the increased cache misses to remotely fetched data           CRediT authorship contribution statement
caused by frequent invalidations. Additionally, REC can be extended
to support hierarchical multi-GPU systems posed by HMG without                        Gun Ko: Writing – original draft, Visualization, Validation, Soft-
significant hardware modifications.                                               ware, Resources, Methodology, Investigation, Formal analysis, Data
    Other efforts aim to design efficient cache coherence protocols for           curation, Conceptualization. Jiwon Lee: Formal analysis, Conceptu-
other processor domains. Wang et al. [33] suggested a method to                   alization. Hongju Kal: Validation, Conceptualization. Hyunwuk Lee:
efficiently support dynamic task parallelism on heterogeneous cache               Visualization, Validation. Won Woo Ro: Supervision, Project adminis-
coherent systems. Zuckerman et al. [34] proposed Cohmeleon that                   tration, Conceptualization.
orchestrates the coherence in accelerators in heterogeneous system-on-
chip designs. HieraGen [35] and HeteroGen [36] are automated tools                Declaration of competing interest
for generating hierarchical and heterogeneous cache coherence proto-
cols, respectively, for generic processor designs. Li et al. [37] proposed            The authors declare that they have no known competing finan-
methodologies to determine the minimum number of virtual networks                 cial interests or personal relationships that could have appeared to
for cache coherence protocols that can avoid deadlocks. However, these            influence the work reported in this paper.
studies do not address the challenges of redundant invalidations in the
cache coherence mechanisms of multi-GPU systems.
                                                                                  Acknowledgments
    Significant research has addressed the NUMA effect in multi-GPU
systems by proposing efficient page placement and migration strate-
                                                                                      This work was supported by Institute of Information & communica-
gies [5,6,38], data transfer and replication methods [4,7,8,10,39,40],
                                                                                  tions Technology Planning & Evaluation (IITP) grant funded by the Ko-
and address translation schemes [41–43]. In particular, several works
                                                                                  rea government (MSIT) (No. 2024-00402898, Simulation-based High-
have focused on improving the management of shared data within the
                                                                                  speed/High-Accuracy Data Center Workload/System Analysis Platform)
local memory hierarchy. NUMA-aware cache partitioning [3] dynami-
cally allocates cache space to accommodate data from both local and
remote memory by monitoring inter-GPU and local DRAM bandwidths.                  Data availability
The authors also extend software coherence with bulk invalidations
to L2 caches and evaluate the overhead associated with unnecessary                   The authors are unable or have chosen not to specify which data
invalidations. SAC [12] proposes reconfigurable last-level caches (LLC)           has been used.
that can be utilized as either memory-side or SM-side, depending on
predicted application behavior in terms of effective LLC bandwidth.
                                                                                  References
SAC evaluates the performance of both software and hardware ex-
tensions for LLC coherence. In contrast, REC specifically targets the
                                                                                   [1] NVIDIA, NVIDIA DGX-2, 2018, https://www.nvidia.com/content/dam/en-
issue of unnecessary invalidations under hardware coherence, which                     zz/Solutions/Data-Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web-
can undermine the efficiency of remote data caching. It introduces                     uk.pdf.
a new directory structure, carefully examining the trade-off between               [2] NVIDIA, NVIDIA DGX A100 system architecture, 2020, https://download.
performance and storage overhead.                                                      boston.co.uk/downloads/3/8/6/386750a7-52cd-4872-95e4-7196ab92b51c/
                                                                                       DGX%20A100%20System%20Architecture%20Whitepaper.pdf.
    Recent studies on multi-GPU and multi-node GPU systems also ad-
                                                                                   [3] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez,
dress challenges in various domains. Researchers proposed methods to
                                                                                       D. Nellans, Beyond the socket: NUMA-aware GPUs, in: Proceedings of IEEE/ACM
accelerate deep learning applications [44], graph neural networks [45],                International Symposium on Microarchitecture, 2017, pp. 123–135.
and graphics rendering applications [46] in multi-GPU systems. Na                  [4] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, O. Villa, Combining
et al. [47] addressed security challenges in inter-GPU communications                  HW/SW mechanisms to improve NUMA performance of multi-GPU systems, in:
under unified virtual memory framework. Barre Chord [48] leverages                     Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2018,
page allocation schemes in multi-chip-module GPUs to reduce address                    pp. 339–351.


                                                                             11
G. Ko et al.                                                                                                                          Journal of Systems Architecture 160 (2025) 103339


 [5] T. Baruah, Y. Sun, A.T. Dinçer, S.A. Mojumder, J.L. Abellán, Y. Ukidave, A.                [30] K. Koukos, A. Ros, E. Hagersten, S. Kaxiras, Building heterogeneous Unified
     Joshi, N. Rubin, J. Kim, D. Kaeli, Griffin: Hardware-software support for efficient             Virtual Memories (UVMs) without the overhead, ACM Trans. Archit. Code Optim.
     page migration in multi-GPU systems, in: Proceedings of IEEE International                      13 (1) (2016).
     Symposium on High Performance Computer Architecture, 2020, pp. 596–609.                    [31] X. Ren, M. Lis, Efficient sequential consistency in GPUs via relativistic cache co-
 [6] M. Khairy, V. Nikiforov, D. Nellans, T.G. Rogers, Locality-centric data and thread-             herence, in: Proceedings of IEEE International Symposium on High Performance
     block management for massive GPUs, in: Proceedings of IEEE/ACM International                    Computer Architecture, 2017, pp. 625–636.
     Symposium on Microarchitecture, 2020, pp. 1022–1036.
                                                                                                [32] S. Puthoor, M.H. Lipasti, Turn-based spatiotemporal coherence for GPUs, ACM
 [7] H. Muthukrishnan, D. Lustig, D. Nellans, T. Wenisch, GPS: A global publish-
                                                                                                     Trans. Archit. Code Optim. 20 (3) (2023).
     subscribe model for multi-GPU memory management, in: Proceedings of
     IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 46–58.                    [33] M. Wang, T. Ta, L. Cheng, C. Batten, Efficiently supporting dynamic task paral-
 [8] L. Belayneh, H. Ye, K.-Y. Chen, D. Blaauw, T. Mudge, R. Dreslinski, N. Talati,                  lelism on heterogeneous cache-coherent systems, in: Proceedings of ACM/IEEE
     Locality-aware optimizations for improving remote memory latency in multi-GPU                   International Symposium on Computer Architecture, 2020, pp. 173–186.
     systems, in: Proceedings of the International Conference on Parallel Architectures         [34] J. Zuckerman, D. Giri, J. Kwon, P. Mantovani, L.P. Carloni, Cohmeleon:
     and Compilation Techniques, 2022, pp. 304–316.                                                  Learning-based orchestration of accelerator coherence in heterogeneous SoCs, in:
 [9] S.B. Dutta, H. Naghibijouybari, A. Gupta, N. Abu-Ghazaleh, A. Marquez, K.                       Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2021,
     Barker, Spy in the GPU-box: Covert and side channel attacks on multi-GPU                        pp. 350–365.
     systems, in: Proceedings of ACM/IEEE International Symposium on Computer
                                                                                                [35] N. Oswald, V. Nagarajan, D.J. Sorin, HieraGen: Automated generation of con-
     Architecture, 2023, pp. 633–645.
                                                                                                     current, hierarchical cache coherence protocols, in: Proceedings of ACM/IEEE
[10] H. Muthukrishnan, D. Lustig, O. Villa, T. Wenisch, D. Nellans, FinePack:
                                                                                                     International Symposium on Computer Architecture, 2020, pp. 888–899.
     Transparently improving the efficiency of fine-grained transfers in multi-GPU
     systems, in: Proceedings of IEEE International Symposium on High Performance               [36] N. Oswald, V. Nagarajan, D.J. Sorin, V. Gavrielatos, T. Olausson, R. Carr,
     Computer Architecture, 2023, pp. 516–529.                                                       HeteroGen: Automatic synthesis of heterogeneous cache coherence protocols, in:
[11] X. Ren, D. Lustig, E. Bolotin, A. Jaleel, O. Villa, D. Nellans, HMG: Extending                  Proceedings of IEEE International Symposium on High Performance Computer
     cache coherence protocols across modern hierarchical multi-GPU systems, in:                     Architecture, 2022, pp. 756–771.
     Proceedings of IEEE International Symposium on High Performance Computer                   [37] W. Li, A.G.U. of Amsterdam, N. Oswald, V. Nagarajan, D.J. Sorin, Determining
     Architecture, 2020, pp. 582–595.                                                                the minimum number of virtual networks for different coherence protocols, in:
[12] S. Zhang, M. Naderan-Tahan, M. Jahre, L. Eeckhout, SAC: Sharing-aware caching                   Proceedings of ACM/IEEE International Symposium on Computer Architecture,
     in multi-chip GPUs, in: Proceedings of ACM/IEEE International Symposium on                      2024, pp. 182–197.
     Computer Architecture, 2023, pp. 605–617.                                                  [38] Y. Wang, B. Li, A. Jaleel, J. Yang, X. Tang, GRIT: Enhancing multi-GPU
[13] B.A. Hechtman, S. Che, D.R. Hower, Y. Tian, B.M. Beckmann, M.D. Hill, S.K.                      performance with fine-grained dynamic page placement, in: Proceedings of IEEE
     Reinhardt, D.A. Wood, QuickRelease: A throughput-oriented approach to release                   International Symposium on High Performance Computer Architecture, 2024, pp.
     consistency on GPUs, in: Proceedings of IEEE International Symposium on High                    1080–1094.
     Performance Computer Architecture, 2014, pp. 189–200.
                                                                                                [39] M.K. Tavana, Y. Sun, N.B. Agostini, D. Kaeli, Exploiting adaptive data com-
[14] M.D. Sinclair, J. Alsop, S.V. Adve, Efficient GPU synchronization without
                                                                                                     pression to improve performance and energy-efficiency of compute workloads in
     scopes: Saying no to complex consistency models, in: Proceedings of IEEE/ACM
                                                                                                     multi-GPU systems, in: Proceedings of IEEE International Parallel and Distributed
     International Symposium on Microarchitecture, 2015, pp. 647–659.
[15] J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consis-                              Processing Symposium, 2019, pp. 664–674.
     tency for GPUs, in: Proceedings of IEEE/ACM International Symposium on                     [40] H. Muthukrishnan, D. Nellans, D. Lustig, J.A. Fessler, T.F. Wenisch, Efficient
     Microarchitecture, 2016, pp. 1–13.                                                              multi-GPU shared memory via automatic optimization of fine-grained trans-
[16] NVIDIA, NVIDIA TESLA V100 GPU architecture, 2017, https://images.nvidia.                        fers, in: Proceedings of the ACM/IEEE International Symposium on Computer
     com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.                           Architecture, 2021, pp. 139–152.
[17] NVIDIA, NVIDIA A100 tensor core GPU architecture, 2020, https:                             [41] B. Li, J. Yin, Y. Zhang, X. Tang, Improving address translation in multi-
     //images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-                          GPUs via sharing and spilling aware TLB design, in: Proceedings of IEEE/ACM
     architecture-whitepaper.pdf.                                                                    International Symposium on Microarchitecture, 2021, pp. 1154–1168.
[18] NVIDIA, NVIDIA NVLink high-speed GPU interconnect, 2024, https://www.
                                                                                                [42] B. Li, J. Yin, A. Holey, Y. Zhang, J. Yang, X. Tang, Trans-FW: Short circuiting
     nvidia.com/en-us/design-visualization/nvlink-bridges/.
                                                                                                     page table walk in multi-GPU systems via remote forwarding, in: Proceedings
[19] I. Singh, A. Shriraman, W.W.L. Fung, M. O’Connor, T.M. Aamodt, Cache coher-
                                                                                                     of IEEE International Symposium on High Performance Computer Architecture,
     ence for GPU architectures, in: Proceedings of IEEE International Symposium on
                                                                                                     2023, pp. 456–470.
     High Performance Computer Architecture, 2013, pp. 578–590.
[20] Y. Sun, T. Baruah, S.A. Mojumder, S. Dong, X. Gong, S. Treadway, Y. Bao,                   [43] B. Li, Y. Guo, Y. Wang, A. Jaleel, J. Yang, X. Tang, IDYLL: Enhancing page
     S. Hance, C. McCardwell, V. Zhao, H. Barclay, A.K. Ziabari, Z. Chen, R.                         translation in multi-GPUs via light weight PTE invalidations, in: Proceedings of
     Ubal, J.L. Abellán, J. Kim, A. Joshi, D. Kaeli, MGPUSim: Enabling multi-                        IEEE/ACM International Symposium on Microarchitecture, 2015, pp. 1163–1177.
     GPU performance modeling and optimization, in: Proceedings of ACM/IEEE                     [44] E. Choukse, M.B. Sullivan, M. O’Connor, M. Erez, J. Pool, D. Nellans, Buddy
     International Symposium on Computer Architecture, 2019, pp. 197–209.                            compression: Enabling larger memory for deep learning and HPC workloads
[21] T. Yuki, L.-N. Pouchet, Polybench 4.0, 2015.                                                    on GPUs, in: Proceedings of ACM/IEEE International Symposium on Computer
[22] Y. Sun, X. Gong, A.K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A.                    Architecture, 2020, pp. 926–939.
     Villegas, D. Kaeli, Hetero-mark, a benchmark suite for CPU-GPU collaborative
                                                                                                [45] Y. Tan, Z. Bai, D. Liu, Z. Zeng, Y. Gan, A. Ren, X. Chen, K. Zhong, BGS: Accelerate
     computing, in: Proceedings of IEEE International Symposium on Workload
                                                                                                     GNN training on multiple GPUs, J. Syst. Archit. 153 (2024) 103162.
     Characterization, 2016, pp. 1–10.
[23] AMD, AMD app SDK OpenCL optimization guide, 2015.                                          [46] X. Ren, M. Lis, CHOPIN: Scalable graphics rendering in multi-GPU systems via
[24] A. Danalis, G. Marin, C. McCurdy, J.S. Meredith, P.C. Roth, K. Spafford, V. Tip-                parallel image composition, in: Proceedings of IEEE International Symposium on
     paraju, J.S. Vetter, The Scalable Heterogeneous Computing (SHOC) benchmark                      High Performance Computer Architecture, 2021, pp. 709–722.
     suite, in: Proceedings of the 3rd Workshop on General-Purpose Computation on               [47] S. Na, J. Kim, S. Lee, J. Huh, Supporting secure multi-GPU computing with dy-
     Graphics Processing Units, 2010, pp. 63–74.                                                     namic and batched metadata management, in: Proceedings of IEEE International
[25] R. Balasubramonian, A.B. Kahng, N. Muralimanohar, A. Shafiee, V. Srinivas,                      Symposium on High Performance Computer Architecture, 2024, pp. 204–217.
     CACTI 7: New tools for interconnect exploration in innovative off-chip memories,           [48] Y. Feng, S. Na, H. Kim, H. Jeon, Barre chord: Efficient virtual memory trans-
     ACM Trans. Archit. Code Optim. 14 (2) (2017) 14:1–25.                                           lation for multi-chip-module GPUs, in: Proceedings of ACM/IEEE International
[26] NVIDIA, NVIDIA DGX-1 with tesla V100 system architecture, 2017, pp. 1–43.
                                                                                                     Symposium on Computer Architecture, 2024, pp. 834–847.
[27] NVIDIA, NVIDIA ADA GPU architecture, 2023, https://images.nvidia.com/aem-
     dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-                       [49] O. Villa, D. Lustig, Z. Yan, E. Bolotin, Y. Fu, N. Chatterjee, Need for speed:
     v2.1.pdf.                                                                                       Experiences building a trustworthy system-level GPU simulator, in: Proceedings
[28] Y. Le, X. Yang, Tiny ImageNet visual recognition challenge, 2015, https://http:                 of IEEE International Symposium on High Performance Computer Architecture,
     //vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf.                        2021, pp. 868–880.
[29] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park,              [50] J. Prades, C. Reaño, F. Silla, NGS: A network GPGPU system for orchestrating
     NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inferencing,                        remote and virtual accelerators, J. Syst. Archit. 151 (2024) 103138.
     in: Proceedings of ACM International Conference on Architectural Support for
     Programming Languages and Operating Systems, 2024, pp. 722–737.


                                                                                           12
G. Ko et al.                                                                                  Journal of Systems Architecture 160 (2025) 103339


               Gun Ko received the B.S. degree in electrical engineering           Hyunwuk Lee received his B.S. and Ph.D. degrees in
               from Pennsylvania State University in 2017. He is currently         electrical and electronic engineering from Yonsei University,
               pursuing the Ph.D. degree with the Embedded Systems                 Seoul, Korea, in 2018 and 2024, respectively. He currently
               and Computer Architecture Laboratory, School of Electrical          works in the memory division at Samsung Electronics. His
               and Electronic Engineering, Yonsei University, Seoul, South         research interests include neural network accelerators and
               Korea. His current research interests include GPU memory            GPU systems.
               systems, multi-GPU systems, and virtual memory.


               Jiwon Lee received the B.S. and Ph.D. degrees in electrical
               and electronic engineering from Yonsei University, Seoul,           Won Woo Ro received the B.S. degree in electrical engineer-
               South Korea, in 2018 and 2024, respectively. He currently           ing from Yonsei University, Seoul, South Korea, in 1996, and
               works in the memory division at Samsung Electronics. His            the M.S. and Ph.D. degrees in electrical engineering from the
               research interests include virtual memory, GPU memory               University of Southern California, in 1999 and 2004, respec-
               systems, and storage systems.                                       tively. He worked as a Research Scientist with the Electrical
                                                                                   Engineering and Computer Science Department, University
                                                                                   of California, Irvine. He currently works as a Professor
                                                                                   with the School of Electrical and Electronic Engineering,
                                                                                   Yonsei University. Prior to joining Yonsei University, he
                                                                                   worked as an Assistant Professor with the Department
               Hongju Kal received the B.S. degree from Seoul National             of Electrical and Computer Engineering, California State
               University of Science and Technology and Ph.D. degree from          University, Northridge. His industry experience includes a
               Yonsei University in school of electric and electronic engi-        college internship with Apple Computer, Inc., and a contract
               neering, Seoul, South Korea in 2018 and 2024, respectively.         software engineer with ARM, Inc. His current research
               He currently works in the memory division at Samsung                interests include high-performance microprocessor design,
               Electronics. His current research interests include memory          GPU microarchitectures, neural network accelerators, and
               architectures, memory hierarchies, near memory processing,          memory hierarchy design.
               and neural network accelerators.


                                                                              13