Journal of Systems Architecture 160 (2025) 103339 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems Gun Ko, Jiwon Lee, Hongju Kal, Hyunwuk Lee, Won Woo Ro ∗ Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea ARTICLE INFO ABSTRACT Keywords: With the increasing demands of modern workloads, multi-GPU systems have emerged as a scalable solution, ex- Multi-GPU tending performance beyond the capabilities of single GPUs. However, these systems face significant challenges Data sharing in managing memory across multiple GPUs, particularly due to the Non-Uniform Memory Access (NUMA) Cache coherence effect, which introduces latency penalties when accessing remote memory. To mitigate NUMA overheads, GPUs Cache architecture typically cache remote memory accesses across multiple levels of the cache hierarchy, which are kept coherent using cache coherence protocols. The traditional GPU bulk-synchronous programming (BSP) model relies on coarse-grained invalidations and cache flushes at kernel boundaries, which are insufficient for the fine-grained communication patterns required by emerging applications. In multi-GPU systems, where NUMA is a major bottleneck, substantial data movement resulting from the bulk cache invalidations exacerbates performance overheads. Recent cache coherence protocol for multi-GPUs enables flexible data sharing through coherence directories that track shared data at a fine-grained level across GPUs. However, these directories limited in capacity, leading to frequent evictions and unnecessary invalidations, which increase cache misses and degrade performance. To address these challenges, we propose REC, a low-cost architectural solution that enhances the effective tracking capacity of coherence directories by leveraging memory access locality. REC coalesces multiple tag addresses from remote read requests within common address ranges, reducing directory storage overhead while maintaining fine-grained coherence for writes. Our evaluation on a 4-GPU system shows that REC reduces L2 cache misses by 53.5% and improves overall system performance by 32.7% across a variety of GPU workloads. 1. Introduction each kernel. However, as recent GPU applications increasingly require more frequent and fine-grained communication both within and across Multi-GPU systems have emerged to meet the growing demands kernels [11,13–15], these frequent synchronizations can lead to sub- of modern workloads, offering scalable performance beyond what a stantial cache operation and data movement overheads. Additionally, single GPU can deliver. However, as multi-GPU architectures scale in precisely managing the synchronizations places additional burdens on size and complexity [1,2], managing memory across multiple GPUs programmers, complicating the optimization of multi-GPU systems. becomes increasingly challenging [3–7]. One of the primary challenges Ren et al. [11] proposed HMG, a hierarchical cache coherence arises from the bandwidth discrepancy between local and remote mem- protocol designed for L2 caches in large-scale multi-GPU systems. HMG ory, commonly known as the Non-Uniform Memory Access (NUMA) employs coherence directories to record cache line addresses and their effect [3,4]. To mitigate the NUMA penalty, GPUs generally rely on associated sharers upon receiving remote read requests. Any writes to caching remote memory accesses, allowing them to be served with these addresses trigger invalidations. Once capacity is reached, existing local bandwidth [5,8–10]. This caching strategy is often extended entries are evicted from the directory, triggering invalidation requests across multiple levels of the cache hierarchy, including both private to the sharer GPUs. These invalidations are unnecessary, as the cor- on-chip caches and shared caches [3,4,11,12], to better accommodate responding cache lines do not immediately require coherence to be the diverse access patterns of emerging workloads. maintained. When GPUs access data across a wide range of addresses, While remote data caching offers significant performance benefits in multi-GPU systems, it also requires extending coherence throughout significant directory insertions lead to a number of unnecessary invali- the cache hierarchy. Conventional GPUs rely on a simple software- dations for cache lines that have not yet been fully utilized. Subsequent inserted bulk-synchronous programming (BSP) model [11], which per- accesses to these cache lines result in cache misses, requiring data to forms cache invalidation and flush operations at the start and end of be fetched again over bandwidth-limited inter-GPU links. ∗ Corresponding author. E-mail address: wro@yonsei.ac.kr (W.W. Ro). https://doi.org/10.1016/j.sysarc.2025.103339 Received 10 September 2024; Received in revised form 27 December 2024; Accepted 5 January 2025 Available online 9 January 2025 1383-7621/© 2025 Published by Elsevier B.V. G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 1. Performance of each caching scheme normalized to a system that enables remote data caching in both L1 and L2 caches using software and hardware coherence protocols, respectively. ‘‘No caching’’ refers to a system that disables remote data Fig. 2. Baseline multi-GPU system. Each GPU has a coherence directory that records caching, simplifying coherence. and tracks the status of shared data at given addresses along with the corresponding sharer IDs. To evaluate the implications of the coherence protocol, we mea- sure the performance impact of unnecessary invalidations on a 4-GPU 2. Background system that caches remote data in both L1 and L2 caches. L1 caches are assumed to be software-managed, while L2 caches are managed 2.1. Multi-GPU architecture under fine-grained invalidation through coherence directories. As Fig. 1 shows, there exists a significant performance opportunity in eliminat- The slowdown of transistor scaling has made it increasingly difficult ing unnecessary invalidations caused by frequent directory evictions. for single GPUs to meet the growing demands of modern workloads. Al- Increasing the size of the coherence directory can delay evictions and ternatively, multi-GPU systems have emerged as a viable path forward, the corresponding invalidation requests, but at the cost of increased offering enhanced performance and memory capacity by leveraging hardware. Our observations indicate that to eliminate unnecessary multiple GPUs connected using high-bandwidth interconnects such as invalidations, the size of the coherence directory would need be sub- PCIe and NVLink [18]. However, these inter-GPU links are likely to stantially increased, accounting for 30.4% of the L2 cache size. As have bandwidth that falls far behind the local memory bandwidth [3, the size of GPU L2 caches continues to grow [16,17], the aggregate 4,8]. The NUMA effect that arises from this large bandwidth gap storage overhead of coherence directories becomes substantial, caus- can significantly impact multi-GPU performance, making it crucial to ing inefficiency in scaling for multi-GPU environment (discussed in optimize remote access bottlenecks to maximize efficiency. Section 3.3). Fig. 2 illustrates the architectural details of our target multi-GPU In this paper, we propose Range-based Directory Entry Coalescing system. Each GPU is divided into several SAs, with each comprising a (REC), an architectural solution that mitigates unnecessary invalidation number of CUs. Every CU has its own private L1 vector cache (L1V$), overhead by increasing the effective tracking capacity of the coher- while the L1 scalar cache (L1S$) and L1 instruction cache (L1I$) are ence directory without incurring significant hardware costs. Our key shared across all CUs within an SA. Additionally, each GPU contains insight is that since directory updates are performed upon receiving a larger L2 cache that is shared across all SAs. When a data access remote read requests, leveraging memory access locality provides an misses in the local cache hierarchy, it is forwarded to either local or opportunity to coalesce multiple tag addresses of shared data based on remote GPU memory, depending on the data location. For local mem- their common address range. To achieve this, we employ a coherence ory accesses, the cache lines are stored in both the shared L2 cache and directory design, which aggregates data from incoming remote reads that share a common base address within the same address range, the L1 cache private to the requesting CU. In the case of remote-GPU storing only the offset and the sharer IDs. We reduce the storage memory accesses, the data can be cached either only in the L1 cache requirements of directory entries by designing them in a base-and-offset of the requesting CU [4,5,8] or in both the L2 and L1 caches [3,11,12]. format, recording the common high-order bits of addresses and using a Caching data in remote memory nodes helps mitigate the performance bit-vector to indicate the index of each coalesced entry within the target degradation caused by accessing remote memory nodes. range. For incoming writes, if they are found in the coherence direc- tory, invalidations are propagated only to the corresponding address, 2.2. Remote data caching in multi-GPU maintaining fine-grained coherence in multi-GPU systems. To summarize, this paper makes the following contributions: While caching remote data only in the L1 cache can save L2 cache capacity, it limits the sharing of remote data among CUs. As a result, • We identify a performance bottleneck of fine-grained shared data such an approach provides lower performance gain when unnecessary tracking mechanisms in multi-GPU systems. Our analysis demon- invalidation overhead is eliminated in its counterpart, as shown in strates that such methods generate unnecessary invalidations at Fig. 1. For this reason, in this study, we assume the baseline multi-GPU coherence directory evictions, which incurs a significant perfor- architecture allows caching of remote data in both L1 and L2 caches. mance bottleneck due to increased cache miss rates. A step-by-step process of remote data caching is shown in Fig. 2. • We show that simply employing larger coherence directories Upon generating a memory request, an L1 cache lookup is performed incurs significant storage overhead. Our analysis shows that the by the requesting CU ( 1 ). When data is not present in the L1, an baseline multi-GPU system requires a 12× increase in the direc- L2 cache lookup is generated to check if the remote data is cached tories to eliminate redundant invalidations. in the L2 ( 2 ). If the data is found in the L2 cache, it is returned to • We propose REC which increases effective coverage of the co- the requesting CU and cached in its local L1 cache. If the data is not herence directory by enabling each entry to coalesce and track multiple memory addresses along with the associated sharers. By found in the L2 cache, the request is forwarded to the remote GPU reducing the L2 cache misses by 53.5%, REC improves overall memory at the given physical address. Subsequently, the requested performance by 32.7% on average across our evaluated GPU data is returned at a cache line granularity and cached in both the L1 workloads. and L2 caches ( 3 ). At the same time, the coherence directory, which maintains information about data locations across multiple GPUs, is 2 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 3. Coherence protocol flows in detail. The baseline hardware protocol has two Fig. 4. L2 cache miss rates in baseline and idealized system where no invalidations stable states: valid and invalid, with no transient states or acknowledgments required are propagated by coherence directory evictions. Cold misses are excluded from the for write permissions. results. updated with the corresponding entry and the sharer GPU ( 4 ). Writes Local reads: Local read requests arriving at the L2 cache are directed to remote data in the home GPU are also performed in the local L2 to either locally- or remotely-mapped data. On cache hits, the data is cache, following the write-through policy, as the corresponding GPU returned and guaranteed to be consistent because it is either the most may access the written data in the future. Remote writes arriving at up-to-date data (if mapped to local DRAM) or correctly managed by the home GPU trigger invalidation messages to be sent out to the sharer the protocol (if mapped to remote GPU). On cache misses, the requests GPU(s), and the requesting GPU is recorded as a sharer ( 4 ). are forwarded to either local DRAM or a remote GPU. In all cases, the directory of the requesting GPU remains unchanged. 2.3. Cache coherence in multi-GPU Remote reads: For remote reads that arrive at the home GPU, the coherence directory records the ID of the requesting GPU at the given cache line address. If the line is already being tracked (i.e., the entry is Existing hardware protocols, such as GPU-VI [19], employ coher- found and valid), the directory simply adds the requester to the sharer ence directories to track sharers (i.e., L1s) and propagate write-initiated field and keeps the entry in the valid state. If the line is not being cache invalidations within a single GPU. Bringing the notion into multi- tracked, the directory finds an empty spot to allocate a new entry and GPU environments, Ren et al. proposed HMG [11], a hierarchical design marks it as valid. When the directory is full and every entry is valid, it that efficiently manages both intra- and inter-GPU coherence. HMG evicts an existing entry and replaces it with the new entry (discussed includes two layers for selecting home nodes to track sharers: (1) the below). inter-GPU module (GPM) level that selects a home GPM within a GPU Local writes: Local writes to data mapped to the home GPU memory and (2) the inter-GPU level that selects a home GPU across the entire look up the directory to find whether a matching entry at the line system. A GPM is a chiplet in multi-chip module GPUs. With this, address exists. If found, invalidations are propagated to the recorded HMG reduces the complexity of tracking and maintaining coherence sharers in the background, and the directory entry becomes invalid. across a large number of sharers. HMG also optimizes performance by Remote writes: By default, L2 caches use a write-back policy for local eliminating all transient states and most invalidation acknowledgments, writes. As described in Section 2.2, remote writes update both the L2 leveraging weak memory models in modern GPUs [11]. cache of the requester and local memory, similar to a write-through Each GPU has a coherence directory attached to its L2 cache, policy. Consequently, the directory maintains the entry as valid by managed by the cache controllers. The directory is organized in a set- adding the requester to the sharer list and sends out invalidations to associative structure, and each entry contains the following fields: tag, other sharers recorded in the original entry. sharer IDs, and coherence state. The tag field stores the cache line Directory entry eviction/replacement: Coherence directories are im- address for the data copied and fetched by the sharer. The sharer plemented in a set-associative structure. Thus, capacity and conflict ID field is a bit-vector representing the list of sharers, excluding the misses occur as directory lookups are initiated by the read requests con- home GPU. Each entry is in one of two stable states: valid or invalid. tinuously received from remote GPUs. To notify that the information Unlike HMG [11], the baseline coherence directory tracks one cache in the evicted entry is no longer traceable, invalidations are sent out as line per each entry. In contrast, a directory entry in HMG is designed with writes. to track four cache lines using a single tag address and sharer ID Acquire and release: At the start of a kernel, invalidations are per- field, which limits its ability to manage each cache line at a fine formed in L1 caches as coherence is maintained using software bulk granularity. Consequently, a write to any address tracked by a directory synchronizations. However, the invalidations are not propagated be- entry may unnecessarily invalidate other cache lines within the same yond L1 caches, as L2 caches are kept coherent with the fine-grained range, potentially causing inefficiencies in remote data caching. We directory protocol. Release operations flush dirty data in both L1 and discuss the importance of reducing unnecessary cache line invalidations L2 caches. in detail in Section 3.1. Like typical memory allocation in multi-GPU systems, the physical address space is partitioned among the GPUs in 3. Motivation the system. Therefore, data at any given physical address is designated to one GPU (i.e., the home GPU), and every access by a remote GPU In multi-GPU systems, coherence is managed explicitly through references the coherence directory of the home GPU. For example, in cache invalidations to ensure data consistency across multiple GPUs. Fig. 2, GPU0 requests data at address 0xA from GPU1, which is the When invalidation requests are received, sharer GPUs must look up and home GPU; the corresponding entry is then inserted into the directory invalidate the corresponding cache lines. Subsequent accesses to these of GPU1 with the relevant information. invalidated cache lines result in cache misses, which are then forwarded Fig. 3 shows the detailed state transitions and actions initiated by to the home GPU. This, in turn, can negate the performance benefits of the coherence directory. Note that local and remote refer to the sources local caching as it undermines the effectiveness of caching mechanisms of memory requests received: local refers to accesses from the local CUs, intended to reduce remote access bottlenecks. In this section, we ana- and remote refers to accesses from the remote GPUs. lyze the behavior of cache invalidation and its impact on the overall 3 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 6. Performance impact of increasing coherence directory sizes. To eliminate unnecessary invalidations, GPUs require a directory size up to 12× larger than the Fig. 5. Fraction of evict-initiated and write-initiated invalidations in the baseline multi- baseline. GPU system. The results are based on invalidation requests that hit in the sharer-side L2 caches. 3.2. Source of premature invalidation performance of multi-GPU systems. We identify the sources of invalida- As described in Section 2.3, when a coherence directory becomes tion and explore a straightforward solution to mitigate the associated full, the GPU needs to evict an old entry and replace it with a new bottlenecks. Our experiments are conducted using MGPUSim [20], a one upon receiving a remote read request; an invalidation request must multi-GPU simulation framework that we have extended to support be sent out to the sharer(s) in the evicted entry. Fig. 5 shows the the hardware cache coherence protocol. The detailed configuration is distribution of invalidations triggered by directory eviction and write provided in Table 2. requests, referred to as evict-initiated and write-initiated invalidations, respectively. The measurements are taken based on the invalidations 3.1. Impact of cache invalidation that are hit in the sharer-side L2 caches after receiving the requests. We observe that significant amount of invalidations (average 79.5%) are To ensure data consistency across multiple GPUs, invalidation re- performed by the requests from directory evictions in the home GPUs. quests are propagated by the home GPU in two cases: (1) when write These invalidations, considered unnecessary as they do not require requests are received and (2) when an entry is evicted from the coher- immediate action, should be delayed until remote GPUs have full use ence directory due to capacity and conflict misses. Invalidation requests of the data. triggered by writes are crucial for maintaining data consistency, as they We also show the percentage of write-initiated invalidations in ensure that no stale data is accessed in the sharer GPU caches. On the Fig. 5. One can observe that applications such as FIR, LU, and MM2 other hand, invalidations generated by directory eviction aim to notify experience a significant number of invalidations due to write re- the sharers that the coherence information is no longer traceable, even quests. These workloads exhibit fine-grained communication within if the data is still valid. A detailed background on the protocol flows and across dependent kernels, necessitating the invalidation of corre- with invalidations is given in Section 2.3. sponding cache lines in the remote L2 cache upon any modification Broadcasting invalidations does not significantly impact cache ef- to the shared data. Although the applications exhibit a high percent- ficiency if the cache lines are already evicted or no longer in use. age of write-initiated invalidations, their impact on cache miss rates However, when applications exhibit frequent remote memory accesses, may be negligible if the GPUs do not subsequently require access the generation of new directory entries increases invalidation requests to the invalidated cache lines. Nonetheless, the results from Fig. 4 from eviction, invalidating the associated cache lines prematurely. clearly demonstrate the importance of minimizing unnecessary cache These premature invalidations lead to higher cache miss rates, as invalidations. So far, we have discussed how prematurely invalidating remote data subsequent accesses to the invalidated cache lines result in misses. leads to increased cache miss rates, which negatively impacts multi- As remote data misses exacerbates NUMA overheads, they need to be GPU performance. We also show that a large fraction of invalidation reduced to improve multi-GPU performance. requests stems from directory evictions, which frequently occur due to Fig. 4 shows the impact of cache miss rate when eliminating unnec- the high volume of remote accesses. These accesses trigger numerous essary invalidations across the benchmarks listed in Table 3 running on directory updates, overwhelming the baseline coherence directory’s a 4-GPU system. The figure demonstrates that the baseline system expe- capacity to effectively manage coherence. A straightforward solution to riences a cache miss rate more than double (average 2.4×) that of the mitigate premature invalidations is to increase the size of the coherence idealized system without the unnecessary invalidation. This increase directory, providing more coverage to track sharers and reducing evic- is mainly due to frequent invalidation requests, which prematurely tion rates. In the following section, we analyze the performance impact invalidate cache lines before they can be fully utilized, leading to an of larger coherence directory sizes. It is important to note that this increase in the number of remote memory accesses. The result strongly paper primarily focuses on delaying invalidations caused by directory motivates us to further study the source of these frequent invalidations evictions, as write-initiated invalidations are necessary and must be to improve efficiency of remote data caching in multi-GPU systems. performed immediately for correctness. To demonstrate the performance opportunity, Fig. 1 presents a study showing the performance of idealized caching without the invalidation 3.3. Increasing directory sizes overhead. With no invalidations to unmodified cache lines, remote data can be fully utilized as needed until they are naturally replaced by the A simple approach to delay directory evictions, thereby minimizing typical cache replacement policy. The performance of the baseline and premature invalidations, is to increase the size of coherence directories. ideal system is represented in the first and fourth bars, respectively, Limited directory sizes lead to significant evict-initiated invalidations, in Fig. 1. The result shows that an ideal system with no unnecessary which can undermine the performance benefits of local caching. To cache invalidation overheads outperforms the baseline by up to 2.79× quantify the benefits of larger directories, we conduct a quantitative (average 36.9%). As demonstrated by Figs. 1 and 4, reducing premature analysis of performance improvements with increasing directory sizes. cache invalidations is crucial in improving efficiency of remote data In our simulated 4-GPU system, each GPU has an L2 cache size of caching in multi-GPU systems. 2 MB, with each cache line being 64B. Each coherence directory tracks 4 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 7. Average performance improvement per increased directory storage in the baseline coherence directory design. The results are normalized to the system with 8K-entry coherence directory. the identity of all sharers excluding the home GPU (i.e., three GPUs). To cover the entire L2 cache space for three GPUs, an ideal coherence directory would require approximately 96K entries, or about 12× the baseline 8K entries. Fig. 6 illustrates the normalized performance for increasing the Fig. 8. A high-level overview of (a) baseline and (b) proposed REC architecture with directory sizes by 2×-12× the baseline. With an ideal directory size, simplified 2-entry coherence directories. The figure illustrates a scenario where GPU1 unnecessary invalidations from directory evictions can be eliminated, accesses memory of GPU0 in order of 0 × 1000, 0 × 1040, 0 × 1080, and 0 × 1000 leaving only write-initiated invalidations. The results show that ap- by each CU. In the baseline directory, entry that tracks status of data at 0 × 1000 plications exhibit significant performance gains as the directory size is evicted for recording the address 0 × 1080. The proposed directory coalesces three increases, with some benchmarks (e.g., ATAX, PR, and ST) requiring addresses with same base address into one entry. 8×-12× the baseline size to achieve the highest speed-up. Specifically, benchmarks such as PR and ST show irregular memory access patterns that span a wide address range, leading to higher chances of conflict 4.1. Hardware overview misses when updating coherence directories. Most other tested bench- marks require up to six times the baseline directory size to achieve As shown in Section 3.2, a significant fraction of cache invalidations maximum attainable performance; the average speedup with six times are generated by the frequent directory evictions. These invalidations the size is 1.35×. lead to increased cache misses, as data is prematurely invalidated from Each entry in the coherence directory comprises a tag, sharer list, the cache, requiring subsequent accesses to fetch the data from remote and coherence state. We assume 48 bits for tag addresses, a 3-bit memory. While simply increasing the directory size can address this vector for tracking sharers, and one bit for the directory entry state; bottleneck, the associated cost of hardware can become substantial. To thus, each entry requires a total of 52 bits of storage. Our baseline address this, we propose REC, an architectural solution that compresses directory implementation has 8K entries and occupies approximately remote GPU access information, retaining as much data as possible 2.5% of the L2 cache [11]. Therefore, the storage cost of the baseline before eviction occurs. It aggregates data from incoming remote read directory in each GPU is 52 × 8192/8/1024 = 52 kB, assuming 8 requests so that (1) multiple reads to the same address range share bits per byte and 1024 bytes per kilobyte. From our observation in a common base address, storing only the offset and source GPU in- Fig. 6, applications require directory sizes from 6× up to 12× the formation, and (2) the coalescing process does not result in any loss baseline to achieve maximum performance. This corresponds to a total of information, maintaining the accuracy of the coherence protocol. storage cost of 312-624 kB, which is an additional 15.2–30.4% of We now discuss the design overview of REC and the details of the the L2 cache size. While increasing directory size can significantly associated hardware components. improve performance, the associated hardware costs are substantial. Fig. 8(a) shows how the baseline GPU handles a sequence of in- To show the inefficiency of simply scaling directory sizes, we calculate coming read requests. The cache controller records the tag addresses the performance per storage using the results in Fig. 6 and the number and the corresponding sharer IDs in the order that the requests arrive. of directory entries. Fig. 7 illustrates the results relative to the baseline When the coherence directory reaches its capacity, the cache controller with 8K entries, showing that performance improvements per increased follows a typical FIFO policy to replace the oldest entry with a new storage do not scale proportionally with larger coherence directories. one within the set. Once an entry is evicted, the information it held Additionally, since GPU applications require different directory sizes can no longer be tracked, triggering an invalidation request to be sent to achieve maximum performance, simply increasing the directory size is not an efficient solution. Moreover, as GPU L2 caches continue to to the GPU listed in the entry. Upon receiving this request, the sharer grow [16,17], the cost of maintaining proportionally larger coherence GPU checks its L2 cache and invalidates the corresponding cache line, directories will only amplify these overheads. Therefore, improving leading to a cache miss on any subsequent access to the cache line. coherence directory coverage without significant storage overhead mo- To delay invalidations caused by directory evictions without signif- tivates the need for more efficient fine-grained hardware protocols in icant hardware overhead, we introduce the REC architecture, which multi-GPU systems. enhances the baseline coherence directory by leveraging spatial locality for merging multiple addresses into a single entry. As illustrated in 4. REC architecture Fig. 8(b), REC stores tag addresses with common high-order bits as a single entry using a base-plus-offset format. When a new read request This work aims to enhance coherence directory coverage while matches the base address in an existing entry, the offset and sharer in- avoiding significant hardware overhead, overall reducing unnecessary formation are appended to that entry, reducing the need for additional cache invalidations in multi-GPU systems. We introduce REC, an archi- entries and delaying evictions. The base address represents the shared tecture that coalesces directory entries by leveraging the spatial locality high-order bits, covering a range of addresses and reducing the storage in memory accesses observed in GPU workloads. In this section, we required compared to storing full tag addresses individually. Addition- provide an overview of REC design and discuss its integration with ally, REC uses position bits to efficiently track multiple addresses within existing multi-GPU coherence protocols. the specified range, further minimizing storage overhead. 5 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Table 1 Trade-offs between addressable range and storage for each entry. Note that one valid bit, not shown in the table, is included in the overall calculation. Addressable range 64B 128B 256B 1 kB 4 kB Base address bits 48 41 40 38 36 Position/Sharer bits −/3 2/6 4/12 16/48 64/192 Total bits per entry 52 50 57 103 293 Table 2 Baseline GPU configuration. Parameter Configuration Number of SAs 16 Number of CUs 4 per SA L1 vector cache 1 per CU, 16 kB 4-way L1 inst cache 1 per SA, 32 kB 4-way Fig. 10. Overview of the REC protocol flows. In the example coherence directory, L1 scalar cache 1 per SA, 16 kB 4-way entry insertion and offset addition operations are highlighted in blue, while eviction L2 cache 2 MB 16-way, 16 banks, write-back and offset deletion operations are shown in red. Cache line size 64B Coherence directory 8K entries, 8-way DRAM capacity 4 GB HBM, 16 banks DRAM bandwidth 1 TB/s [11] for comparing the storage costs. REC designs with larger addressable Inter-GPU bandwidth 300 GB/s, bi-directional ranges can benefit from increased directory coverage but at the cost of storage. In the evaluation of this paper, we tested various addressable ranges for REC. Each design is configured to coalesce the maximum Table 3 number of offsets within its specified range. Later in the results, we Tested workloads. confirm that a 1 kB coalesceable range offers the best trade-off, bal- Benchmark Abbr. Memory footprint ancing reasonable size overhead per entry with the ability to coalesce Matrix transpose and vector multiplication [21] ATAX 128 MB 2-D convolution [21] C2D 512 MB a significant number of entries before evictions occur (discussed in Finite impulse response [22] FIR 128 MB Section 5.2). Matrix-multiply [21] GEMM 128 MB Based on these findings, the format of a directory entry is as Vector multiplication and matrix addition [21] GEMV 256 MB illustrated in Fig. 9. Each entry comprises a base address, coalesced 2-D jacobi solver [21] J2D 128 MB entries, and a valid bit. When the first remote read request arrives at the LU decomposition [21] LU 128 MB 2 matrix multiplications [21] MM2 128 MB home GPU, the cache controller sets the base address by right-shifting 3 matrix multiplications [21] MM3 64 MB the tag address by the number of bits needed to represent the offset PageRank [22] PR 256 MB within the specified range. For a 48-bit tag, the address is right-shifted Simple convolution [23] SC 512 MB by 10 bits (considering a 64B-aligned 1 kB range), and the resulting Stencil 2D [24] ST 128 MB bits from positions 64 to 101 are used to store the base address. The coalesced entry is identified using the offset within the 1 kB range, represented by a position bit, followed by three bits for recording the sharers. The position bit is calculated as: ( ) Tag mod 𝑚 𝑝= × (𝑛 + 1) 64 where 𝑚 denotes the coalescing range, and 𝑛 is the number of shar- ers, which are set to 1 kB and 3, respectively. Once the position is determined, the corresponding position and the sharer bit are set to 1 using bitwise OR operation. Given that the 1 kB range allows each entry to record up to 16 individual tag addresses, we use the lower 64 Fig. 9. Coherence directory entry structure for 64B cache lines. In our design, each bits to store the coalesced entries. Furthermore, the position bit can entry stores up to 16 coalesced entries based on 1 kB range. also function as the valid bit for each coalesced entry, meaning only one valid bit is necessary to indicate whether the entire entry is valid or not. Determining the address range within which REC coalesces entries is one of the key design considerations, as it directly impacts the number 4.2. REC protocol flows of bits required for each entry. Table 1 shows a list of design choices for implementing REC with varying addressable ranges and their potential The baseline coherence protocol operates with two stable states- trade-offs. The number of required base address bits is calculated using valid and invalid-allowing it to remain lightweight and efficient. In 2n = addressable_range, where n is the number of bits right-shifted our proposed coherence directory design, each entry represents the from the original tag address. Also, the number of required position validity of an entire address range instead of tracking individual tag bits is determined by the maximum number of coalesceable cache line addresses and associated sharers. This enables the state transitions addresses within the target range, assuming 64B line size. Then, the to be managed at a coarser granularity during directory evictions. number of sharer bits required is (n-1)×num_position_bits, where n is Additionally, REC supports fine-grained control over write requests by the number of GPUs. For example, if REC is designed to coalesce with tracking specific offsets within these address ranges, avoiding the need addressable range of 256B, each entry would require 40, 4, and 12 bits to evict entire entries. Fig. 10 highlights the architecture of REC and for base address, position, and sharer fields, respectively. Lastly, one how it differently handles the received requests with the baseline. REC valid bit is added to each entry. In Table 1, we show the total bits does not require additional coherence states but instead modifies the required per entry under the addressable ranges from 128B to 4 kB transitions triggered under specific conditions. 6 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Remote reads: When the GPU receives the read request from the 4.3. Discussion remote GPU, the cache controller extracts the base and offset from the tag address ( A ). The controller then looks up the coherence directory Overheads: In our design, the coherence directory consists of 8K for an entry with the matching base address ( B ). If a valid entry is entries, with each entry covering a 1 kB range of addresses. Each entry found, the position bit corresponding to the offset calculated using the comprises a 38-bit base address field, a 64-bit vector for offsets and formula in Section 4.1 and the associated sharer bit are set ( C ). For sharers, and a valid bit (detailed in Table 1). Thus, the total directory example, the position bit is 34016 /64 × 4 = 52 representing the 14th size is 8192 × 103/8/1024 = 103 kB. We also estimate the area cache line within the specified 1 kB range. The sharer bit is determined and power overhead of the coherence directory in REC, using CACTI by the source GPU index (e.g., GPU1). Therefore, bit 52 and 53 are set 7.0 [25]. The results show that the directory is 3.94% area and has to 1. It can happen that the position bit is already set; nevertheless, the 3.28% power consumption compared to GPU L2 cache. REC requires no controller still performs a bitwise OR on the bits at the corresponding additional hardware extensions for managing the coherence directory. positions. Since the entry already exists in the directory, it remains The existing cache controller handles operations such as base address valid. Otherwise, if no valid entry is found, a new entry is created calculation and bitwise manipulation efficiently. with the base address, and the position and sharer bits are set. With Comparison to prior work: As discussed in Section 2.3, HMG [11] designs each coherence directory entry to track four cache lines at the insertion of a new entry, the state transitions from invalid to valid. a coarse granularity. We empirically show, in Section 3.3, that GPUs Local writes: When the write request is performed locally ( D ), the require a directory size up to 12× the baseline to eliminate unnecessary cache controller must determine whether it needs to send out inval- cache line invalidations. Since REC coalesces up to 16 consecutive idation requests to the sharers that hold the copy of data. For this, cache line addresses per entry, REC can track a significantly larger num- the controller again looks up the directory with the calculated base ber of cache lines compared to the prior work. Moreover, REC precisely address and offset ( E ). If an entry is found and the offset is valid tracks each address by storing the offset and sharer information. Thus, (i.e., the position bit is set), the invalidation request is generated REC fully support fine-grained management of cache lines under write and propagated to the recorded sharers immediately ( F ). The state operations. transition is handled differently based on two conditions. First, when Scalability: REC requires modifications to its design in large-scale another offset is tracked under the common address range, the directory systems, specifically to the sharer bit field. For an 8-GPU system, REC entry should remain valid. Thus, the controller clears only the position requires (8-1) × 16 = 112 bits to record sharers in each entry. Then, and sharer bits for the specific offset of the target address. For example, the size of each entry becomes 112 + 38 + 16 + 1 = 167 bits, which in Fig. 10, the directory entry has another offset (atp = 56) recorded is approximately three times the baseline size, where each entry costs under the same base address. Once the invalidation request is sent out 56 bits, including a 4-bit increase for sharers. Similarly, for a 16-GPU to GPU1, the controller only clears bits 0 and 1. If the cleared bits are system, REC requires 295 bits per entry, roughly five times the baseline the last ones, the entire directory entry transitions to an invalid state size. However, as observed in Section 3.3, an ideal GPU requires up to to make room for new entries. 12 times the baseline directory size even in a 4-GPU system, implying Remote writes: For the remote write request, the cache controller that simply increasing the baseline directory size is insufficient to meet begins the same directory lookup process by calculating the base and scalability demands. offset from the tag ( G ). In our target multi-GPU system, the source GPU also performs writes to the copy of data in its local L2 cache 5. Evaluation (discussed in Section 2.2). Therefore, the controller handles remote write requests differently from local writes. When an entry already 5.1. Methodology exists in the directory (i.e., hits), there may be two circumstances: (1) the target offset is invalid but the entry has other valid offsets and (2) We use MGPUSim [20], a cycle-accurate multi-GPU simulator, to the target offset is already valid and one or more sharers are being model baseline and REC architecture with four AMD GPUs connected tracked. If the target offset is invalid, the controller simply adds the using inter-GPU links of 300 GB/s bandwidth [26]. The configuration of offset and the sharer to the entry in the same way it handles remote the modeled GPU architecture is detailed in Table 2. Each GPU includes reads. If the offset is valid, the controller adds the source GPU to the L1 scalar and instruction caches shared within each SA, while the L1 sharer list by setting its corresponding bit and clearing other sharer vector cache is private to each CU, and the L2 cache is shared across the GPU. We extend remote data caching to the L2 caches, allowing data bits ( H ), then sends invalidation requests to all other sharers ( I ). In from any GPU in the system to be cached in the L2 cache of any other Fig. 10, the entry and the target offset (atp = 56) both are already GPU. Since MGPUSim does not include a support of hardware cache recorded. The controller, thus, additionally sets bit 58 to add GPU2 as coherence, we extend the simulator by implementing a coherence di- a sharer while clearing the bit 59 and sends the invalidation request rectory managed by the L2 cache controller. The coherence directory is to GPU3. In either cases, the directory entry remains valid. When the implemented with a set-associative structure to reduce lookup latency. directory misses, the cache controller allocates a new entry to record Since the baseline coherence directory is decoupled from the caches, the base, offset, and sharer from the write request. Then, the entry state its way associativity as well as the size can be scaled independently. transitions to valid. In our evaluation, the coherence directory is designed with an 8-way Directory entry eviction/replacement: When the coherence directory set-associative structure to reduce conflict misses, containing 8K entries becomes full, it needs to replace an entry with the newly inserted in both the baseline and REC architectures. Upon receiving remote read one. The baseline coherence directory uses a FIFO replacement policy. requests, the cache controller updates the coherence directory with However, for workloads that exhibit irregular memory access pat- recording the addresses and the associated sharers. Once capacity of terns, capturing locality becomes a challenge. To address this, REC the directory is reached, the cache controller evicts an entry and sends adopts the replacement policy, similar to LRU, to better retain entries out invalidation requests to the recorded sharers. For receiving write that are more likely to be accessed again. As the cache controller requests, the controller looks up the directory to find whether data receives the remote read request and does find an entry with the with matching addresses are shared by remote GPUs. If the matching matching base address ( J ), it determines an entry for replacement entries are found, invalidation requests are propagated to the sharers ( K ). The evicting entry is then replaced with the new entry from the except the source GPU. Additionally, since L2 caches are managed incoming request ( L ). Meanwhile, the controller retrieves the base by coherence directories, acquire operations do not perform invalida- address, every merged offset from the evicting entry and reconstructs tions on L2 caches, but release operations flush the L2 caches. We the original tag addresses. Invalidation requests are propagated to every use workloads from a diverse set of benchmark suites, including AM- recorded sharer associated with each tag address ( M ). Lastly, the entry DAPPSDK [23], Heteromark [22], Polybench [21], SHOC [24]. Table 3 transitions to an invalid state. lists the workloads with their memory footprints. 7 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 11. Performance comparison of the baseline with double-sized coherence direc- Fig. 12. Number of coalesced cache line addresses at directory entry eviction under tory, HMG [11], REC, and an idealized system with zero unnecessary invalidations. REC with varying addressable ranges. REC in this work coalesces with 1 kB addressable Performance is normalized to the baseline with 8K-entry coherence directory. range. 5.2. Performance analysis Fig. 11 shows the performance of the baseline with coherence di- rectory double in size, HMG [11], REC, and an ideal multi-GPU system with zero unnecessary invalidations relative to the baseline. First, we include the performance of baseline with double in coherence directory size to compare REC with the same storage cost. The result shows that the baseline with double the size of directory achieves average speedup of 7.3%. The baseline coherence directory tracks each remote access Fig. 13. Total number of L2 cache misses in the baseline with double-sized coherence individually, on a per-entry basis. As discussed in Section 3.3, doubling directory, HMG [11], and REC relative to the baseline. the size of coherence directory does not mitigate the unnecessary cache line invalidations for applications with significant directory evictions. Also, results show that HMG and REC achieve average speedup of As a result, this delays the replacement of useful cache lines, thereby 16.7% and 32.7% across the evaluated workloads. We observe that improving cache efficiency. REC outperforms the prior scheme for two reasons. First, REC delays L2 cache misses: The performance improvement of REC is largely directory evictions by allowing each entry to record more cache line attributed to the reduction in cache misses caused by unnecessary addresses for a wider range. Since HMG uses each directory entry to invalidations from frequent evictions in the coherence directory of track four cache lines, an entire coherence directory can track cache home GPUs. Fig. 13 shows the total number of L2 cache misses in the lines up to 4× the baseline. On the other hand, the directory in REC baseline with double-sized directory, HMG, and REC relative to the can record up to 16× the number of entries. Second, REC manages write baseline. Cold misses are excluded from the results. We observe that operations to shared cache lines at a fine granularity by searching the REC reduces L2 cache misses by 53.5%. In contrast, the baseline with directory with exact addresses and sharers, propagating invalidations double-sized directory and HMG experience 1.79× and 1.40× higher only when necessary. Since each directory entry of HMG stores only a single address and sharer ID field that cover for four cache lines, number of cache misses than REC since neither approach is insufficient writes to any of these cache lines trigger invalidation requests to every to delay evict-initiated cache line invalidations. The result is closely cache line and recorded sharer which leads them to be false positives. related to the reduction in remote access latency, as the corresponding In contrast, REC does not allow any false positives and performs inval- misses are forwarded to the remote GPUs. Addressing the remote GPU idations only to the modified cache lines and the associated sharers. As access bottleneck is performance-critical in multi-GPU systems. a result, REC reduces unnecessary invalidations on cache lines that are Unnecessary invalidations: In the baseline, invalidation requests actively being accessed by the requesting GPUs, minimizing redundant propagated from frequent directory evictions in the home GPU lead to a remote memory accesses. To investigate the effectiveness of REC under higher chances of finding the corresponding cache lines still valid in the different addressable ranges listed in Table 1, we also measure the sharer-side L2 caches. This results in premature invalidations of cache number of coalesced cache line addresses when an entry is evicted lines that are actively in use, exacerbating the cache miss rate. In REC, and plot in Fig. 12. We observe that the directory entries capture an the invalidation requests generated by directory eviction reduce the average of 1.8, 3.4, 12.9, and 54.7 addresses until eviction under REC chances of invalidating valid cache lines. Fig. 14 shows that the number with 128B, 256B, 1 kB, and 4 kB coalesceable ranges. Specifically, of unnecessary invalidations performed in remote L2 caches (i.e., where REC captures more than 14 addresses before directory eviction for they are hits) is reduced by 84.4%. Since REC significantly delays evict- applications with strong spatial locality. initiated invalidation requests, many cache lines have already been Fig. 12 also illustrates the characteristics of limited locality for evicted from the caches by the time these requests are issued. certain workloads where REC benefits less. In ATAX, PR, and ST, REC Inter-GPU transactions: The reduction in unnecessary invalidations coalesces 3.9, 6.1, and 5.8 addresses, respectively. This is because the enhances the utilization of data within the sharer GPUs and min- applications exhibit locality challenging to be captured due to their imizes redundant accesses over inter-GPU links. Fig. 14 shows the irregular memory access patterns that span across a wide range of total number of inter-GPU transactions compared to the baseline. As addresses. To delay the eviction of entries in irregular workloads, we illustrated, REC reduces inter-GPU transactions by an average of 34.9%. design our proposed coherence directory with an LRU-like replace- The reduced inter-GPU transactions directly contributes to the overall ment policy (discussed in Section 4.2). Another interesting observation performance improvement in multi-GPU systems. is that the performance improvement of GEMV with REC is higher Bandwidth impact: Fig. 15 shows the total inter-GPU bandwidth than the improvement seen when eliminating unnecessary invalida- costs of invalidation requests. As presented in Section 3.2, a large tions. Our approach delays invalidations, but still performs them when fraction of invalidation requests are propagated due to frequent direc- the directories become full. During cache line replacement, the con- tory evictions. Since REC delays invalidation requests from directory troller prioritizes invalid cache lines before applying the LRU policy. evictions by allowing each entry to coalesce multiple tag addresses, the 8 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 14. Total number of unnecessary invalidations (bars) and inter-GPU transactions (plots) relative to the baseline. Fig. 17. Performance of REC under varying (a) coalescing address ranges and (b) number of directory entries. Results are shown relative to the baseline with an 8K- entry coherence directory. Fig. 15. Total bandwidth consumption of invalidation requests. Fig. 18. Performance comparison of REC using FIFO and LRU replacement policies. Performance is normalized to the baseline coherence directory with FIFO policy. Fig. 16. L2 cache lookup latency. bandwidth in most of the workloads becomes only a few gigabytes per Fig. 19. Performance impact of different L2 cache sizes in the baseline and REC. Performance is normalized to the baseline with 2 MB L2 cache. second. Cache lookup latency: Fig. 16 illustrates the average L2 cache lookup latency of REC normalized to the baseline. The results show that the lookup latency reduces by 14.8% compared to the baseline. REC affects average, REC outperforms the baseline, even with reduced entry sizes the average lookup latency as evict-initiated invalidation requests are compared to the baseline system with 8K-entry coherence directory. propagated in burst. However, since REC significantly delays direc- This is because the coverage of each coherence directory in REC tory eviction by coalescing multiple tag addresses, the overall latency can increase by up to 16× when locality is fully utilized. Although decreases for most of the evaluated workloads. applications with limited locality show performance improvements as the directory size increases, these gains are relatively modest when 5.3. Sensitivity analysis considered against the additional hardware costs. FIFO replacement: Fig. 18 represents the performance of REC with Coalescing range: One important design decision in optimizing REC a FIFO replacement policy. Our evaluation shows that the choice of is determining the range over which to coalesce when remote read replacement policy has a relatively small impact on the overall perfor- requests are received. As discussed in Section 4.1, the trade-off exists mance. For the workloads with regular and more predictable memory between the range an entry coalesces and the number of bits required: access patterns, using the FIFO replacement policy is already effective the larger the range, the more bits are needed to store the remote in coalescing sufficient number of addresses under the target ranges GPU access information. Fig. 17(a) shows that the performance of REC (shown in Fig. 12). However, for some applications, such as ATAX, improves as the coalescing range increases, with performance gains PR, and ST, performance is lower with FIFO compared to REC due beginning to saturate at 1 kB. For our applications, a 1 kB range is to their limited locality patterns. These applications, therefore, benefit sufficient to capture the majority of memory access locality within the from using an LRU-like replacement policy. workloads. Since coalescing beyond 4 kB incurs excessive overhead in L2 cache size: The performance impact of different L2 cache sizes is terms of bits required per entry (with 4 kB already requiring nearly 6× shown in Fig. 19. The results are normalized to the baseline with a the baseline size), the potential performance improvement may not be 2 MB L2 cache. The benefits from increasing L2 cache capacity are substantial to offset the additional cost. Therefore, we choose a 1 kB limited by the baseline coherence directory. In contrast, the perfor- range for our implementation. mance of REC improves as L2 cache size increases, demonstrating its Entry size: In our evaluation, we use a directory size of 8K entries ability to leverage larger caches effectively. Another observation is that to match the baseline coherence directory. Fig. 17(b) shows the per- performance improvement with smaller L2 capacity is less significant formance REC with varying entry sizes, ranging from 2K to 32K. On compared to larger L2 caches. This is because the coverage of the 9 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Fig. 20. Performance impact of different inter-GPU bandwidth in the baseline and REC. Fig. 23. Performance of REC in different GPU architecture. Performance is normalized to the baseline with 300 GB/s inter-GPU bandwidth. Fig. 21. Performance of REC with different number of SAs normalized to the baseline Fig. 24. Performance of REC with DNN applications. with 16 SAs. 16-GPU systems, respectively. We observe that the performance im- provement decreases as the number of GPUs increases. This is because, with more GPUs, the application dataset is more distributed, and the amount of data allocated to each GPU’s memory decreases, resulting in reduced pressure on each coherence directory for tracking shared copies. Additionally, we compare REC with the baseline configured with different directory sizes to match equal storage costs (discussed in Section 4.3). We observe that REC achieves performance improvements of 2.04× and 1.83× over the baseline with directory sizes increased by Fig. 22. Performance comparison of REC and the baseline with equal storage cost 3× and 5×, respectively. The results confirm that simply increasing di- under different number of GPUs. Performance is normalized to the baseline with 8K rectory sizes is not an efficient approach, even in large-scale multi-GPU entries. systems. 5.4. REC with Different GPU Architecture baseline coherence directory relatively increases as the L2 cache size decreases. To further explore the performance sensitivity to different We extend the evaluation of REC to include a different GPU ar- L2 cache sizes, we evaluate REC in systems with L2 cache sizes of chitecture by adapting the simulation environment to a more recent 0.5 MB and 8 MB. We find that REC achieves an average performance NVIDIA-styled GPU [27]. This involves increasing the number of com- improvement of 6.3% and 26.7% compared to the baseline with 0.5 MB putation and memory resources compared to the AMD GPU setup. and 8 MB L2 caches, respectively. Additionally, the performance trend Specifically, we change the GPU configuration to include 128 CUs, each of REC decreases as the L2 cache size increases since the effectiveness with a 128 kB L1V cache. The L2 cache size is increased to 72 MB of REC also reduces larger caches. Nevertheless, the results emphasize with the cache line size adjusted to 128B. With the increased cache the importance of coherence protocol in improving cache efficiency. line size, we configure the addressable range of REC to 2 kB, allowing Inter-GPU bandwidth: The bandwidth of inter-GPU links is a critical for coalescing up to the same number of tag addresses. We also scale factor in scaling multi-GPU performance. Fig. 20 shows the perfor- the input sizes of the workloads until the simulations remain feasible. mance of the baseline and REC under different inter-GPU bandwidths, The performance results, in Fig. 23, show that REC achieves a 12.9% relative to the 300 GB/s baseline. The results demonstrate that REC out- performance improvement over the baseline. This indicates that our performs the baseline, even in applications where performance begins proposed REC also benefits the NVIDIA-like GPU architecture. to saturate with increased bandwidth. Number of SAs: We also evaluate REC with increasing the number 5.5. Effectiveness of REC on DNN applications of SAs as shown in Fig. 21. The performance improvement of REC decreases compared to the system with 16 SAs since the increased We evaluate the performance improvement of REC in training number of SAs improves thread-level parallelism of GPUs. However, the two DNN models, VGG16 and ResNet18, using Tiny-Imagenet-200 system with a larger number of SAs also elevates the intensity of data dataset [28]. As shown in Fig. 24, REC outperforms the baseline for sharing thus, increases the frequency of coherence directory evictions. training VGG16 and ResNet18 by 5.6% and 8.9%, respectively. The As a result, REC outperforms the baseline with 16 SAs by 17.1%. results imply that REC also has benefits in multi-GPU training on Number of GPUs: We evaluate REC in 8-GPU and 16-GPU systems, DNN workloads. Additionally, GPUs have recently gained significant as shown in Fig. 22. To ensure a fair comparison, we do not change attention for training large language models (LLM). The computation the workload sizes. The results show that REC provides performance of LLM training comprises multiple decoder blocks with each primarily improvements of 24.7% and 14.7% over the baseline in 8-GPU and having series of matrix and vector operations [29]. In our evaluation, 10 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 we observe that REC improves multi-GPU performance by 20.2% and translation overheads, and [47]. Villa et al. [49] studied design- 20.4% on GEMM and GEMV workloads, respectively. Considering real- ing trustworthy system-level simulation methodologies for single- and world LLM training, the memory requirements can become significant multi-GPU systems. Lastly, NGS [50] enables multiple nodes in a data with large parameters which can pressure memory systems and lead center network to share the compute resources of GPUs on top of a to under-utilization of computation resources [29]. Since REC im- virtualization technique. proves the cache efficiency in multi-GPU systems, we expect a higher performance potential from REC in real-world LLM training. 7. Conclusion 6. Related work In this paper, we propose REC to improve the efficiency of cache coherence in multi-GPU systems. Our analysis shows that the limited Several prior works have proposed GPU memory consistency and capacity of coherence directories in fine-grained hardware protocols cache coherence mechanisms optimized for general-purpose domains frequently leads to evictions and unnecessary invalidations of shared [13–15,19,30–32]. GPU-VI [19] reduces stalls at the cache controller data. As a result, the increase in cache misses exacerbates NUMA by employing write-through, write-no-allocate L1 caches and treating overhead, leading to significant performance degradation in multi-GPU loads to the pending writes as misses. To maintain write atomicity, systems. To address this challenge, REC leverages memory access local- GPU-VI adds transient states and state transitions and requires invali- ity to coalesce multiple tag addresses within common address ranges, dation acknowledgments before write completion. REC is implemented effectively increasing the coverage of coherence directories without based on the relaxed memory models commonly adopted in recent incurring significant hardware overhead. Additionally, REC maintains GPU architectures, which do not require acknowledgments to be sent write-initiated invalidations at a fine granularity to ensure precise and or received over long-latency inter-GPU links. HMG [11] proposes a flexible coherence across GPUs. Experiments show that REC reduces lightweight directory protocol by addressing up-to-date memory consis- L2 cache misses by 53.5% and improves overall system performance tency and coherence requirements. HMG integrates separate layers for by 32.7%. managing inter-GPM and inter-GPU level coherence, reducing network traffic and complexity in deeply hierarchical multi-GPU systems. REC primarily addresses the increased cache misses to remotely fetched data CRediT authorship contribution statement caused by frequent invalidations. Additionally, REC can be extended to support hierarchical multi-GPU systems posed by HMG without Gun Ko: Writing – original draft, Visualization, Validation, Soft- significant hardware modifications. ware, Resources, Methodology, Investigation, Formal analysis, Data Other efforts aim to design efficient cache coherence protocols for curation, Conceptualization. Jiwon Lee: Formal analysis, Conceptu- other processor domains. Wang et al. [33] suggested a method to alization. Hongju Kal: Validation, Conceptualization. Hyunwuk Lee: efficiently support dynamic task parallelism on heterogeneous cache Visualization, Validation. Won Woo Ro: Supervision, Project adminis- coherent systems. Zuckerman et al. [34] proposed Cohmeleon that tration, Conceptualization. orchestrates the coherence in accelerators in heterogeneous system-on- chip designs. HieraGen [35] and HeteroGen [36] are automated tools Declaration of competing interest for generating hierarchical and heterogeneous cache coherence proto- cols, respectively, for generic processor designs. Li et al. [37] proposed The authors declare that they have no known competing finan- methodologies to determine the minimum number of virtual networks cial interests or personal relationships that could have appeared to for cache coherence protocols that can avoid deadlocks. However, these influence the work reported in this paper. studies do not address the challenges of redundant invalidations in the cache coherence mechanisms of multi-GPU systems. Acknowledgments Significant research has addressed the NUMA effect in multi-GPU systems by proposing efficient page placement and migration strate- This work was supported by Institute of Information & communica- gies [5,6,38], data transfer and replication methods [4,7,8,10,39,40], tions Technology Planning & Evaluation (IITP) grant funded by the Ko- and address translation schemes [41–43]. In particular, several works rea government (MSIT) (No. 2024-00402898, Simulation-based High- have focused on improving the management of shared data within the speed/High-Accuracy Data Center Workload/System Analysis Platform) local memory hierarchy. NUMA-aware cache partitioning [3] dynami- cally allocates cache space to accommodate data from both local and remote memory by monitoring inter-GPU and local DRAM bandwidths. Data availability The authors also extend software coherence with bulk invalidations to L2 caches and evaluate the overhead associated with unnecessary The authors are unable or have chosen not to specify which data invalidations. SAC [12] proposes reconfigurable last-level caches (LLC) has been used. that can be utilized as either memory-side or SM-side, depending on predicted application behavior in terms of effective LLC bandwidth. References SAC evaluates the performance of both software and hardware ex- tensions for LLC coherence. In contrast, REC specifically targets the [1] NVIDIA, NVIDIA DGX-2, 2018, https://www.nvidia.com/content/dam/en- issue of unnecessary invalidations under hardware coherence, which zz/Solutions/Data-Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web- can undermine the efficiency of remote data caching. It introduces uk.pdf. a new directory structure, carefully examining the trade-off between [2] NVIDIA, NVIDIA DGX A100 system architecture, 2020, https://download. performance and storage overhead. boston.co.uk/downloads/3/8/6/386750a7-52cd-4872-95e4-7196ab92b51c/ DGX%20A100%20System%20Architecture%20Whitepaper.pdf. Recent studies on multi-GPU and multi-node GPU systems also ad- [3] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez, dress challenges in various domains. Researchers proposed methods to D. Nellans, Beyond the socket: NUMA-aware GPUs, in: Proceedings of IEEE/ACM accelerate deep learning applications [44], graph neural networks [45], International Symposium on Microarchitecture, 2017, pp. 123–135. and graphics rendering applications [46] in multi-GPU systems. Na [4] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, O. Villa, Combining et al. [47] addressed security challenges in inter-GPU communications HW/SW mechanisms to improve NUMA performance of multi-GPU systems, in: under unified virtual memory framework. Barre Chord [48] leverages Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2018, page allocation schemes in multi-chip-module GPUs to reduce address pp. 339–351. 11 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 [5] T. Baruah, Y. Sun, A.T. Dinçer, S.A. Mojumder, J.L. Abellán, Y. Ukidave, A. [30] K. Koukos, A. Ros, E. Hagersten, S. Kaxiras, Building heterogeneous Unified Joshi, N. Rubin, J. Kim, D. Kaeli, Griffin: Hardware-software support for efficient Virtual Memories (UVMs) without the overhead, ACM Trans. Archit. Code Optim. page migration in multi-GPU systems, in: Proceedings of IEEE International 13 (1) (2016). Symposium on High Performance Computer Architecture, 2020, pp. 596–609. [31] X. Ren, M. Lis, Efficient sequential consistency in GPUs via relativistic cache co- [6] M. Khairy, V. Nikiforov, D. Nellans, T.G. Rogers, Locality-centric data and thread- herence, in: Proceedings of IEEE International Symposium on High Performance block management for massive GPUs, in: Proceedings of IEEE/ACM International Computer Architecture, 2017, pp. 625–636. Symposium on Microarchitecture, 2020, pp. 1022–1036. [32] S. Puthoor, M.H. Lipasti, Turn-based spatiotemporal coherence for GPUs, ACM [7] H. Muthukrishnan, D. Lustig, D. Nellans, T. Wenisch, GPS: A global publish- Trans. Archit. Code Optim. 20 (3) (2023). subscribe model for multi-GPU memory management, in: Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 46–58. [33] M. Wang, T. Ta, L. Cheng, C. Batten, Efficiently supporting dynamic task paral- [8] L. Belayneh, H. Ye, K.-Y. Chen, D. Blaauw, T. Mudge, R. Dreslinski, N. Talati, lelism on heterogeneous cache-coherent systems, in: Proceedings of ACM/IEEE Locality-aware optimizations for improving remote memory latency in multi-GPU International Symposium on Computer Architecture, 2020, pp. 173–186. systems, in: Proceedings of the International Conference on Parallel Architectures [34] J. Zuckerman, D. Giri, J. Kwon, P. Mantovani, L.P. Carloni, Cohmeleon: and Compilation Techniques, 2022, pp. 304–316. Learning-based orchestration of accelerator coherence in heterogeneous SoCs, in: [9] S.B. Dutta, H. Naghibijouybari, A. Gupta, N. Abu-Ghazaleh, A. Marquez, K. Proceedings of IEEE/ACM International Symposium on Microarchitecture, 2021, Barker, Spy in the GPU-box: Covert and side channel attacks on multi-GPU pp. 350–365. systems, in: Proceedings of ACM/IEEE International Symposium on Computer [35] N. Oswald, V. Nagarajan, D.J. Sorin, HieraGen: Automated generation of con- Architecture, 2023, pp. 633–645. current, hierarchical cache coherence protocols, in: Proceedings of ACM/IEEE [10] H. Muthukrishnan, D. Lustig, O. Villa, T. Wenisch, D. Nellans, FinePack: International Symposium on Computer Architecture, 2020, pp. 888–899. Transparently improving the efficiency of fine-grained transfers in multi-GPU systems, in: Proceedings of IEEE International Symposium on High Performance [36] N. Oswald, V. Nagarajan, D.J. Sorin, V. Gavrielatos, T. Olausson, R. Carr, Computer Architecture, 2023, pp. 516–529. HeteroGen: Automatic synthesis of heterogeneous cache coherence protocols, in: [11] X. Ren, D. Lustig, E. Bolotin, A. Jaleel, O. Villa, D. Nellans, HMG: Extending Proceedings of IEEE International Symposium on High Performance Computer cache coherence protocols across modern hierarchical multi-GPU systems, in: Architecture, 2022, pp. 756–771. Proceedings of IEEE International Symposium on High Performance Computer [37] W. Li, A.G.U. of Amsterdam, N. Oswald, V. Nagarajan, D.J. Sorin, Determining Architecture, 2020, pp. 582–595. the minimum number of virtual networks for different coherence protocols, in: [12] S. Zhang, M. Naderan-Tahan, M. Jahre, L. Eeckhout, SAC: Sharing-aware caching Proceedings of ACM/IEEE International Symposium on Computer Architecture, in multi-chip GPUs, in: Proceedings of ACM/IEEE International Symposium on 2024, pp. 182–197. Computer Architecture, 2023, pp. 605–617. [38] Y. Wang, B. Li, A. Jaleel, J. Yang, X. Tang, GRIT: Enhancing multi-GPU [13] B.A. Hechtman, S. Che, D.R. Hower, Y. Tian, B.M. Beckmann, M.D. Hill, S.K. performance with fine-grained dynamic page placement, in: Proceedings of IEEE Reinhardt, D.A. Wood, QuickRelease: A throughput-oriented approach to release International Symposium on High Performance Computer Architecture, 2024, pp. consistency on GPUs, in: Proceedings of IEEE International Symposium on High 1080–1094. Performance Computer Architecture, 2014, pp. 189–200. [39] M.K. Tavana, Y. Sun, N.B. Agostini, D. Kaeli, Exploiting adaptive data com- [14] M.D. Sinclair, J. Alsop, S.V. Adve, Efficient GPU synchronization without pression to improve performance and energy-efficiency of compute workloads in scopes: Saying no to complex consistency models, in: Proceedings of IEEE/ACM multi-GPU systems, in: Proceedings of IEEE International Parallel and Distributed International Symposium on Microarchitecture, 2015, pp. 647–659. [15] J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consis- Processing Symposium, 2019, pp. 664–674. tency for GPUs, in: Proceedings of IEEE/ACM International Symposium on [40] H. Muthukrishnan, D. Nellans, D. Lustig, J.A. Fessler, T.F. Wenisch, Efficient Microarchitecture, 2016, pp. 1–13. multi-GPU shared memory via automatic optimization of fine-grained trans- [16] NVIDIA, NVIDIA TESLA V100 GPU architecture, 2017, https://images.nvidia. fers, in: Proceedings of the ACM/IEEE International Symposium on Computer com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Architecture, 2021, pp. 139–152. [17] NVIDIA, NVIDIA A100 tensor core GPU architecture, 2020, https: [41] B. Li, J. Yin, Y. Zhang, X. Tang, Improving address translation in multi- //images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere- GPUs via sharing and spilling aware TLB design, in: Proceedings of IEEE/ACM architecture-whitepaper.pdf. International Symposium on Microarchitecture, 2021, pp. 1154–1168. [18] NVIDIA, NVIDIA NVLink high-speed GPU interconnect, 2024, https://www. [42] B. Li, J. Yin, A. Holey, Y. Zhang, J. Yang, X. Tang, Trans-FW: Short circuiting nvidia.com/en-us/design-visualization/nvlink-bridges/. page table walk in multi-GPU systems via remote forwarding, in: Proceedings [19] I. Singh, A. Shriraman, W.W.L. Fung, M. O’Connor, T.M. Aamodt, Cache coher- of IEEE International Symposium on High Performance Computer Architecture, ence for GPU architectures, in: Proceedings of IEEE International Symposium on 2023, pp. 456–470. High Performance Computer Architecture, 2013, pp. 578–590. [20] Y. Sun, T. Baruah, S.A. Mojumder, S. Dong, X. Gong, S. Treadway, Y. Bao, [43] B. Li, Y. Guo, Y. Wang, A. Jaleel, J. Yang, X. Tang, IDYLL: Enhancing page S. Hance, C. McCardwell, V. Zhao, H. Barclay, A.K. Ziabari, Z. Chen, R. translation in multi-GPUs via light weight PTE invalidations, in: Proceedings of Ubal, J.L. Abellán, J. Kim, A. Joshi, D. Kaeli, MGPUSim: Enabling multi- IEEE/ACM International Symposium on Microarchitecture, 2015, pp. 1163–1177. GPU performance modeling and optimization, in: Proceedings of ACM/IEEE [44] E. Choukse, M.B. Sullivan, M. O’Connor, M. Erez, J. Pool, D. Nellans, Buddy International Symposium on Computer Architecture, 2019, pp. 197–209. compression: Enabling larger memory for deep learning and HPC workloads [21] T. Yuki, L.-N. Pouchet, Polybench 4.0, 2015. on GPUs, in: Proceedings of ACM/IEEE International Symposium on Computer [22] Y. Sun, X. Gong, A.K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A. Architecture, 2020, pp. 926–939. Villegas, D. Kaeli, Hetero-mark, a benchmark suite for CPU-GPU collaborative [45] Y. Tan, Z. Bai, D. Liu, Z. Zeng, Y. Gan, A. Ren, X. Chen, K. Zhong, BGS: Accelerate computing, in: Proceedings of IEEE International Symposium on Workload GNN training on multiple GPUs, J. Syst. Archit. 153 (2024) 103162. Characterization, 2016, pp. 1–10. [23] AMD, AMD app SDK OpenCL optimization guide, 2015. [46] X. Ren, M. Lis, CHOPIN: Scalable graphics rendering in multi-GPU systems via [24] A. Danalis, G. Marin, C. McCurdy, J.S. Meredith, P.C. Roth, K. Spafford, V. Tip- parallel image composition, in: Proceedings of IEEE International Symposium on paraju, J.S. Vetter, The Scalable Heterogeneous Computing (SHOC) benchmark High Performance Computer Architecture, 2021, pp. 709–722. suite, in: Proceedings of the 3rd Workshop on General-Purpose Computation on [47] S. Na, J. Kim, S. Lee, J. Huh, Supporting secure multi-GPU computing with dy- Graphics Processing Units, 2010, pp. 63–74. namic and batched metadata management, in: Proceedings of IEEE International [25] R. Balasubramonian, A.B. Kahng, N. Muralimanohar, A. Shafiee, V. Srinivas, Symposium on High Performance Computer Architecture, 2024, pp. 204–217. CACTI 7: New tools for interconnect exploration in innovative off-chip memories, [48] Y. Feng, S. Na, H. Kim, H. Jeon, Barre chord: Efficient virtual memory trans- ACM Trans. Archit. Code Optim. 14 (2) (2017) 14:1–25. lation for multi-chip-module GPUs, in: Proceedings of ACM/IEEE International [26] NVIDIA, NVIDIA DGX-1 with tesla V100 system architecture, 2017, pp. 1–43. Symposium on Computer Architecture, 2024, pp. 834–847. [27] NVIDIA, NVIDIA ADA GPU architecture, 2023, https://images.nvidia.com/aem- dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper- [49] O. Villa, D. Lustig, Z. Yan, E. Bolotin, Y. Fu, N. Chatterjee, Need for speed: v2.1.pdf. Experiences building a trustworthy system-level GPU simulator, in: Proceedings [28] Y. Le, X. Yang, Tiny ImageNet visual recognition challenge, 2015, https://http: of IEEE International Symposium on High Performance Computer Architecture, //vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf. 2021, pp. 868–880. [29] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park, [50] J. Prades, C. Reaño, F. Silla, NGS: A network GPGPU system for orchestrating NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inferencing, remote and virtual accelerators, J. Syst. Archit. 151 (2024) 103138. in: Proceedings of ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024, pp. 722–737. 12 G. Ko et al. Journal of Systems Architecture 160 (2025) 103339 Gun Ko received the B.S. degree in electrical engineering Hyunwuk Lee received his B.S. and Ph.D. degrees in from Pennsylvania State University in 2017. He is currently electrical and electronic engineering from Yonsei University, pursuing the Ph.D. degree with the Embedded Systems Seoul, Korea, in 2018 and 2024, respectively. He currently and Computer Architecture Laboratory, School of Electrical works in the memory division at Samsung Electronics. His and Electronic Engineering, Yonsei University, Seoul, South research interests include neural network accelerators and Korea. His current research interests include GPU memory GPU systems. systems, multi-GPU systems, and virtual memory. Jiwon Lee received the B.S. and Ph.D. degrees in electrical and electronic engineering from Yonsei University, Seoul, Won Woo Ro received the B.S. degree in electrical engineer- South Korea, in 2018 and 2024, respectively. He currently ing from Yonsei University, Seoul, South Korea, in 1996, and works in the memory division at Samsung Electronics. His the M.S. and Ph.D. degrees in electrical engineering from the research interests include virtual memory, GPU memory University of Southern California, in 1999 and 2004, respec- systems, and storage systems. tively. He worked as a Research Scientist with the Electrical Engineering and Computer Science Department, University of California, Irvine. He currently works as a Professor with the School of Electrical and Electronic Engineering, Yonsei University. Prior to joining Yonsei University, he worked as an Assistant Professor with the Department Hongju Kal received the B.S. degree from Seoul National of Electrical and Computer Engineering, California State University of Science and Technology and Ph.D. degree from University, Northridge. His industry experience includes a Yonsei University in school of electric and electronic engi- college internship with Apple Computer, Inc., and a contract neering, Seoul, South Korea in 2018 and 2024, respectively. software engineer with ARM, Inc. His current research He currently works in the memory division at Samsung interests include high-performance microprocessor design, Electronics. His current research interests include memory GPU microarchitectures, neural network accelerators, and architectures, memory hierarchies, near memory processing, memory hierarchy design. and neural network accelerators. 13