Journal of Systems Architecture 160 (2025) 103341 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc A load-balanced acceleration method for small and irregular batch matrix multiplication on GPU Yu Zhang a , Lu Lu a,b ,∗, Zhanyu Yang a , Zhihong Liang c,d , Siliang Suo c,d a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China b Peng Cheng Laboratory, Shenzhen, 518055, China c Electric Power Research Institute, CSG, Guangzhou, China d Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, China ARTICLE INFO ABSTRACT Keywords: As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many Batch GEMM applications, such as high-performance computing, machine learning, etc. In practice, the performance of Thread workload GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing Multi-thread kernel with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, a Tiling algorithm common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism and ignore the issues of low computational efficiency and hardware resource utilization caused by the difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multi- thread kernel design can address the workload unbalance between wavefronts in different workgroups, and the efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms previous methods. 1. Introduction Many real-world applications, such as deep learning, involve ir- regular, small-size matrix multiplication operations in their computa- GEneral Matrix Multiplication (GEMM) is a standard computing tions [11]. For example, in Convolutional Neural Networks (CNN) [12– kernel that plays an important role in high-performance computing [1], 14], the structure of these models contains a large number of convo- artificial intelligence [2], image processing [3], and other research lutional layers. The scale of the convolution kernel tends to be small fields. With the explosive growth of data volume and the emergence of (e.g. ‘‘1*1’’ and ‘‘3*3’’). Convolution operations are converted to GEMM various algorithms, the demand for high-performance GEMM comput- using Im2col function, and the dimension of the matrix is typically ing is increasing [4,5]. Additional stream processors and memory are less than 1000 [15,16]. These small GEMM computations prevent the integrated into the GPU to cater to this trend, providing tremendous GPU from fully exploiting its hardware computing potential. In this computational power for GEMM acceleration. To fully utilize the hard- case, the scheduling overhead between batch GEMMs and the regularity ware acceleration capability, AMD and NVIDIA, provide developers of the matrix poses challenges to computational performance [17,18]. with a platform for parallel computing based on GPU (ROCm and For a GEMM, the tiling is a standard solution method. The matrix is CUDA). Based on these parallel computing acceleration platforms, var- segmented into multiple tiles, and a thread block is responsible for ious optimization algorithms and acceleration libraries have been pro- computing individual tiles. Since each tile is independent, multiple tiles posed and demonstrated to have powerful effects, such as rocBLAS [6], can be computed in parallel by using multiple threads in GPU, to speed cuBLAS [7], MAGMA [8], etc. These methods achieve optimal computa- tional task allocation through hardware resource scheduling and thread up the computation process of GEMM. The larger dimension of tile will parallelism to accelerate the matrix multiplication operation [9,10]. increase the Thread-Level Parallelism (TLP) of a single tile and also will ∗ Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China. E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang), suosl@csg.cn (S. Suo). https://doi.org/10.1016/j.sysarc.2025.103341 Received 3 September 2024; Received in revised form 3 November 2024; Accepted 8 January 2025 Available online 23 January 2025 1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 reduce the number of tile, resulting in the failure to fully utilize the 2. Related work and motivation hardware resources of GPU [19,20]. The Instruction-Level Parallelism (ILP) of a single thread is related to the K-dimension. Generally, for a 2.1. Related work large enough matrix size, it can fully use GPU hardware resources and achieve higher TLP and ILP [21,22]. Several approaches have been proposed for batch GEMM computa- To improve computational efficiency, previous studies have pro- tion, which mainly focus on algorithm-level optimization or architecture- posed some acceleration methods for matrix multiplication. For in- level optimization. The former mainly explores lower bounds on the stance, rocBLAS [6] and cuBLAS [7] provide batch GEMM API time complexity of GEMM operations at the mathematical level and (rocblasSgemmBatched and cublasSgemmBatched), which can support optimizes the computational effort. The latter is based on different GPU multiple GEMMs to be simultaneously calculated on GPUs. However, architecture features and uses corresponding optimization techniques these APIs support only uniform matrix sizes that considerably limit to improve the computational efficiency of GEMM. In algorithm-level these applications. NVIDIA also provides a C++-style template library, optimization, Strassen et al. [24] proposed a novel GEMM algorithm CUTLASS [23], which utilizes built-in tile templates and sorting to based on the property that matrix addition is faster than matrix multi- accelerate matrix multiplication operations. In fact, the size of matrices plication to speed up the computational process, which uses seven-time is variable in many real-world applications [11]. To solve this issue, multiplications and multiple addition operations instead of eight-time a Vbatch GEMM route that supports batch GEMM in various sizes is multiplications. This approach mathematically reduced the time com- designed and implemented by MAGMA (magmablas_sgemm_vbatched). It plexity of GEMM to 𝑂(𝑛2.81 ) for the first time. To reduce the require- adapts to batch GEMMs with multiple tiling strategies, assigning the ap- ment of Strassen’s algorithm for extra memory space, three different propriate tile to a single GEMM for huge performance gains. Although methods were proposed in [25]: pre-additions, overwriting the input variable sizes are supported in MAGMA, it still has some limitations. First, MAGMA only supports some coarse-grained tiling strategies that matrix, and recursive scheduling to alleviate this problem. At the are not appropriate for all GEMM. Coarse-grained tiling results in an same time, due to the powerful effect of deep neural networks in unbalanced kernel workload and GPU utilization reduction. Second, the various domains, Alhussein Fawzi et al. [26] transformed the process grid size is determined by the tiling of the largest matrix, which leads of finding the optimal complexity of matrix multiplication into a tensor to idle threads and a waste of GPU computing power. Third, the lack decomposition problem and used reinforcement learning to explore of an evaluation criterion for tiling leads to lower efficiency of strategy lower bounds on the complexity of matrix multiplication. In particular, choice. for a 4 × 4 matrix, the multiplication number was as low as 47 multi- To thoroughly support batch GEMM with variable sizes, it is es- plications. This performance was better than the two-level Strassen’s sential to design a tiling algorithm that can be adapted to all GEMMs algorithm, which involves 49 multiplications. Although the above and adaptively choose tile sizes, not limited to single size. The optimal approach reduces the mathematical complexity of matrix multiplication tiling for each GEMM is different, depending on the size of the matrix operations, it is difficult to take advantage of the performance benefits dimensions (𝑀, 𝑁, 𝐾). How to choose a suitable tile is a challenge of these approach due to the neglect of computational scheduling for batch GEMM. At the same time, an evaluation criterion based on strategies and multi-level memory architecture features on the GPU. the current GPU hardware and tiling strategy is also essential. With In architecture-level optimization, GPU vendors (NVIDIA and AMD) GPU hardware, an appropriate tiling for each GEMM can be chosen have designed and implemented computing libraries such as cuBLAS [6] to fully utilize the GPU computing capabilities and achieve better and rocBLAS [7] based on their parallel computing platforms to im- computational performance. How to measure the effectiveness of the prove GPU hardware utilization and parallelism. However, due to the tiling algorithm on the GPU hardware is a challenging problem. The tile restriction of uniform-sized matrix, the performance is poor when faced with various sizes can lead to significant differences in computational with small and irregular batch GEMMs. Although NVIDIA provides effort within each workgroup, further to an unbalanced distribution of a C++-style template library, the small size of the matrix and the computational tasks and excessive load differences between threads. lack of assembly-level optimizations make it difficult for CUTLASS Hence, for tiles with various sizes, balancing thread computation and to fully exploit its performance advantages for irregular and small data loading during computation is also a challenge for batch GEMM. matrix multiplication [23]. These irregular and small-sized matrices To address the above challenges, we propose a batch GEMM accel- often lead to unbalanced workloads among threads in different work- eration method with a multi-thread kernel design. Furthermore, an ef- groups, which can reduce kernel performance. For Sparse GEneral ficient tiling algorithm is proposed to achieve load-balanced and higher Matrix-Matrix multiplication (SpGEMM), the matrix’s sparsity leads to hardware resource utilization. Our contributions can be summarized as significant differences in thread workloads [27,28]. To address the follows: unbalanced workload, Chen et al. [29] optimized the matrix segmen- • A multi-threaded kernel design scheme is proposed to balance tation by analyzing the distribution of the floating point calculations thread computation and data loading in different workgroups to of the CSR-based SpGEMM, which achieves load balance and perfor- compute the various tiles. mance improvement on Sunway TaihuLight. For the issue of workload • A novel TLP computation method is designed to select the optimal unbalance in threads, it is necessary to conduct a detailed analysis tiling algorithm by combining the kernel occupancy of the GPU of the computation process and hardware platform characteristics to and the tiling operation. design an efficient parallel framework implementation [30,31]. Xiao • An efficient tiling algorithm is implemented by considering the et al. [32] introduce a fine-grained partitioning strategy to select ap- GPU hardware architecture and the batch GEMM workload. propriate segmentation dimensions, efficiently utilizing the parallelism • The proposed method can efficiently handle batch irregular GEMM of multi-thread and improving the performance of binary sparse tensor and achieve state-of-the-art performance on AMD and NVIDIA contracts. The diversity of matrix sizes makes it difficult to utilize a GPU platforms. unified routine for calculations, resulting in some threads being idle The rest of the paper is organized as follows. Section 2 provides in CU [33,34]. Indeed, the size of matrices is variable and irregular related work and motivation. Section 3 introduces background on batch in various scientific computing scenarios. To overcome the matrix GEMM, GPU architecture, and kernel occupancy. Section 4 presents restriction of uniform size, MAGMA [8] proposes a Vbatch routine the details of the multi-thread kernel design and load-balanced tiling to support batch GEMM with various sizes. In this way, it uses a 3D algorithm. Section 5 demonstrates and evaluates the experimental re- grid to indicate batch GEMM’s kernel design, where grid.z represents sult. Section 6 provides the conclusions of the paper and future work. batch size. Each GEMM corresponds to one of the 2D-grid planes, and The source code of this paper can be obtained in this repository link: the size of the two-dimensional plane (grid.x, grid.y) is determined by https://github.com/zhangyu0722/BatchGEMM.git. the largest GEMM. In the case of irregular GEMM, if the dimension 2 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 1. GEMM and batch GEMM schematic diagram. difference between the largest GEMM and the rest is too large, a large 3. Background number of threads and workgroups will be idle, resulting in a waste of GPU computing resources. For various parallel acceleration platforms, 3.1. GEMM and batch GEMM different hardware characteristics, such as register size and number of CUs, will affect the allocation of computing resources in the kernel. To For a single GEMM, its accumulation routine is 𝐶 = 𝛼 𝐴𝐵+𝛽 𝐶, where ensure kernel performance, it is necessary to flexibly set parameters 𝐴 ∈ 𝑅𝑀×𝐾 , 𝐵 ∈ 𝑅𝐾×𝑁 and 𝐶 ∈ 𝑅𝑀×𝑁 are dense matrices, 𝑀, 𝑁, and based on different matrix sizes and hardware architectures [9,35]. 𝐾 represent matrix dimensions, and 𝛼 and 𝛽 are constant scalars. A To solve this problem, a coordinated tiling and batching strategy is common approach is tiling matrix C into multiple tiles [21,36], which proposed in [21], where a different tiling strategy is used for each utilizes the parallel computing of thread in GPU to calculate each tile GEMM in batch GEMM and appropriate batching is used according and splices together the result. As shown in Fig. 1 (b), given a GEMM to the tile size to improve the computational efficiency of the GPU. with size 𝑀 × 𝑁 × 𝐾, the matrix C is segmented into multiple tiles with Wang et al. [36] proposed the sort-up algorithm based on the GEMM 𝑇𝑚 × 𝑇𝑛 . Each workgroup is responsible for the calculation of a tile and workload and split-down in the tiling process, which can segment large needs to access the row section of matrix A with size 𝑇𝑚 ×𝐾 and column tiles into multiple smaller tiles. This approach can make better use of section of matrix B with size 𝐾 × 𝑇𝑛 . However, the row cross-section CU utilization when the number of GEMM is limited. of A and the column cross-section of B (represented in Fig. 1 (b) by 2.2. Motivation the gray parts of matrices A and B, respectively) are too large to store in shared memory and registers. Hence, the row section of A and the Although the above-mentioned methods improve the parallel com- column section of B are segments of multiple A tiles with 𝑇𝑚 × 𝑇𝑘 and B puting efficiency of batch GEMM on GPU from various perspectives, tiles with 𝑇𝑘 × 𝑇𝑛 , respectively. The partial result of C can be obtained there are two problems. One is that the workload of threads varies by calculating with A tile and B tile, and accumulative partial results significantly across the kernel. In the above approach, tiles with various can obtain the final result. sizes are designed, and each tile is responsible for the corresponding To batch-run multiple GEMMs, a naive routine is computed for kernel, where the number of threads is fixed. In general, larger tiles each GEMM individually. However, when the matrix size is small, have better TLP. This will also increase the workload of each thread a single GEMM does not fully utilize the GPU’s computing power, for large-size tiles, and the thread responsible for computing large tiles leaving the CU idle [37,38]. To avoid this situation, a batch GEMM requires more hardware resources (VGPR, SGPR, LDS) and computing method is proposed to design multiple kernels for various GEMM in time. The other one is that differences between wavefronts within dif- the GPUs [36,39]. Compared to GEMM, batch GEMM is expressed in ferent workgroups are ignored in the TLP calculations. The workgroup (𝑀 × 𝑁 × 𝐾 × 𝐵𝑠𝑖𝑧𝑒 ), where 𝑀, 𝑁 and 𝐾 represent the dimensions of will be transformed into multiple wavefronts during GPU computation the matrix, and 𝐵𝑠𝑖𝑧𝑒 represents the batch size. A batch GEMM is 3D- and be executed in parallel on the CU. Each CU can run multiple dimension grid, where grid.z is batch sizes, and grid.x and grid.y are the wavefronts simultaneously, and the number of wavefronts depends on lengths and widths of a two-dimensional plane respectively [40]. To the hardware resources required by the wavefront. Thus, the TLP on the balance the workload of a batch GEMM, a variety of tile sizes are used GPU should be determined by the number of threads in the wavefront for GEMM tiling. The two-dimensional grid size has the corresponding that can be executed in parallel on the CU. matrix C and tiling strategy. Each tile is responsible for the correspond- To solve the above problems, we propose an efficient and load- balanced batch GEMM acceleration method, which consists of two ing workgroup. A workgroup is decomposed into multiple wavefronts parts: a multi-thread kernel design scheme and an efficient tiling algo- that execute on the CU. The 3D grid of batch GEMM is shown in Fig. 1 rithm. A multi-thread kernel design is proposed to balance the amount (a). of loading and computation in the thread corresponding to each tile. Tiles with various sizes correspond to the number of threads selected. 3.2. GPU architecture and kernel occupancy Although this is limited by the parallel programming interfaces of the CUDA and ROCm platforms, the number of threads responsible for With the improvement of hardware architecture and parallel com- computing a tile is uniform. To overcome this shortcoming, we use the puting programming platforms (such as ROCm1 and CUDA2 ), GPUs corresponding filtering operation in the kernel execution process to ef- are becoming the most popular hardware accelerator. The two most fectively alleviate this problem. An efficient tiling algorithm can choose commonly used GPUs are AMD and NVIDIA, widely used in various the optimal scheme based on different GEMMs and GPUs. To measure scientific computing platforms. However, some basic concepts of ex- the effect of tiling, we propose a new way of TLP computation based pression in ROCm and CUDA are different. We chose AMD’s official on wavefronts. The optimal tiling scheme is obtained by adjusting the tiling strategy according to the TLP. Finally, we obtain an efficient tiling algorithm based on the new TLP calculation method. In Section 4, the 1 https://rocm.docs.amd.com/en/latest/ 2 details of the proposed method are introduced. https://docs.nvidia.com/cuda/ 3 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Table 1 ROCm/CUDA terminology. ROCm CUDA Description Compute Unit (CU) Streaming One of many parallel vector processors in a GPU that contains Multiprocessor (SM) parallel ALUs. All waves in a workgroup are assigned to the same CU. Kernel Kernel Functions launched to the GPU that are executed by multiple parallel workers on the GPU. Kernels can work in parallel with CPU. Wavefront Warp Collection of operations that execute in lockstep, run the same instructions, and follow the same control-flow path. Individual lanes can be masked off. Workgroup Thread block Think of this as a vector thread. A 64-wide wavefront is a 64-wide vector op. Work-item/Thread Thread GPU programming models can treat this as a separate thread of execution, though this does not necessarily get forward sub-wavefront progress. Global Memory Global Memory DRAM memory accessible by the GPU that goes through some layers cache. Local Memory Shared Memory Scratchpad that allows communication between wavefront in a workgroup. Private Memory Local Memory Per-thread private memory often mapped to registers. terminology for this paper to provide precise specifications. To clarify resources. In order to fully utilize the hardware resources of the GPU some differences and relationships between ROCm and CUDA terms, a and improve the efficiency of parallel computing, the kernel occupancy comparison of terminology is given in Table 1. should be improved as much as possible without data overflow [46,47]. A GPU is composed of multiple Shader Engines (SE) and a com- In batch GEMM, an efficient kernel design should properly allocate mand processor. Each SE has its own workload manager. One SE is the data loading and computation workload for each work-item in the integrated with multiple CU and workload manager. Each CU contains wavefront, so that the memory space and computing power on the CU an enormous amount of Arithmetic and Logic Units (ALUs), a small can be more efficiently utilized [48,49]. number of control units, and caches. Hence, GPUs are suitable for a large number of simple parallel computing tasks. A GPU kernel consists 4. Overview of one or multiple workgroups, the size of which is determined by the number of wavefronts and threads. On the memory hierarchy, the GPU 4.1. Multi-thread kernel design has global memory, local memory, and private memory from slow to fast according to memory access speed, and local memory and private Tile size and kernel design are closely related in the design of batch memory are much smaller than global memory [41,42]. GEMM algorithms, and there are two matrix tile design routes. The Kernel Occupancy represents the actual utilization of computing first way is to design a tile to adapt to all GEMMs, and the second unit resources by a kernel function on GPU, which is the ratio of is to design the various tiles to adapt to different GEMMs. Compared actived wavefront to the maximum wavefront supported by CU [35,43]. with the first method, for irregular GEMM, the latter method is more An active wavefront running on CU requires resources such as Vec- flexible and efficient to utilize the computing resources of GPU. For tor General-Purpose Register (VGPR), Scalar General-Purpose Registers GEMMs with various shapes and sizes, using a single tile can easily (SGPR), Local Data Share (LDS), etc. A wavefront can be activated lead to increased workload differences between threads in multiple and run on a CU when all required resources are available. When the workgroups, affecting the allocation of computing resources. In this utilization of CU resources is low, the number of active wavefronts paper, we perform a multi-thread kernel design for the second matrix is small, which leads to the waste of hardware resources and the segmentation method. Two different tile design strategies are shown degradation of the parallel performance of the kernel. On the other in Fig. 2. Here we present the effect of two different tile strategies on hand, when the number of active wavefronts in the CU increases, the the occupancy of the 3D grid. For the batch GEMM, different tile sizes resources used by each wavefront and the available register storage lead to different numbers of workgroups, resulting in different 3D grid space of each work-item in the wavefront decrease [44,45]. occupancies. The number of active wavefronts on a CU is mainly limited by the For a single GEMM, matrix C is tiled into multiple tiles. The tile following factors: the number of work-items in each workgroup and size can be flexibly designed, and each tile can be run in parallel the sizes of VGPR, SGPR, and LDS. For example, in AMD’s MI1003 without data interference. Each tile is calculated by the corresponding and MI210,4 a wavefront consists of 64 work-terms. When the number workgroup and can be represented by a 2D-grid as a whole. When the of work-items in a workgroup is less than or equal to 64, only one size and number of tiles is large enough, efficient parallel execution wavefront is included. The VGPR, SGPR, and LDS sizes on the CU have a efficiency can usually be obtained. However, in real-world cases, the corresponding upper bound for each work-item. According to the kernel size of matrices in batch GEMM tends to be small and irregular, design, the resources on the CU need to be allocated before executing which leads to poor performance of traditional methods. Therefore, the each work-item. When resource requirements of the work-item are previous method adopts a variety of tiles to adapt to the corresponding satisfied, the wavefront can be active and run on the CU. Otherwise, GEMM, and each tile is based on a unified number of threads, which it will not run until other wavefronts accomplish tasks and release will lead to the workload of threads in large-scale tiles being much larger than that of small tiles. This gap in the workload of threads 3 https://www.amd.com/system/files/documents/instinct-mi100- results in unbalanced thread loading and reduces GPU parallel com- brochure.pdf puting efficiency. Table 2 lists the detailed parameters for tiles with 4 https://www.amd.com/content/dam/amd/en/documents/instinct- various sizes based on the same work-item design (The number of work- business-docs/white-papers/amd-cdna2-white-paper.pdf items in the kernel is 128). 𝑊𝐶 𝑃 and 𝑊𝐷𝐿 represent the computation 4 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 2. Two different tile design strategies for batch GEMMs. ((a) All GEMMs adopt the same tiling scheme, which is divided into multiple tiles of the same size. (b) Different GEMMs adopt different tiling schemes and are divided into multiple tiles of different sizes.). Table 2 speed of global memory is considerably lower than that of registers, The common kernel design scheme for batch GEMM (There are significant workload threads’ data access efficiency decreases, and overall time consumption gaps between threads). increases. At the same time, since the variety of thread workloads, Tile 𝑇𝑚 𝑇𝑛 𝑇𝑘 𝑊𝐶 𝑃 𝑊𝐷𝐿 when a thread with a heavy workload is run on the CU, the number small 16 16 8/16 2 4/6 of active wavefronts on the CU is less, resulting in the CU’s kernel medium 32 32 8/16 8 12/16 large 64 64 8/16 32 40/48 occupancy (The ratio between the number of active wavefronts and the maximum number of supported wavefronts) will be reduced. The state of the CU with low kernel occupancy will be longer due to the longer work-item computation time. amount and data loading amount of work-item, respectively, and their To solve this problem, we propose a multi-thread kernel design, calculation expressions are considered as: which ensures that the workload of each thread is balanced as much as 𝑇 × 𝑇𝑛 possible. The experimental results in Fig. 3 show that multiple kernels’ 𝑊𝐶 𝑃 = 𝑚 (1) 𝑊𝑛𝑢𝑚 performance varies when calculating the same tile. For example, the 𝑇𝑚 × 𝑇𝑛 + 𝑇𝑚 × 𝑇𝑘 + 𝑇𝑘 × 𝑇𝑛 128-thread kernel performs best when calculating a tile with ‘‘32*32’’, 𝑊𝐷 𝐿 = (2) as shown in Fig. 3. The performance gap mentioned above is mainly 𝑊𝑛𝑢𝑚 because of the varying workloads of threads under different kernels, where 𝑊𝑛𝑢𝑚 represents the number of work-items responsible for com- which affects the overall performance. For the 128-thread kernel, when puting the tile. calculating a tile with ‘‘32*32’’, each thread needs to complete the For different tiles, there is a significant gap in workload between calculation of 8 elements and the loading of 16 elements. When cal- threads (𝑊𝐶 𝑃 ∈ [2, 32] and 𝑊𝐷𝐿 ∈ [4, 48]). The choice of 𝑇𝑘 also has a culating a tile with ‘‘64*64’’, the workload of the threads is heavy, and certain impact on the data load of work-item. Each thread is responsible each thread needs to complete the calculation of 32 elements and the for more data loads when 𝑇𝑘 is larger. For example, in large tile, when loading of 64 elements. When calculating larger tiles, the workload of the value of 𝑇𝑘 is set to 8 or 16, each work-item is responsible for the thread increases significantly. To avoid significant differences in loading 40 and 48 elements, respectively. The workload differences workload between threads, we used a multi-thread kernel to calculate caused by these different tile sizes impact kernel performance. various tiles by considering the computation amount (𝑊𝐶 𝑃 ) and data To explore the impact of the number of work-items in the work- loading amount (𝑊𝐷𝐿 ) of threads in the kernel. For larger tiles such group and the tile size on the performance of batch GEMM, some as ‘‘32*64’’ and ‘‘64*64’’, a 256-thread kernel is used for computation. experiments are performed, whose results are given in Fig. 3. As shown In this way, increasing the number of threads will reduce the thread’s in Fig. 3, under the condition that the number of GEMMs is large and computation amount and data loading amount, thereby reducing the 𝑀, 𝑁, and 𝐾 are large enough, various thread-kernels (thread number gaps between threads’ workloads and achieving load balancing. There is 64, 128, 256, and 512) are used to compute multiple tiles (The nine are five tiles and two kernels (𝑊𝑛𝑢𝑚 ) for small and irregular batch tiles are shown in Fig. 3). In Fig. 3, four thread kernels commonly matrix multiplication, as shown in Table 3. Compared to Table 2, we used in previous work are selected as benchmarks [21,34,36]. We used balance the thread workload by setting the tile size and number of these kernels to investigate their performance under various tiles in kernel threads so that thread computation and data loading are as comparative experiments. Fig. 3 shows that the kernel’s performance consistent as possible across different workgroups. In the calculation first increases and then decreases for different tiles. When the tile process of GEMM, five tile types are designed for GEMM calculation size is small, the thread’s workload is also tiny. In this case, threads of different sizes, from ‘‘small’’ to ‘‘large’’. To ensure that the amount in the kernel only compute a few elements, which causes a lack of of computation and data loading for the work-item responsible for full utilization of threads’ computing power. As the tile size increases, computing different tiles are as equal as possible, the number of threads the number of elements that the thread needs to calculate and store varies depending on the tile size. In Table 3, two different thread is also increasing. Under the condition that the register data does numbers are used (128 and 256), respectively, and the computation not overflow, the computing efficiency of the thread is continuously amount (𝑊𝐶 𝑃 ) and data loading amount (𝑊𝐷𝐿 ) of the work-item in improving. When the tile corresponding to the thread is too large, the each scheme are given. Although the current ROCm and CUDA platform register data overflows, and the data will be transferred to the global programming interfaces only support the kernel design of a uniform memory. For example, for a 64-thread-kernel, when computing ‘‘8*8’’ thread number, we use a screening operation in the early stage of kernel and ‘‘32*32’’ tiles, respectively, each thread needs to compute 1 and 32 execution to achieve the effect of kernel design of multiple threads. For elements in matrix C. It is obvious that ‘‘32*32’’ requires more register example, in this paper, the number of kernel threads is set to 256. When memory. However, the register memory of each thread is precious. the tiles of ‘‘small’’, ‘‘small-medium’’ and ‘‘medium’’ are executed, the When the maximum limit of the register memory is exceeded, the data extra threads will be terminated immediately and the corresponding will be transferred to the global memory for storage. Because the access computing resources will be released because these tiles only need 5 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 3. Experimental results of multi-thread kernel. Table 3 and each tile is computed by a workgroup. Workgroups are further The multi-thread kernel design scheme with a more balanced workload. transformed into wavefronts based on their hardware resource require- Tile 𝑇𝑚 𝑇𝑛 𝑇𝑘 𝑊𝑛𝑢𝑚 𝑊𝐶 𝑃 𝑊𝐷𝐿 ments and the number of work-item. Finally, these wavefronts are run small 16 16 16 128 2 6 in parallel on multiple CUs for batch GEMM calculations. Due to the small-medium 16 32 16 128 4 10 difference between tile sizes, the computation amount and data loading medium 32 32 16 128 8 16 medium–large 32 64 16 256 8 14 amount of threads are not unified in the different wavefront, which large 64 64 16 256 16 24 will lead to unbalanced hardware resource requirements. The execution time of the wavefront on the CU is also different. The overall time of the batch GEMM is the maximum of all CU execution time. If the workload difference between wavefronts is too significant, the execution time of 128 threads. Terminating threads early allows for a better allocation one wavefront will be excessive, increasing the overall calculation time of computational resources to threads responsible for computing other consumption. tiles. With this implementation, we can achieve the effect of a multi- Therefore, Eq. (3) does not consider the workload gaps between threaded kernel. Even though the performance may be degraded in wavefronts. To solve this problem, we propose a new TLP calculation comparison with an actual multi-threaded kernel, the experimental method as follows: results in Section 5 demonstrate the excellent performance of this ( ) ∑ 𝑀𝑖 × 𝑁𝑖 method. 𝑇 𝐿𝑃 𝑛𝑒𝑤 = 𝜑 × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 (4) 𝑖 𝑇𝑚𝑖 × 𝑇𝑛𝑖 4.2. Tiling algorithm where the expression of 𝑀𝑖 , 𝑁𝑖 , 𝑇𝑚𝑖 and 𝑇𝑛𝑖 have the same meaning as Eq. (3), and 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 is number of work-item in wavefront, 𝜑 4.2.1. Criteria for evaluation represents the conversion process of workgroup to wavefront. The tiling can be seen as a re-assignment of GEMM computation The conversion process mainly considers the following factors: the task. Efficient tiling algorithm can transform GEMM operations and number of workitems in the workgroup, the size of VGPR, SGPR, LDS improve hardware resource utilization. When various kernel designs required by a workitem, and the maximum number of wavefront sup- are implemented, choosing an appropriate tiling scheme becomes a ported in the CU. These factors are related to GPU hardware architec- crucial issue. In general, for a GEMM, there will be better parallelism ture. Next, take AMD’s MI210, which is based on CDNA2.0 architecture, within the workgroup when the tile size is larger. However, a larger tile as an example. Under the limitation of the number of workitems in the means that the number of tiles needs to be reduced. If the number of workgroup, the number of wavefront can be calculated as follows: tiles is too few, the CU cannot be fully utilized, resulting in a waste of ( ) 𝑊 𝐼𝑤𝑔 computing resources. Therefore, choosing a suitable tiling evaluation 𝑊 𝐹𝑤𝑔 = 16 × ceil (5) 64 criteria is crucial. In the previous study, TLP was used to quantify the parallelism of tiling strategies on GPUs. Given a GEMM and a tiling where 𝑊 𝐹𝑤𝑔 is the maximum number of wavefronts under the limit of strategy, its TLP can be calculated as follows: the number of work-item in the workgroup, and 𝑊 𝐼𝑤𝑔 represents the ∑ 𝑀𝑖 × 𝑁 𝑖 number of work-item in the workgroup. Eq. (5) indicates that when the 𝑇 𝐿𝑃 = × 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 (3) 𝑇𝑚𝑖 × 𝑇𝑛𝑖 number of work-item is less than or equal to 64, a workgroup contains 𝑖 only one wavefront, and the number of workgroups is limited to 16 in where 𝑀𝑖 and 𝑁𝑖 are the dimension size of matrix C of the 𝑖th GEMM, the CU. and 𝑇𝑚𝑖 and 𝑇𝑛𝑖 are the tile sizes chosen by matrix C. 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 is Limited by the size of VGPR, SGPR, and LDS, the number of the the number of threads in workgroup. However, the above formulation wavefront can be calculated as follows: only considers TLP from the level of the workgroup. Indeed, during ( ) 𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥 the computation of the GEMM, the workgroup needs to be further 𝑊 𝐹𝑉 = 4 × floor (6) 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 × 64 transformed into wavefronts and run on the CU in the form of a wavefront. The execution process of batch GEMM can be divided into where 𝑊 𝐹𝑉 is the maximum number of wavefronts under the limit of four phases: segmentation, workgroup, wavefront, and execution. In the the size of VGPR, 𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of VGPR in the Single Instruction segmentation phase, the GEMM is tiling into tiles with various sizes, Multiple Data (SIMD) unit, and 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 is the VGPR size used by a 6 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 work-item. In the CDNA2.0 hardware architecture, each CU consists of decreases. This fine-tuning approach ensures that the CU is not idle four SIMDs. by increasing the utilization of hardware resources at the expense of ( ) 𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 intra-tile parallelism. 𝑊 𝐹𝑆 = floor (7) 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 Algorithm 1 The Tiling algorithm. where 𝑊 𝐹𝑆 is the maximum number of wavefronts under the limit of 1: Initialize 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 , 𝑇 𝐿𝑃 =0, 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝=0, the size of SGPR, 𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of SGPR in the CU, and 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 𝑡𝑜𝑡𝑎𝑙_𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 = 0; is the size of SGPR used by a wavefront. 2: for 𝑖 = 0 to 𝐵𝑠𝑖𝑧𝑒 − 1 do ( ) ( ) 𝐿𝐷𝑆𝑚𝑎𝑥 𝑊 𝐼𝑤𝑔 3: Calculate 𝑇𝑚𝑖 , 𝑇𝑛𝑖 according to equation (10); 𝑊 𝐹𝐿 = floor × ceil (8) 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 64 4: 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑖 ∗ 𝑁𝑖 )∕(𝑇𝑚𝑖 ∗ 𝑇𝑛𝑖 ); 5: end for where 𝑊 𝐹𝐿 is the maximum number of wavefronts under the limit of the size of LDS, 𝐿𝐷𝑆𝑚𝑎𝑥 is the size of LDS in the workgroup, 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 6: 𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ; is the size of LDS used by a workgroup, and the expression of 𝑊 𝐼𝑤𝑔 7: 𝑇 𝑖𝑙𝑒[𝑠𝑖𝑧𝑒] represent to "large" to "small"; 8: while ( 𝑇 𝐿𝑃𝑛𝑒𝑤 >= 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ) do have same meaning as Eq. (5). 9: for 𝑗 = 0 to 𝐵𝑠𝑖𝑧𝑒 − 1 do To sum up, the number of wavefronts should meet the limitations 10: if 𝑇 𝑖𝑙𝑒[𝑗] is "large" then of all the above factors, and the calculation method is as follows. 11: Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large"; 𝑊 𝐹 = min(𝑊 𝐹𝑤𝑔 , 𝑊 𝐹𝑉 , 𝑊 𝐹𝑆 , 𝑊 𝐹𝐿 , 𝑊 𝐹𝐶 ) (9) 12: else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large" then 13: Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium"; where 𝑊 𝐹 is the number of activated wavefronts, 𝑊 𝐹𝐶 is the maxi- 14: else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium" then mum number of wavefront supported in the CU. 15: Set 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium"; The number of wavefronts and the corresponding number of threads 16: else if 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium" then are introduced into Eq. (4) to compute the TLP more accurately and 17: Set 𝑇 𝑖𝑙𝑒[𝑗] is "small"; appropriately. Compared to Eq. (3), the former only considers the 18: end if workload at the workgroup-level, which neglects further conversion 19: 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑗 ∗ 𝑁𝑗 )∕(𝑇𝑚𝑗 ∗ 𝑇𝑛𝑗 ); between the workgroup and wavefront at runtime. Eq. (3) is valid only 20: end for if the following two conditions are satisfied. One is that all thread 21: 𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ; computations and data load amounts are consistent. The other one 22: end while is that the hardware resources required for activated wavefront do not exceed the limit in the CU. Note that for GEMM with different 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 is used as a threshold to ensure parallelism among precision, threads have different requirements for computing resources multiple tiles in fine-tuning phase. Note that 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 has an impor- (VGPR, SGPR, LDS) during the computation process. Therefore, for tant influence on the selection of tiling scheme for different hardware matrices with different precision, the values of 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , architectures. As a measure, the TLP values of the batch GEMM vary and 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 in Eqs. (6)–(8) above are different. This will affect the according to the different tiling schemes. The setting of the 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 number of activated wavefronts. value is related to the architecture of the GPU because it uses the number of wavefront and the number of threads in the wavefront to 4.2.2. Tiling fine-tuning measure the parallelism of the tiling scheme. The hardware resources For batch GEMM, an initial tiling scheme is first assigned to solve and the maximum number of wavefronts supported by each CU are the problem of switching between contexts and low hardware resource diverse, so corresponding 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 should be set for different GPU utilization caused by the matrix’s variable scale. Then, the tiling scheme architectures. is adjusted according to the TLP estimation of batch GEMM and the The specific process of selecting a tiling scheme for batch GEMM hardware architecture of GPU, and finally, the best tiling scheme is ob- is given in Algorithm 1: (1) when batch GEMM is given, an ‘‘initial tained. In the first stage, the tile size chosen by each GEMM according scheme’’ is obtained according to Eq. (10). (2) The TLP of this scheme to the dimensions of the matrix should meet the following conditions: is calculated according to the given batch GEMM and tiling scheme. ⎧𝑇𝑚𝑖 ≤ 𝑀𝑖 and 𝑀𝑖 𝑚𝑜𝑑 𝑇𝑚𝑖 = 0 (3) Compare the TLP of the current tiling scheme with the 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 . ⎪ ⎨𝑇𝑛𝑖 ≤ 𝑁𝑖 and 𝑁𝑖 𝑚𝑜𝑑 𝑇𝑛𝑖 = 0 (10) If the TLP is not reached, the fine-tuning operation will be performed, ⎪ and the current tiling scheme will be changed and then returned to ⎩𝑇𝑘𝑖 ≤ 𝐾𝑖 and 𝐾𝑖 𝑚𝑜𝑑 𝑇𝑘𝑖 = 0 step (2). If the current TLP is greater than or equal to the threshold, where 𝑇𝑚𝑖 and 𝑇𝑛𝑖 represent the size of the tile dimension corresponding go to step (4). (4) The batch GEMM is calculated according to the final to the tiling scheme, and 𝑇𝑘𝑖 is the sub-tile size along the dimension of tiling scheme. In the above procedures, the TLP is used as an evaluation 𝐾. There are two issues. (1) After the first phase, batch GEMM is only criterion to measure the effectiveness of the tiling scheme on the batch an ‘‘initial scheme’’ that cannot achieve optimal parallel computing GEMM. If the threshold is not reached, fine-tuning is used to adjust and efficiency. (2) Due to the variability of matrix size in batch GEMM, one improve the utilization of GPU hardware resources. The optimal tiling or several items of 𝐵𝑠𝑖𝑧𝑒 , 𝑀, 𝑁, and 𝐾 values may be particularly small scheme can be obtained to ensure an optimal implementation at the in batch GEMM, which is called an extreme GEMM case. In this case, GEMM and workgroup level. After the final tiling scheme, the multi- the ‘‘initial scheme’’ cannot get enough tiles, which will make some CU thread kernel is calculated based on the tile size so that the wavefront in an idle state, resulting in a waste of GPU computing power. and work-item levels can achieve a ‘‘workload balance’’ state. To solve these problems, the ‘‘initial scheme’’ is adjusted reasonably and efficiently in the second stage. For the larger-size matrix, smaller The proposed method is based on the GPU platforms of AMD and tiles are used to segment, and the number of tiles is increased by NVIDIA for implementation. The hardware characteristics of the GPU reducing the tile size to avoid CU being idle. The details are as follows: platform can also significantly impact GEMM performance. For exam- for a GEMM, given an appropriate ‘‘initial scheme’’, to avoid the waste ple, in AMD and NVIDIA platforms, threads are based on wavefront of GPU hardware resources, some larger GEMMs are cut with smaller and warp as the basic execution units containing 64 and 32 threads, tiles to ensure that the number of tiles is sufficient. For example, for respectively. The number of threads in the kernel needs to be an integer tiles whose initial value is ‘‘64 * 64’’, tiles with ‘‘32 * 32’’ are used for multiple of the number of threads in wavefront and warp to improve segmentation. As a result, the number of tiles increases as the tile size kernel occupancy. Meanwhile, the size of registers and shared memory 7 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Table 4 set of value ranges. The experimental results were represented by the The configuration of platforms for evaluation. average value of GFLOPS (Giga Floating-point Operations Per Second), Platform setup AMD-platform NVIDIA-platform which is calculated as: CPU EPYC 7763 Platinum 8358 ∑𝑛−1 2(𝑀𝑖 × 𝑁𝑖 × 𝐾𝑖 ) GPU MI210 A800 𝐺𝐹 𝐿𝑂𝑃 𝑆 = 𝑖=0 (11) OS Ubuntu 20.04 Ubuntu 20.04 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 ∗ 1.0𝑒9 ROCm/CUDA ROCm 5.6 CUDA 12.0 where 𝑀𝑖 , 𝑁𝑖 and 𝐾𝑖 represent the matrix dimension of the 𝑖th GEMM, and 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time on this GPU, 𝑛 represents Table 5 batch sizes. For simplicity, the experimental data is represented as The configuration of GPUs for evaluation. single-precision floating-point data and the storage format is based Name MI210 A800 on the row-first format. The experimental results are averaged over Architecture CDNA 2.0 Ampere 10 consecutive runs. The final experimental results were rounded to Core 1700 MHz 1410 MHz preserve two decimal places. Caches L1 16 KB (per CU) L2 16 MB L1 192 KB (per SM) L2 40 MB Memory 64 GB 3.2 Gbps HBM2 80 GB 2.4 Gbps HBM2 Bandwidth 1.6 TB/s 2.04 TB/s 5.2. Speed up In the two platforms, we first compare with the default methods rocBLAS and cuBLAS. These two methods do not support batch irreg- can affect parameter settings during implementation based on different ular GEMMs; we convert batch GEMMs into multiple single GEMMs hardware architectures. Based on this difference, the proposed method and compute the results. The specific experimental results are shown considers parallelism at the wavefront or warp level when performing in Figs. 4–5. Figs. 4–5 show that the proposed method achieves 5.09× matrix segmentation on two GPU platforms. In this way, the proposed and 7.18× average speedup compared to rocBLAS and cuBLAS. This method can flexibly select tiling schemes based on the hardware char- result is primarily due to the fact that this method does not sup- acteristics of the GPU to achieve optimal performance. In this way, port GEMMs of different scales when computing batch GEMMs, so the proposed method can avoid exceeding the maximum register limit it can only compute one GEMM simultaneously. When faced with and prevent data overflow, which improves its applicability for various a small matrix, the computational resources of the GPU cannot be hardware architectures. fully utilized due to the cost of context switching between multiple GEMMs. As the batch size gradually increases, the advantage of the 5. Evaluation proposed method becomes more evident. This shows that for batch and irregular GEMMs, rocBLAS and cuBLAS are at a disadvantage in 5.1. Setup terms of computational efficiency and switching between instances. Meanwhile, we also compare CUTLASS, which handles batch GEMM, Experiment platform and matrix generation. The overall configu- using sorting to solve the problem of significant workload differences ration of the experimental platform and the details of the two GPUs between multiple matrix multiplications. Fig. 5 shows that the proposed are shown in Tables 4 and 5, respectively. To ensure the irregular- method has a 4.64× speedup, which is because CUTLASS’s built-in ity and variability of the input matrix, the GEMM size parameters tiles are unsuitable when the matrix dimensions are small. Therefore, 𝑀, 𝑁, and 𝐾 are randomly generated within corresponding ranges the proposed method performs better acceleration than CUTLASS for ([𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝑀(𝑁)] and [𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝐾]). 𝑀 𝑎𝑥_𝑀, 𝑀 𝑎𝑥_𝑁, and 𝑀 𝑎𝑥_𝐾 batch, irregular, and small-size matrix multiplication. We then perform represent the upper bounds of 𝑀, 𝑁, and 𝐾, respectively. The lower a detailed comparison and analysis of the experimental performance bound for each experiment is denoted uniformly by 𝑀 𝑖𝑛. In this paper, based on MAGMA. The proposed method has 4.37× and 3.36× speed the value of 𝑀 𝑖𝑛 is set to 16. For example, Max_M(N) = 512 and improvement compared to MAGMA. Figs. 4–5 show that the advantage Max_K = 128 indicate that the range of matrix dimensions is 𝑀 ∈ of our method becomes more pronounced as the batch size increases. [16, 512], 𝑁 ∈ [16, 512] and 𝐾 ∈ [16, 128]. Thus, multiple sets of This is because MAGMA only uses the largest GEMM size in the batch matrix dimension ranges can be obtained, and the parameters needed GEMM to set grid.x. Due to the irregularity of the matrix size, a for GEMM generation are chosen from the different value ranges by large number of computational resources in the grid will be idle. The random selection. proposed method, in this case, employs fine-grained filtering operations Comparison method. First, for the two GPU experimental platforms, to ensure further efficient utilization of computational resources, which the default GEMM processing methods rocBLAS [6] and cuBLAS [7] is more evident when the difference between matrix dimensions is provided by the respective GPU manufacturers are chosen as the basic significant. comparison methods to demonstrate the effectiveness of the proposed As shown in Fig. 4, the proposed method achieves an average method. Since these methods do not support the way of batch invo- 1.88× speedup performance compared to Wang. It is noted that the cation, in this paper, rocBLAS and cuBLAS compute batch GEMM in a advantage of the proposed method is more pronounced when 𝑀 𝑎𝑥_𝐾 loop manner. No stream operations are used during the computation. and 𝑀 𝑎𝑥_𝑀 are small. For example, in the case of (𝑀 𝑎𝑥_𝑀(𝑁) = 128, Meanwhile, we also compared the CUTLASS [23], which supports 𝑀 𝑎𝑥_𝐾 = 128), the average speedup can reach 1.95×. This is mainly batch GEMM based on sorting and built-in tiles. We then compare due to the fact that when the dimension of matrix is small, there are with MAGMA [8] supported by the University of Tennessee ICL Lab, not enough tiles to cover the time consumption of data loading in the which only extends the 𝑔 𝑟𝑖𝑑 .𝑧 to support batch GEMM but does not wavefront, which is more pronounced in workgroups with heavy loads. have a fine-grained optimization strategy. The MAGMA comparison The proposed method adjusts the wavefront workload corresponding experiments were run on two GPU platforms. Meanwhile, to show the to the tiles through a multi-thread kernel and ensures consistent com- advancement of our proposed method, we compare with the state-of- putation and data loading by different workgroups. At the same time, the-art methods such as Wang [36] and Li [21] on their respective it has also shown that the state of load and computation balancing platforms. All of the above methods perform a warp-up operation to between wavefronts is more conducive to improving the efficiency of eliminate the effect of the first kernel boot. GPU parallel computing. In the NVIDIA platform, Fig. 5 shows that the Evaluation criteria. In the following experiments, there are 12 sets proposed method has average 1.94× speedup performance compared to of GEMM dimension ranges. The experiments with batch sizes 8, 16, Li. The advantage of the proposed method becomes clearer as the batch 32, 64, 128, and 256 were run continuously for ten epochs under each size increases. There are two reasons for this speedup performance : 8 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 4. The comparative results on MI210. (5.09×, 4.37×, 1.88× speedup over rocBLAS, MAGMA, Wang). Fig. 5. The comparative results with on A800. (7.18×, 4.64×, 3.63×, 1.94× speedup over cuBLAS, CUTLASS, MAGMA, Li). 9 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 6. The kernel occupancy on two GPU platforms. Fig. 7. The time overhead of tiling algorithm. (1) Li et al. used batching to balance the workload among different wavefronts and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 is the theoretical number of wavefronts that blocks but did not consider the difference between the workload of CU can execute simultaneously. 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 represent threads in different tiles. (2) When selecting the tiling scheme, the TLP the number of warps in activation and the number of warps that are is calculated only by considering the block, and the fine-grained warp theoretically parallelizable simultaneously in the NVIDIA platform. level is neglected, which leads to the inaccurate calculation of TLP. The The results of the experiment are shown in Fig. 6. By comparing proposed method adjusts the wavefront workload corresponding to the rocBLAS and cuBLAS, it can be seen that the proposed method has a tiles through a multi-thread kernel and ensures consistent computation clear advantage in the case of batch GEMM. The proposed method is and data loading by different workgroups. At the same time, it has also in the best position compared to the other methods (CUTLASS, also shown that the state of load and computation balancing between MAGMA, Wang, Li), showing high efficiency in terms of utilization of wavefronts is more conducive to improving the efficiency of GPU GPU resources. As shown in Fig. 6, the proposed method consistently parallel computing. maintains the optimal kernel occupancy on both GPU platforms, which indicates that the proposed method can better exploit the computing power of the GPU. 5.3. Kernel occupancy 5.4. The overhead of tiling algorithm To explore the difference between the proposed method and the comparison methods in terms of GPU resource utilization, we present This section presents the proportion of the runtime that is taken the kernel occupancy of the various methods on two GPU platforms. up by the tiling algorithm when executing the proposed method on The formula for kernel occupancy can be expressed as: two different GPU platforms with various batch sizes. The experimental 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 kernel occupancy = (12) results are presented in Fig. 7. From Fig. 7, it is evident that the tiling 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 algorithm’s runtime percentage decreases as the batch size increases. To obtain more accurate performance metrics, we utilize Omniperf5 When batch size is 8, the runtime of the tiling algorithm on the two and Nsight6 commands, profiling tools provided by AMD and NVIDIA, GPU platforms is 6.06% and 6.37%, respectively. As the batch size to evaluate the resource utilization of the kernel during the execution increases, more and more GEMMs are executed on the GPU, and the process. The kernel occupancy has distinct interpretations owing to execution time of these GEMMs on the GPU side takes up most of the the distinctions in GPU architecture between AMD MI210 and NVIDIA time, resulting in a smaller runtime portion of the tiling algorithm. A800. On the AMD platform, 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 is the number of activated For example, with a batch size is 1024, the tiling algorithm takes less than 1% of the runtime. The experimental results on two GPUs indicate that the time overhead of the tiling algorithm in the batch GEMM 5 https://github.com/ROCm/omniperf execution process is negligible, especially when the batch size is large. 6 https://docs.nvidia.com/nsight-compute/NsightCompute/index.html In real-world scenarios such as deep learning, where a large number of 10 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 8. The performance improvement of the proposed TLP on MI210. (1.077× average speedup). GEMM operations are often required, the tiling algorithm will have less 4.53×, and 1.62× compared to rocBLAS, MAGMA, and Wang, respec- overhead in the execution process. tively. The proposed method has the lowest latency performance on MI210, indicating higher computational efficiency and can effectively 5.5. The performance benefits of the proposed TLP reduce latency. On A800, the proposed method showed performance improvements of 3.02×, 2.59×, 2.45×, and 1.89× compared to cuBLAS, This section presents the comparative experimental results on two MAGMA, CUTLASS, and Li, respectively. Fig. 10 shows that as the GPU platforms to provide a more detailed evaluation of the proposed batch size gradually increases, the kernel latency increases on both TLP. The detailed experimental results are shown in Figs. 8–9. From GPU platforms. rocBLAS and cuBLAS have the highest latency as the Figs. 8–9, it is clear that the proposed TLP performs better overall than batch size increases. This phenomenon is because the traditional loop traditional TLP. The proposed methods have a speedup of 1.077× and scheduling method significantly increases latency consumption due to 1.085× on MI210 and A800, respectively. From Fig. 8, the proposed context switching between kernels when the batch size is large. From method significantly improves performance when the batch size is Fig. 10, it can be seen that some methods exhibit different latency larger. For example, on MI210, the proposed method has an average performances at various batch sizes. For example, when batch size speedup of 1.04× when batch size <= 16. When batch size >= 32, the <= 16, MAGMA has the highest latency performance on two GPU proposed method can improve performance by 1.10×. The performance platforms. When the batch size is large, its computational performance improvement gap is because when the batch size and matrix dimension improves, indicating that the MAGMA performs better when there are are small, it is difficult to utilize hardware resources fully. When there many matrices. The experimental results on two platforms show that are a large number of tiles, the proposed TLP can more accurately the proposed method has the lowest latency under various batch sizes, evaluate the thread’s workload and select the optimal tiling scheme. indicating better performance and broad applicability. The same performance trend is also reflected in the A800 platform. On A800, the proposed TLP has performance improvements of 1.04× and 1.11× when batch size <= 16 and batch size >= 32, respectively. The 5.7. The improved performance on inception layers of CNN effectiveness of the proposed TLP can be further demonstrated through comparative experiment results on two GPU platforms. Modern CNN model architectures often have multiple branches to capture features at different scales. Convolution operations of differ- 5.6. The latency ent scales in each branch can be represented as batch GEMM oper- ations with various dimensions, e.g. GoogleNet [13], DenseNet [50], This section compares kernel latency on two GPU platforms to SqueezeNet [12], etc. To demonstrate the effectiveness of the proposed provide a more detailed evaluation of the proposed method. We mea- method in real-world scenarios, we use various Inception module as sured kernel latency with different batch sizes in the comparative a typical application to perform the forward computation process on experiment. The detailed experimental results are shown in Fig. 10. two GPU platforms. The Inception module involves a large number of On MI210, the proposed method has a latency reduction of 3.87×, irregular, small-size GEMM operations. The deep learning frameworks 11 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 9. The performance improvement of the proposed TLP on A800. (1.085× average speedup). Fig. 10. The latency performance of the kernel on two GPU platforms. MIOpen7 and cuDNN8 are used as benchmark implementations on the other Inception module, and the dimensions of these matrices are both GPU platforms. In this section, we select several commonly used smaller than the former two. Finally, the proposed method has been Inception modules to evaluate the proposed method’s speedup perfor- proven to significantly accelerate CNN models with various branch mance. The GEMM sizes in Inception modules are shown in Table 6. structures on two different GPU platforms, particularly in scenarios Fig. 11 shows the speedup performance of the proposed method in each involving multiple branches, irregular shapes, and small dimensions. Inception module. As shown in Fig. 11, the average speedups are 2.88× and 1.87× respectively. The gray boxes represent the average speedup 6. Conclusion ratios of the different Inception modules in Fig. 11. The experimental results suggest that the Inception 8–9 series has the highest average In this paper, we propose a load-balanced batch GEMM acceleration speedup ratio (3.68× and 2.66× respectively) among the Inception method for the problem of low parallel computing efficiency and poor modules, because Inception 8–9 has more matrix shapes compared to hardware resource utilization in batch, irregular, and variable matrix multiplication scenarios. The kernel occupancy and hardware resource utilization can be effectively improved by a multi-thread kernel design 7 https://github.com/ROCm/MIOpen that balances the computational and data load in the work-item. A 8 https://github.com/NVIDIA/cudnn-frontend novel approach to TLP computation is devised, where the parallelism of 12 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 Fig. 11. The speedup performance on Inception layers. Table 6 The size of GEMM in various Inception modules. Inception module GEMM size (M ×N × K) Inception-1 784 × 96 × 192, 784 × 64 × 192, 784 × 32 × 192, 784 × 16 × 192 Inception-2 784 × 64 × 192, 784 × 32 × 192, 784 × 128 × 192 Inception-3 196 × 192 × 192, 196 × 16 × 192, 196 × 96 × 192, 196 × 64 × 192 Inception-4 196 × 64 × 192, 196 × 24 × 192, 196 × 160 × 192 Inception-5 196 × 64 × 192, 196 × 128 × 192, 196 × 24 × 192 Inception-6 196 × 112 × 192, 196 × 144 × 192, 196 × 32 × 192, 196 × 64 × 192 Inception-7 196 × 256 × 192, 196 × 160 × 192, 196 × 128 × 192 Inception-8 49 × 160 × 192, 49 × 128 × 192, 49 × 256 × 192, 49 × 160 × 192, 49 × 32 × 192 Inception-9 49 × 192 × 192, 49 × 128 × 192, 49 × 384 × 192, 49 × 192 × 192, 49 × 48 × 192 the tiling scheme is measured by the number of activated wavefronts. References This approach allows the optimal tiling scheme to be selected based on different GPU architectures. Experiments are conducted on two GPU [1] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using the TF32 and TF64 frameworks on low-precision AI tensor cores, in: Proceedings platforms to validate the effectiveness and progress of our proposed of the SC’23 Workshops of the International Conference on High Performance method. Computing, Network, Storage, and Analysis, 2023, pp. 179–186. Future work includes exploring batch GEMM with various preci- [2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based sion performances. With the development of Transformer-based, many convolutions for deep learning on multicore ARM and RISC-V architectures, J. GEMM operations are involved in the training and inference process Syst. Archit. (2024) 103186. [3] J. Fornt, P. Fontova-Musté, M. Caro, J. Abella, F. Moll, J. Altet, C. Studer, An of Large Language Models (LLMs), which often have lower accuracy, energy-efficient gemm-based convolution accelerator with on-the-fly im2col, IEEE such as FP16, FP8, etc. For example, quantized LLMs often involve Trans. Very Large Scale Integr. (VLSI) Syst. 31 (11) (2023) 1874–1878. GEMM operations where the weight matrices and activation values [4] H. Kim, W.J. Song, Las: locality-aware scheduling for GEMM-accelerated have different precisions, e.g. W4A16, W8A8. More complex precisions convolutions in GPUs, IEEE Trans. Parallel Distrib. Syst. 34 (5) (2023) 1479–1494. and storage formats pose challenges to the performance of GEMM [5] W. Yang, J. Fang, D. Dong, X. Su, Z. Wang, Optimizing full-spectrum matrix operations. multiplications on ARMv8 multi-core CPUs, IEEE Trans. Parallel Distrib. Syst. (2024). CRediT authorship contribution statement [6] AMD, Next generation BLAS implementation for ROCm platform, 2024, https: //github.com/ROCm/rocBLAS. [7] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore Yu Zhang: Writing – review & editing, Writing – original draft. Lu High-Performance Parallel Computing with CUDA, Packt Publishing Ltd, 2018. Lu: Writing – review & editing, Supervision. Zhanyu Yang: Writing [8] ICL, Matrix algebra for GPU and multicore architectures, 2024, https://icl.utk. – review & editing. Zhihong Liang: Supervision, Conceptualization. edu/magma/. [9] T. Faingnaert, T. Besard, B. De Sutter, Flexible performant GEMM kernels on Siliang Suo: Supervision, Conceptualization. GPUs, IEEE Trans. Parallel Distrib. Syst. 33 (9) (2021) 2230–2248. [10] W.S. Moses, I.R. Ivanov, J. Domke, T. Endo, J. Doerfert, O. Zinenko, High- Declaration of competing interest performance gpu-to-cpu transpilation and optimization via high-level parallel constructs, in: Proceedings of the 28th ACM SIGPLAN Annual Symposium on The authors declare that they have no known competing finan- Principles and Practice of Parallel Programming, 2023, pp. 119–134. [11] H. Kim, H. Nam, W. Jung, J. Lee, Performance analysis of CNN frameworks cial interests or personal relationships that could have appeared to for GPUs, in: 2017 IEEE International Symposium on Performance Analysis of influence the work reported in this paper. Systems and Software, ISPASS, IEEE, 2017, pp. 55–64. [12] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer, Acknowledgments SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size, 2016, arXiv preprint arXiv:1602.07360. [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. This work was supported by the Natural Science Foundation of Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of Guangdong Province (2024A1515010204) and the Technological Re- the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. search Project of Southern Power Grid Company (ZBKJXM20232483). 1–9. [14] G. Pant, D. Yadav, A. Gaur, ResNeXt convolution neural network topology-based deep learning model for identification and classification of pediastrum, Algal Res. Data availability 48 (2020) 101932. [15] S. Barrachina, M.F. Dolz, P. San Juan, E.S. Quintana-Ortí, Efficient and No data was used for the research described in the article. portable GEMM-based convolution operators for deep neural network training on multicore processors, J. Parallel Distrib. Comput. 167 (2022) 240–254. 13 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341 [16] S. Rajbhandari, Y. He, O. Ruwase, M. Carbin, T. Chilimbi, Optimizing cnns on [35] G. Alaejos, A. Castelló, H. Martínez, P. Alonso-Jordá, F.D. Igual, E.S. Quintana- multicores for scalability, performance and goodput, ACM SIGARCH Comput. Ortí, Micro-kernels for portable and efficient matrix multiplication in deep Archit. News 45 (1) (2017) 267–280. learning, J. Supercomput. 79 (7) (2023) 8124–8147. [17] C. Rivera, J. Chen, N. Xiong, S.L. Song, D. Tao, Ism2: Optimizing irregular-shaped [36] R. Wang, Z. Yang, H. Xu, L. Lu, A high-performance batched matrix multiplica- matrix-matrix multiplication on gpus, 2020, arXiv preprint arXiv:2002.03258. tion framework for gpus under unbalanced input distribution, J. Supercomput. [18] K. Matsumoto, N. Nakasato, S.G. Sedukhin, Performance tuning of matrix 78 (2) (2022) 1741–1758. multiplication in opencl on different gpus and CPUs, in: 2012 SC Companion: [37] Y. Zhang, Y. Wang, Z. Mo, Y. Zhou, T. Sun, G. Xu, C. Xing, L. Yang, Accelerating High Performance Computing, Networking Storage and Analysis, IEEE, 2012, pp. small matrix multiplications by adaptive batching strategy on GPU, in: 2022 396–405. IEEE 24th Int Conf on High Performance Computing & Communications; 8th [19] G.E. Moon, H. Kwon, G. Jeong, P. Chatarasi, S. Rajamanickam, T. Krishna, Eval- Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int uating spatial accelerator architectures with tiled matrix-matrix multiplication, Conf on Dependability in Sensor, Cloud & Big Data Systems & Application, IEEE Trans. Parallel Distrib. Syst. 33 (4) (2021) 1002–1014. HPCC/DSS/SmartCity/DependSys, IEEE, 2022, pp. 882–887. [20] Q. Han, H. Yang, M. Dun, Z. Luan, L. Gan, G. Yang, D. Qian, Towards [38] A. Abdelfattah, S. Tomov, J. Dongarra, Matrix multiplication on batches of small efficient tile low-rank GEMM computation on sunway many-core processors, J. matrices in half and half-complex precisions, J. Parallel Distrib. Comput. 145 Supercomput. 77 (5) (2021) 4533–4564. (2020) 188–201. [21] X. Li, Y. Liang, S. Yan, L. Jia, Y. Li, A coordinated tiling and batching [39] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Novel HPC techniques to batch framework for efficient GEMM on GPUs, in: Proceedings of the 24th Symposium execution of many variable size BLAS computations on GPUs, in: Proceedings of on Principles and Practice of Parallel Programming, 2019, pp. 229–241. the International Conference on Supercomputing, 2017, pp. 1–10. [22] P. Tillet, D. Cox, Input-aware auto-tuning of compute-bound HPC kernels, in: [40] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Performance, design, and Proceedings of the International Conference for High Performance Computing, autotuning of batched GEMM for GPUs, in: High Performance Computing: 31st Networking, Storage and Analysis, 2017, pp. 1–12. International Conference, ISC High Performance 2016, Frankfurt, Germany, June [23] NVIDIA, CUDA templates for linear algebra subroutines, 2024, https://github. 19-23, 2016, Proceedings, Springer, 2016, pp. 21–38. com/NVIDIA/cutlass. [41] A. Li, G.-J. van den Braak, H. Corporaal, A. Kumar, Fine-grained synchronizations [24] J. Huang, C.D. Yu, R.A.v.d. Geijn, Strassen’s algorithm reloaded on GPUs, ACM and dataflow programming on GPUs, in: Proceedings of the 29th ACM on Trans. Math. Softw. 46 (1) (2020) 1–22. International Conference on Supercomputing, 2015, pp. 109–118. [25] B. Boyer, J.-G. Dumas, C. Pernet, W. Zhou, Memory efficient scheduling of [42] J. Li, H. Ye, S. Tian, X. Li, J. Zhang, A fine-grained prefetching scheme strassen-winograd’s matrix multiplication algorithm, in: Proceedings of the 2009 for DGEMM kernels on GPU with auto-tuning compatibility, in: 2022 IEEE International Symposium on Symbolic and Algebraic Computation, 2009, pp. International Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2022, 55–62. pp. 863–874. [26] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, [43] Z. Yang, L. Lu, R. Wang, A batched GEMM optimization framework for deep A. Novikov, F.J. R Ruiz, J. Schrittwieser, G. Swirszcz, et al., Discovering faster learning, J. Supercomput. 78 (11) (2022) 13393–13408. matrix multiplication algorithms with reinforcement learning, Nature 610 (7930) [44] H. Mei, H. Qu, J. Sun, Y. Gao, H. Lin, G. Sun, GPU occupancy prediction of (2022) 47–53. deep learning models using graph neural network, in: 2023 IEEE International [27] G. Xiao, C. Yin, T. Zhou, X. Li, Y. Chen, K. Li, A survey of accelerating parallel Conference on Cluster Computing, CLUSTER, IEEE, 2023, pp. 318–329. sparse linear algebra, ACM Comput. Surv. 56 (1) (2023) 1–38. [45] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, J. [28] Y. Chen, G. Xiao, K. Li, F. Piccialli, A.Y. Zomaya, fgSpMSpV: A fine-grained Dongarra, Algorithms and optimization techniques for high-performance matrix- parallel SpMSpV framework on HPC platforms, ACM Trans. Parallel Comput. 9 matrix multiplications of very small matrices, Parallel Comput. 81 (2019) (2) (2022) 1–29. 1–21. [29] Y. Chen, G. Xiao, W. Yang, Optimizing partitioned CSR-based SpGEMM on the [46] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee, sunway TaihuLight, Neural Comput. Appl. 32 (10) (2020) 5571–5582. D. Lee, Lut-gemm: Quantized matrix multiplication based on luts for efficient [30] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, T. Li, Performance-aware model for inference in large-scale generative language models, 2022, arXiv preprint arXiv: sparse matrix-matrix multiplication on the sunway taihulight supercomputer, 2206.09557. IEEE Trans. Parallel Distrib. Syst. 30 (4) (2018) 923–938. [47] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating [31] G. Xiao, K. Li, Y. Chen, W. He, A.Y. Zomaya, T. Li, Caspmv: A customized scientific computing on tensor cores with extended precision, in: Proceedings and accelerative spmv framework for the sunway taihulight, IEEE Trans. Parallel of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Distrib. Syst. 32 (1) (2019) 131–146. Programming, 2021, pp. 278–291. [32] G. Xiao, C. Yin, Y. Chen, M. Duan, K. Li, Efficient utilization of multi-threading [48] G. Shobaki, A. Kerbow, S. Mekhanoshin, Optimizing occupancy and ILP on the parallelism on heterogeneous systems for sparse tensor contraction, IEEE Trans. GPU using a combinatorial approach, in: Proceedings of the 18th ACM/IEEE Parallel Distrib. Syst. (2024). International Symposium on Code Generation and Optimization, 2020, pp. [33] D.E. Tanner, Tensile: Auto-tuning gemm gpu assembly for all problem sizes, 133–144. in: 2018 IEEE International Parallel and Distributed Processing Symposium [49] A.B. Hayes, L. Li, D. Chavarría-Miranda, S.L. Song, E.Z. Zhang, Orion: A Workshops, IPDPSW, IEEE, 2018, pp. 1066–1075. framework for gpu occupancy tuning, in: Proceedings of the 17th International [34] S. Wang, FlexGEMM: A flexible micro-kernel generation framework, in: Proceed- Middleware Conference, 2016, pp. 1–13. ings of the 5th International Conference on Computer Information and Big Data [50] G. Huang, S. Liu, L. Van der Maaten, K.Q. Weinberger, Condensenet: An Applications, 2024, pp. 164–170. efficient densenet using learned group convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761. 14