Files
opaque-lattice/papers_txt/A-load-balanced-acceleration-method-for-small-and-irr_2025_Journal-of-System.txt
2026-01-06 12:49:26 -07:00

929 lines
107 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103341
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
A load-balanced acceleration method for small and irregular batch matrix
multiplication on GPU
Yu Zhang a , Lu Lu a,b ,, Zhanyu Yang a , Zhihong Liang c,d , Siliang Suo c,d
a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
b
Peng Cheng Laboratory, Shenzhen, 518055, China
c
Electric Power Research Institute, CSG, Guangzhou, China
d
Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, China
ARTICLE INFO ABSTRACT
Keywords: As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many
Batch GEMM applications, such as high-performance computing, machine learning, etc. In practice, the performance of
Thread workload GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing
Multi-thread kernel
with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, a
Tiling algorithm
common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in
GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism
and ignore the issues of low computational efficiency and hardware resource utilization caused by the
difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM
acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multi-
thread kernel design can address the workload unbalance between wavefronts in different workgroups, and the
efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation
method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted
on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms
previous methods.
1. Introduction Many real-world applications, such as deep learning, involve ir-
regular, small-size matrix multiplication operations in their computa-
GEneral Matrix Multiplication (GEMM) is a standard computing tions [11]. For example, in Convolutional Neural Networks (CNN) [12
kernel that plays an important role in high-performance computing [1], 14], the structure of these models contains a large number of convo-
artificial intelligence [2], image processing [3], and other research lutional layers. The scale of the convolution kernel tends to be small
fields. With the explosive growth of data volume and the emergence of (e.g. 1*1 and 3*3). Convolution operations are converted to GEMM
various algorithms, the demand for high-performance GEMM comput- using Im2col function, and the dimension of the matrix is typically
ing is increasing [4,5]. Additional stream processors and memory are less than 1000 [15,16]. These small GEMM computations prevent the
integrated into the GPU to cater to this trend, providing tremendous GPU from fully exploiting its hardware computing potential. In this
computational power for GEMM acceleration. To fully utilize the hard- case, the scheduling overhead between batch GEMMs and the regularity
ware acceleration capability, AMD and NVIDIA, provide developers
of the matrix poses challenges to computational performance [17,18].
with a platform for parallel computing based on GPU (ROCm and
For a GEMM, the tiling is a standard solution method. The matrix is
CUDA). Based on these parallel computing acceleration platforms, var-
segmented into multiple tiles, and a thread block is responsible for
ious optimization algorithms and acceleration libraries have been pro-
computing individual tiles. Since each tile is independent, multiple tiles
posed and demonstrated to have powerful effects, such as rocBLAS [6],
can be computed in parallel by using multiple threads in GPU, to speed
cuBLAS [7], MAGMA [8], etc. These methods achieve optimal computa-
tional task allocation through hardware resource scheduling and thread up the computation process of GEMM. The larger dimension of tile will
parallelism to accelerate the matrix multiplication operation [9,10]. increase the Thread-Level Parallelism (TLP) of a single tile and also will
Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang),
suosl@csg.cn (S. Suo).
https://doi.org/10.1016/j.sysarc.2025.103341
Received 3 September 2024; Received in revised form 3 November 2024; Accepted 8 January 2025
Available online 23 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
reduce the number of tile, resulting in the failure to fully utilize the 2. Related work and motivation
hardware resources of GPU [19,20]. The Instruction-Level Parallelism
(ILP) of a single thread is related to the K-dimension. Generally, for a 2.1. Related work
large enough matrix size, it can fully use GPU hardware resources and
achieve higher TLP and ILP [21,22]. Several approaches have been proposed for batch GEMM computa-
To improve computational efficiency, previous studies have pro- tion, which mainly focus on algorithm-level optimization or architecture-
posed some acceleration methods for matrix multiplication. For in- level optimization. The former mainly explores lower bounds on the
stance, rocBLAS [6] and cuBLAS [7] provide batch GEMM API time complexity of GEMM operations at the mathematical level and
(rocblasSgemmBatched and cublasSgemmBatched), which can support optimizes the computational effort. The latter is based on different GPU
multiple GEMMs to be simultaneously calculated on GPUs. However, architecture features and uses corresponding optimization techniques
these APIs support only uniform matrix sizes that considerably limit to improve the computational efficiency of GEMM. In algorithm-level
these applications. NVIDIA also provides a C++-style template library, optimization, Strassen et al. [24] proposed a novel GEMM algorithm
CUTLASS [23], which utilizes built-in tile templates and sorting to
based on the property that matrix addition is faster than matrix multi-
accelerate matrix multiplication operations. In fact, the size of matrices
plication to speed up the computational process, which uses seven-time
is variable in many real-world applications [11]. To solve this issue,
multiplications and multiple addition operations instead of eight-time
a Vbatch GEMM route that supports batch GEMM in various sizes is
multiplications. This approach mathematically reduced the time com-
designed and implemented by MAGMA (magmablas_sgemm_vbatched). It
plexity of GEMM to 𝑂(𝑛2.81 ) for the first time. To reduce the require-
adapts to batch GEMMs with multiple tiling strategies, assigning the ap-
ment of Strassens algorithm for extra memory space, three different
propriate tile to a single GEMM for huge performance gains. Although
methods were proposed in [25]: pre-additions, overwriting the input
variable sizes are supported in MAGMA, it still has some limitations.
First, MAGMA only supports some coarse-grained tiling strategies that matrix, and recursive scheduling to alleviate this problem. At the
are not appropriate for all GEMM. Coarse-grained tiling results in an same time, due to the powerful effect of deep neural networks in
unbalanced kernel workload and GPU utilization reduction. Second, the various domains, Alhussein Fawzi et al. [26] transformed the process
grid size is determined by the tiling of the largest matrix, which leads of finding the optimal complexity of matrix multiplication into a tensor
to idle threads and a waste of GPU computing power. Third, the lack decomposition problem and used reinforcement learning to explore
of an evaluation criterion for tiling leads to lower efficiency of strategy lower bounds on the complexity of matrix multiplication. In particular,
choice. for a 4 × 4 matrix, the multiplication number was as low as 47 multi-
To thoroughly support batch GEMM with variable sizes, it is es- plications. This performance was better than the two-level Strassens
sential to design a tiling algorithm that can be adapted to all GEMMs algorithm, which involves 49 multiplications. Although the above
and adaptively choose tile sizes, not limited to single size. The optimal approach reduces the mathematical complexity of matrix multiplication
tiling for each GEMM is different, depending on the size of the matrix operations, it is difficult to take advantage of the performance benefits
dimensions (𝑀, 𝑁, 𝐾). How to choose a suitable tile is a challenge of these approach due to the neglect of computational scheduling
for batch GEMM. At the same time, an evaluation criterion based on strategies and multi-level memory architecture features on the GPU.
the current GPU hardware and tiling strategy is also essential. With In architecture-level optimization, GPU vendors (NVIDIA and AMD)
GPU hardware, an appropriate tiling for each GEMM can be chosen have designed and implemented computing libraries such as cuBLAS [6]
to fully utilize the GPU computing capabilities and achieve better and rocBLAS [7] based on their parallel computing platforms to im-
computational performance. How to measure the effectiveness of the prove GPU hardware utilization and parallelism. However, due to the
tiling algorithm on the GPU hardware is a challenging problem. The tile restriction of uniform-sized matrix, the performance is poor when faced
with various sizes can lead to significant differences in computational with small and irregular batch GEMMs. Although NVIDIA provides
effort within each workgroup, further to an unbalanced distribution of a C++-style template library, the small size of the matrix and the
computational tasks and excessive load differences between threads. lack of assembly-level optimizations make it difficult for CUTLASS
Hence, for tiles with various sizes, balancing thread computation and to fully exploit its performance advantages for irregular and small
data loading during computation is also a challenge for batch GEMM. matrix multiplication [23]. These irregular and small-sized matrices
To address the above challenges, we propose a batch GEMM accel-
often lead to unbalanced workloads among threads in different work-
eration method with a multi-thread kernel design. Furthermore, an ef-
groups, which can reduce kernel performance. For Sparse GEneral
ficient tiling algorithm is proposed to achieve load-balanced and higher
Matrix-Matrix multiplication (SpGEMM), the matrixs sparsity leads to
hardware resource utilization. Our contributions can be summarized as
significant differences in thread workloads [27,28]. To address the
follows:
unbalanced workload, Chen et al. [29] optimized the matrix segmen-
• A multi-threaded kernel design scheme is proposed to balance tation by analyzing the distribution of the floating point calculations
thread computation and data loading in different workgroups to of the CSR-based SpGEMM, which achieves load balance and perfor-
compute the various tiles. mance improvement on Sunway TaihuLight. For the issue of workload
• A novel TLP computation method is designed to select the optimal unbalance in threads, it is necessary to conduct a detailed analysis
tiling algorithm by combining the kernel occupancy of the GPU of the computation process and hardware platform characteristics to
and the tiling operation. design an efficient parallel framework implementation [30,31]. Xiao
• An efficient tiling algorithm is implemented by considering the et al. [32] introduce a fine-grained partitioning strategy to select ap-
GPU hardware architecture and the batch GEMM workload. propriate segmentation dimensions, efficiently utilizing the parallelism
• The proposed method can efficiently handle batch irregular GEMM of multi-thread and improving the performance of binary sparse tensor
and achieve state-of-the-art performance on AMD and NVIDIA contracts. The diversity of matrix sizes makes it difficult to utilize a
GPU platforms. unified routine for calculations, resulting in some threads being idle
The rest of the paper is organized as follows. Section 2 provides in CU [33,34]. Indeed, the size of matrices is variable and irregular
related work and motivation. Section 3 introduces background on batch in various scientific computing scenarios. To overcome the matrix
GEMM, GPU architecture, and kernel occupancy. Section 4 presents restriction of uniform size, MAGMA [8] proposes a Vbatch routine
the details of the multi-thread kernel design and load-balanced tiling to support batch GEMM with various sizes. In this way, it uses a 3D
algorithm. Section 5 demonstrates and evaluates the experimental re- grid to indicate batch GEMMs kernel design, where grid.z represents
sult. Section 6 provides the conclusions of the paper and future work. batch size. Each GEMM corresponds to one of the 2D-grid planes, and
The source code of this paper can be obtained in this repository link: the size of the two-dimensional plane (grid.x, grid.y) is determined by
https://github.com/zhangyu0722/BatchGEMM.git. the largest GEMM. In the case of irregular GEMM, if the dimension
2
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 1. GEMM and batch GEMM schematic diagram.
difference between the largest GEMM and the rest is too large, a large 3. Background
number of threads and workgroups will be idle, resulting in a waste of
GPU computing resources. For various parallel acceleration platforms, 3.1. GEMM and batch GEMM
different hardware characteristics, such as register size and number of
CUs, will affect the allocation of computing resources in the kernel. To For a single GEMM, its accumulation routine is 𝐶 = 𝛼 𝐴𝐵+𝛽 𝐶, where
ensure kernel performance, it is necessary to flexibly set parameters 𝐴𝑅𝑀×𝐾 , 𝐵𝑅𝐾×𝑁 and 𝐶𝑅𝑀×𝑁 are dense matrices, 𝑀, 𝑁, and
based on different matrix sizes and hardware architectures [9,35]. 𝐾 represent matrix dimensions, and 𝛼 and 𝛽 are constant scalars. A
To solve this problem, a coordinated tiling and batching strategy is common approach is tiling matrix C into multiple tiles [21,36], which
proposed in [21], where a different tiling strategy is used for each utilizes the parallel computing of thread in GPU to calculate each tile
GEMM in batch GEMM and appropriate batching is used according and splices together the result. As shown in Fig. 1 (b), given a GEMM
to the tile size to improve the computational efficiency of the GPU.
with size 𝑀 × 𝑁 × 𝐾, the matrix C is segmented into multiple tiles with
Wang et al. [36] proposed the sort-up algorithm based on the GEMM
𝑇𝑚 × 𝑇𝑛 . Each workgroup is responsible for the calculation of a tile and
workload and split-down in the tiling process, which can segment large
needs to access the row section of matrix A with size 𝑇𝑚 ×𝐾 and column
tiles into multiple smaller tiles. This approach can make better use of
section of matrix B with size 𝐾 × 𝑇𝑛 . However, the row cross-section
CU utilization when the number of GEMM is limited.
of A and the column cross-section of B (represented in Fig. 1 (b) by
2.2. Motivation the gray parts of matrices A and B, respectively) are too large to store
in shared memory and registers. Hence, the row section of A and the
Although the above-mentioned methods improve the parallel com- column section of B are segments of multiple A tiles with 𝑇𝑚 × 𝑇𝑘 and B
puting efficiency of batch GEMM on GPU from various perspectives, tiles with 𝑇𝑘 × 𝑇𝑛 , respectively. The partial result of C can be obtained
there are two problems. One is that the workload of threads varies by calculating with A tile and B tile, and accumulative partial results
significantly across the kernel. In the above approach, tiles with various can obtain the final result.
sizes are designed, and each tile is responsible for the corresponding To batch-run multiple GEMMs, a naive routine is computed for
kernel, where the number of threads is fixed. In general, larger tiles each GEMM individually. However, when the matrix size is small,
have better TLP. This will also increase the workload of each thread a single GEMM does not fully utilize the GPUs computing power,
for large-size tiles, and the thread responsible for computing large tiles leaving the CU idle [37,38]. To avoid this situation, a batch GEMM
requires more hardware resources (VGPR, SGPR, LDS) and computing method is proposed to design multiple kernels for various GEMM in
time. The other one is that differences between wavefronts within dif- the GPUs [36,39]. Compared to GEMM, batch GEMM is expressed in
ferent workgroups are ignored in the TLP calculations. The workgroup (𝑀 × 𝑁 × 𝐾 × 𝐵𝑠𝑖𝑧𝑒 ), where 𝑀, 𝑁 and 𝐾 represent the dimensions of
will be transformed into multiple wavefronts during GPU computation the matrix, and 𝐵𝑠𝑖𝑧𝑒 represents the batch size. A batch GEMM is 3D-
and be executed in parallel on the CU. Each CU can run multiple dimension grid, where grid.z is batch sizes, and grid.x and grid.y are the
wavefronts simultaneously, and the number of wavefronts depends on
lengths and widths of a two-dimensional plane respectively [40]. To
the hardware resources required by the wavefront. Thus, the TLP on the
balance the workload of a batch GEMM, a variety of tile sizes are used
GPU should be determined by the number of threads in the wavefront
for GEMM tiling. The two-dimensional grid size has the corresponding
that can be executed in parallel on the CU.
matrix C and tiling strategy. Each tile is responsible for the correspond-
To solve the above problems, we propose an efficient and load-
balanced batch GEMM acceleration method, which consists of two ing workgroup. A workgroup is decomposed into multiple wavefronts
parts: a multi-thread kernel design scheme and an efficient tiling algo- that execute on the CU. The 3D grid of batch GEMM is shown in Fig. 1
rithm. A multi-thread kernel design is proposed to balance the amount (a).
of loading and computation in the thread corresponding to each tile.
Tiles with various sizes correspond to the number of threads selected. 3.2. GPU architecture and kernel occupancy
Although this is limited by the parallel programming interfaces of the
CUDA and ROCm platforms, the number of threads responsible for With the improvement of hardware architecture and parallel com-
computing a tile is uniform. To overcome this shortcoming, we use the puting programming platforms (such as ROCm1 and CUDA2 ), GPUs
corresponding filtering operation in the kernel execution process to ef- are becoming the most popular hardware accelerator. The two most
fectively alleviate this problem. An efficient tiling algorithm can choose commonly used GPUs are AMD and NVIDIA, widely used in various
the optimal scheme based on different GEMMs and GPUs. To measure scientific computing platforms. However, some basic concepts of ex-
the effect of tiling, we propose a new way of TLP computation based pression in ROCm and CUDA are different. We chose AMDs official
on wavefronts. The optimal tiling scheme is obtained by adjusting the
tiling strategy according to the TLP. Finally, we obtain an efficient tiling
algorithm based on the new TLP calculation method. In Section 4, the 1
https://rocm.docs.amd.com/en/latest/
2
details of the proposed method are introduced. https://docs.nvidia.com/cuda/
3
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Table 1
ROCm/CUDA terminology.
ROCm CUDA Description
Compute Unit (CU) Streaming One of many parallel vector processors in a GPU that contains
Multiprocessor (SM) parallel ALUs. All waves in a workgroup are assigned
to the same CU.
Kernel Kernel Functions launched to the GPU that are executed by multiple
parallel workers on the GPU. Kernels can work in
parallel with CPU.
Wavefront Warp Collection of operations that execute in lockstep, run the
same instructions, and follow the same control-flow path.
Individual lanes can be masked off.
Workgroup Thread block Think of this as a vector thread. A 64-wide wavefront
is a 64-wide vector op.
Work-item/Thread Thread GPU programming models can treat this as a separate thread
of execution, though this does not necessarily get
forward sub-wavefront progress.
Global Memory Global Memory DRAM memory accessible by the GPU that goes
through some layers cache.
Local Memory Shared Memory Scratchpad that allows communication between wavefront
in a workgroup.
Private Memory Local Memory Per-thread private memory often mapped to registers.
terminology for this paper to provide precise specifications. To clarify resources. In order to fully utilize the hardware resources of the GPU
some differences and relationships between ROCm and CUDA terms, a and improve the efficiency of parallel computing, the kernel occupancy
comparison of terminology is given in Table 1. should be improved as much as possible without data overflow [46,47].
A GPU is composed of multiple Shader Engines (SE) and a com- In batch GEMM, an efficient kernel design should properly allocate
mand processor. Each SE has its own workload manager. One SE is the data loading and computation workload for each work-item in the
integrated with multiple CU and workload manager. Each CU contains wavefront, so that the memory space and computing power on the CU
an enormous amount of Arithmetic and Logic Units (ALUs), a small can be more efficiently utilized [48,49].
number of control units, and caches. Hence, GPUs are suitable for a
large number of simple parallel computing tasks. A GPU kernel consists 4. Overview
of one or multiple workgroups, the size of which is determined by the
number of wavefronts and threads. On the memory hierarchy, the GPU 4.1. Multi-thread kernel design
has global memory, local memory, and private memory from slow to
fast according to memory access speed, and local memory and private Tile size and kernel design are closely related in the design of batch
memory are much smaller than global memory [41,42]. GEMM algorithms, and there are two matrix tile design routes. The
Kernel Occupancy represents the actual utilization of computing first way is to design a tile to adapt to all GEMMs, and the second
unit resources by a kernel function on GPU, which is the ratio of is to design the various tiles to adapt to different GEMMs. Compared
actived wavefront to the maximum wavefront supported by CU [35,43]. with the first method, for irregular GEMM, the latter method is more
An active wavefront running on CU requires resources such as Vec- flexible and efficient to utilize the computing resources of GPU. For
tor General-Purpose Register (VGPR), Scalar General-Purpose Registers GEMMs with various shapes and sizes, using a single tile can easily
(SGPR), Local Data Share (LDS), etc. A wavefront can be activated lead to increased workload differences between threads in multiple
and run on a CU when all required resources are available. When the
workgroups, affecting the allocation of computing resources. In this
utilization of CU resources is low, the number of active wavefronts
paper, we perform a multi-thread kernel design for the second matrix
is small, which leads to the waste of hardware resources and the
segmentation method. Two different tile design strategies are shown
degradation of the parallel performance of the kernel. On the other
in Fig. 2. Here we present the effect of two different tile strategies on
hand, when the number of active wavefronts in the CU increases, the
the occupancy of the 3D grid. For the batch GEMM, different tile sizes
resources used by each wavefront and the available register storage
lead to different numbers of workgroups, resulting in different 3D grid
space of each work-item in the wavefront decrease [44,45].
occupancies.
The number of active wavefronts on a CU is mainly limited by the
For a single GEMM, matrix C is tiled into multiple tiles. The tile
following factors: the number of work-items in each workgroup and
size can be flexibly designed, and each tile can be run in parallel
the sizes of VGPR, SGPR, and LDS. For example, in AMDs MI1003
without data interference. Each tile is calculated by the corresponding
and MI210,4 a wavefront consists of 64 work-terms. When the number
workgroup and can be represented by a 2D-grid as a whole. When the
of work-items in a workgroup is less than or equal to 64, only one
size and number of tiles is large enough, efficient parallel execution
wavefront is included. The VGPR, SGPR, and LDS sizes on the CU have a
efficiency can usually be obtained. However, in real-world cases, the
corresponding upper bound for each work-item. According to the kernel
size of matrices in batch GEMM tends to be small and irregular,
design, the resources on the CU need to be allocated before executing
which leads to poor performance of traditional methods. Therefore, the
each work-item. When resource requirements of the work-item are
previous method adopts a variety of tiles to adapt to the corresponding
satisfied, the wavefront can be active and run on the CU. Otherwise,
GEMM, and each tile is based on a unified number of threads, which
it will not run until other wavefronts accomplish tasks and release
will lead to the workload of threads in large-scale tiles being much
larger than that of small tiles. This gap in the workload of threads
3
https://www.amd.com/system/files/documents/instinct-mi100- results in unbalanced thread loading and reduces GPU parallel com-
brochure.pdf puting efficiency. Table 2 lists the detailed parameters for tiles with
4
https://www.amd.com/content/dam/amd/en/documents/instinct- various sizes based on the same work-item design (The number of work-
business-docs/white-papers/amd-cdna2-white-paper.pdf items in the kernel is 128). 𝑊𝐶 𝑃 and 𝑊𝐷𝐿 represent the computation
4
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 2. Two different tile design strategies for batch GEMMs. ((a) All GEMMs adopt the same tiling scheme, which is divided into multiple tiles of the same size. (b) Different
GEMMs adopt different tiling schemes and are divided into multiple tiles of different sizes.).
Table 2 speed of global memory is considerably lower than that of registers,
The common kernel design scheme for batch GEMM (There are significant workload threads data access efficiency decreases, and overall time consumption
gaps between threads).
increases. At the same time, since the variety of thread workloads,
Tile 𝑇𝑚 𝑇𝑛 𝑇𝑘 𝑊𝐶 𝑃 𝑊𝐷𝐿
when a thread with a heavy workload is run on the CU, the number
small 16 16 8/16 2 4/6 of active wavefronts on the CU is less, resulting in the CUs kernel
medium 32 32 8/16 8 12/16
large 64 64 8/16 32 40/48
occupancy (The ratio between the number of active wavefronts and the
maximum number of supported wavefronts) will be reduced. The state
of the CU with low kernel occupancy will be longer due to the longer
work-item computation time.
amount and data loading amount of work-item, respectively, and their To solve this problem, we propose a multi-thread kernel design,
calculation expressions are considered as: which ensures that the workload of each thread is balanced as much as
𝑇 × 𝑇𝑛 possible. The experimental results in Fig. 3 show that multiple kernels
𝑊𝐶 𝑃 = 𝑚 (1)
𝑊𝑛𝑢𝑚 performance varies when calculating the same tile. For example, the
𝑇𝑚 × 𝑇𝑛 + 𝑇𝑚 × 𝑇𝑘 + 𝑇𝑘 × 𝑇𝑛 128-thread kernel performs best when calculating a tile with 32*32,
𝑊𝐷 𝐿 = (2) as shown in Fig. 3. The performance gap mentioned above is mainly
𝑊𝑛𝑢𝑚
because of the varying workloads of threads under different kernels,
where 𝑊𝑛𝑢𝑚 represents the number of work-items responsible for com-
which affects the overall performance. For the 128-thread kernel, when
puting the tile.
calculating a tile with 32*32, each thread needs to complete the
For different tiles, there is a significant gap in workload between
calculation of 8 elements and the loading of 16 elements. When cal-
threads (𝑊𝐶 𝑃 ∈ [2, 32] and 𝑊𝐷𝐿 ∈ [4, 48]). The choice of 𝑇𝑘 also has a
culating a tile with 64*64, the workload of the threads is heavy, and
certain impact on the data load of work-item. Each thread is responsible
each thread needs to complete the calculation of 32 elements and the
for more data loads when 𝑇𝑘 is larger. For example, in large tile, when loading of 64 elements. When calculating larger tiles, the workload of
the value of 𝑇𝑘 is set to 8 or 16, each work-item is responsible for the thread increases significantly. To avoid significant differences in
loading 40 and 48 elements, respectively. The workload differences workload between threads, we used a multi-thread kernel to calculate
caused by these different tile sizes impact kernel performance. various tiles by considering the computation amount (𝑊𝐶 𝑃 ) and data
To explore the impact of the number of work-items in the work- loading amount (𝑊𝐷𝐿 ) of threads in the kernel. For larger tiles such
group and the tile size on the performance of batch GEMM, some as 32*64 and 64*64, a 256-thread kernel is used for computation.
experiments are performed, whose results are given in Fig. 3. As shown In this way, increasing the number of threads will reduce the threads
in Fig. 3, under the condition that the number of GEMMs is large and computation amount and data loading amount, thereby reducing the
𝑀, 𝑁, and 𝐾 are large enough, various thread-kernels (thread number gaps between threads workloads and achieving load balancing. There
is 64, 128, 256, and 512) are used to compute multiple tiles (The nine are five tiles and two kernels (𝑊𝑛𝑢𝑚 ) for small and irregular batch
tiles are shown in Fig. 3). In Fig. 3, four thread kernels commonly matrix multiplication, as shown in Table 3. Compared to Table 2, we
used in previous work are selected as benchmarks [21,34,36]. We used balance the thread workload by setting the tile size and number of
these kernels to investigate their performance under various tiles in kernel threads so that thread computation and data loading are as
comparative experiments. Fig. 3 shows that the kernels performance consistent as possible across different workgroups. In the calculation
first increases and then decreases for different tiles. When the tile process of GEMM, five tile types are designed for GEMM calculation
size is small, the threads workload is also tiny. In this case, threads of different sizes, from small to large. To ensure that the amount
in the kernel only compute a few elements, which causes a lack of of computation and data loading for the work-item responsible for
full utilization of threads computing power. As the tile size increases, computing different tiles are as equal as possible, the number of threads
the number of elements that the thread needs to calculate and store varies depending on the tile size. In Table 3, two different thread
is also increasing. Under the condition that the register data does numbers are used (128 and 256), respectively, and the computation
not overflow, the computing efficiency of the thread is continuously amount (𝑊𝐶 𝑃 ) and data loading amount (𝑊𝐷𝐿 ) of the work-item in
improving. When the tile corresponding to the thread is too large, the each scheme are given. Although the current ROCm and CUDA platform
register data overflows, and the data will be transferred to the global programming interfaces only support the kernel design of a uniform
memory. For example, for a 64-thread-kernel, when computing 8*8 thread number, we use a screening operation in the early stage of kernel
and 32*32 tiles, respectively, each thread needs to compute 1 and 32 execution to achieve the effect of kernel design of multiple threads. For
elements in matrix C. It is obvious that 32*32 requires more register example, in this paper, the number of kernel threads is set to 256. When
memory. However, the register memory of each thread is precious. the tiles of small, small-medium and medium are executed, the
When the maximum limit of the register memory is exceeded, the data extra threads will be terminated immediately and the corresponding
will be transferred to the global memory for storage. Because the access computing resources will be released because these tiles only need
5
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 3. Experimental results of multi-thread kernel.
Table 3 and each tile is computed by a workgroup. Workgroups are further
The multi-thread kernel design scheme with a more balanced workload.
transformed into wavefronts based on their hardware resource require-
Tile 𝑇𝑚 𝑇𝑛 𝑇𝑘 𝑊𝑛𝑢𝑚 𝑊𝐶 𝑃 𝑊𝐷𝐿
ments and the number of work-item. Finally, these wavefronts are run
small 16 16 16 128 2 6 in parallel on multiple CUs for batch GEMM calculations. Due to the
small-medium 16 32 16 128 4 10
difference between tile sizes, the computation amount and data loading
medium 32 32 16 128 8 16
mediumlarge 32 64 16 256 8 14 amount of threads are not unified in the different wavefront, which
large 64 64 16 256 16 24 will lead to unbalanced hardware resource requirements. The execution
time of the wavefront on the CU is also different. The overall time of the
batch GEMM is the maximum of all CU execution time. If the workload
difference between wavefronts is too significant, the execution time of
128 threads. Terminating threads early allows for a better allocation
one wavefront will be excessive, increasing the overall calculation time
of computational resources to threads responsible for computing other
consumption.
tiles. With this implementation, we can achieve the effect of a multi-
Therefore, Eq. (3) does not consider the workload gaps between
threaded kernel. Even though the performance may be degraded in
wavefronts. To solve this problem, we propose a new TLP calculation
comparison with an actual multi-threaded kernel, the experimental
method as follows:
results in Section 5 demonstrate the excellent performance of this ( )
𝑀𝑖 × 𝑁𝑖
method. 𝑇 𝐿𝑃 𝑛𝑒𝑤 = 𝜑 × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 (4)
𝑖
𝑇𝑚𝑖 × 𝑇𝑛𝑖
4.2. Tiling algorithm where the expression of 𝑀𝑖 , 𝑁𝑖 , 𝑇𝑚𝑖 and 𝑇𝑛𝑖 have the same meaning
as Eq. (3), and 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 is number of work-item in wavefront, 𝜑
4.2.1. Criteria for evaluation represents the conversion process of workgroup to wavefront.
The tiling can be seen as a re-assignment of GEMM computation The conversion process mainly considers the following factors: the
task. Efficient tiling algorithm can transform GEMM operations and number of workitems in the workgroup, the size of VGPR, SGPR, LDS
improve hardware resource utilization. When various kernel designs required by a workitem, and the maximum number of wavefront sup-
are implemented, choosing an appropriate tiling scheme becomes a ported in the CU. These factors are related to GPU hardware architec-
crucial issue. In general, for a GEMM, there will be better parallelism ture. Next, take AMDs MI210, which is based on CDNA2.0 architecture,
within the workgroup when the tile size is larger. However, a larger tile as an example. Under the limitation of the number of workitems in the
means that the number of tiles needs to be reduced. If the number of workgroup, the number of wavefront can be calculated as follows:
tiles is too few, the CU cannot be fully utilized, resulting in a waste of ( )
𝑊 𝐼𝑤𝑔
computing resources. Therefore, choosing a suitable tiling evaluation 𝑊 𝐹𝑤𝑔 = 16 × ceil (5)
64
criteria is crucial. In the previous study, TLP was used to quantify the
parallelism of tiling strategies on GPUs. Given a GEMM and a tiling where 𝑊 𝐹𝑤𝑔 is the maximum number of wavefronts under the limit of
strategy, its TLP can be calculated as follows: the number of work-item in the workgroup, and 𝑊 𝐼𝑤𝑔 represents the
𝑀𝑖 × 𝑁 𝑖 number of work-item in the workgroup. Eq. (5) indicates that when the
𝑇 𝐿𝑃 = × 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 (3)
𝑇𝑚𝑖 × 𝑇𝑛𝑖 number of work-item is less than or equal to 64, a workgroup contains
𝑖
only one wavefront, and the number of workgroups is limited to 16 in
where 𝑀𝑖 and 𝑁𝑖 are the dimension size of matrix C of the 𝑖th GEMM,
the CU.
and 𝑇𝑚𝑖 and 𝑇𝑛𝑖 are the tile sizes chosen by matrix C. 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 is
Limited by the size of VGPR, SGPR, and LDS, the number of the
the number of threads in workgroup. However, the above formulation
wavefront can be calculated as follows:
only considers TLP from the level of the workgroup. Indeed, during ( )
𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥
the computation of the GEMM, the workgroup needs to be further 𝑊 𝐹𝑉 = 4 × floor (6)
𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 × 64
transformed into wavefronts and run on the CU in the form of a
wavefront. The execution process of batch GEMM can be divided into where 𝑊 𝐹𝑉 is the maximum number of wavefronts under the limit of
four phases: segmentation, workgroup, wavefront, and execution. In the the size of VGPR, 𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of VGPR in the Single Instruction
segmentation phase, the GEMM is tiling into tiles with various sizes, Multiple Data (SIMD) unit, and 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 is the VGPR size used by a
6
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
work-item. In the CDNA2.0 hardware architecture, each CU consists of decreases. This fine-tuning approach ensures that the CU is not idle
four SIMDs. by increasing the utilization of hardware resources at the expense of
( )
𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 intra-tile parallelism.
𝑊 𝐹𝑆 = floor (7)
𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 Algorithm 1 The Tiling algorithm.
where 𝑊 𝐹𝑆 is the maximum number of wavefronts under the limit of 1: Initialize 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑 , 𝑇 𝐿𝑃 =0, 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝=0,
the size of SGPR, 𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of SGPR in the CU, and 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 𝑡𝑜𝑡𝑎𝑙_𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 = 0;
is the size of SGPR used by a wavefront. 2: for 𝑖 = 0 to 𝐵𝑠𝑖𝑧𝑒 1 do
( ) ( )
𝐿𝐷𝑆𝑚𝑎𝑥 𝑊 𝐼𝑤𝑔 3: Calculate 𝑇𝑚𝑖 , 𝑇𝑛𝑖 according to equation (10);
𝑊 𝐹𝐿 = floor × ceil (8)
𝐿𝐷𝑆𝑢𝑠𝑒𝑑 64 4: 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑖 𝑁𝑖 )(𝑇𝑚𝑖 𝑇𝑛𝑖 );
5: end for
where 𝑊 𝐹𝐿 is the maximum number of wavefronts under the limit of
the size of LDS, 𝐿𝐷𝑆𝑚𝑎𝑥 is the size of LDS in the workgroup, 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 6: 𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ;
is the size of LDS used by a workgroup, and the expression of 𝑊 𝐼𝑤𝑔 7: 𝑇 𝑖𝑙𝑒[𝑠𝑖𝑧𝑒] represent to "large" to "small";
8: while ( 𝑇 𝐿𝑃𝑛𝑒𝑤 >= 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑 ) do
have same meaning as Eq. (5).
9: for 𝑗 = 0 to 𝐵𝑠𝑖𝑧𝑒 1 do
To sum up, the number of wavefronts should meet the limitations
10: if 𝑇 𝑖𝑙𝑒[𝑗] is "large" then
of all the above factors, and the calculation method is as follows.
11: Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large";
𝑊 𝐹 = min(𝑊 𝐹𝑤𝑔 , 𝑊 𝐹𝑉 , 𝑊 𝐹𝑆 , 𝑊 𝐹𝐿 , 𝑊 𝐹𝐶 ) (9) 12: else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large" then
13: Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium";
where 𝑊 𝐹 is the number of activated wavefronts, 𝑊 𝐹𝐶 is the maxi-
14: else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium" then
mum number of wavefront supported in the CU.
15: Set 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium";
The number of wavefronts and the corresponding number of threads
16: else if 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium" then
are introduced into Eq. (4) to compute the TLP more accurately and
17: Set 𝑇 𝑖𝑙𝑒[𝑗] is "small";
appropriately. Compared to Eq. (3), the former only considers the
18: end if
workload at the workgroup-level, which neglects further conversion
19: 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑗 𝑁𝑗 )(𝑇𝑚𝑗 𝑇𝑛𝑗 );
between the workgroup and wavefront at runtime. Eq. (3) is valid only
20: end for
if the following two conditions are satisfied. One is that all thread
21: 𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ;
computations and data load amounts are consistent. The other one
22: end while
is that the hardware resources required for activated wavefront do
not exceed the limit in the CU. Note that for GEMM with different 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑 is used as a threshold to ensure parallelism among
precision, threads have different requirements for computing resources multiple tiles in fine-tuning phase. Note that 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑 has an impor-
(VGPR, SGPR, LDS) during the computation process. Therefore, for tant influence on the selection of tiling scheme for different hardware
matrices with different precision, the values of 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , architectures. As a measure, the TLP values of the batch GEMM vary
and 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 in Eqs. (6)(8) above are different. This will affect the according to the different tiling schemes. The setting of the 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑
number of activated wavefronts. value is related to the architecture of the GPU because it uses the
number of wavefront and the number of threads in the wavefront to
4.2.2. Tiling fine-tuning measure the parallelism of the tiling scheme. The hardware resources
For batch GEMM, an initial tiling scheme is first assigned to solve and the maximum number of wavefronts supported by each CU are
the problem of switching between contexts and low hardware resource diverse, so corresponding 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑 should be set for different GPU
utilization caused by the matrixs variable scale. Then, the tiling scheme architectures.
is adjusted according to the TLP estimation of batch GEMM and the The specific process of selecting a tiling scheme for batch GEMM
hardware architecture of GPU, and finally, the best tiling scheme is ob- is given in Algorithm 1: (1) when batch GEMM is given, an initial
tained. In the first stage, the tile size chosen by each GEMM according scheme is obtained according to Eq. (10). (2) The TLP of this scheme
to the dimensions of the matrix should meet the following conditions: is calculated according to the given batch GEMM and tiling scheme.
⎧𝑇𝑚𝑖 ≤ 𝑀𝑖 and 𝑀𝑖 𝑚𝑜𝑑 𝑇𝑚𝑖 = 0 (3) Compare the TLP of the current tiling scheme with the 𝑇 𝐿𝑃𝑡𝑟𝑒𝑠𝑜𝑙𝑑 .
⎨𝑇𝑛𝑖 ≤ 𝑁𝑖 and 𝑁𝑖 𝑚𝑜𝑑 𝑇𝑛𝑖 = 0 (10) If the TLP is not reached, the fine-tuning operation will be performed,
⎪ and the current tiling scheme will be changed and then returned to
⎩𝑇𝑘𝑖 ≤ 𝐾𝑖 and 𝐾𝑖 𝑚𝑜𝑑 𝑇𝑘𝑖 = 0
step (2). If the current TLP is greater than or equal to the threshold,
where 𝑇𝑚𝑖 and 𝑇𝑛𝑖 represent the size of the tile dimension corresponding go to step (4). (4) The batch GEMM is calculated according to the final
to the tiling scheme, and 𝑇𝑘𝑖 is the sub-tile size along the dimension of tiling scheme. In the above procedures, the TLP is used as an evaluation
𝐾. There are two issues. (1) After the first phase, batch GEMM is only
criterion to measure the effectiveness of the tiling scheme on the batch
an initial scheme that cannot achieve optimal parallel computing
GEMM. If the threshold is not reached, fine-tuning is used to adjust and
efficiency. (2) Due to the variability of matrix size in batch GEMM, one
improve the utilization of GPU hardware resources. The optimal tiling
or several items of 𝐵𝑠𝑖𝑧𝑒 , 𝑀, 𝑁, and 𝐾 values may be particularly small
scheme can be obtained to ensure an optimal implementation at the
in batch GEMM, which is called an extreme GEMM case. In this case,
GEMM and workgroup level. After the final tiling scheme, the multi-
the initial scheme cannot get enough tiles, which will make some CU
thread kernel is calculated based on the tile size so that the wavefront
in an idle state, resulting in a waste of GPU computing power.
and work-item levels can achieve a workload balance state.
To solve these problems, the initial scheme is adjusted reasonably
and efficiently in the second stage. For the larger-size matrix, smaller The proposed method is based on the GPU platforms of AMD and
tiles are used to segment, and the number of tiles is increased by NVIDIA for implementation. The hardware characteristics of the GPU
reducing the tile size to avoid CU being idle. The details are as follows: platform can also significantly impact GEMM performance. For exam-
for a GEMM, given an appropriate initial scheme, to avoid the waste ple, in AMD and NVIDIA platforms, threads are based on wavefront
of GPU hardware resources, some larger GEMMs are cut with smaller and warp as the basic execution units containing 64 and 32 threads,
tiles to ensure that the number of tiles is sufficient. For example, for respectively. The number of threads in the kernel needs to be an integer
tiles whose initial value is 64 * 64, tiles with 32 * 32 are used for multiple of the number of threads in wavefront and warp to improve
segmentation. As a result, the number of tiles increases as the tile size kernel occupancy. Meanwhile, the size of registers and shared memory
7
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Table 4 set of value ranges. The experimental results were represented by the
The configuration of platforms for evaluation.
average value of GFLOPS (Giga Floating-point Operations Per Second),
Platform setup AMD-platform NVIDIA-platform which is calculated as:
CPU EPYC 7763 Platinum 8358 ∑𝑛1
2(𝑀𝑖 × 𝑁𝑖 × 𝐾𝑖 )
GPU MI210 A800 𝐺𝐹 𝐿𝑂𝑃 𝑆 = 𝑖=0 (11)
OS Ubuntu 20.04 Ubuntu 20.04 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 1.0𝑒9
ROCm/CUDA ROCm 5.6 CUDA 12.0 where 𝑀𝑖 , 𝑁𝑖 and 𝐾𝑖 represent the matrix dimension of the 𝑖th GEMM,
and 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time on this GPU, 𝑛 represents
Table 5 batch sizes. For simplicity, the experimental data is represented as
The configuration of GPUs for evaluation. single-precision floating-point data and the storage format is based
Name MI210 A800 on the row-first format. The experimental results are averaged over
Architecture CDNA 2.0 Ampere 10 consecutive runs. The final experimental results were rounded to
Core 1700 MHz 1410 MHz preserve two decimal places.
Caches L1 16 KB (per CU) L2 16 MB L1 192 KB (per SM) L2 40 MB
Memory 64 GB 3.2 Gbps HBM2 80 GB 2.4 Gbps HBM2
Bandwidth 1.6 TB/s 2.04 TB/s 5.2. Speed up
In the two platforms, we first compare with the default methods
rocBLAS and cuBLAS. These two methods do not support batch irreg-
can affect parameter settings during implementation based on different ular GEMMs; we convert batch GEMMs into multiple single GEMMs
hardware architectures. Based on this difference, the proposed method and compute the results. The specific experimental results are shown
considers parallelism at the wavefront or warp level when performing in Figs. 45. Figs. 45 show that the proposed method achieves 5.09×
matrix segmentation on two GPU platforms. In this way, the proposed and 7.18× average speedup compared to rocBLAS and cuBLAS. This
method can flexibly select tiling schemes based on the hardware char- result is primarily due to the fact that this method does not sup-
acteristics of the GPU to achieve optimal performance. In this way, port GEMMs of different scales when computing batch GEMMs, so
the proposed method can avoid exceeding the maximum register limit it can only compute one GEMM simultaneously. When faced with
and prevent data overflow, which improves its applicability for various a small matrix, the computational resources of the GPU cannot be
hardware architectures. fully utilized due to the cost of context switching between multiple
GEMMs. As the batch size gradually increases, the advantage of the
5. Evaluation proposed method becomes more evident. This shows that for batch
and irregular GEMMs, rocBLAS and cuBLAS are at a disadvantage in
5.1. Setup terms of computational efficiency and switching between instances.
Meanwhile, we also compare CUTLASS, which handles batch GEMM,
Experiment platform and matrix generation. The overall configu- using sorting to solve the problem of significant workload differences
ration of the experimental platform and the details of the two GPUs between multiple matrix multiplications. Fig. 5 shows that the proposed
are shown in Tables 4 and 5, respectively. To ensure the irregular- method has a 4.64× speedup, which is because CUTLASSs built-in
ity and variability of the input matrix, the GEMM size parameters tiles are unsuitable when the matrix dimensions are small. Therefore,
𝑀, 𝑁, and 𝐾 are randomly generated within corresponding ranges the proposed method performs better acceleration than CUTLASS for
([𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝑀(𝑁)] and [𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝐾]). 𝑀 𝑎𝑥_𝑀, 𝑀 𝑎𝑥_𝑁, and 𝑀 𝑎𝑥_𝐾 batch, irregular, and small-size matrix multiplication. We then perform
represent the upper bounds of 𝑀, 𝑁, and 𝐾, respectively. The lower a detailed comparison and analysis of the experimental performance
bound for each experiment is denoted uniformly by 𝑀 𝑖𝑛. In this paper, based on MAGMA. The proposed method has 4.37× and 3.36× speed
the value of 𝑀 𝑖𝑛 is set to 16. For example, Max_M(N) = 512 and improvement compared to MAGMA. Figs. 45 show that the advantage
Max_K = 128 indicate that the range of matrix dimensions is 𝑀 ∈ of our method becomes more pronounced as the batch size increases.
[16, 512], 𝑁 ∈ [16, 512] and 𝐾 ∈ [16, 128]. Thus, multiple sets of This is because MAGMA only uses the largest GEMM size in the batch
matrix dimension ranges can be obtained, and the parameters needed GEMM to set grid.x. Due to the irregularity of the matrix size, a
for GEMM generation are chosen from the different value ranges by large number of computational resources in the grid will be idle. The
random selection. proposed method, in this case, employs fine-grained filtering operations
Comparison method. First, for the two GPU experimental platforms, to ensure further efficient utilization of computational resources, which
the default GEMM processing methods rocBLAS [6] and cuBLAS [7] is more evident when the difference between matrix dimensions is
provided by the respective GPU manufacturers are chosen as the basic significant.
comparison methods to demonstrate the effectiveness of the proposed As shown in Fig. 4, the proposed method achieves an average
method. Since these methods do not support the way of batch invo- 1.88× speedup performance compared to Wang. It is noted that the
cation, in this paper, rocBLAS and cuBLAS compute batch GEMM in a advantage of the proposed method is more pronounced when 𝑀 𝑎𝑥_𝐾
loop manner. No stream operations are used during the computation. and 𝑀 𝑎𝑥_𝑀 are small. For example, in the case of (𝑀 𝑎𝑥_𝑀(𝑁) = 128,
Meanwhile, we also compared the CUTLASS [23], which supports 𝑀 𝑎𝑥_𝐾 = 128), the average speedup can reach 1.95×. This is mainly
batch GEMM based on sorting and built-in tiles. We then compare due to the fact that when the dimension of matrix is small, there are
with MAGMA [8] supported by the University of Tennessee ICL Lab, not enough tiles to cover the time consumption of data loading in the
which only extends the 𝑔 𝑟𝑖𝑑 .𝑧 to support batch GEMM but does not wavefront, which is more pronounced in workgroups with heavy loads.
have a fine-grained optimization strategy. The MAGMA comparison The proposed method adjusts the wavefront workload corresponding
experiments were run on two GPU platforms. Meanwhile, to show the to the tiles through a multi-thread kernel and ensures consistent com-
advancement of our proposed method, we compare with the state-of- putation and data loading by different workgroups. At the same time,
the-art methods such as Wang [36] and Li [21] on their respective it has also shown that the state of load and computation balancing
platforms. All of the above methods perform a warp-up operation to between wavefronts is more conducive to improving the efficiency of
eliminate the effect of the first kernel boot. GPU parallel computing. In the NVIDIA platform, Fig. 5 shows that the
Evaluation criteria. In the following experiments, there are 12 sets proposed method has average 1.94× speedup performance compared to
of GEMM dimension ranges. The experiments with batch sizes 8, 16, Li. The advantage of the proposed method becomes clearer as the batch
32, 64, 128, and 256 were run continuously for ten epochs under each size increases. There are two reasons for this speedup performance :
8
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 4. The comparative results on MI210. (5.09×, 4.37×, 1.88× speedup over rocBLAS, MAGMA, Wang).
Fig. 5. The comparative results with on A800. (7.18×, 4.64×, 3.63×, 1.94× speedup over cuBLAS, CUTLASS, MAGMA, Li).
9
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 6. The kernel occupancy on two GPU platforms.
Fig. 7. The time overhead of tiling algorithm.
(1) Li et al. used batching to balance the workload among different wavefronts and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 is the theoretical number of wavefronts that
blocks but did not consider the difference between the workload of CU can execute simultaneously. 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 represent
threads in different tiles. (2) When selecting the tiling scheme, the TLP the number of warps in activation and the number of warps that are
is calculated only by considering the block, and the fine-grained warp theoretically parallelizable simultaneously in the NVIDIA platform.
level is neglected, which leads to the inaccurate calculation of TLP. The The results of the experiment are shown in Fig. 6. By comparing
proposed method adjusts the wavefront workload corresponding to the rocBLAS and cuBLAS, it can be seen that the proposed method has a
tiles through a multi-thread kernel and ensures consistent computation clear advantage in the case of batch GEMM. The proposed method is
and data loading by different workgroups. At the same time, it has also in the best position compared to the other methods (CUTLASS,
also shown that the state of load and computation balancing between MAGMA, Wang, Li), showing high efficiency in terms of utilization of
wavefronts is more conducive to improving the efficiency of GPU GPU resources. As shown in Fig. 6, the proposed method consistently
parallel computing. maintains the optimal kernel occupancy on both GPU platforms, which
indicates that the proposed method can better exploit the computing
power of the GPU.
5.3. Kernel occupancy
5.4. The overhead of tiling algorithm
To explore the difference between the proposed method and the
comparison methods in terms of GPU resource utilization, we present
This section presents the proportion of the runtime that is taken
the kernel occupancy of the various methods on two GPU platforms.
up by the tiling algorithm when executing the proposed method on
The formula for kernel occupancy can be expressed as:
two different GPU platforms with various batch sizes. The experimental
𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑
kernel occupancy = (12) results are presented in Fig. 7. From Fig. 7, it is evident that the tiling
𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙
algorithms runtime percentage decreases as the batch size increases.
To obtain more accurate performance metrics, we utilize Omniperf5 When batch size is 8, the runtime of the tiling algorithm on the two
and Nsight6 commands, profiling tools provided by AMD and NVIDIA, GPU platforms is 6.06% and 6.37%, respectively. As the batch size
to evaluate the resource utilization of the kernel during the execution increases, more and more GEMMs are executed on the GPU, and the
process. The kernel occupancy has distinct interpretations owing to execution time of these GEMMs on the GPU side takes up most of the
the distinctions in GPU architecture between AMD MI210 and NVIDIA time, resulting in a smaller runtime portion of the tiling algorithm.
A800. On the AMD platform, 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 is the number of activated For example, with a batch size is 1024, the tiling algorithm takes less
than 1% of the runtime. The experimental results on two GPUs indicate
that the time overhead of the tiling algorithm in the batch GEMM
5
https://github.com/ROCm/omniperf execution process is negligible, especially when the batch size is large.
6
https://docs.nvidia.com/nsight-compute/NsightCompute/index.html In real-world scenarios such as deep learning, where a large number of
10
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 8. The performance improvement of the proposed TLP on MI210. (1.077× average speedup).
GEMM operations are often required, the tiling algorithm will have less 4.53×, and 1.62× compared to rocBLAS, MAGMA, and Wang, respec-
overhead in the execution process. tively. The proposed method has the lowest latency performance on
MI210, indicating higher computational efficiency and can effectively
5.5. The performance benefits of the proposed TLP reduce latency. On A800, the proposed method showed performance
improvements of 3.02×, 2.59×, 2.45×, and 1.89× compared to cuBLAS,
This section presents the comparative experimental results on two MAGMA, CUTLASS, and Li, respectively. Fig. 10 shows that as the
GPU platforms to provide a more detailed evaluation of the proposed batch size gradually increases, the kernel latency increases on both
TLP. The detailed experimental results are shown in Figs. 89. From GPU platforms. rocBLAS and cuBLAS have the highest latency as the
Figs. 89, it is clear that the proposed TLP performs better overall than batch size increases. This phenomenon is because the traditional loop
traditional TLP. The proposed methods have a speedup of 1.077× and scheduling method significantly increases latency consumption due to
1.085× on MI210 and A800, respectively. From Fig. 8, the proposed context switching between kernels when the batch size is large. From
method significantly improves performance when the batch size is Fig. 10, it can be seen that some methods exhibit different latency
larger. For example, on MI210, the proposed method has an average performances at various batch sizes. For example, when batch size
speedup of 1.04× when batch size <= 16. When batch size >= 32, the <= 16, MAGMA has the highest latency performance on two GPU
proposed method can improve performance by 1.10×. The performance platforms. When the batch size is large, its computational performance
improvement gap is because when the batch size and matrix dimension improves, indicating that the MAGMA performs better when there are
are small, it is difficult to utilize hardware resources fully. When there many matrices. The experimental results on two platforms show that
are a large number of tiles, the proposed TLP can more accurately the proposed method has the lowest latency under various batch sizes,
evaluate the threads workload and select the optimal tiling scheme. indicating better performance and broad applicability.
The same performance trend is also reflected in the A800 platform. On
A800, the proposed TLP has performance improvements of 1.04× and
1.11× when batch size <= 16 and batch size >= 32, respectively. The 5.7. The improved performance on inception layers of CNN
effectiveness of the proposed TLP can be further demonstrated through
comparative experiment results on two GPU platforms. Modern CNN model architectures often have multiple branches to
capture features at different scales. Convolution operations of differ-
5.6. The latency ent scales in each branch can be represented as batch GEMM oper-
ations with various dimensions, e.g. GoogleNet [13], DenseNet [50],
This section compares kernel latency on two GPU platforms to SqueezeNet [12], etc. To demonstrate the effectiveness of the proposed
provide a more detailed evaluation of the proposed method. We mea- method in real-world scenarios, we use various Inception module as
sured kernel latency with different batch sizes in the comparative a typical application to perform the forward computation process on
experiment. The detailed experimental results are shown in Fig. 10. two GPU platforms. The Inception module involves a large number of
On MI210, the proposed method has a latency reduction of 3.87×, irregular, small-size GEMM operations. The deep learning frameworks
11
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 9. The performance improvement of the proposed TLP on A800. (1.085× average speedup).
Fig. 10. The latency performance of the kernel on two GPU platforms.
MIOpen7 and cuDNN8 are used as benchmark implementations on the other Inception module, and the dimensions of these matrices are
both GPU platforms. In this section, we select several commonly used smaller than the former two. Finally, the proposed method has been
Inception modules to evaluate the proposed methods speedup perfor- proven to significantly accelerate CNN models with various branch
mance. The GEMM sizes in Inception modules are shown in Table 6. structures on two different GPU platforms, particularly in scenarios
Fig. 11 shows the speedup performance of the proposed method in each involving multiple branches, irregular shapes, and small dimensions.
Inception module. As shown in Fig. 11, the average speedups are 2.88×
and 1.87× respectively. The gray boxes represent the average speedup 6. Conclusion
ratios of the different Inception modules in Fig. 11. The experimental
results suggest that the Inception 89 series has the highest average In this paper, we propose a load-balanced batch GEMM acceleration
speedup ratio (3.68× and 2.66× respectively) among the Inception method for the problem of low parallel computing efficiency and poor
modules, because Inception 89 has more matrix shapes compared to hardware resource utilization in batch, irregular, and variable matrix
multiplication scenarios. The kernel occupancy and hardware resource
utilization can be effectively improved by a multi-thread kernel design
7
https://github.com/ROCm/MIOpen that balances the computational and data load in the work-item. A
8
https://github.com/NVIDIA/cudnn-frontend novel approach to TLP computation is devised, where the parallelism of
12
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
Fig. 11. The speedup performance on Inception layers.
Table 6
The size of GEMM in various Inception modules.
Inception module GEMM size (M ×N × K)
Inception-1 784 × 96 × 192, 784 × 64 × 192, 784 × 32 × 192, 784 × 16 × 192
Inception-2 784 × 64 × 192, 784 × 32 × 192, 784 × 128 × 192
Inception-3 196 × 192 × 192, 196 × 16 × 192, 196 × 96 × 192, 196 × 64 × 192
Inception-4 196 × 64 × 192, 196 × 24 × 192, 196 × 160 × 192
Inception-5 196 × 64 × 192, 196 × 128 × 192, 196 × 24 × 192
Inception-6 196 × 112 × 192, 196 × 144 × 192, 196 × 32 × 192, 196 × 64 × 192
Inception-7 196 × 256 × 192, 196 × 160 × 192, 196 × 128 × 192
Inception-8 49 × 160 × 192, 49 × 128 × 192, 49 × 256 × 192, 49 × 160 × 192, 49 × 32 × 192
Inception-9 49 × 192 × 192, 49 × 128 × 192, 49 × 384 × 192, 49 × 192 × 192, 49 × 48 × 192
the tiling scheme is measured by the number of activated wavefronts. References
This approach allows the optimal tiling scheme to be selected based on
different GPU architectures. Experiments are conducted on two GPU [1] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using
the TF32 and TF64 frameworks on low-precision AI tensor cores, in: Proceedings
platforms to validate the effectiveness and progress of our proposed
of the SC23 Workshops of the International Conference on High Performance
method. Computing, Network, Storage, and Analysis, 2023, pp. 179186.
Future work includes exploring batch GEMM with various preci- [2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based
sion performances. With the development of Transformer-based, many convolutions for deep learning on multicore ARM and RISC-V architectures, J.
GEMM operations are involved in the training and inference process Syst. Archit. (2024) 103186.
[3] J. Fornt, P. Fontova-Musté, M. Caro, J. Abella, F. Moll, J. Altet, C. Studer, An
of Large Language Models (LLMs), which often have lower accuracy, energy-efficient gemm-based convolution accelerator with on-the-fly im2col, IEEE
such as FP16, FP8, etc. For example, quantized LLMs often involve Trans. Very Large Scale Integr. (VLSI) Syst. 31 (11) (2023) 18741878.
GEMM operations where the weight matrices and activation values [4] H. Kim, W.J. Song, Las: locality-aware scheduling for GEMM-accelerated
have different precisions, e.g. W4A16, W8A8. More complex precisions convolutions in GPUs, IEEE Trans. Parallel Distrib. Syst. 34 (5) (2023)
14791494.
and storage formats pose challenges to the performance of GEMM [5] W. Yang, J. Fang, D. Dong, X. Su, Z. Wang, Optimizing full-spectrum matrix
operations. multiplications on ARMv8 multi-core CPUs, IEEE Trans. Parallel Distrib. Syst.
(2024).
CRediT authorship contribution statement [6] AMD, Next generation BLAS implementation for ROCm platform, 2024, https:
//github.com/ROCm/rocBLAS.
[7] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore
Yu Zhang: Writing review & editing, Writing original draft. Lu High-Performance Parallel Computing with CUDA, Packt Publishing Ltd, 2018.
Lu: Writing review & editing, Supervision. Zhanyu Yang: Writing [8] ICL, Matrix algebra for GPU and multicore architectures, 2024, https://icl.utk.
review & editing. Zhihong Liang: Supervision, Conceptualization. edu/magma/.
[9] T. Faingnaert, T. Besard, B. De Sutter, Flexible performant GEMM kernels on
Siliang Suo: Supervision, Conceptualization.
GPUs, IEEE Trans. Parallel Distrib. Syst. 33 (9) (2021) 22302248.
[10] W.S. Moses, I.R. Ivanov, J. Domke, T. Endo, J. Doerfert, O. Zinenko, High-
Declaration of competing interest performance gpu-to-cpu transpilation and optimization via high-level parallel
constructs, in: Proceedings of the 28th ACM SIGPLAN Annual Symposium on
The authors declare that they have no known competing finan- Principles and Practice of Parallel Programming, 2023, pp. 119134.
[11] H. Kim, H. Nam, W. Jung, J. Lee, Performance analysis of CNN frameworks
cial interests or personal relationships that could have appeared to
for GPUs, in: 2017 IEEE International Symposium on Performance Analysis of
influence the work reported in this paper. Systems and Software, ISPASS, IEEE, 2017, pp. 5564.
[12] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
Acknowledgments SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB
model size, 2016, arXiv preprint arXiv:1602.07360.
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
This work was supported by the Natural Science Foundation of Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of
Guangdong Province (2024A1515010204) and the Technological Re- the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.
search Project of Southern Power Grid Company (ZBKJXM20232483). 19.
[14] G. Pant, D. Yadav, A. Gaur, ResNeXt convolution neural network topology-based
deep learning model for identification and classification of pediastrum, Algal Res.
Data availability
48 (2020) 101932.
[15] S. Barrachina, M.F. Dolz, P. San Juan, E.S. Quintana-Ortí, Efficient and
No data was used for the research described in the article. portable GEMM-based convolution operators for deep neural network training
on multicore processors, J. Parallel Distrib. Comput. 167 (2022) 240254.
13
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
[16] S. Rajbhandari, Y. He, O. Ruwase, M. Carbin, T. Chilimbi, Optimizing cnns on [35] G. Alaejos, A. Castelló, H. Martínez, P. Alonso-Jordá, F.D. Igual, E.S. Quintana-
multicores for scalability, performance and goodput, ACM SIGARCH Comput. Ortí, Micro-kernels for portable and efficient matrix multiplication in deep
Archit. News 45 (1) (2017) 267280. learning, J. Supercomput. 79 (7) (2023) 81248147.
[17] C. Rivera, J. Chen, N. Xiong, S.L. Song, D. Tao, Ism2: Optimizing irregular-shaped [36] R. Wang, Z. Yang, H. Xu, L. Lu, A high-performance batched matrix multiplica-
matrix-matrix multiplication on gpus, 2020, arXiv preprint arXiv:2002.03258. tion framework for gpus under unbalanced input distribution, J. Supercomput.
[18] K. Matsumoto, N. Nakasato, S.G. Sedukhin, Performance tuning of matrix 78 (2) (2022) 17411758.
multiplication in opencl on different gpus and CPUs, in: 2012 SC Companion: [37] Y. Zhang, Y. Wang, Z. Mo, Y. Zhou, T. Sun, G. Xu, C. Xing, L. Yang, Accelerating
High Performance Computing, Networking Storage and Analysis, IEEE, 2012, pp. small matrix multiplications by adaptive batching strategy on GPU, in: 2022
396405. IEEE 24th Int Conf on High Performance Computing & Communications; 8th
[19] G.E. Moon, H. Kwon, G. Jeong, P. Chatarasi, S. Rajamanickam, T. Krishna, Eval- Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int
uating spatial accelerator architectures with tiled matrix-matrix multiplication, Conf on Dependability in Sensor, Cloud & Big Data Systems & Application,
IEEE Trans. Parallel Distrib. Syst. 33 (4) (2021) 10021014. HPCC/DSS/SmartCity/DependSys, IEEE, 2022, pp. 882887.
[20] Q. Han, H. Yang, M. Dun, Z. Luan, L. Gan, G. Yang, D. Qian, Towards [38] A. Abdelfattah, S. Tomov, J. Dongarra, Matrix multiplication on batches of small
efficient tile low-rank GEMM computation on sunway many-core processors, J. matrices in half and half-complex precisions, J. Parallel Distrib. Comput. 145
Supercomput. 77 (5) (2021) 45334564. (2020) 188201.
[21] X. Li, Y. Liang, S. Yan, L. Jia, Y. Li, A coordinated tiling and batching [39] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Novel HPC techniques to batch
framework for efficient GEMM on GPUs, in: Proceedings of the 24th Symposium execution of many variable size BLAS computations on GPUs, in: Proceedings of
on Principles and Practice of Parallel Programming, 2019, pp. 229241. the International Conference on Supercomputing, 2017, pp. 110.
[22] P. Tillet, D. Cox, Input-aware auto-tuning of compute-bound HPC kernels, in: [40] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Performance, design, and
Proceedings of the International Conference for High Performance Computing, autotuning of batched GEMM for GPUs, in: High Performance Computing: 31st
Networking, Storage and Analysis, 2017, pp. 112. International Conference, ISC High Performance 2016, Frankfurt, Germany, June
[23] NVIDIA, CUDA templates for linear algebra subroutines, 2024, https://github. 19-23, 2016, Proceedings, Springer, 2016, pp. 2138.
com/NVIDIA/cutlass. [41] A. Li, G.-J. van den Braak, H. Corporaal, A. Kumar, Fine-grained synchronizations
[24] J. Huang, C.D. Yu, R.A.v.d. Geijn, Strassens algorithm reloaded on GPUs, ACM and dataflow programming on GPUs, in: Proceedings of the 29th ACM on
Trans. Math. Softw. 46 (1) (2020) 122. International Conference on Supercomputing, 2015, pp. 109118.
[25] B. Boyer, J.-G. Dumas, C. Pernet, W. Zhou, Memory efficient scheduling of [42] J. Li, H. Ye, S. Tian, X. Li, J. Zhang, A fine-grained prefetching scheme
strassen-winograds matrix multiplication algorithm, in: Proceedings of the 2009 for DGEMM kernels on GPU with auto-tuning compatibility, in: 2022 IEEE
International Symposium on Symbolic and Algebraic Computation, 2009, pp. International Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2022,
5562. pp. 863874.
[26] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, [43] Z. Yang, L. Lu, R. Wang, A batched GEMM optimization framework for deep
A. Novikov, F.J. R Ruiz, J. Schrittwieser, G. Swirszcz, et al., Discovering faster learning, J. Supercomput. 78 (11) (2022) 1339313408.
matrix multiplication algorithms with reinforcement learning, Nature 610 (7930) [44] H. Mei, H. Qu, J. Sun, Y. Gao, H. Lin, G. Sun, GPU occupancy prediction of
(2022) 4753. deep learning models using graph neural network, in: 2023 IEEE International
[27] G. Xiao, C. Yin, T. Zhou, X. Li, Y. Chen, K. Li, A survey of accelerating parallel Conference on Cluster Computing, CLUSTER, IEEE, 2023, pp. 318329.
sparse linear algebra, ACM Comput. Surv. 56 (1) (2023) 138. [45] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, J.
[28] Y. Chen, G. Xiao, K. Li, F. Piccialli, A.Y. Zomaya, fgSpMSpV: A fine-grained Dongarra, Algorithms and optimization techniques for high-performance matrix-
parallel SpMSpV framework on HPC platforms, ACM Trans. Parallel Comput. 9 matrix multiplications of very small matrices, Parallel Comput. 81 (2019)
(2) (2022) 129. 121.
[29] Y. Chen, G. Xiao, W. Yang, Optimizing partitioned CSR-based SpGEMM on the [46] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee,
sunway TaihuLight, Neural Comput. Appl. 32 (10) (2020) 55715582. D. Lee, Lut-gemm: Quantized matrix multiplication based on luts for efficient
[30] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, T. Li, Performance-aware model for inference in large-scale generative language models, 2022, arXiv preprint arXiv:
sparse matrix-matrix multiplication on the sunway taihulight supercomputer, 2206.09557.
IEEE Trans. Parallel Distrib. Syst. 30 (4) (2018) 923938. [47] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating
[31] G. Xiao, K. Li, Y. Chen, W. He, A.Y. Zomaya, T. Li, Caspmv: A customized scientific computing on tensor cores with extended precision, in: Proceedings
and accelerative spmv framework for the sunway taihulight, IEEE Trans. Parallel of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Distrib. Syst. 32 (1) (2019) 131146. Programming, 2021, pp. 278291.
[32] G. Xiao, C. Yin, Y. Chen, M. Duan, K. Li, Efficient utilization of multi-threading [48] G. Shobaki, A. Kerbow, S. Mekhanoshin, Optimizing occupancy and ILP on the
parallelism on heterogeneous systems for sparse tensor contraction, IEEE Trans. GPU using a combinatorial approach, in: Proceedings of the 18th ACM/IEEE
Parallel Distrib. Syst. (2024). International Symposium on Code Generation and Optimization, 2020, pp.
[33] D.E. Tanner, Tensile: Auto-tuning gemm gpu assembly for all problem sizes, 133144.
in: 2018 IEEE International Parallel and Distributed Processing Symposium [49] A.B. Hayes, L. Li, D. Chavarría-Miranda, S.L. Song, E.Z. Zhang, Orion: A
Workshops, IPDPSW, IEEE, 2018, pp. 10661075. framework for gpu occupancy tuning, in: Proceedings of the 17th International
[34] S. Wang, FlexGEMM: A flexible micro-kernel generation framework, in: Proceed- Middleware Conference, 2016, pp. 113.
ings of the 5th International Conference on Computer Information and Big Data [50] G. Huang, S. Liu, L. Van der Maaten, K.Q. Weinberger, Condensenet: An
Applications, 2024, pp. 164170. efficient densenet using learned group convolutions, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp. 27522761.
14