Journal of Systems Architecture 160 (2025) 103341


                                                                 Contents lists available at ScienceDirect


                                                        Journal of Systems Architecture
                                                       journal homepage: www.elsevier.com/locate/sysarc


A load-balanced acceleration method for small and irregular batch matrix
multiplication on GPU
Yu Zhang a , Lu Lu a,b ,∗, Zhanyu Yang a , Zhihong Liang c,d , Siliang Suo c,d
a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
b
  Peng Cheng Laboratory, Shenzhen, 518055, China
c
  Electric Power Research Institute, CSG, Guangzhou, China
d
  Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, China


ARTICLE               INFO                              ABSTRACT

Keywords:                                               As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many
Batch GEMM                                              applications, such as high-performance computing, machine learning, etc. In practice, the performance of
Thread workload                                         GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing
Multi-thread kernel
                                                        with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, a
Tiling algorithm
                                                        common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in
                                                        GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism
                                                        and ignore the issues of low computational efficiency and hardware resource utilization caused by the
                                                        difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM
                                                        acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multi-
                                                        thread kernel design can address the workload unbalance between wavefronts in different workgroups, and the
                                                        efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation
                                                        method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted
                                                        on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms
                                                        previous methods.


1. Introduction                                                                                 Many real-world applications, such as deep learning, involve ir-
                                                                                            regular, small-size matrix multiplication operations in their computa-
    GEneral Matrix Multiplication (GEMM) is a standard computing                            tions [11]. For example, in Convolutional Neural Networks (CNN) [12–
kernel that plays an important role in high-performance computing [1],                      14], the structure of these models contains a large number of convo-
artificial intelligence [2], image processing [3], and other research                       lutional layers. The scale of the convolution kernel tends to be small
fields. With the explosive growth of data volume and the emergence of                       (e.g. ‘‘1*1’’ and ‘‘3*3’’). Convolution operations are converted to GEMM
various algorithms, the demand for high-performance GEMM comput-                            using Im2col function, and the dimension of the matrix is typically
ing is increasing [4,5]. Additional stream processors and memory are                        less than 1000 [15,16]. These small GEMM computations prevent the
integrated into the GPU to cater to this trend, providing tremendous                        GPU from fully exploiting its hardware computing potential. In this
computational power for GEMM acceleration. To fully utilize the hard-                       case, the scheduling overhead between batch GEMMs and the regularity
ware acceleration capability, AMD and NVIDIA, provide developers
                                                                                            of the matrix poses challenges to computational performance [17,18].
with a platform for parallel computing based on GPU (ROCm and
                                                                                            For a GEMM, the tiling is a standard solution method. The matrix is
CUDA). Based on these parallel computing acceleration platforms, var-
                                                                                            segmented into multiple tiles, and a thread block is responsible for
ious optimization algorithms and acceleration libraries have been pro-
                                                                                            computing individual tiles. Since each tile is independent, multiple tiles
posed and demonstrated to have powerful effects, such as rocBLAS [6],
                                                                                            can be computed in parallel by using multiple threads in GPU, to speed
cuBLAS [7], MAGMA [8], etc. These methods achieve optimal computa-
tional task allocation through hardware resource scheduling and thread                      up the computation process of GEMM. The larger dimension of tile will
parallelism to accelerate the matrix multiplication operation [9,10].                       increase the Thread-Level Parallelism (TLP) of a single tile and also will


    ∗ Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
     E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang),
suosl@csg.cn (S. Suo).

https://doi.org/10.1016/j.sysarc.2025.103341
Received 3 September 2024; Received in revised form 3 November 2024; Accepted 8 January 2025
Available online 23 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Zhang et al.                                                                                                  Journal of Systems Architecture 160 (2025) 103341


reduce the number of tile, resulting in the failure to fully utilize the          2. Related work and motivation
hardware resources of GPU [19,20]. The Instruction-Level Parallelism
(ILP) of a single thread is related to the K-dimension. Generally, for a          2.1. Related work
large enough matrix size, it can fully use GPU hardware resources and
achieve higher TLP and ILP [21,22].                                                   Several approaches have been proposed for batch GEMM computa-
    To improve computational efficiency, previous studies have pro-               tion, which mainly focus on algorithm-level optimization or architecture-
posed some acceleration methods for matrix multiplication. For in-                level optimization. The former mainly explores lower bounds on the
stance, rocBLAS [6] and cuBLAS [7] provide batch GEMM API                         time complexity of GEMM operations at the mathematical level and
(rocblasSgemmBatched and cublasSgemmBatched), which can support                   optimizes the computational effort. The latter is based on different GPU
multiple GEMMs to be simultaneously calculated on GPUs. However,                  architecture features and uses corresponding optimization techniques
these APIs support only uniform matrix sizes that considerably limit              to improve the computational efficiency of GEMM. In algorithm-level
these applications. NVIDIA also provides a C++-style template library,            optimization, Strassen et al. [24] proposed a novel GEMM algorithm
CUTLASS [23], which utilizes built-in tile templates and sorting to
                                                                                  based on the property that matrix addition is faster than matrix multi-
accelerate matrix multiplication operations. In fact, the size of matrices
                                                                                  plication to speed up the computational process, which uses seven-time
is variable in many real-world applications [11]. To solve this issue,
                                                                                  multiplications and multiple addition operations instead of eight-time
a Vbatch GEMM route that supports batch GEMM in various sizes is
                                                                                  multiplications. This approach mathematically reduced the time com-
designed and implemented by MAGMA (magmablas_sgemm_vbatched). It
                                                                                  plexity of GEMM to 𝑂(𝑛2.81 ) for the first time. To reduce the require-
adapts to batch GEMMs with multiple tiling strategies, assigning the ap-
                                                                                  ment of Strassen’s algorithm for extra memory space, three different
propriate tile to a single GEMM for huge performance gains. Although
                                                                                  methods were proposed in [25]: pre-additions, overwriting the input
variable sizes are supported in MAGMA, it still has some limitations.
First, MAGMA only supports some coarse-grained tiling strategies that             matrix, and recursive scheduling to alleviate this problem. At the
are not appropriate for all GEMM. Coarse-grained tiling results in an             same time, due to the powerful effect of deep neural networks in
unbalanced kernel workload and GPU utilization reduction. Second, the             various domains, Alhussein Fawzi et al. [26] transformed the process
grid size is determined by the tiling of the largest matrix, which leads          of finding the optimal complexity of matrix multiplication into a tensor
to idle threads and a waste of GPU computing power. Third, the lack               decomposition problem and used reinforcement learning to explore
of an evaluation criterion for tiling leads to lower efficiency of strategy       lower bounds on the complexity of matrix multiplication. In particular,
choice.                                                                           for a 4 × 4 matrix, the multiplication number was as low as 47 multi-
    To thoroughly support batch GEMM with variable sizes, it is es-               plications. This performance was better than the two-level Strassen’s
sential to design a tiling algorithm that can be adapted to all GEMMs             algorithm, which involves 49 multiplications. Although the above
and adaptively choose tile sizes, not limited to single size. The optimal         approach reduces the mathematical complexity of matrix multiplication
tiling for each GEMM is different, depending on the size of the matrix            operations, it is difficult to take advantage of the performance benefits
dimensions (𝑀, 𝑁, 𝐾). How to choose a suitable tile is a challenge                of these approach due to the neglect of computational scheduling
for batch GEMM. At the same time, an evaluation criterion based on                strategies and multi-level memory architecture features on the GPU.
the current GPU hardware and tiling strategy is also essential. With                  In architecture-level optimization, GPU vendors (NVIDIA and AMD)
GPU hardware, an appropriate tiling for each GEMM can be chosen                   have designed and implemented computing libraries such as cuBLAS [6]
to fully utilize the GPU computing capabilities and achieve better                and rocBLAS [7] based on their parallel computing platforms to im-
computational performance. How to measure the effectiveness of the                prove GPU hardware utilization and parallelism. However, due to the
tiling algorithm on the GPU hardware is a challenging problem. The tile           restriction of uniform-sized matrix, the performance is poor when faced
with various sizes can lead to significant differences in computational           with small and irregular batch GEMMs. Although NVIDIA provides
effort within each workgroup, further to an unbalanced distribution of            a C++-style template library, the small size of the matrix and the
computational tasks and excessive load differences between threads.               lack of assembly-level optimizations make it difficult for CUTLASS
Hence, for tiles with various sizes, balancing thread computation and             to fully exploit its performance advantages for irregular and small
data loading during computation is also a challenge for batch GEMM.               matrix multiplication [23]. These irregular and small-sized matrices
    To address the above challenges, we propose a batch GEMM accel-
                                                                                  often lead to unbalanced workloads among threads in different work-
eration method with a multi-thread kernel design. Furthermore, an ef-
                                                                                  groups, which can reduce kernel performance. For Sparse GEneral
ficient tiling algorithm is proposed to achieve load-balanced and higher
                                                                                  Matrix-Matrix multiplication (SpGEMM), the matrix’s sparsity leads to
hardware resource utilization. Our contributions can be summarized as
                                                                                  significant differences in thread workloads [27,28]. To address the
follows:
                                                                                  unbalanced workload, Chen et al. [29] optimized the matrix segmen-
    • A multi-threaded kernel design scheme is proposed to balance                tation by analyzing the distribution of the floating point calculations
      thread computation and data loading in different workgroups to              of the CSR-based SpGEMM, which achieves load balance and perfor-
      compute the various tiles.                                                  mance improvement on Sunway TaihuLight. For the issue of workload
    • A novel TLP computation method is designed to select the optimal            unbalance in threads, it is necessary to conduct a detailed analysis
      tiling algorithm by combining the kernel occupancy of the GPU               of the computation process and hardware platform characteristics to
      and the tiling operation.                                                   design an efficient parallel framework implementation [30,31]. Xiao
    • An efficient tiling algorithm is implemented by considering the             et al. [32] introduce a fine-grained partitioning strategy to select ap-
      GPU hardware architecture and the batch GEMM workload.                      propriate segmentation dimensions, efficiently utilizing the parallelism
    • The proposed method can efficiently handle batch irregular GEMM             of multi-thread and improving the performance of binary sparse tensor
      and achieve state-of-the-art performance on AMD and NVIDIA                  contracts. The diversity of matrix sizes makes it difficult to utilize a
      GPU platforms.                                                              unified routine for calculations, resulting in some threads being idle
    The rest of the paper is organized as follows. Section 2 provides             in CU [33,34]. Indeed, the size of matrices is variable and irregular
related work and motivation. Section 3 introduces background on batch             in various scientific computing scenarios. To overcome the matrix
GEMM, GPU architecture, and kernel occupancy. Section 4 presents                  restriction of uniform size, MAGMA [8] proposes a Vbatch routine
the details of the multi-thread kernel design and load-balanced tiling            to support batch GEMM with various sizes. In this way, it uses a 3D
algorithm. Section 5 demonstrates and evaluates the experimental re-              grid to indicate batch GEMM’s kernel design, where grid.z represents
sult. Section 6 provides the conclusions of the paper and future work.            batch size. Each GEMM corresponds to one of the 2D-grid planes, and
The source code of this paper can be obtained in this repository link:            the size of the two-dimensional plane (grid.x, grid.y) is determined by
https://github.com/zhangyu0722/BatchGEMM.git.                                     the largest GEMM. In the case of irregular GEMM, if the dimension

                                                                              2
Y. Zhang et al.                                                                                                     Journal of Systems Architecture 160 (2025) 103341


                                                       Fig. 1. GEMM and batch GEMM schematic diagram.


difference between the largest GEMM and the rest is too large, a large             3. Background
number of threads and workgroups will be idle, resulting in a waste of
GPU computing resources. For various parallel acceleration platforms,              3.1. GEMM and batch GEMM
different hardware characteristics, such as register size and number of
CUs, will affect the allocation of computing resources in the kernel. To               For a single GEMM, its accumulation routine is 𝐶 = 𝛼 𝐴𝐵+𝛽 𝐶, where
ensure kernel performance, it is necessary to flexibly set parameters              𝐴 ∈ 𝑅𝑀×𝐾 , 𝐵 ∈ 𝑅𝐾×𝑁 and 𝐶 ∈ 𝑅𝑀×𝑁 are dense matrices, 𝑀, 𝑁, and
based on different matrix sizes and hardware architectures [9,35].                 𝐾 represent matrix dimensions, and 𝛼 and 𝛽 are constant scalars. A
To solve this problem, a coordinated tiling and batching strategy is               common approach is tiling matrix C into multiple tiles [21,36], which
proposed in [21], where a different tiling strategy is used for each               utilizes the parallel computing of thread in GPU to calculate each tile
GEMM in batch GEMM and appropriate batching is used according                      and splices together the result. As shown in Fig. 1 (b), given a GEMM
to the tile size to improve the computational efficiency of the GPU.
                                                                                   with size 𝑀 × 𝑁 × 𝐾, the matrix C is segmented into multiple tiles with
Wang et al. [36] proposed the sort-up algorithm based on the GEMM
                                                                                   𝑇𝑚 × 𝑇𝑛 . Each workgroup is responsible for the calculation of a tile and
workload and split-down in the tiling process, which can segment large
                                                                                   needs to access the row section of matrix A with size 𝑇𝑚 ×𝐾 and column
tiles into multiple smaller tiles. This approach can make better use of
                                                                                   section of matrix B with size 𝐾 × 𝑇𝑛 . However, the row cross-section
CU utilization when the number of GEMM is limited.
                                                                                   of A and the column cross-section of B (represented in Fig. 1 (b) by
2.2. Motivation                                                                    the gray parts of matrices A and B, respectively) are too large to store
                                                                                   in shared memory and registers. Hence, the row section of A and the
    Although the above-mentioned methods improve the parallel com-                 column section of B are segments of multiple A tiles with 𝑇𝑚 × 𝑇𝑘 and B
puting efficiency of batch GEMM on GPU from various perspectives,                  tiles with 𝑇𝑘 × 𝑇𝑛 , respectively. The partial result of C can be obtained
there are two problems. One is that the workload of threads varies                 by calculating with A tile and B tile, and accumulative partial results
significantly across the kernel. In the above approach, tiles with various         can obtain the final result.
sizes are designed, and each tile is responsible for the corresponding                 To batch-run multiple GEMMs, a naive routine is computed for
kernel, where the number of threads is fixed. In general, larger tiles             each GEMM individually. However, when the matrix size is small,
have better TLP. This will also increase the workload of each thread               a single GEMM does not fully utilize the GPU’s computing power,
for large-size tiles, and the thread responsible for computing large tiles         leaving the CU idle [37,38]. To avoid this situation, a batch GEMM
requires more hardware resources (VGPR, SGPR, LDS) and computing                   method is proposed to design multiple kernels for various GEMM in
time. The other one is that differences between wavefronts within dif-             the GPUs [36,39]. Compared to GEMM, batch GEMM is expressed in
ferent workgroups are ignored in the TLP calculations. The workgroup               (𝑀 × 𝑁 × 𝐾 × 𝐵𝑠𝑖𝑧𝑒 ), where 𝑀, 𝑁 and 𝐾 represent the dimensions of
will be transformed into multiple wavefronts during GPU computation                the matrix, and 𝐵𝑠𝑖𝑧𝑒 represents the batch size. A batch GEMM is 3D-
and be executed in parallel on the CU. Each CU can run multiple                    dimension grid, where grid.z is batch sizes, and grid.x and grid.y are the
wavefronts simultaneously, and the number of wavefronts depends on
                                                                                   lengths and widths of a two-dimensional plane respectively [40]. To
the hardware resources required by the wavefront. Thus, the TLP on the
                                                                                   balance the workload of a batch GEMM, a variety of tile sizes are used
GPU should be determined by the number of threads in the wavefront
                                                                                   for GEMM tiling. The two-dimensional grid size has the corresponding
that can be executed in parallel on the CU.
                                                                                   matrix C and tiling strategy. Each tile is responsible for the correspond-
    To solve the above problems, we propose an efficient and load-
balanced batch GEMM acceleration method, which consists of two                     ing workgroup. A workgroup is decomposed into multiple wavefronts
parts: a multi-thread kernel design scheme and an efficient tiling algo-           that execute on the CU. The 3D grid of batch GEMM is shown in Fig. 1
rithm. A multi-thread kernel design is proposed to balance the amount              (a).
of loading and computation in the thread corresponding to each tile.
Tiles with various sizes correspond to the number of threads selected.             3.2. GPU architecture and kernel occupancy
Although this is limited by the parallel programming interfaces of the
CUDA and ROCm platforms, the number of threads responsible for                         With the improvement of hardware architecture and parallel com-
computing a tile is uniform. To overcome this shortcoming, we use the              puting programming platforms (such as ROCm1 and CUDA2 ), GPUs
corresponding filtering operation in the kernel execution process to ef-           are becoming the most popular hardware accelerator. The two most
fectively alleviate this problem. An efficient tiling algorithm can choose         commonly used GPUs are AMD and NVIDIA, widely used in various
the optimal scheme based on different GEMMs and GPUs. To measure                   scientific computing platforms. However, some basic concepts of ex-
the effect of tiling, we propose a new way of TLP computation based                pression in ROCm and CUDA are different. We chose AMD’s official
on wavefronts. The optimal tiling scheme is obtained by adjusting the
tiling strategy according to the TLP. Finally, we obtain an efficient tiling
algorithm based on the new TLP calculation method. In Section 4, the                 1
                                                                                         https://rocm.docs.amd.com/en/latest/
                                                                                     2
details of the proposed method are introduced.                                           https://docs.nvidia.com/cuda/


                                                                               3
Y. Zhang et al.                                                                                                       Journal of Systems Architecture 160 (2025) 103341

                  Table 1
                  ROCm/CUDA terminology.
                   ROCm                    CUDA                                Description
                   Compute Unit (CU)       Streaming                           One of many parallel vector processors in a GPU that contains
                                           Multiprocessor (SM)                 parallel ALUs. All waves in a workgroup are assigned
                                                                               to the same CU.
                   Kernel                  Kernel                              Functions launched to the GPU that are executed by multiple
                                                                               parallel workers on the GPU. Kernels can work in
                                                                               parallel with CPU.
                   Wavefront               Warp                                Collection of operations that execute in lockstep, run the
                                                                               same instructions, and follow the same control-flow path.
                                                                               Individual lanes can be masked off.
                   Workgroup               Thread block                        Think of this as a vector thread. A 64-wide wavefront
                                                                               is a 64-wide vector op.
                   Work-item/Thread        Thread                              GPU programming models can treat this as a separate thread
                                                                               of execution, though this does not necessarily get
                                                                               forward sub-wavefront progress.
                   Global Memory           Global Memory                       DRAM memory accessible by the GPU that goes
                                                                               through some layers cache.
                   Local Memory            Shared Memory                       Scratchpad that allows communication between wavefront
                                                                               in a workgroup.
                   Private Memory          Local Memory                        Per-thread private memory often mapped to registers.


terminology for this paper to provide precise specifications. To clarify        resources. In order to fully utilize the hardware resources of the GPU
some differences and relationships between ROCm and CUDA terms, a               and improve the efficiency of parallel computing, the kernel occupancy
comparison of terminology is given in Table 1.                                  should be improved as much as possible without data overflow [46,47].
    A GPU is composed of multiple Shader Engines (SE) and a com-                In batch GEMM, an efficient kernel design should properly allocate
mand processor. Each SE has its own workload manager. One SE is                 the data loading and computation workload for each work-item in the
integrated with multiple CU and workload manager. Each CU contains              wavefront, so that the memory space and computing power on the CU
an enormous amount of Arithmetic and Logic Units (ALUs), a small                can be more efficiently utilized [48,49].
number of control units, and caches. Hence, GPUs are suitable for a
large number of simple parallel computing tasks. A GPU kernel consists          4. Overview
of one or multiple workgroups, the size of which is determined by the
number of wavefronts and threads. On the memory hierarchy, the GPU              4.1. Multi-thread kernel design
has global memory, local memory, and private memory from slow to
fast according to memory access speed, and local memory and private                 Tile size and kernel design are closely related in the design of batch
memory are much smaller than global memory [41,42].                             GEMM algorithms, and there are two matrix tile design routes. The
    Kernel Occupancy represents the actual utilization of computing             first way is to design a tile to adapt to all GEMMs, and the second
unit resources by a kernel function on GPU, which is the ratio of               is to design the various tiles to adapt to different GEMMs. Compared
actived wavefront to the maximum wavefront supported by CU [35,43].             with the first method, for irregular GEMM, the latter method is more
An active wavefront running on CU requires resources such as Vec-               flexible and efficient to utilize the computing resources of GPU. For
tor General-Purpose Register (VGPR), Scalar General-Purpose Registers           GEMMs with various shapes and sizes, using a single tile can easily
(SGPR), Local Data Share (LDS), etc. A wavefront can be activated               lead to increased workload differences between threads in multiple
and run on a CU when all required resources are available. When the
                                                                                workgroups, affecting the allocation of computing resources. In this
utilization of CU resources is low, the number of active wavefronts
                                                                                paper, we perform a multi-thread kernel design for the second matrix
is small, which leads to the waste of hardware resources and the
                                                                                segmentation method. Two different tile design strategies are shown
degradation of the parallel performance of the kernel. On the other
                                                                                in Fig. 2. Here we present the effect of two different tile strategies on
hand, when the number of active wavefronts in the CU increases, the
                                                                                the occupancy of the 3D grid. For the batch GEMM, different tile sizes
resources used by each wavefront and the available register storage
                                                                                lead to different numbers of workgroups, resulting in different 3D grid
space of each work-item in the wavefront decrease [44,45].
                                                                                occupancies.
    The number of active wavefronts on a CU is mainly limited by the
                                                                                    For a single GEMM, matrix C is tiled into multiple tiles. The tile
following factors: the number of work-items in each workgroup and
                                                                                size can be flexibly designed, and each tile can be run in parallel
the sizes of VGPR, SGPR, and LDS. For example, in AMD’s MI1003
                                                                                without data interference. Each tile is calculated by the corresponding
and MI210,4 a wavefront consists of 64 work-terms. When the number
                                                                                workgroup and can be represented by a 2D-grid as a whole. When the
of work-items in a workgroup is less than or equal to 64, only one
                                                                                size and number of tiles is large enough, efficient parallel execution
wavefront is included. The VGPR, SGPR, and LDS sizes on the CU have a
                                                                                efficiency can usually be obtained. However, in real-world cases, the
corresponding upper bound for each work-item. According to the kernel
                                                                                size of matrices in batch GEMM tends to be small and irregular,
design, the resources on the CU need to be allocated before executing
                                                                                which leads to poor performance of traditional methods. Therefore, the
each work-item. When resource requirements of the work-item are
                                                                                previous method adopts a variety of tiles to adapt to the corresponding
satisfied, the wavefront can be active and run on the CU. Otherwise,
                                                                                GEMM, and each tile is based on a unified number of threads, which
it will not run until other wavefronts accomplish tasks and release
                                                                                will lead to the workload of threads in large-scale tiles being much
                                                                                larger than that of small tiles. This gap in the workload of threads
  3
    https://www.amd.com/system/files/documents/instinct-mi100-                  results in unbalanced thread loading and reduces GPU parallel com-
brochure.pdf                                                                    puting efficiency. Table 2 lists the detailed parameters for tiles with
  4
    https://www.amd.com/content/dam/amd/en/documents/instinct-                  various sizes based on the same work-item design (The number of work-
business-docs/white-papers/amd-cdna2-white-paper.pdf                            items in the kernel is 128). 𝑊𝐶 𝑃 and 𝑊𝐷𝐿 represent the computation

                                                                           4
Y. Zhang et al.                                                                                                                Journal of Systems Architecture 160 (2025) 103341


Fig. 2. Two different tile design strategies for batch GEMMs. ((a) All GEMMs adopt the same tiling scheme, which is divided into multiple tiles of the same size. (b) Different
GEMMs adopt different tiling schemes and are divided into multiple tiles of different sizes.).


Table 2                                                                                    speed of global memory is considerably lower than that of registers,
The common kernel design scheme for batch GEMM (There are significant workload             threads’ data access efficiency decreases, and overall time consumption
gaps between threads).
                                                                                           increases. At the same time, since the variety of thread workloads,
 Tile               𝑇𝑚           𝑇𝑛           𝑇𝑘             𝑊𝐶 𝑃            𝑊𝐷𝐿
                                                                                           when a thread with a heavy workload is run on the CU, the number
 small              16           16           8/16           2               4/6           of active wavefronts on the CU is less, resulting in the CU’s kernel
 medium             32           32           8/16           8               12/16
 large              64           64           8/16           32              40/48
                                                                                           occupancy (The ratio between the number of active wavefronts and the
                                                                                           maximum number of supported wavefronts) will be reduced. The state
                                                                                           of the CU with low kernel occupancy will be longer due to the longer
                                                                                           work-item computation time.
amount and data loading amount of work-item, respectively, and their                           To solve this problem, we propose a multi-thread kernel design,
calculation expressions are considered as:                                                 which ensures that the workload of each thread is balanced as much as
        𝑇 × 𝑇𝑛                                                                             possible. The experimental results in Fig. 3 show that multiple kernels’
𝑊𝐶 𝑃 = 𝑚                                                         (1)
         𝑊𝑛𝑢𝑚                                                                              performance varies when calculating the same tile. For example, the
          𝑇𝑚 × 𝑇𝑛 + 𝑇𝑚 × 𝑇𝑘 + 𝑇𝑘 × 𝑇𝑛                                                      128-thread kernel performs best when calculating a tile with ‘‘32*32’’,
𝑊𝐷 𝐿 =                                                                          (2)        as shown in Fig. 3. The performance gap mentioned above is mainly
                     𝑊𝑛𝑢𝑚
                                                                                           because of the varying workloads of threads under different kernels,
where 𝑊𝑛𝑢𝑚 represents the number of work-items responsible for com-
                                                                                           which affects the overall performance. For the 128-thread kernel, when
puting the tile.
                                                                                           calculating a tile with ‘‘32*32’’, each thread needs to complete the
    For different tiles, there is a significant gap in workload between
                                                                                           calculation of 8 elements and the loading of 16 elements. When cal-
threads (𝑊𝐶 𝑃 ∈ [2, 32] and 𝑊𝐷𝐿 ∈ [4, 48]). The choice of 𝑇𝑘 also has a
                                                                                           culating a tile with ‘‘64*64’’, the workload of the threads is heavy, and
certain impact on the data load of work-item. Each thread is responsible
                                                                                           each thread needs to complete the calculation of 32 elements and the
for more data loads when 𝑇𝑘 is larger. For example, in large tile, when                    loading of 64 elements. When calculating larger tiles, the workload of
the value of 𝑇𝑘 is set to 8 or 16, each work-item is responsible for                       the thread increases significantly. To avoid significant differences in
loading 40 and 48 elements, respectively. The workload differences                         workload between threads, we used a multi-thread kernel to calculate
caused by these different tile sizes impact kernel performance.                            various tiles by considering the computation amount (𝑊𝐶 𝑃 ) and data
    To explore the impact of the number of work-items in the work-                         loading amount (𝑊𝐷𝐿 ) of threads in the kernel. For larger tiles such
group and the tile size on the performance of batch GEMM, some                             as ‘‘32*64’’ and ‘‘64*64’’, a 256-thread kernel is used for computation.
experiments are performed, whose results are given in Fig. 3. As shown                     In this way, increasing the number of threads will reduce the thread’s
in Fig. 3, under the condition that the number of GEMMs is large and                       computation amount and data loading amount, thereby reducing the
𝑀, 𝑁, and 𝐾 are large enough, various thread-kernels (thread number                        gaps between threads’ workloads and achieving load balancing. There
is 64, 128, 256, and 512) are used to compute multiple tiles (The nine                     are five tiles and two kernels (𝑊𝑛𝑢𝑚 ) for small and irregular batch
tiles are shown in Fig. 3). In Fig. 3, four thread kernels commonly                        matrix multiplication, as shown in Table 3. Compared to Table 2, we
used in previous work are selected as benchmarks [21,34,36]. We used                       balance the thread workload by setting the tile size and number of
these kernels to investigate their performance under various tiles in                      kernel threads so that thread computation and data loading are as
comparative experiments. Fig. 3 shows that the kernel’s performance                        consistent as possible across different workgroups. In the calculation
first increases and then decreases for different tiles. When the tile                      process of GEMM, five tile types are designed for GEMM calculation
size is small, the thread’s workload is also tiny. In this case, threads                   of different sizes, from ‘‘small’’ to ‘‘large’’. To ensure that the amount
in the kernel only compute a few elements, which causes a lack of                          of computation and data loading for the work-item responsible for
full utilization of threads’ computing power. As the tile size increases,                  computing different tiles are as equal as possible, the number of threads
the number of elements that the thread needs to calculate and store                        varies depending on the tile size. In Table 3, two different thread
is also increasing. Under the condition that the register data does                        numbers are used (128 and 256), respectively, and the computation
not overflow, the computing efficiency of the thread is continuously                       amount (𝑊𝐶 𝑃 ) and data loading amount (𝑊𝐷𝐿 ) of the work-item in
improving. When the tile corresponding to the thread is too large, the                     each scheme are given. Although the current ROCm and CUDA platform
register data overflows, and the data will be transferred to the global                    programming interfaces only support the kernel design of a uniform
memory. For example, for a 64-thread-kernel, when computing ‘‘8*8’’                        thread number, we use a screening operation in the early stage of kernel
and ‘‘32*32’’ tiles, respectively, each thread needs to compute 1 and 32                   execution to achieve the effect of kernel design of multiple threads. For
elements in matrix C. It is obvious that ‘‘32*32’’ requires more register                  example, in this paper, the number of kernel threads is set to 256. When
memory. However, the register memory of each thread is precious.                           the tiles of ‘‘small’’, ‘‘small-medium’’ and ‘‘medium’’ are executed, the
When the maximum limit of the register memory is exceeded, the data                        extra threads will be terminated immediately and the corresponding
will be transferred to the global memory for storage. Because the access                   computing resources will be released because these tiles only need

                                                                                       5
Y. Zhang et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103341


                                                           Fig. 3. Experimental results of multi-thread kernel.


Table 3                                                                                 and each tile is computed by a workgroup. Workgroups are further
The multi-thread kernel design scheme with a more balanced workload.
                                                                                        transformed into wavefronts based on their hardware resource require-
 Tile                 𝑇𝑚       𝑇𝑛        𝑇𝑘        𝑊𝑛𝑢𝑚        𝑊𝐶 𝑃        𝑊𝐷𝐿
                                                                                        ments and the number of work-item. Finally, these wavefronts are run
 small                16       16        16        128         2           6            in parallel on multiple CUs for batch GEMM calculations. Due to the
 small-medium         16       32        16        128         4           10
                                                                                        difference between tile sizes, the computation amount and data loading
 medium               32       32        16        128         8           16
 medium–large         32       64        16        256         8           14           amount of threads are not unified in the different wavefront, which
 large                64       64        16        256         16          24           will lead to unbalanced hardware resource requirements. The execution
                                                                                        time of the wavefront on the CU is also different. The overall time of the
                                                                                        batch GEMM is the maximum of all CU execution time. If the workload
                                                                                        difference between wavefronts is too significant, the execution time of
128 threads. Terminating threads early allows for a better allocation
                                                                                        one wavefront will be excessive, increasing the overall calculation time
of computational resources to threads responsible for computing other
                                                                                        consumption.
tiles. With this implementation, we can achieve the effect of a multi-
                                                                                            Therefore, Eq. (3) does not consider the workload gaps between
threaded kernel. Even though the performance may be degraded in
                                                                                        wavefronts. To solve this problem, we propose a new TLP calculation
comparison with an actual multi-threaded kernel, the experimental
                                                                                        method as follows:
results in Section 5 demonstrate the excellent performance of this                                       (           )
                                                                                                    ∑      𝑀𝑖 × 𝑁𝑖
method.                                                                                 𝑇 𝐿𝑃 𝑛𝑒𝑤 =     𝜑               × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡                           (4)
                                                                                                     𝑖
                                                                                                           𝑇𝑚𝑖 × 𝑇𝑛𝑖
4.2. Tiling algorithm                                                                   where the expression of 𝑀𝑖 , 𝑁𝑖 , 𝑇𝑚𝑖 and 𝑇𝑛𝑖 have the same meaning
                                                                                        as Eq. (3), and 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 is number of work-item in wavefront, 𝜑
4.2.1. Criteria for evaluation                                                          represents the conversion process of workgroup to wavefront.
    The tiling can be seen as a re-assignment of GEMM computation                           The conversion process mainly considers the following factors: the
task. Efficient tiling algorithm can transform GEMM operations and                      number of workitems in the workgroup, the size of VGPR, SGPR, LDS
improve hardware resource utilization. When various kernel designs                      required by a workitem, and the maximum number of wavefront sup-
are implemented, choosing an appropriate tiling scheme becomes a                        ported in the CU. These factors are related to GPU hardware architec-
crucial issue. In general, for a GEMM, there will be better parallelism                 ture. Next, take AMD’s MI210, which is based on CDNA2.0 architecture,
within the workgroup when the tile size is larger. However, a larger tile               as an example. Under the limitation of the number of workitems in the
means that the number of tiles needs to be reduced. If the number of                    workgroup, the number of wavefront can be calculated as follows:
tiles is too few, the CU cannot be fully utilized, resulting in a waste of                                (       )
                                                                                                            𝑊 𝐼𝑤𝑔
computing resources. Therefore, choosing a suitable tiling evaluation                   𝑊 𝐹𝑤𝑔 = 16 × ceil                                                  (5)
                                                                                                             64
criteria is crucial. In the previous study, TLP was used to quantify the
parallelism of tiling strategies on GPUs. Given a GEMM and a tiling                     where 𝑊 𝐹𝑤𝑔 is the maximum number of wavefronts under the limit of
strategy, its TLP can be calculated as follows:                                         the number of work-item in the workgroup, and 𝑊 𝐼𝑤𝑔 represents the
          ∑ 𝑀𝑖 × 𝑁 𝑖                                                                    number of work-item in the workgroup. Eq. (5) indicates that when the
𝑇 𝐿𝑃 =                 × 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝                                    (3)
             𝑇𝑚𝑖 × 𝑇𝑛𝑖                                                                  number of work-item is less than or equal to 64, a workgroup contains
           𝑖
                                                                                        only one wavefront, and the number of workgroups is limited to 16 in
where 𝑀𝑖 and 𝑁𝑖 are the dimension size of matrix C of the 𝑖th GEMM,
                                                                                        the CU.
and 𝑇𝑚𝑖 and 𝑇𝑛𝑖 are the tile sizes chosen by matrix C. 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 is
                                                                                           Limited by the size of VGPR, SGPR, and LDS, the number of the
the number of threads in workgroup. However, the above formulation
                                                                                        wavefront can be calculated as follows:
only considers TLP from the level of the workgroup. Indeed, during                                       (                 )
                                                                                                             𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥
the computation of the GEMM, the workgroup needs to be further                          𝑊 𝐹𝑉 = 4 × floor                                                  (6)
                                                                                                           𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 × 64
transformed into wavefronts and run on the CU in the form of a
wavefront. The execution process of batch GEMM can be divided into                      where 𝑊 𝐹𝑉 is the maximum number of wavefronts under the limit of
four phases: segmentation, workgroup, wavefront, and execution. In the                  the size of VGPR, 𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of VGPR in the Single Instruction
segmentation phase, the GEMM is tiling into tiles with various sizes,                   Multiple Data (SIMD) unit, and 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 is the VGPR size used by a

                                                                                    6
Y. Zhang et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103341


work-item. In the CDNA2.0 hardware architecture, each CU consists of                decreases. This fine-tuning approach ensures that the CU is not idle
four SIMDs.                                                                         by increasing the utilization of hardware resources at the expense of
             (            )
               𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥                                                            intra-tile parallelism.
𝑊 𝐹𝑆 = floor                                                     (7)
               𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑                                                           Algorithm 1 The Tiling algorithm.
where 𝑊 𝐹𝑆 is the maximum number of wavefronts under the limit of                    1: Initialize       𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ,  𝑇 𝐿𝑃 =0,       𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝=0,
the size of SGPR, 𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of SGPR in the CU, and 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑               𝑡𝑜𝑡𝑎𝑙_𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 = 0;
is the size of SGPR used by a wavefront.                                             2: for 𝑖 = 0 to 𝐵𝑠𝑖𝑧𝑒 − 1 do
               (         )        (        )
                 𝐿𝐷𝑆𝑚𝑎𝑥             𝑊 𝐼𝑤𝑔                                            3:    Calculate 𝑇𝑚𝑖 , 𝑇𝑛𝑖 according to equation (10);
𝑊 𝐹𝐿 = floor               × ceil                                     (8)
                 𝐿𝐷𝑆𝑢𝑠𝑒𝑑              64                                             4:    𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑖 ∗ 𝑁𝑖 )∕(𝑇𝑚𝑖 ∗ 𝑇𝑛𝑖 );
                                                                                     5: end for
where 𝑊 𝐹𝐿 is the maximum number of wavefronts under the limit of
the size of LDS, 𝐿𝐷𝑆𝑚𝑎𝑥 is the size of LDS in the workgroup, 𝐿𝐷𝑆𝑢𝑠𝑒𝑑                 6:    𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ;
is the size of LDS used by a workgroup, and the expression of 𝑊 𝐼𝑤𝑔                  7:    𝑇 𝑖𝑙𝑒[𝑠𝑖𝑧𝑒] represent to "large" to "small";
                                                                                     8: while ( 𝑇 𝐿𝑃𝑛𝑒𝑤 >= 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ) do
have same meaning as Eq. (5).
                                                                                     9:    for 𝑗 = 0 to 𝐵𝑠𝑖𝑧𝑒 − 1 do
    To sum up, the number of wavefronts should meet the limitations
                                                                                    10:       if 𝑇 𝑖𝑙𝑒[𝑗] is "large" then
of all the above factors, and the calculation method is as follows.
                                                                                    11:            Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large";
𝑊 𝐹 = min(𝑊 𝐹𝑤𝑔 , 𝑊 𝐹𝑉 , 𝑊 𝐹𝑆 , 𝑊 𝐹𝐿 , 𝑊 𝐹𝐶 )                            (9)        12:       else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large" then
                                                                                    13:            Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium";
where 𝑊 𝐹 is the number of activated wavefronts, 𝑊 𝐹𝐶 is the maxi-
                                                                                    14:       else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium" then
mum number of wavefront supported in the CU.
                                                                                    15:            Set 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium";
    The number of wavefronts and the corresponding number of threads
                                                                                    16:       else if 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium" then
are introduced into Eq. (4) to compute the TLP more accurately and
                                                                                    17:            Set 𝑇 𝑖𝑙𝑒[𝑗] is "small";
appropriately. Compared to Eq. (3), the former only considers the
                                                                                    18:       end if
workload at the workgroup-level, which neglects further conversion
                                                                                    19:       𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑗 ∗ 𝑁𝑗 )∕(𝑇𝑚𝑗 ∗ 𝑇𝑛𝑗 );
between the workgroup and wavefront at runtime. Eq. (3) is valid only
                                                                                    20:    end for
if the following two conditions are satisfied. One is that all thread
                                                                                    21:    𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ;
computations and data load amounts are consistent. The other one
                                                                                    22: end while
is that the hardware resources required for activated wavefront do
not exceed the limit in the CU. Note that for GEMM with different                      𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 is used as a threshold to ensure parallelism among
precision, threads have different requirements for computing resources              multiple tiles in fine-tuning phase. Note that 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 has an impor-
(VGPR, SGPR, LDS) during the computation process. Therefore, for                    tant influence on the selection of tiling scheme for different hardware
matrices with different precision, the values of 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 ,          architectures. As a measure, the TLP values of the batch GEMM vary
and 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 in Eqs. (6)–(8) above are different. This will affect the               according to the different tiling schemes. The setting of the 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
number of activated wavefronts.                                                     value is related to the architecture of the GPU because it uses the
                                                                                    number of wavefront and the number of threads in the wavefront to
4.2.2. Tiling fine-tuning                                                           measure the parallelism of the tiling scheme. The hardware resources
    For batch GEMM, an initial tiling scheme is first assigned to solve             and the maximum number of wavefronts supported by each CU are
the problem of switching between contexts and low hardware resource                 diverse, so corresponding 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 should be set for different GPU
utilization caused by the matrix’s variable scale. Then, the tiling scheme          architectures.
is adjusted according to the TLP estimation of batch GEMM and the                       The specific process of selecting a tiling scheme for batch GEMM
hardware architecture of GPU, and finally, the best tiling scheme is ob-            is given in Algorithm 1: (1) when batch GEMM is given, an ‘‘initial
tained. In the first stage, the tile size chosen by each GEMM according             scheme’’ is obtained according to Eq. (10). (2) The TLP of this scheme
to the dimensions of the matrix should meet the following conditions:               is calculated according to the given batch GEMM and tiling scheme.
⎧𝑇𝑚𝑖 ≤ 𝑀𝑖 and 𝑀𝑖 𝑚𝑜𝑑 𝑇𝑚𝑖 = 0                                                        (3) Compare the TLP of the current tiling scheme with the 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 .
⎪
⎨𝑇𝑛𝑖 ≤ 𝑁𝑖 and 𝑁𝑖 𝑚𝑜𝑑 𝑇𝑛𝑖 = 0                                            (10)        If the TLP is not reached, the fine-tuning operation will be performed,
⎪                                                                                   and the current tiling scheme will be changed and then returned to
⎩𝑇𝑘𝑖 ≤ 𝐾𝑖 and 𝐾𝑖 𝑚𝑜𝑑 𝑇𝑘𝑖 = 0
                                                                                    step (2). If the current TLP is greater than or equal to the threshold,
where 𝑇𝑚𝑖 and 𝑇𝑛𝑖 represent the size of the tile dimension corresponding            go to step (4). (4) The batch GEMM is calculated according to the final
to the tiling scheme, and 𝑇𝑘𝑖 is the sub-tile size along the dimension of           tiling scheme. In the above procedures, the TLP is used as an evaluation
𝐾. There are two issues. (1) After the first phase, batch GEMM is only
                                                                                    criterion to measure the effectiveness of the tiling scheme on the batch
an ‘‘initial scheme’’ that cannot achieve optimal parallel computing
                                                                                    GEMM. If the threshold is not reached, fine-tuning is used to adjust and
efficiency. (2) Due to the variability of matrix size in batch GEMM, one
                                                                                    improve the utilization of GPU hardware resources. The optimal tiling
or several items of 𝐵𝑠𝑖𝑧𝑒 , 𝑀, 𝑁, and 𝐾 values may be particularly small
                                                                                    scheme can be obtained to ensure an optimal implementation at the
in batch GEMM, which is called an extreme GEMM case. In this case,
                                                                                    GEMM and workgroup level. After the final tiling scheme, the multi-
the ‘‘initial scheme’’ cannot get enough tiles, which will make some CU
                                                                                    thread kernel is calculated based on the tile size so that the wavefront
in an idle state, resulting in a waste of GPU computing power.
                                                                                    and work-item levels can achieve a ‘‘workload balance’’ state.
    To solve these problems, the ‘‘initial scheme’’ is adjusted reasonably
and efficiently in the second stage. For the larger-size matrix, smaller                The proposed method is based on the GPU platforms of AMD and
tiles are used to segment, and the number of tiles is increased by                  NVIDIA for implementation. The hardware characteristics of the GPU
reducing the tile size to avoid CU being idle. The details are as follows:          platform can also significantly impact GEMM performance. For exam-
for a GEMM, given an appropriate ‘‘initial scheme’’, to avoid the waste             ple, in AMD and NVIDIA platforms, threads are based on wavefront
of GPU hardware resources, some larger GEMMs are cut with smaller                   and warp as the basic execution units containing 64 and 32 threads,
tiles to ensure that the number of tiles is sufficient. For example, for            respectively. The number of threads in the kernel needs to be an integer
tiles whose initial value is ‘‘64 * 64’’, tiles with ‘‘32 * 32’’ are used for       multiple of the number of threads in wavefront and warp to improve
segmentation. As a result, the number of tiles increases as the tile size           kernel occupancy. Meanwhile, the size of registers and shared memory

                                                                                7
Y. Zhang et al.                                                                                                     Journal of Systems Architecture 160 (2025) 103341

Table 4                                                                             set of value ranges. The experimental results were represented by the
The configuration of platforms for evaluation.
                                                                                    average value of GFLOPS (Giga Floating-point Operations Per Second),
 Platform setup                    AMD-platform               NVIDIA-platform       which is calculated as:
 CPU                               EPYC 7763                  Platinum 8358                       ∑𝑛−1
                                                                                                        2(𝑀𝑖 × 𝑁𝑖 × 𝐾𝑖 )
 GPU                               MI210                      A800                  𝐺𝐹 𝐿𝑂𝑃 𝑆 = 𝑖=0                                                   (11)
 OS                                Ubuntu 20.04               Ubuntu 20.04                          𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 ∗ 1.0𝑒9
 ROCm/CUDA                         ROCm 5.6                   CUDA 12.0             where 𝑀𝑖 , 𝑁𝑖 and 𝐾𝑖 represent the matrix dimension of the 𝑖th GEMM,
                                                                                    and 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time on this GPU, 𝑛 represents
Table 5                                                                             batch sizes. For simplicity, the experimental data is represented as
The configuration of GPUs for evaluation.                                           single-precision floating-point data and the storage format is based
 Name             MI210                           A800                              on the row-first format. The experimental results are averaged over
 Architecture     CDNA 2.0                        Ampere                            10 consecutive runs. The final experimental results were rounded to
 Core             1700 MHz                        1410 MHz                          preserve two decimal places.
 Caches           L1 16 KB (per CU) L2 16 MB      L1 192 KB (per SM) L2 40 MB
 Memory           64 GB 3.2 Gbps HBM2             80 GB 2.4 Gbps HBM2
 Bandwidth        1.6 TB/s                        2.04 TB/s                         5.2. Speed up

                                                                                        In the two platforms, we first compare with the default methods
                                                                                    rocBLAS and cuBLAS. These two methods do not support batch irreg-
can affect parameter settings during implementation based on different              ular GEMMs; we convert batch GEMMs into multiple single GEMMs
hardware architectures. Based on this difference, the proposed method               and compute the results. The specific experimental results are shown
considers parallelism at the wavefront or warp level when performing                in Figs. 4–5. Figs. 4–5 show that the proposed method achieves 5.09×
matrix segmentation on two GPU platforms. In this way, the proposed                 and 7.18× average speedup compared to rocBLAS and cuBLAS. This
method can flexibly select tiling schemes based on the hardware char-               result is primarily due to the fact that this method does not sup-
acteristics of the GPU to achieve optimal performance. In this way,                 port GEMMs of different scales when computing batch GEMMs, so
the proposed method can avoid exceeding the maximum register limit                  it can only compute one GEMM simultaneously. When faced with
and prevent data overflow, which improves its applicability for various             a small matrix, the computational resources of the GPU cannot be
hardware architectures.                                                             fully utilized due to the cost of context switching between multiple
                                                                                    GEMMs. As the batch size gradually increases, the advantage of the
5. Evaluation                                                                       proposed method becomes more evident. This shows that for batch
                                                                                    and irregular GEMMs, rocBLAS and cuBLAS are at a disadvantage in
5.1. Setup                                                                          terms of computational efficiency and switching between instances.
                                                                                    Meanwhile, we also compare CUTLASS, which handles batch GEMM,
    Experiment platform and matrix generation. The overall configu-                 using sorting to solve the problem of significant workload differences
ration of the experimental platform and the details of the two GPUs                 between multiple matrix multiplications. Fig. 5 shows that the proposed
are shown in Tables 4 and 5, respectively. To ensure the irregular-                 method has a 4.64× speedup, which is because CUTLASS’s built-in
ity and variability of the input matrix, the GEMM size parameters                   tiles are unsuitable when the matrix dimensions are small. Therefore,
𝑀, 𝑁, and 𝐾 are randomly generated within corresponding ranges                      the proposed method performs better acceleration than CUTLASS for
([𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝑀(𝑁)] and [𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝐾]). 𝑀 𝑎𝑥_𝑀, 𝑀 𝑎𝑥_𝑁, and 𝑀 𝑎𝑥_𝐾                  batch, irregular, and small-size matrix multiplication. We then perform
represent the upper bounds of 𝑀, 𝑁, and 𝐾, respectively. The lower                  a detailed comparison and analysis of the experimental performance
bound for each experiment is denoted uniformly by 𝑀 𝑖𝑛. In this paper,              based on MAGMA. The proposed method has 4.37× and 3.36× speed
the value of 𝑀 𝑖𝑛 is set to 16. For example, Max_M(N) = 512 and                     improvement compared to MAGMA. Figs. 4–5 show that the advantage
Max_K = 128 indicate that the range of matrix dimensions is 𝑀 ∈                     of our method becomes more pronounced as the batch size increases.
[16, 512], 𝑁 ∈ [16, 512] and 𝐾 ∈ [16, 128]. Thus, multiple sets of                  This is because MAGMA only uses the largest GEMM size in the batch
matrix dimension ranges can be obtained, and the parameters needed                  GEMM to set grid.x. Due to the irregularity of the matrix size, a
for GEMM generation are chosen from the different value ranges by                   large number of computational resources in the grid will be idle. The
random selection.                                                                   proposed method, in this case, employs fine-grained filtering operations
    Comparison method. First, for the two GPU experimental platforms,               to ensure further efficient utilization of computational resources, which
the default GEMM processing methods rocBLAS [6] and cuBLAS [7]                      is more evident when the difference between matrix dimensions is
provided by the respective GPU manufacturers are chosen as the basic                significant.
comparison methods to demonstrate the effectiveness of the proposed                     As shown in Fig. 4, the proposed method achieves an average
method. Since these methods do not support the way of batch invo-                   1.88× speedup performance compared to Wang. It is noted that the
cation, in this paper, rocBLAS and cuBLAS compute batch GEMM in a                   advantage of the proposed method is more pronounced when 𝑀 𝑎𝑥_𝐾
loop manner. No stream operations are used during the computation.                  and 𝑀 𝑎𝑥_𝑀 are small. For example, in the case of (𝑀 𝑎𝑥_𝑀(𝑁) = 128,
Meanwhile, we also compared the CUTLASS [23], which supports                        𝑀 𝑎𝑥_𝐾 = 128), the average speedup can reach 1.95×. This is mainly
batch GEMM based on sorting and built-in tiles. We then compare                     due to the fact that when the dimension of matrix is small, there are
with MAGMA [8] supported by the University of Tennessee ICL Lab,                    not enough tiles to cover the time consumption of data loading in the
which only extends the 𝑔 𝑟𝑖𝑑 .𝑧 to support batch GEMM but does not                  wavefront, which is more pronounced in workgroups with heavy loads.
have a fine-grained optimization strategy. The MAGMA comparison                     The proposed method adjusts the wavefront workload corresponding
experiments were run on two GPU platforms. Meanwhile, to show the                   to the tiles through a multi-thread kernel and ensures consistent com-
advancement of our proposed method, we compare with the state-of-                   putation and data loading by different workgroups. At the same time,
the-art methods such as Wang [36] and Li [21] on their respective                   it has also shown that the state of load and computation balancing
platforms. All of the above methods perform a warp-up operation to                  between wavefronts is more conducive to improving the efficiency of
eliminate the effect of the first kernel boot.                                      GPU parallel computing. In the NVIDIA platform, Fig. 5 shows that the
    Evaluation criteria. In the following experiments, there are 12 sets            proposed method has average 1.94× speedup performance compared to
of GEMM dimension ranges. The experiments with batch sizes 8, 16,                   Li. The advantage of the proposed method becomes clearer as the batch
32, 64, 128, and 256 were run continuously for ten epochs under each                size increases. There are two reasons for this speedup performance :

                                                                                8
Y. Zhang et al.                                                                                                  Journal of Systems Architecture 160 (2025) 103341


                          Fig. 4. The comparative results on MI210. (5.09×, 4.37×, 1.88× speedup over rocBLAS, MAGMA, Wang).


                  Fig. 5. The comparative results with on A800. (7.18×, 4.64×, 3.63×, 1.94× speedup over cuBLAS, CUTLASS, MAGMA, Li).


                                                                           9
Y. Zhang et al.                                                                                                   Journal of Systems Architecture 160 (2025) 103341


                                                     Fig. 6. The kernel occupancy on two GPU platforms.


                                                        Fig. 7. The time overhead of tiling algorithm.


(1) Li et al. used batching to balance the workload among different               wavefronts and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 is the theoretical number of wavefronts that
blocks but did not consider the difference between the workload of                CU can execute simultaneously. 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 represent
threads in different tiles. (2) When selecting the tiling scheme, the TLP         the number of warps in activation and the number of warps that are
is calculated only by considering the block, and the fine-grained warp            theoretically parallelizable simultaneously in the NVIDIA platform.
level is neglected, which leads to the inaccurate calculation of TLP. The             The results of the experiment are shown in Fig. 6. By comparing
proposed method adjusts the wavefront workload corresponding to the               rocBLAS and cuBLAS, it can be seen that the proposed method has a
tiles through a multi-thread kernel and ensures consistent computation            clear advantage in the case of batch GEMM. The proposed method is
and data loading by different workgroups. At the same time, it has                also in the best position compared to the other methods (CUTLASS,
also shown that the state of load and computation balancing between               MAGMA, Wang, Li), showing high efficiency in terms of utilization of
wavefronts is more conducive to improving the efficiency of GPU                   GPU resources. As shown in Fig. 6, the proposed method consistently
parallel computing.                                                               maintains the optimal kernel occupancy on both GPU platforms, which
                                                                                  indicates that the proposed method can better exploit the computing
                                                                                  power of the GPU.
5.3. Kernel occupancy

                                                                                  5.4. The overhead of tiling algorithm
   To explore the difference between the proposed method and the
comparison methods in terms of GPU resource utilization, we present
                                                                                      This section presents the proportion of the runtime that is taken
the kernel occupancy of the various methods on two GPU platforms.
                                                                                  up by the tiling algorithm when executing the proposed method on
The formula for kernel occupancy can be expressed as:
                                                                                  two different GPU platforms with various batch sizes. The experimental
                    𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑
kernel occupancy =                                             (12)               results are presented in Fig. 7. From Fig. 7, it is evident that the tiling
                     𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙
                                                                                  algorithm’s runtime percentage decreases as the batch size increases.
To obtain more accurate performance metrics, we utilize Omniperf5                 When batch size is 8, the runtime of the tiling algorithm on the two
and Nsight6 commands, profiling tools provided by AMD and NVIDIA,                 GPU platforms is 6.06% and 6.37%, respectively. As the batch size
to evaluate the resource utilization of the kernel during the execution           increases, more and more GEMMs are executed on the GPU, and the
process. The kernel occupancy has distinct interpretations owing to               execution time of these GEMMs on the GPU side takes up most of the
the distinctions in GPU architecture between AMD MI210 and NVIDIA                 time, resulting in a smaller runtime portion of the tiling algorithm.
A800. On the AMD platform, 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 is the number of activated               For example, with a batch size is 1024, the tiling algorithm takes less
                                                                                  than 1% of the runtime. The experimental results on two GPUs indicate
                                                                                  that the time overhead of the tiling algorithm in the batch GEMM
  5
      https://github.com/ROCm/omniperf                                            execution process is negligible, especially when the batch size is large.
  6
      https://docs.nvidia.com/nsight-compute/NsightCompute/index.html             In real-world scenarios such as deep learning, where a large number of

                                                                             10
Y. Zhang et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103341


                                  Fig. 8. The performance improvement of the proposed TLP on MI210. (1.077× average speedup).


GEMM operations are often required, the tiling algorithm will have less            4.53×, and 1.62× compared to rocBLAS, MAGMA, and Wang, respec-
overhead in the execution process.                                                 tively. The proposed method has the lowest latency performance on
                                                                                   MI210, indicating higher computational efficiency and can effectively
5.5. The performance benefits of the proposed TLP                                  reduce latency. On A800, the proposed method showed performance
                                                                                   improvements of 3.02×, 2.59×, 2.45×, and 1.89× compared to cuBLAS,
    This section presents the comparative experimental results on two              MAGMA, CUTLASS, and Li, respectively. Fig. 10 shows that as the
GPU platforms to provide a more detailed evaluation of the proposed                batch size gradually increases, the kernel latency increases on both
TLP. The detailed experimental results are shown in Figs. 8–9. From                GPU platforms. rocBLAS and cuBLAS have the highest latency as the
Figs. 8–9, it is clear that the proposed TLP performs better overall than          batch size increases. This phenomenon is because the traditional loop
traditional TLP. The proposed methods have a speedup of 1.077× and                 scheduling method significantly increases latency consumption due to
1.085× on MI210 and A800, respectively. From Fig. 8, the proposed                  context switching between kernels when the batch size is large. From
method significantly improves performance when the batch size is                   Fig. 10, it can be seen that some methods exhibit different latency
larger. For example, on MI210, the proposed method has an average                  performances at various batch sizes. For example, when batch size
speedup of 1.04× when batch size <= 16. When batch size >= 32, the                 <= 16, MAGMA has the highest latency performance on two GPU
proposed method can improve performance by 1.10×. The performance                  platforms. When the batch size is large, its computational performance
improvement gap is because when the batch size and matrix dimension                improves, indicating that the MAGMA performs better when there are
are small, it is difficult to utilize hardware resources fully. When there         many matrices. The experimental results on two platforms show that
are a large number of tiles, the proposed TLP can more accurately                  the proposed method has the lowest latency under various batch sizes,
evaluate the thread’s workload and select the optimal tiling scheme.               indicating better performance and broad applicability.
The same performance trend is also reflected in the A800 platform. On
A800, the proposed TLP has performance improvements of 1.04× and
1.11× when batch size <= 16 and batch size >= 32, respectively. The                5.7. The improved performance on inception layers of CNN
effectiveness of the proposed TLP can be further demonstrated through
comparative experiment results on two GPU platforms.                                   Modern CNN model architectures often have multiple branches to
                                                                                   capture features at different scales. Convolution operations of differ-
5.6. The latency                                                                   ent scales in each branch can be represented as batch GEMM oper-
                                                                                   ations with various dimensions, e.g. GoogleNet [13], DenseNet [50],
   This section compares kernel latency on two GPU platforms to                    SqueezeNet [12], etc. To demonstrate the effectiveness of the proposed
provide a more detailed evaluation of the proposed method. We mea-                 method in real-world scenarios, we use various Inception module as
sured kernel latency with different batch sizes in the comparative                 a typical application to perform the forward computation process on
experiment. The detailed experimental results are shown in Fig. 10.                two GPU platforms. The Inception module involves a large number of
On MI210, the proposed method has a latency reduction of 3.87×,                    irregular, small-size GEMM operations. The deep learning frameworks

                                                                              11
Y. Zhang et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103341


                                   Fig. 9. The performance improvement of the proposed TLP on A800. (1.085× average speedup).


                                              Fig. 10. The latency performance of the kernel on two GPU platforms.


MIOpen7 and cuDNN8 are used as benchmark implementations on                        the other Inception module, and the dimensions of these matrices are
both GPU platforms. In this section, we select several commonly used               smaller than the former two. Finally, the proposed method has been
Inception modules to evaluate the proposed method’s speedup perfor-                proven to significantly accelerate CNN models with various branch
mance. The GEMM sizes in Inception modules are shown in Table 6.                   structures on two different GPU platforms, particularly in scenarios
Fig. 11 shows the speedup performance of the proposed method in each               involving multiple branches, irregular shapes, and small dimensions.
Inception module. As shown in Fig. 11, the average speedups are 2.88×
and 1.87× respectively. The gray boxes represent the average speedup               6. Conclusion
ratios of the different Inception modules in Fig. 11. The experimental
results suggest that the Inception 8–9 series has the highest average                  In this paper, we propose a load-balanced batch GEMM acceleration
speedup ratio (3.68× and 2.66× respectively) among the Inception                   method for the problem of low parallel computing efficiency and poor
modules, because Inception 8–9 has more matrix shapes compared to                  hardware resource utilization in batch, irregular, and variable matrix
                                                                                   multiplication scenarios. The kernel occupancy and hardware resource
                                                                                   utilization can be effectively improved by a multi-thread kernel design
  7
      https://github.com/ROCm/MIOpen                                               that balances the computational and data load in the work-item. A
  8
      https://github.com/NVIDIA/cudnn-frontend                                     novel approach to TLP computation is devised, where the parallelism of

                                                                              12
Y. Zhang et al.                                                                                                                   Journal of Systems Architecture 160 (2025) 103341


                                                      Fig. 11. The speedup performance on Inception layers.


                              Table 6
                              The size of GEMM in various Inception modules.
                               Inception module                GEMM size (M ×N × K)
                               Inception-1                     784 × 96 × 192, 784 × 64 × 192, 784 × 32 × 192, 784 × 16 × 192
                               Inception-2                     784 × 64 × 192, 784 × 32 × 192, 784 × 128 × 192
                               Inception-3                     196 × 192 × 192, 196 × 16 × 192, 196 × 96 × 192, 196 × 64 × 192
                               Inception-4                     196 × 64 × 192, 196 × 24 × 192, 196 × 160 × 192
                               Inception-5                     196 × 64 × 192, 196 × 128 × 192, 196 × 24 × 192
                               Inception-6                     196 × 112 × 192, 196 × 144 × 192, 196 × 32 × 192, 196 × 64 × 192
                               Inception-7                     196 × 256 × 192, 196 × 160 × 192, 196 × 128 × 192
                               Inception-8                     49 × 160 × 192, 49 × 128 × 192, 49 × 256 × 192, 49 × 160 × 192, 49 × 32 × 192
                               Inception-9                     49 × 192 × 192, 49 × 128 × 192, 49 × 384 × 192, 49 × 192 × 192, 49 × 48 × 192


the tiling scheme is measured by the number of activated wavefronts.                  References
This approach allows the optimal tiling scheme to be selected based on
different GPU architectures. Experiments are conducted on two GPU                      [1] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using
                                                                                           the TF32 and TF64 frameworks on low-precision AI tensor cores, in: Proceedings
platforms to validate the effectiveness and progress of our proposed
                                                                                           of the SC’23 Workshops of the International Conference on High Performance
method.                                                                                    Computing, Network, Storage, and Analysis, 2023, pp. 179–186.
    Future work includes exploring batch GEMM with various preci-                      [2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based
sion performances. With the development of Transformer-based, many                         convolutions for deep learning on multicore ARM and RISC-V architectures, J.
GEMM operations are involved in the training and inference process                         Syst. Archit. (2024) 103186.
                                                                                       [3] J. Fornt, P. Fontova-Musté, M. Caro, J. Abella, F. Moll, J. Altet, C. Studer, An
of Large Language Models (LLMs), which often have lower accuracy,                          energy-efficient gemm-based convolution accelerator with on-the-fly im2col, IEEE
such as FP16, FP8, etc. For example, quantized LLMs often involve                          Trans. Very Large Scale Integr. (VLSI) Syst. 31 (11) (2023) 1874–1878.
GEMM operations where the weight matrices and activation values                        [4] H. Kim, W.J. Song, Las: locality-aware scheduling for GEMM-accelerated
have different precisions, e.g. W4A16, W8A8. More complex precisions                       convolutions in GPUs, IEEE Trans. Parallel Distrib. Syst. 34 (5) (2023)
                                                                                           1479–1494.
and storage formats pose challenges to the performance of GEMM                         [5] W. Yang, J. Fang, D. Dong, X. Su, Z. Wang, Optimizing full-spectrum matrix
operations.                                                                                multiplications on ARMv8 multi-core CPUs, IEEE Trans. Parallel Distrib. Syst.
                                                                                           (2024).
CRediT authorship contribution statement                                               [6] AMD, Next generation BLAS implementation for ROCm platform, 2024, https:
                                                                                           //github.com/ROCm/rocBLAS.
                                                                                       [7] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore
    Yu Zhang: Writing – review & editing, Writing – original draft. Lu                     High-Performance Parallel Computing with CUDA, Packt Publishing Ltd, 2018.
Lu: Writing – review & editing, Supervision. Zhanyu Yang: Writing                      [8] ICL, Matrix algebra for GPU and multicore architectures, 2024, https://icl.utk.
– review & editing. Zhihong Liang: Supervision, Conceptualization.                         edu/magma/.
                                                                                       [9] T. Faingnaert, T. Besard, B. De Sutter, Flexible performant GEMM kernels on
Siliang Suo: Supervision, Conceptualization.
                                                                                           GPUs, IEEE Trans. Parallel Distrib. Syst. 33 (9) (2021) 2230–2248.
                                                                                      [10] W.S. Moses, I.R. Ivanov, J. Domke, T. Endo, J. Doerfert, O. Zinenko, High-
Declaration of competing interest                                                          performance gpu-to-cpu transpilation and optimization via high-level parallel
                                                                                           constructs, in: Proceedings of the 28th ACM SIGPLAN Annual Symposium on
    The authors declare that they have no known competing finan-                           Principles and Practice of Parallel Programming, 2023, pp. 119–134.
                                                                                      [11] H. Kim, H. Nam, W. Jung, J. Lee, Performance analysis of CNN frameworks
cial interests or personal relationships that could have appeared to
                                                                                           for GPUs, in: 2017 IEEE International Symposium on Performance Analysis of
influence the work reported in this paper.                                                 Systems and Software, ISPASS, IEEE, 2017, pp. 55–64.
                                                                                      [12] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
Acknowledgments                                                                            SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB
                                                                                           model size, 2016, arXiv preprint arXiv:1602.07360.
                                                                                      [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
   This work was supported by the Natural Science Foundation of                            Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of
Guangdong Province (2024A1515010204) and the Technological Re-                             the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.
search Project of Southern Power Grid Company (ZBKJXM20232483).                            1–9.
                                                                                      [14] G. Pant, D. Yadav, A. Gaur, ResNeXt convolution neural network topology-based
                                                                                           deep learning model for identification and classification of pediastrum, Algal Res.
Data availability
                                                                                           48 (2020) 101932.
                                                                                      [15] S. Barrachina, M.F. Dolz, P. San Juan, E.S. Quintana-Ortí, Efficient and
    No data was used for the research described in the article.                            portable GEMM-based convolution operators for deep neural network training
                                                                                           on multicore processors, J. Parallel Distrib. Comput. 167 (2022) 240–254.


                                                                                 13
Y. Zhang et al.                                                                                                                     Journal of Systems Architecture 160 (2025) 103341


[16] S. Rajbhandari, Y. He, O. Ruwase, M. Carbin, T. Chilimbi, Optimizing cnns on              [35] G. Alaejos, A. Castelló, H. Martínez, P. Alonso-Jordá, F.D. Igual, E.S. Quintana-
     multicores for scalability, performance and goodput, ACM SIGARCH Comput.                       Ortí, Micro-kernels for portable and efficient matrix multiplication in deep
     Archit. News 45 (1) (2017) 267–280.                                                            learning, J. Supercomput. 79 (7) (2023) 8124–8147.
[17] C. Rivera, J. Chen, N. Xiong, S.L. Song, D. Tao, Ism2: Optimizing irregular-shaped        [36] R. Wang, Z. Yang, H. Xu, L. Lu, A high-performance batched matrix multiplica-
     matrix-matrix multiplication on gpus, 2020, arXiv preprint arXiv:2002.03258.                   tion framework for gpus under unbalanced input distribution, J. Supercomput.
[18] K. Matsumoto, N. Nakasato, S.G. Sedukhin, Performance tuning of matrix                         78 (2) (2022) 1741–1758.
     multiplication in opencl on different gpus and CPUs, in: 2012 SC Companion:               [37] Y. Zhang, Y. Wang, Z. Mo, Y. Zhou, T. Sun, G. Xu, C. Xing, L. Yang, Accelerating
     High Performance Computing, Networking Storage and Analysis, IEEE, 2012, pp.                   small matrix multiplications by adaptive batching strategy on GPU, in: 2022
     396–405.                                                                                       IEEE 24th Int Conf on High Performance Computing & Communications; 8th
[19] G.E. Moon, H. Kwon, G. Jeong, P. Chatarasi, S. Rajamanickam, T. Krishna, Eval-                 Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int
     uating spatial accelerator architectures with tiled matrix-matrix multiplication,              Conf on Dependability in Sensor, Cloud & Big Data Systems & Application,
     IEEE Trans. Parallel Distrib. Syst. 33 (4) (2021) 1002–1014.                                   HPCC/DSS/SmartCity/DependSys, IEEE, 2022, pp. 882–887.
[20] Q. Han, H. Yang, M. Dun, Z. Luan, L. Gan, G. Yang, D. Qian, Towards                       [38] A. Abdelfattah, S. Tomov, J. Dongarra, Matrix multiplication on batches of small
     efficient tile low-rank GEMM computation on sunway many-core processors, J.                    matrices in half and half-complex precisions, J. Parallel Distrib. Comput. 145
     Supercomput. 77 (5) (2021) 4533–4564.                                                          (2020) 188–201.
[21] X. Li, Y. Liang, S. Yan, L. Jia, Y. Li, A coordinated tiling and batching                 [39] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Novel HPC techniques to batch
     framework for efficient GEMM on GPUs, in: Proceedings of the 24th Symposium                    execution of many variable size BLAS computations on GPUs, in: Proceedings of
     on Principles and Practice of Parallel Programming, 2019, pp. 229–241.                         the International Conference on Supercomputing, 2017, pp. 1–10.
[22] P. Tillet, D. Cox, Input-aware auto-tuning of compute-bound HPC kernels, in:              [40] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Performance, design, and
     Proceedings of the International Conference for High Performance Computing,                    autotuning of batched GEMM for GPUs, in: High Performance Computing: 31st
     Networking, Storage and Analysis, 2017, pp. 1–12.                                              International Conference, ISC High Performance 2016, Frankfurt, Germany, June
[23] NVIDIA, CUDA templates for linear algebra subroutines, 2024, https://github.                   19-23, 2016, Proceedings, Springer, 2016, pp. 21–38.
     com/NVIDIA/cutlass.                                                                       [41] A. Li, G.-J. van den Braak, H. Corporaal, A. Kumar, Fine-grained synchronizations
[24] J. Huang, C.D. Yu, R.A.v.d. Geijn, Strassen’s algorithm reloaded on GPUs, ACM                  and dataflow programming on GPUs, in: Proceedings of the 29th ACM on
     Trans. Math. Softw. 46 (1) (2020) 1–22.                                                        International Conference on Supercomputing, 2015, pp. 109–118.
[25] B. Boyer, J.-G. Dumas, C. Pernet, W. Zhou, Memory efficient scheduling of                 [42] J. Li, H. Ye, S. Tian, X. Li, J. Zhang, A fine-grained prefetching scheme
     strassen-winograd’s matrix multiplication algorithm, in: Proceedings of the 2009               for DGEMM kernels on GPU with auto-tuning compatibility, in: 2022 IEEE
     International Symposium on Symbolic and Algebraic Computation, 2009, pp.                       International Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2022,
     55–62.                                                                                         pp. 863–874.
[26] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain,                [43] Z. Yang, L. Lu, R. Wang, A batched GEMM optimization framework for deep
     A. Novikov, F.J. R Ruiz, J. Schrittwieser, G. Swirszcz, et al., Discovering faster             learning, J. Supercomput. 78 (11) (2022) 13393–13408.
     matrix multiplication algorithms with reinforcement learning, Nature 610 (7930)           [44] H. Mei, H. Qu, J. Sun, Y. Gao, H. Lin, G. Sun, GPU occupancy prediction of
     (2022) 47–53.                                                                                  deep learning models using graph neural network, in: 2023 IEEE International
[27] G. Xiao, C. Yin, T. Zhou, X. Li, Y. Chen, K. Li, A survey of accelerating parallel             Conference on Cluster Computing, CLUSTER, IEEE, 2023, pp. 318–329.
     sparse linear algebra, ACM Comput. Surv. 56 (1) (2023) 1–38.                              [45] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, J.
[28] Y. Chen, G. Xiao, K. Li, F. Piccialli, A.Y. Zomaya, fgSpMSpV: A fine-grained                   Dongarra, Algorithms and optimization techniques for high-performance matrix-
     parallel SpMSpV framework on HPC platforms, ACM Trans. Parallel Comput. 9                      matrix multiplications of very small matrices, Parallel Comput. 81 (2019)
     (2) (2022) 1–29.                                                                               1–21.
[29] Y. Chen, G. Xiao, W. Yang, Optimizing partitioned CSR-based SpGEMM on the                 [46] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee,
     sunway TaihuLight, Neural Comput. Appl. 32 (10) (2020) 5571–5582.                              D. Lee, Lut-gemm: Quantized matrix multiplication based on luts for efficient
[30] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, T. Li, Performance-aware model for                   inference in large-scale generative language models, 2022, arXiv preprint arXiv:
     sparse matrix-matrix multiplication on the sunway taihulight supercomputer,                    2206.09557.
     IEEE Trans. Parallel Distrib. Syst. 30 (4) (2018) 923–938.                                [47] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating
[31] G. Xiao, K. Li, Y. Chen, W. He, A.Y. Zomaya, T. Li, Caspmv: A customized                       scientific computing on tensor cores with extended precision, in: Proceedings
     and accelerative spmv framework for the sunway taihulight, IEEE Trans. Parallel                of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
     Distrib. Syst. 32 (1) (2019) 131–146.                                                          Programming, 2021, pp. 278–291.
[32] G. Xiao, C. Yin, Y. Chen, M. Duan, K. Li, Efficient utilization of multi-threading        [48] G. Shobaki, A. Kerbow, S. Mekhanoshin, Optimizing occupancy and ILP on the
     parallelism on heterogeneous systems for sparse tensor contraction, IEEE Trans.                GPU using a combinatorial approach, in: Proceedings of the 18th ACM/IEEE
     Parallel Distrib. Syst. (2024).                                                                International Symposium on Code Generation and Optimization, 2020, pp.
[33] D.E. Tanner, Tensile: Auto-tuning gemm gpu assembly for all problem sizes,                     133–144.
     in: 2018 IEEE International Parallel and Distributed Processing Symposium                 [49] A.B. Hayes, L. Li, D. Chavarría-Miranda, S.L. Song, E.Z. Zhang, Orion: A
     Workshops, IPDPSW, IEEE, 2018, pp. 1066–1075.                                                  framework for gpu occupancy tuning, in: Proceedings of the 17th International
[34] S. Wang, FlexGEMM: A flexible micro-kernel generation framework, in: Proceed-                  Middleware Conference, 2016, pp. 1–13.
     ings of the 5th International Conference on Computer Information and Big Data             [50] G. Huang, S. Liu, L. Van der Maaten, K.Q. Weinberger, Condensenet: An
     Applications, 2024, pp. 164–170.                                                               efficient densenet using learned group convolutions, in: Proceedings of the IEEE
                                                                                                    Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.


                                                                                          14