929 lines
107 KiB
Plaintext
929 lines
107 KiB
Plaintext
Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Journal of Systems Architecture
|
||
journal homepage: www.elsevier.com/locate/sysarc
|
||
|
||
|
||
|
||
|
||
A load-balanced acceleration method for small and irregular batch matrix
|
||
multiplication on GPU
|
||
Yu Zhang a , Lu Lu a,b ,∗, Zhanyu Yang a , Zhihong Liang c,d , Siliang Suo c,d
|
||
a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
|
||
b
|
||
Peng Cheng Laboratory, Shenzhen, 518055, China
|
||
c
|
||
Electric Power Research Institute, CSG, Guangzhou, China
|
||
d
|
||
Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, China
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many
|
||
Batch GEMM applications, such as high-performance computing, machine learning, etc. In practice, the performance of
|
||
Thread workload GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing
|
||
Multi-thread kernel
|
||
with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, a
|
||
Tiling algorithm
|
||
common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in
|
||
GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism
|
||
and ignore the issues of low computational efficiency and hardware resource utilization caused by the
|
||
difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM
|
||
acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multi-
|
||
thread kernel design can address the workload unbalance between wavefronts in different workgroups, and the
|
||
efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation
|
||
method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted
|
||
on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms
|
||
previous methods.
|
||
|
||
|
||
|
||
1. Introduction Many real-world applications, such as deep learning, involve ir-
|
||
regular, small-size matrix multiplication operations in their computa-
|
||
GEneral Matrix Multiplication (GEMM) is a standard computing tions [11]. For example, in Convolutional Neural Networks (CNN) [12–
|
||
kernel that plays an important role in high-performance computing [1], 14], the structure of these models contains a large number of convo-
|
||
artificial intelligence [2], image processing [3], and other research lutional layers. The scale of the convolution kernel tends to be small
|
||
fields. With the explosive growth of data volume and the emergence of (e.g. ‘‘1*1’’ and ‘‘3*3’’). Convolution operations are converted to GEMM
|
||
various algorithms, the demand for high-performance GEMM comput- using Im2col function, and the dimension of the matrix is typically
|
||
ing is increasing [4,5]. Additional stream processors and memory are less than 1000 [15,16]. These small GEMM computations prevent the
|
||
integrated into the GPU to cater to this trend, providing tremendous GPU from fully exploiting its hardware computing potential. In this
|
||
computational power for GEMM acceleration. To fully utilize the hard- case, the scheduling overhead between batch GEMMs and the regularity
|
||
ware acceleration capability, AMD and NVIDIA, provide developers
|
||
of the matrix poses challenges to computational performance [17,18].
|
||
with a platform for parallel computing based on GPU (ROCm and
|
||
For a GEMM, the tiling is a standard solution method. The matrix is
|
||
CUDA). Based on these parallel computing acceleration platforms, var-
|
||
segmented into multiple tiles, and a thread block is responsible for
|
||
ious optimization algorithms and acceleration libraries have been pro-
|
||
computing individual tiles. Since each tile is independent, multiple tiles
|
||
posed and demonstrated to have powerful effects, such as rocBLAS [6],
|
||
can be computed in parallel by using multiple threads in GPU, to speed
|
||
cuBLAS [7], MAGMA [8], etc. These methods achieve optimal computa-
|
||
tional task allocation through hardware resource scheduling and thread up the computation process of GEMM. The larger dimension of tile will
|
||
parallelism to accelerate the matrix multiplication operation [9,10]. increase the Thread-Level Parallelism (TLP) of a single tile and also will
|
||
|
||
|
||
|
||
∗ Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
|
||
E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang),
|
||
suosl@csg.cn (S. Suo).
|
||
|
||
https://doi.org/10.1016/j.sysarc.2025.103341
|
||
Received 3 September 2024; Received in revised form 3 November 2024; Accepted 8 January 2025
|
||
Available online 23 January 2025
|
||
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
reduce the number of tile, resulting in the failure to fully utilize the 2. Related work and motivation
|
||
hardware resources of GPU [19,20]. The Instruction-Level Parallelism
|
||
(ILP) of a single thread is related to the K-dimension. Generally, for a 2.1. Related work
|
||
large enough matrix size, it can fully use GPU hardware resources and
|
||
achieve higher TLP and ILP [21,22]. Several approaches have been proposed for batch GEMM computa-
|
||
To improve computational efficiency, previous studies have pro- tion, which mainly focus on algorithm-level optimization or architecture-
|
||
posed some acceleration methods for matrix multiplication. For in- level optimization. The former mainly explores lower bounds on the
|
||
stance, rocBLAS [6] and cuBLAS [7] provide batch GEMM API time complexity of GEMM operations at the mathematical level and
|
||
(rocblasSgemmBatched and cublasSgemmBatched), which can support optimizes the computational effort. The latter is based on different GPU
|
||
multiple GEMMs to be simultaneously calculated on GPUs. However, architecture features and uses corresponding optimization techniques
|
||
these APIs support only uniform matrix sizes that considerably limit to improve the computational efficiency of GEMM. In algorithm-level
|
||
these applications. NVIDIA also provides a C++-style template library, optimization, Strassen et al. [24] proposed a novel GEMM algorithm
|
||
CUTLASS [23], which utilizes built-in tile templates and sorting to
|
||
based on the property that matrix addition is faster than matrix multi-
|
||
accelerate matrix multiplication operations. In fact, the size of matrices
|
||
plication to speed up the computational process, which uses seven-time
|
||
is variable in many real-world applications [11]. To solve this issue,
|
||
multiplications and multiple addition operations instead of eight-time
|
||
a Vbatch GEMM route that supports batch GEMM in various sizes is
|
||
multiplications. This approach mathematically reduced the time com-
|
||
designed and implemented by MAGMA (magmablas_sgemm_vbatched). It
|
||
plexity of GEMM to 𝑂(𝑛2.81 ) for the first time. To reduce the require-
|
||
adapts to batch GEMMs with multiple tiling strategies, assigning the ap-
|
||
ment of Strassen’s algorithm for extra memory space, three different
|
||
propriate tile to a single GEMM for huge performance gains. Although
|
||
methods were proposed in [25]: pre-additions, overwriting the input
|
||
variable sizes are supported in MAGMA, it still has some limitations.
|
||
First, MAGMA only supports some coarse-grained tiling strategies that matrix, and recursive scheduling to alleviate this problem. At the
|
||
are not appropriate for all GEMM. Coarse-grained tiling results in an same time, due to the powerful effect of deep neural networks in
|
||
unbalanced kernel workload and GPU utilization reduction. Second, the various domains, Alhussein Fawzi et al. [26] transformed the process
|
||
grid size is determined by the tiling of the largest matrix, which leads of finding the optimal complexity of matrix multiplication into a tensor
|
||
to idle threads and a waste of GPU computing power. Third, the lack decomposition problem and used reinforcement learning to explore
|
||
of an evaluation criterion for tiling leads to lower efficiency of strategy lower bounds on the complexity of matrix multiplication. In particular,
|
||
choice. for a 4 × 4 matrix, the multiplication number was as low as 47 multi-
|
||
To thoroughly support batch GEMM with variable sizes, it is es- plications. This performance was better than the two-level Strassen’s
|
||
sential to design a tiling algorithm that can be adapted to all GEMMs algorithm, which involves 49 multiplications. Although the above
|
||
and adaptively choose tile sizes, not limited to single size. The optimal approach reduces the mathematical complexity of matrix multiplication
|
||
tiling for each GEMM is different, depending on the size of the matrix operations, it is difficult to take advantage of the performance benefits
|
||
dimensions (𝑀, 𝑁, 𝐾). How to choose a suitable tile is a challenge of these approach due to the neglect of computational scheduling
|
||
for batch GEMM. At the same time, an evaluation criterion based on strategies and multi-level memory architecture features on the GPU.
|
||
the current GPU hardware and tiling strategy is also essential. With In architecture-level optimization, GPU vendors (NVIDIA and AMD)
|
||
GPU hardware, an appropriate tiling for each GEMM can be chosen have designed and implemented computing libraries such as cuBLAS [6]
|
||
to fully utilize the GPU computing capabilities and achieve better and rocBLAS [7] based on their parallel computing platforms to im-
|
||
computational performance. How to measure the effectiveness of the prove GPU hardware utilization and parallelism. However, due to the
|
||
tiling algorithm on the GPU hardware is a challenging problem. The tile restriction of uniform-sized matrix, the performance is poor when faced
|
||
with various sizes can lead to significant differences in computational with small and irregular batch GEMMs. Although NVIDIA provides
|
||
effort within each workgroup, further to an unbalanced distribution of a C++-style template library, the small size of the matrix and the
|
||
computational tasks and excessive load differences between threads. lack of assembly-level optimizations make it difficult for CUTLASS
|
||
Hence, for tiles with various sizes, balancing thread computation and to fully exploit its performance advantages for irregular and small
|
||
data loading during computation is also a challenge for batch GEMM. matrix multiplication [23]. These irregular and small-sized matrices
|
||
To address the above challenges, we propose a batch GEMM accel-
|
||
often lead to unbalanced workloads among threads in different work-
|
||
eration method with a multi-thread kernel design. Furthermore, an ef-
|
||
groups, which can reduce kernel performance. For Sparse GEneral
|
||
ficient tiling algorithm is proposed to achieve load-balanced and higher
|
||
Matrix-Matrix multiplication (SpGEMM), the matrix’s sparsity leads to
|
||
hardware resource utilization. Our contributions can be summarized as
|
||
significant differences in thread workloads [27,28]. To address the
|
||
follows:
|
||
unbalanced workload, Chen et al. [29] optimized the matrix segmen-
|
||
• A multi-threaded kernel design scheme is proposed to balance tation by analyzing the distribution of the floating point calculations
|
||
thread computation and data loading in different workgroups to of the CSR-based SpGEMM, which achieves load balance and perfor-
|
||
compute the various tiles. mance improvement on Sunway TaihuLight. For the issue of workload
|
||
• A novel TLP computation method is designed to select the optimal unbalance in threads, it is necessary to conduct a detailed analysis
|
||
tiling algorithm by combining the kernel occupancy of the GPU of the computation process and hardware platform characteristics to
|
||
and the tiling operation. design an efficient parallel framework implementation [30,31]. Xiao
|
||
• An efficient tiling algorithm is implemented by considering the et al. [32] introduce a fine-grained partitioning strategy to select ap-
|
||
GPU hardware architecture and the batch GEMM workload. propriate segmentation dimensions, efficiently utilizing the parallelism
|
||
• The proposed method can efficiently handle batch irregular GEMM of multi-thread and improving the performance of binary sparse tensor
|
||
and achieve state-of-the-art performance on AMD and NVIDIA contracts. The diversity of matrix sizes makes it difficult to utilize a
|
||
GPU platforms. unified routine for calculations, resulting in some threads being idle
|
||
The rest of the paper is organized as follows. Section 2 provides in CU [33,34]. Indeed, the size of matrices is variable and irregular
|
||
related work and motivation. Section 3 introduces background on batch in various scientific computing scenarios. To overcome the matrix
|
||
GEMM, GPU architecture, and kernel occupancy. Section 4 presents restriction of uniform size, MAGMA [8] proposes a Vbatch routine
|
||
the details of the multi-thread kernel design and load-balanced tiling to support batch GEMM with various sizes. In this way, it uses a 3D
|
||
algorithm. Section 5 demonstrates and evaluates the experimental re- grid to indicate batch GEMM’s kernel design, where grid.z represents
|
||
sult. Section 6 provides the conclusions of the paper and future work. batch size. Each GEMM corresponds to one of the 2D-grid planes, and
|
||
The source code of this paper can be obtained in this repository link: the size of the two-dimensional plane (grid.x, grid.y) is determined by
|
||
https://github.com/zhangyu0722/BatchGEMM.git. the largest GEMM. In the case of irregular GEMM, if the dimension
|
||
|
||
2
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 1. GEMM and batch GEMM schematic diagram.
|
||
|
||
|
||
difference between the largest GEMM and the rest is too large, a large 3. Background
|
||
number of threads and workgroups will be idle, resulting in a waste of
|
||
GPU computing resources. For various parallel acceleration platforms, 3.1. GEMM and batch GEMM
|
||
different hardware characteristics, such as register size and number of
|
||
CUs, will affect the allocation of computing resources in the kernel. To For a single GEMM, its accumulation routine is 𝐶 = 𝛼 𝐴𝐵+𝛽 𝐶, where
|
||
ensure kernel performance, it is necessary to flexibly set parameters 𝐴 ∈ 𝑅𝑀×𝐾 , 𝐵 ∈ 𝑅𝐾×𝑁 and 𝐶 ∈ 𝑅𝑀×𝑁 are dense matrices, 𝑀, 𝑁, and
|
||
based on different matrix sizes and hardware architectures [9,35]. 𝐾 represent matrix dimensions, and 𝛼 and 𝛽 are constant scalars. A
|
||
To solve this problem, a coordinated tiling and batching strategy is common approach is tiling matrix C into multiple tiles [21,36], which
|
||
proposed in [21], where a different tiling strategy is used for each utilizes the parallel computing of thread in GPU to calculate each tile
|
||
GEMM in batch GEMM and appropriate batching is used according and splices together the result. As shown in Fig. 1 (b), given a GEMM
|
||
to the tile size to improve the computational efficiency of the GPU.
|
||
with size 𝑀 × 𝑁 × 𝐾, the matrix C is segmented into multiple tiles with
|
||
Wang et al. [36] proposed the sort-up algorithm based on the GEMM
|
||
𝑇𝑚 × 𝑇𝑛 . Each workgroup is responsible for the calculation of a tile and
|
||
workload and split-down in the tiling process, which can segment large
|
||
needs to access the row section of matrix A with size 𝑇𝑚 ×𝐾 and column
|
||
tiles into multiple smaller tiles. This approach can make better use of
|
||
section of matrix B with size 𝐾 × 𝑇𝑛 . However, the row cross-section
|
||
CU utilization when the number of GEMM is limited.
|
||
of A and the column cross-section of B (represented in Fig. 1 (b) by
|
||
2.2. Motivation the gray parts of matrices A and B, respectively) are too large to store
|
||
in shared memory and registers. Hence, the row section of A and the
|
||
Although the above-mentioned methods improve the parallel com- column section of B are segments of multiple A tiles with 𝑇𝑚 × 𝑇𝑘 and B
|
||
puting efficiency of batch GEMM on GPU from various perspectives, tiles with 𝑇𝑘 × 𝑇𝑛 , respectively. The partial result of C can be obtained
|
||
there are two problems. One is that the workload of threads varies by calculating with A tile and B tile, and accumulative partial results
|
||
significantly across the kernel. In the above approach, tiles with various can obtain the final result.
|
||
sizes are designed, and each tile is responsible for the corresponding To batch-run multiple GEMMs, a naive routine is computed for
|
||
kernel, where the number of threads is fixed. In general, larger tiles each GEMM individually. However, when the matrix size is small,
|
||
have better TLP. This will also increase the workload of each thread a single GEMM does not fully utilize the GPU’s computing power,
|
||
for large-size tiles, and the thread responsible for computing large tiles leaving the CU idle [37,38]. To avoid this situation, a batch GEMM
|
||
requires more hardware resources (VGPR, SGPR, LDS) and computing method is proposed to design multiple kernels for various GEMM in
|
||
time. The other one is that differences between wavefronts within dif- the GPUs [36,39]. Compared to GEMM, batch GEMM is expressed in
|
||
ferent workgroups are ignored in the TLP calculations. The workgroup (𝑀 × 𝑁 × 𝐾 × 𝐵𝑠𝑖𝑧𝑒 ), where 𝑀, 𝑁 and 𝐾 represent the dimensions of
|
||
will be transformed into multiple wavefronts during GPU computation the matrix, and 𝐵𝑠𝑖𝑧𝑒 represents the batch size. A batch GEMM is 3D-
|
||
and be executed in parallel on the CU. Each CU can run multiple dimension grid, where grid.z is batch sizes, and grid.x and grid.y are the
|
||
wavefronts simultaneously, and the number of wavefronts depends on
|
||
lengths and widths of a two-dimensional plane respectively [40]. To
|
||
the hardware resources required by the wavefront. Thus, the TLP on the
|
||
balance the workload of a batch GEMM, a variety of tile sizes are used
|
||
GPU should be determined by the number of threads in the wavefront
|
||
for GEMM tiling. The two-dimensional grid size has the corresponding
|
||
that can be executed in parallel on the CU.
|
||
matrix C and tiling strategy. Each tile is responsible for the correspond-
|
||
To solve the above problems, we propose an efficient and load-
|
||
balanced batch GEMM acceleration method, which consists of two ing workgroup. A workgroup is decomposed into multiple wavefronts
|
||
parts: a multi-thread kernel design scheme and an efficient tiling algo- that execute on the CU. The 3D grid of batch GEMM is shown in Fig. 1
|
||
rithm. A multi-thread kernel design is proposed to balance the amount (a).
|
||
of loading and computation in the thread corresponding to each tile.
|
||
Tiles with various sizes correspond to the number of threads selected. 3.2. GPU architecture and kernel occupancy
|
||
Although this is limited by the parallel programming interfaces of the
|
||
CUDA and ROCm platforms, the number of threads responsible for With the improvement of hardware architecture and parallel com-
|
||
computing a tile is uniform. To overcome this shortcoming, we use the puting programming platforms (such as ROCm1 and CUDA2 ), GPUs
|
||
corresponding filtering operation in the kernel execution process to ef- are becoming the most popular hardware accelerator. The two most
|
||
fectively alleviate this problem. An efficient tiling algorithm can choose commonly used GPUs are AMD and NVIDIA, widely used in various
|
||
the optimal scheme based on different GEMMs and GPUs. To measure scientific computing platforms. However, some basic concepts of ex-
|
||
the effect of tiling, we propose a new way of TLP computation based pression in ROCm and CUDA are different. We chose AMD’s official
|
||
on wavefronts. The optimal tiling scheme is obtained by adjusting the
|
||
tiling strategy according to the TLP. Finally, we obtain an efficient tiling
|
||
algorithm based on the new TLP calculation method. In Section 4, the 1
|
||
https://rocm.docs.amd.com/en/latest/
|
||
2
|
||
details of the proposed method are introduced. https://docs.nvidia.com/cuda/
|
||
|
||
|
||
3
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
Table 1
|
||
ROCm/CUDA terminology.
|
||
ROCm CUDA Description
|
||
Compute Unit (CU) Streaming One of many parallel vector processors in a GPU that contains
|
||
Multiprocessor (SM) parallel ALUs. All waves in a workgroup are assigned
|
||
to the same CU.
|
||
Kernel Kernel Functions launched to the GPU that are executed by multiple
|
||
parallel workers on the GPU. Kernels can work in
|
||
parallel with CPU.
|
||
Wavefront Warp Collection of operations that execute in lockstep, run the
|
||
same instructions, and follow the same control-flow path.
|
||
Individual lanes can be masked off.
|
||
Workgroup Thread block Think of this as a vector thread. A 64-wide wavefront
|
||
is a 64-wide vector op.
|
||
Work-item/Thread Thread GPU programming models can treat this as a separate thread
|
||
of execution, though this does not necessarily get
|
||
forward sub-wavefront progress.
|
||
Global Memory Global Memory DRAM memory accessible by the GPU that goes
|
||
through some layers cache.
|
||
Local Memory Shared Memory Scratchpad that allows communication between wavefront
|
||
in a workgroup.
|
||
Private Memory Local Memory Per-thread private memory often mapped to registers.
|
||
|
||
|
||
|
||
terminology for this paper to provide precise specifications. To clarify resources. In order to fully utilize the hardware resources of the GPU
|
||
some differences and relationships between ROCm and CUDA terms, a and improve the efficiency of parallel computing, the kernel occupancy
|
||
comparison of terminology is given in Table 1. should be improved as much as possible without data overflow [46,47].
|
||
A GPU is composed of multiple Shader Engines (SE) and a com- In batch GEMM, an efficient kernel design should properly allocate
|
||
mand processor. Each SE has its own workload manager. One SE is the data loading and computation workload for each work-item in the
|
||
integrated with multiple CU and workload manager. Each CU contains wavefront, so that the memory space and computing power on the CU
|
||
an enormous amount of Arithmetic and Logic Units (ALUs), a small can be more efficiently utilized [48,49].
|
||
number of control units, and caches. Hence, GPUs are suitable for a
|
||
large number of simple parallel computing tasks. A GPU kernel consists 4. Overview
|
||
of one or multiple workgroups, the size of which is determined by the
|
||
number of wavefronts and threads. On the memory hierarchy, the GPU 4.1. Multi-thread kernel design
|
||
has global memory, local memory, and private memory from slow to
|
||
fast according to memory access speed, and local memory and private Tile size and kernel design are closely related in the design of batch
|
||
memory are much smaller than global memory [41,42]. GEMM algorithms, and there are two matrix tile design routes. The
|
||
Kernel Occupancy represents the actual utilization of computing first way is to design a tile to adapt to all GEMMs, and the second
|
||
unit resources by a kernel function on GPU, which is the ratio of is to design the various tiles to adapt to different GEMMs. Compared
|
||
actived wavefront to the maximum wavefront supported by CU [35,43]. with the first method, for irregular GEMM, the latter method is more
|
||
An active wavefront running on CU requires resources such as Vec- flexible and efficient to utilize the computing resources of GPU. For
|
||
tor General-Purpose Register (VGPR), Scalar General-Purpose Registers GEMMs with various shapes and sizes, using a single tile can easily
|
||
(SGPR), Local Data Share (LDS), etc. A wavefront can be activated lead to increased workload differences between threads in multiple
|
||
and run on a CU when all required resources are available. When the
|
||
workgroups, affecting the allocation of computing resources. In this
|
||
utilization of CU resources is low, the number of active wavefronts
|
||
paper, we perform a multi-thread kernel design for the second matrix
|
||
is small, which leads to the waste of hardware resources and the
|
||
segmentation method. Two different tile design strategies are shown
|
||
degradation of the parallel performance of the kernel. On the other
|
||
in Fig. 2. Here we present the effect of two different tile strategies on
|
||
hand, when the number of active wavefronts in the CU increases, the
|
||
the occupancy of the 3D grid. For the batch GEMM, different tile sizes
|
||
resources used by each wavefront and the available register storage
|
||
lead to different numbers of workgroups, resulting in different 3D grid
|
||
space of each work-item in the wavefront decrease [44,45].
|
||
occupancies.
|
||
The number of active wavefronts on a CU is mainly limited by the
|
||
For a single GEMM, matrix C is tiled into multiple tiles. The tile
|
||
following factors: the number of work-items in each workgroup and
|
||
size can be flexibly designed, and each tile can be run in parallel
|
||
the sizes of VGPR, SGPR, and LDS. For example, in AMD’s MI1003
|
||
without data interference. Each tile is calculated by the corresponding
|
||
and MI210,4 a wavefront consists of 64 work-terms. When the number
|
||
workgroup and can be represented by a 2D-grid as a whole. When the
|
||
of work-items in a workgroup is less than or equal to 64, only one
|
||
size and number of tiles is large enough, efficient parallel execution
|
||
wavefront is included. The VGPR, SGPR, and LDS sizes on the CU have a
|
||
efficiency can usually be obtained. However, in real-world cases, the
|
||
corresponding upper bound for each work-item. According to the kernel
|
||
size of matrices in batch GEMM tends to be small and irregular,
|
||
design, the resources on the CU need to be allocated before executing
|
||
which leads to poor performance of traditional methods. Therefore, the
|
||
each work-item. When resource requirements of the work-item are
|
||
previous method adopts a variety of tiles to adapt to the corresponding
|
||
satisfied, the wavefront can be active and run on the CU. Otherwise,
|
||
GEMM, and each tile is based on a unified number of threads, which
|
||
it will not run until other wavefronts accomplish tasks and release
|
||
will lead to the workload of threads in large-scale tiles being much
|
||
larger than that of small tiles. This gap in the workload of threads
|
||
3
|
||
https://www.amd.com/system/files/documents/instinct-mi100- results in unbalanced thread loading and reduces GPU parallel com-
|
||
brochure.pdf puting efficiency. Table 2 lists the detailed parameters for tiles with
|
||
4
|
||
https://www.amd.com/content/dam/amd/en/documents/instinct- various sizes based on the same work-item design (The number of work-
|
||
business-docs/white-papers/amd-cdna2-white-paper.pdf items in the kernel is 128). 𝑊𝐶 𝑃 and 𝑊𝐷𝐿 represent the computation
|
||
|
||
4
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 2. Two different tile design strategies for batch GEMMs. ((a) All GEMMs adopt the same tiling scheme, which is divided into multiple tiles of the same size. (b) Different
|
||
GEMMs adopt different tiling schemes and are divided into multiple tiles of different sizes.).
|
||
|
||
|
||
Table 2 speed of global memory is considerably lower than that of registers,
|
||
The common kernel design scheme for batch GEMM (There are significant workload threads’ data access efficiency decreases, and overall time consumption
|
||
gaps between threads).
|
||
increases. At the same time, since the variety of thread workloads,
|
||
Tile 𝑇𝑚 𝑇𝑛 𝑇𝑘 𝑊𝐶 𝑃 𝑊𝐷𝐿
|
||
when a thread with a heavy workload is run on the CU, the number
|
||
small 16 16 8/16 2 4/6 of active wavefronts on the CU is less, resulting in the CU’s kernel
|
||
medium 32 32 8/16 8 12/16
|
||
large 64 64 8/16 32 40/48
|
||
occupancy (The ratio between the number of active wavefronts and the
|
||
maximum number of supported wavefronts) will be reduced. The state
|
||
of the CU with low kernel occupancy will be longer due to the longer
|
||
work-item computation time.
|
||
amount and data loading amount of work-item, respectively, and their To solve this problem, we propose a multi-thread kernel design,
|
||
calculation expressions are considered as: which ensures that the workload of each thread is balanced as much as
|
||
𝑇 × 𝑇𝑛 possible. The experimental results in Fig. 3 show that multiple kernels’
|
||
𝑊𝐶 𝑃 = 𝑚 (1)
|
||
𝑊𝑛𝑢𝑚 performance varies when calculating the same tile. For example, the
|
||
𝑇𝑚 × 𝑇𝑛 + 𝑇𝑚 × 𝑇𝑘 + 𝑇𝑘 × 𝑇𝑛 128-thread kernel performs best when calculating a tile with ‘‘32*32’’,
|
||
𝑊𝐷 𝐿 = (2) as shown in Fig. 3. The performance gap mentioned above is mainly
|
||
𝑊𝑛𝑢𝑚
|
||
because of the varying workloads of threads under different kernels,
|
||
where 𝑊𝑛𝑢𝑚 represents the number of work-items responsible for com-
|
||
which affects the overall performance. For the 128-thread kernel, when
|
||
puting the tile.
|
||
calculating a tile with ‘‘32*32’’, each thread needs to complete the
|
||
For different tiles, there is a significant gap in workload between
|
||
calculation of 8 elements and the loading of 16 elements. When cal-
|
||
threads (𝑊𝐶 𝑃 ∈ [2, 32] and 𝑊𝐷𝐿 ∈ [4, 48]). The choice of 𝑇𝑘 also has a
|
||
culating a tile with ‘‘64*64’’, the workload of the threads is heavy, and
|
||
certain impact on the data load of work-item. Each thread is responsible
|
||
each thread needs to complete the calculation of 32 elements and the
|
||
for more data loads when 𝑇𝑘 is larger. For example, in large tile, when loading of 64 elements. When calculating larger tiles, the workload of
|
||
the value of 𝑇𝑘 is set to 8 or 16, each work-item is responsible for the thread increases significantly. To avoid significant differences in
|
||
loading 40 and 48 elements, respectively. The workload differences workload between threads, we used a multi-thread kernel to calculate
|
||
caused by these different tile sizes impact kernel performance. various tiles by considering the computation amount (𝑊𝐶 𝑃 ) and data
|
||
To explore the impact of the number of work-items in the work- loading amount (𝑊𝐷𝐿 ) of threads in the kernel. For larger tiles such
|
||
group and the tile size on the performance of batch GEMM, some as ‘‘32*64’’ and ‘‘64*64’’, a 256-thread kernel is used for computation.
|
||
experiments are performed, whose results are given in Fig. 3. As shown In this way, increasing the number of threads will reduce the thread’s
|
||
in Fig. 3, under the condition that the number of GEMMs is large and computation amount and data loading amount, thereby reducing the
|
||
𝑀, 𝑁, and 𝐾 are large enough, various thread-kernels (thread number gaps between threads’ workloads and achieving load balancing. There
|
||
is 64, 128, 256, and 512) are used to compute multiple tiles (The nine are five tiles and two kernels (𝑊𝑛𝑢𝑚 ) for small and irregular batch
|
||
tiles are shown in Fig. 3). In Fig. 3, four thread kernels commonly matrix multiplication, as shown in Table 3. Compared to Table 2, we
|
||
used in previous work are selected as benchmarks [21,34,36]. We used balance the thread workload by setting the tile size and number of
|
||
these kernels to investigate their performance under various tiles in kernel threads so that thread computation and data loading are as
|
||
comparative experiments. Fig. 3 shows that the kernel’s performance consistent as possible across different workgroups. In the calculation
|
||
first increases and then decreases for different tiles. When the tile process of GEMM, five tile types are designed for GEMM calculation
|
||
size is small, the thread’s workload is also tiny. In this case, threads of different sizes, from ‘‘small’’ to ‘‘large’’. To ensure that the amount
|
||
in the kernel only compute a few elements, which causes a lack of of computation and data loading for the work-item responsible for
|
||
full utilization of threads’ computing power. As the tile size increases, computing different tiles are as equal as possible, the number of threads
|
||
the number of elements that the thread needs to calculate and store varies depending on the tile size. In Table 3, two different thread
|
||
is also increasing. Under the condition that the register data does numbers are used (128 and 256), respectively, and the computation
|
||
not overflow, the computing efficiency of the thread is continuously amount (𝑊𝐶 𝑃 ) and data loading amount (𝑊𝐷𝐿 ) of the work-item in
|
||
improving. When the tile corresponding to the thread is too large, the each scheme are given. Although the current ROCm and CUDA platform
|
||
register data overflows, and the data will be transferred to the global programming interfaces only support the kernel design of a uniform
|
||
memory. For example, for a 64-thread-kernel, when computing ‘‘8*8’’ thread number, we use a screening operation in the early stage of kernel
|
||
and ‘‘32*32’’ tiles, respectively, each thread needs to compute 1 and 32 execution to achieve the effect of kernel design of multiple threads. For
|
||
elements in matrix C. It is obvious that ‘‘32*32’’ requires more register example, in this paper, the number of kernel threads is set to 256. When
|
||
memory. However, the register memory of each thread is precious. the tiles of ‘‘small’’, ‘‘small-medium’’ and ‘‘medium’’ are executed, the
|
||
When the maximum limit of the register memory is exceeded, the data extra threads will be terminated immediately and the corresponding
|
||
will be transferred to the global memory for storage. Because the access computing resources will be released because these tiles only need
|
||
|
||
5
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 3. Experimental results of multi-thread kernel.
|
||
|
||
|
||
Table 3 and each tile is computed by a workgroup. Workgroups are further
|
||
The multi-thread kernel design scheme with a more balanced workload.
|
||
transformed into wavefronts based on their hardware resource require-
|
||
Tile 𝑇𝑚 𝑇𝑛 𝑇𝑘 𝑊𝑛𝑢𝑚 𝑊𝐶 𝑃 𝑊𝐷𝐿
|
||
ments and the number of work-item. Finally, these wavefronts are run
|
||
small 16 16 16 128 2 6 in parallel on multiple CUs for batch GEMM calculations. Due to the
|
||
small-medium 16 32 16 128 4 10
|
||
difference between tile sizes, the computation amount and data loading
|
||
medium 32 32 16 128 8 16
|
||
medium–large 32 64 16 256 8 14 amount of threads are not unified in the different wavefront, which
|
||
large 64 64 16 256 16 24 will lead to unbalanced hardware resource requirements. The execution
|
||
time of the wavefront on the CU is also different. The overall time of the
|
||
batch GEMM is the maximum of all CU execution time. If the workload
|
||
difference between wavefronts is too significant, the execution time of
|
||
128 threads. Terminating threads early allows for a better allocation
|
||
one wavefront will be excessive, increasing the overall calculation time
|
||
of computational resources to threads responsible for computing other
|
||
consumption.
|
||
tiles. With this implementation, we can achieve the effect of a multi-
|
||
Therefore, Eq. (3) does not consider the workload gaps between
|
||
threaded kernel. Even though the performance may be degraded in
|
||
wavefronts. To solve this problem, we propose a new TLP calculation
|
||
comparison with an actual multi-threaded kernel, the experimental
|
||
method as follows:
|
||
results in Section 5 demonstrate the excellent performance of this ( )
|
||
∑ 𝑀𝑖 × 𝑁𝑖
|
||
method. 𝑇 𝐿𝑃 𝑛𝑒𝑤 = 𝜑 × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 (4)
|
||
𝑖
|
||
𝑇𝑚𝑖 × 𝑇𝑛𝑖
|
||
4.2. Tiling algorithm where the expression of 𝑀𝑖 , 𝑁𝑖 , 𝑇𝑚𝑖 and 𝑇𝑛𝑖 have the same meaning
|
||
as Eq. (3), and 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 is number of work-item in wavefront, 𝜑
|
||
4.2.1. Criteria for evaluation represents the conversion process of workgroup to wavefront.
|
||
The tiling can be seen as a re-assignment of GEMM computation The conversion process mainly considers the following factors: the
|
||
task. Efficient tiling algorithm can transform GEMM operations and number of workitems in the workgroup, the size of VGPR, SGPR, LDS
|
||
improve hardware resource utilization. When various kernel designs required by a workitem, and the maximum number of wavefront sup-
|
||
are implemented, choosing an appropriate tiling scheme becomes a ported in the CU. These factors are related to GPU hardware architec-
|
||
crucial issue. In general, for a GEMM, there will be better parallelism ture. Next, take AMD’s MI210, which is based on CDNA2.0 architecture,
|
||
within the workgroup when the tile size is larger. However, a larger tile as an example. Under the limitation of the number of workitems in the
|
||
means that the number of tiles needs to be reduced. If the number of workgroup, the number of wavefront can be calculated as follows:
|
||
tiles is too few, the CU cannot be fully utilized, resulting in a waste of ( )
|
||
𝑊 𝐼𝑤𝑔
|
||
computing resources. Therefore, choosing a suitable tiling evaluation 𝑊 𝐹𝑤𝑔 = 16 × ceil (5)
|
||
64
|
||
criteria is crucial. In the previous study, TLP was used to quantify the
|
||
parallelism of tiling strategies on GPUs. Given a GEMM and a tiling where 𝑊 𝐹𝑤𝑔 is the maximum number of wavefronts under the limit of
|
||
strategy, its TLP can be calculated as follows: the number of work-item in the workgroup, and 𝑊 𝐼𝑤𝑔 represents the
|
||
∑ 𝑀𝑖 × 𝑁 𝑖 number of work-item in the workgroup. Eq. (5) indicates that when the
|
||
𝑇 𝐿𝑃 = × 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 (3)
|
||
𝑇𝑚𝑖 × 𝑇𝑛𝑖 number of work-item is less than or equal to 64, a workgroup contains
|
||
𝑖
|
||
only one wavefront, and the number of workgroups is limited to 16 in
|
||
where 𝑀𝑖 and 𝑁𝑖 are the dimension size of matrix C of the 𝑖th GEMM,
|
||
the CU.
|
||
and 𝑇𝑚𝑖 and 𝑇𝑛𝑖 are the tile sizes chosen by matrix C. 𝑇𝑤𝑜𝑟𝑘𝑔𝑟𝑜𝑢𝑝 is
|
||
Limited by the size of VGPR, SGPR, and LDS, the number of the
|
||
the number of threads in workgroup. However, the above formulation
|
||
wavefront can be calculated as follows:
|
||
only considers TLP from the level of the workgroup. Indeed, during ( )
|
||
𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥
|
||
the computation of the GEMM, the workgroup needs to be further 𝑊 𝐹𝑉 = 4 × floor (6)
|
||
𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 × 64
|
||
transformed into wavefronts and run on the CU in the form of a
|
||
wavefront. The execution process of batch GEMM can be divided into where 𝑊 𝐹𝑉 is the maximum number of wavefronts under the limit of
|
||
four phases: segmentation, workgroup, wavefront, and execution. In the the size of VGPR, 𝑉 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of VGPR in the Single Instruction
|
||
segmentation phase, the GEMM is tiling into tiles with various sizes, Multiple Data (SIMD) unit, and 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 is the VGPR size used by a
|
||
|
||
6
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
work-item. In the CDNA2.0 hardware architecture, each CU consists of decreases. This fine-tuning approach ensures that the CU is not idle
|
||
four SIMDs. by increasing the utilization of hardware resources at the expense of
|
||
( )
|
||
𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 intra-tile parallelism.
|
||
𝑊 𝐹𝑆 = floor (7)
|
||
𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 Algorithm 1 The Tiling algorithm.
|
||
where 𝑊 𝐹𝑆 is the maximum number of wavefronts under the limit of 1: Initialize 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 , 𝑇 𝐿𝑃 =0, 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝=0,
|
||
the size of SGPR, 𝑆 𝐺𝑃 𝑅𝑚𝑎𝑥 is the size of SGPR in the CU, and 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 𝑡𝑜𝑡𝑎𝑙_𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 = 0;
|
||
is the size of SGPR used by a wavefront. 2: for 𝑖 = 0 to 𝐵𝑠𝑖𝑧𝑒 − 1 do
|
||
( ) ( )
|
||
𝐿𝐷𝑆𝑚𝑎𝑥 𝑊 𝐼𝑤𝑔 3: Calculate 𝑇𝑚𝑖 , 𝑇𝑛𝑖 according to equation (10);
|
||
𝑊 𝐹𝐿 = floor × ceil (8)
|
||
𝐿𝐷𝑆𝑢𝑠𝑒𝑑 64 4: 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑖 ∗ 𝑁𝑖 )∕(𝑇𝑚𝑖 ∗ 𝑇𝑛𝑖 );
|
||
5: end for
|
||
where 𝑊 𝐹𝐿 is the maximum number of wavefronts under the limit of
|
||
the size of LDS, 𝐿𝐷𝑆𝑚𝑎𝑥 is the size of LDS in the workgroup, 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 6: 𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ;
|
||
is the size of LDS used by a workgroup, and the expression of 𝑊 𝐼𝑤𝑔 7: 𝑇 𝑖𝑙𝑒[𝑠𝑖𝑧𝑒] represent to "large" to "small";
|
||
8: while ( 𝑇 𝐿𝑃𝑛𝑒𝑤 >= 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ) do
|
||
have same meaning as Eq. (5).
|
||
9: for 𝑗 = 0 to 𝐵𝑠𝑖𝑧𝑒 − 1 do
|
||
To sum up, the number of wavefronts should meet the limitations
|
||
10: if 𝑇 𝑖𝑙𝑒[𝑗] is "large" then
|
||
of all the above factors, and the calculation method is as follows.
|
||
11: Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large";
|
||
𝑊 𝐹 = min(𝑊 𝐹𝑤𝑔 , 𝑊 𝐹𝑉 , 𝑊 𝐹𝑆 , 𝑊 𝐹𝐿 , 𝑊 𝐹𝐶 ) (9) 12: else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium-large" then
|
||
13: Set 𝑇 𝑖𝑙𝑒[𝑗] is "medium";
|
||
where 𝑊 𝐹 is the number of activated wavefronts, 𝑊 𝐹𝐶 is the maxi-
|
||
14: else if 𝑇 𝑖𝑙𝑒[𝑗] is "medium" then
|
||
mum number of wavefront supported in the CU.
|
||
15: Set 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium";
|
||
The number of wavefronts and the corresponding number of threads
|
||
16: else if 𝑇 𝑖𝑙𝑒[𝑗] is "small-medium" then
|
||
are introduced into Eq. (4) to compute the TLP more accurately and
|
||
17: Set 𝑇 𝑖𝑙𝑒[𝑗] is "small";
|
||
appropriately. Compared to Eq. (3), the former only considers the
|
||
18: end if
|
||
workload at the workgroup-level, which neglects further conversion
|
||
19: 𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝+ = (𝑀𝑗 ∗ 𝑁𝑗 )∕(𝑇𝑚𝑗 ∗ 𝑇𝑛𝑗 );
|
||
between the workgroup and wavefront at runtime. Eq. (3) is valid only
|
||
20: end for
|
||
if the following two conditions are satisfied. One is that all thread
|
||
21: 𝑇 𝐿𝑃𝑛𝑒𝑤 = 𝜑(𝑡𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑘𝑔 𝑟𝑜𝑢𝑝) × 𝑇𝑤𝑎𝑣𝑒𝑓 𝑟𝑜𝑛𝑡 ;
|
||
computations and data load amounts are consistent. The other one
|
||
22: end while
|
||
is that the hardware resources required for activated wavefront do
|
||
not exceed the limit in the CU. Note that for GEMM with different 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 is used as a threshold to ensure parallelism among
|
||
precision, threads have different requirements for computing resources multiple tiles in fine-tuning phase. Note that 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 has an impor-
|
||
(VGPR, SGPR, LDS) during the computation process. Therefore, for tant influence on the selection of tiling scheme for different hardware
|
||
matrices with different precision, the values of 𝑉 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , 𝑆 𝐺𝑃 𝑅𝑢𝑠𝑒𝑑 , architectures. As a measure, the TLP values of the batch GEMM vary
|
||
and 𝐿𝐷𝑆𝑢𝑠𝑒𝑑 in Eqs. (6)–(8) above are different. This will affect the according to the different tiling schemes. The setting of the 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
|
||
number of activated wavefronts. value is related to the architecture of the GPU because it uses the
|
||
number of wavefront and the number of threads in the wavefront to
|
||
4.2.2. Tiling fine-tuning measure the parallelism of the tiling scheme. The hardware resources
|
||
For batch GEMM, an initial tiling scheme is first assigned to solve and the maximum number of wavefronts supported by each CU are
|
||
the problem of switching between contexts and low hardware resource diverse, so corresponding 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 should be set for different GPU
|
||
utilization caused by the matrix’s variable scale. Then, the tiling scheme architectures.
|
||
is adjusted according to the TLP estimation of batch GEMM and the The specific process of selecting a tiling scheme for batch GEMM
|
||
hardware architecture of GPU, and finally, the best tiling scheme is ob- is given in Algorithm 1: (1) when batch GEMM is given, an ‘‘initial
|
||
tained. In the first stage, the tile size chosen by each GEMM according scheme’’ is obtained according to Eq. (10). (2) The TLP of this scheme
|
||
to the dimensions of the matrix should meet the following conditions: is calculated according to the given batch GEMM and tiling scheme.
|
||
⎧𝑇𝑚𝑖 ≤ 𝑀𝑖 and 𝑀𝑖 𝑚𝑜𝑑 𝑇𝑚𝑖 = 0 (3) Compare the TLP of the current tiling scheme with the 𝑇 𝐿𝑃𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 .
|
||
⎪
|
||
⎨𝑇𝑛𝑖 ≤ 𝑁𝑖 and 𝑁𝑖 𝑚𝑜𝑑 𝑇𝑛𝑖 = 0 (10) If the TLP is not reached, the fine-tuning operation will be performed,
|
||
⎪ and the current tiling scheme will be changed and then returned to
|
||
⎩𝑇𝑘𝑖 ≤ 𝐾𝑖 and 𝐾𝑖 𝑚𝑜𝑑 𝑇𝑘𝑖 = 0
|
||
step (2). If the current TLP is greater than or equal to the threshold,
|
||
where 𝑇𝑚𝑖 and 𝑇𝑛𝑖 represent the size of the tile dimension corresponding go to step (4). (4) The batch GEMM is calculated according to the final
|
||
to the tiling scheme, and 𝑇𝑘𝑖 is the sub-tile size along the dimension of tiling scheme. In the above procedures, the TLP is used as an evaluation
|
||
𝐾. There are two issues. (1) After the first phase, batch GEMM is only
|
||
criterion to measure the effectiveness of the tiling scheme on the batch
|
||
an ‘‘initial scheme’’ that cannot achieve optimal parallel computing
|
||
GEMM. If the threshold is not reached, fine-tuning is used to adjust and
|
||
efficiency. (2) Due to the variability of matrix size in batch GEMM, one
|
||
improve the utilization of GPU hardware resources. The optimal tiling
|
||
or several items of 𝐵𝑠𝑖𝑧𝑒 , 𝑀, 𝑁, and 𝐾 values may be particularly small
|
||
scheme can be obtained to ensure an optimal implementation at the
|
||
in batch GEMM, which is called an extreme GEMM case. In this case,
|
||
GEMM and workgroup level. After the final tiling scheme, the multi-
|
||
the ‘‘initial scheme’’ cannot get enough tiles, which will make some CU
|
||
thread kernel is calculated based on the tile size so that the wavefront
|
||
in an idle state, resulting in a waste of GPU computing power.
|
||
and work-item levels can achieve a ‘‘workload balance’’ state.
|
||
To solve these problems, the ‘‘initial scheme’’ is adjusted reasonably
|
||
and efficiently in the second stage. For the larger-size matrix, smaller The proposed method is based on the GPU platforms of AMD and
|
||
tiles are used to segment, and the number of tiles is increased by NVIDIA for implementation. The hardware characteristics of the GPU
|
||
reducing the tile size to avoid CU being idle. The details are as follows: platform can also significantly impact GEMM performance. For exam-
|
||
for a GEMM, given an appropriate ‘‘initial scheme’’, to avoid the waste ple, in AMD and NVIDIA platforms, threads are based on wavefront
|
||
of GPU hardware resources, some larger GEMMs are cut with smaller and warp as the basic execution units containing 64 and 32 threads,
|
||
tiles to ensure that the number of tiles is sufficient. For example, for respectively. The number of threads in the kernel needs to be an integer
|
||
tiles whose initial value is ‘‘64 * 64’’, tiles with ‘‘32 * 32’’ are used for multiple of the number of threads in wavefront and warp to improve
|
||
segmentation. As a result, the number of tiles increases as the tile size kernel occupancy. Meanwhile, the size of registers and shared memory
|
||
|
||
7
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
Table 4 set of value ranges. The experimental results were represented by the
|
||
The configuration of platforms for evaluation.
|
||
average value of GFLOPS (Giga Floating-point Operations Per Second),
|
||
Platform setup AMD-platform NVIDIA-platform which is calculated as:
|
||
CPU EPYC 7763 Platinum 8358 ∑𝑛−1
|
||
2(𝑀𝑖 × 𝑁𝑖 × 𝐾𝑖 )
|
||
GPU MI210 A800 𝐺𝐹 𝐿𝑂𝑃 𝑆 = 𝑖=0 (11)
|
||
OS Ubuntu 20.04 Ubuntu 20.04 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 ∗ 1.0𝑒9
|
||
ROCm/CUDA ROCm 5.6 CUDA 12.0 where 𝑀𝑖 , 𝑁𝑖 and 𝐾𝑖 represent the matrix dimension of the 𝑖th GEMM,
|
||
and 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time on this GPU, 𝑛 represents
|
||
Table 5 batch sizes. For simplicity, the experimental data is represented as
|
||
The configuration of GPUs for evaluation. single-precision floating-point data and the storage format is based
|
||
Name MI210 A800 on the row-first format. The experimental results are averaged over
|
||
Architecture CDNA 2.0 Ampere 10 consecutive runs. The final experimental results were rounded to
|
||
Core 1700 MHz 1410 MHz preserve two decimal places.
|
||
Caches L1 16 KB (per CU) L2 16 MB L1 192 KB (per SM) L2 40 MB
|
||
Memory 64 GB 3.2 Gbps HBM2 80 GB 2.4 Gbps HBM2
|
||
Bandwidth 1.6 TB/s 2.04 TB/s 5.2. Speed up
|
||
|
||
In the two platforms, we first compare with the default methods
|
||
rocBLAS and cuBLAS. These two methods do not support batch irreg-
|
||
can affect parameter settings during implementation based on different ular GEMMs; we convert batch GEMMs into multiple single GEMMs
|
||
hardware architectures. Based on this difference, the proposed method and compute the results. The specific experimental results are shown
|
||
considers parallelism at the wavefront or warp level when performing in Figs. 4–5. Figs. 4–5 show that the proposed method achieves 5.09×
|
||
matrix segmentation on two GPU platforms. In this way, the proposed and 7.18× average speedup compared to rocBLAS and cuBLAS. This
|
||
method can flexibly select tiling schemes based on the hardware char- result is primarily due to the fact that this method does not sup-
|
||
acteristics of the GPU to achieve optimal performance. In this way, port GEMMs of different scales when computing batch GEMMs, so
|
||
the proposed method can avoid exceeding the maximum register limit it can only compute one GEMM simultaneously. When faced with
|
||
and prevent data overflow, which improves its applicability for various a small matrix, the computational resources of the GPU cannot be
|
||
hardware architectures. fully utilized due to the cost of context switching between multiple
|
||
GEMMs. As the batch size gradually increases, the advantage of the
|
||
5. Evaluation proposed method becomes more evident. This shows that for batch
|
||
and irregular GEMMs, rocBLAS and cuBLAS are at a disadvantage in
|
||
5.1. Setup terms of computational efficiency and switching between instances.
|
||
Meanwhile, we also compare CUTLASS, which handles batch GEMM,
|
||
Experiment platform and matrix generation. The overall configu- using sorting to solve the problem of significant workload differences
|
||
ration of the experimental platform and the details of the two GPUs between multiple matrix multiplications. Fig. 5 shows that the proposed
|
||
are shown in Tables 4 and 5, respectively. To ensure the irregular- method has a 4.64× speedup, which is because CUTLASS’s built-in
|
||
ity and variability of the input matrix, the GEMM size parameters tiles are unsuitable when the matrix dimensions are small. Therefore,
|
||
𝑀, 𝑁, and 𝐾 are randomly generated within corresponding ranges the proposed method performs better acceleration than CUTLASS for
|
||
([𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝑀(𝑁)] and [𝑀 𝑖𝑛, 𝑀 𝑎𝑥_𝐾]). 𝑀 𝑎𝑥_𝑀, 𝑀 𝑎𝑥_𝑁, and 𝑀 𝑎𝑥_𝐾 batch, irregular, and small-size matrix multiplication. We then perform
|
||
represent the upper bounds of 𝑀, 𝑁, and 𝐾, respectively. The lower a detailed comparison and analysis of the experimental performance
|
||
bound for each experiment is denoted uniformly by 𝑀 𝑖𝑛. In this paper, based on MAGMA. The proposed method has 4.37× and 3.36× speed
|
||
the value of 𝑀 𝑖𝑛 is set to 16. For example, Max_M(N) = 512 and improvement compared to MAGMA. Figs. 4–5 show that the advantage
|
||
Max_K = 128 indicate that the range of matrix dimensions is 𝑀 ∈ of our method becomes more pronounced as the batch size increases.
|
||
[16, 512], 𝑁 ∈ [16, 512] and 𝐾 ∈ [16, 128]. Thus, multiple sets of This is because MAGMA only uses the largest GEMM size in the batch
|
||
matrix dimension ranges can be obtained, and the parameters needed GEMM to set grid.x. Due to the irregularity of the matrix size, a
|
||
for GEMM generation are chosen from the different value ranges by large number of computational resources in the grid will be idle. The
|
||
random selection. proposed method, in this case, employs fine-grained filtering operations
|
||
Comparison method. First, for the two GPU experimental platforms, to ensure further efficient utilization of computational resources, which
|
||
the default GEMM processing methods rocBLAS [6] and cuBLAS [7] is more evident when the difference between matrix dimensions is
|
||
provided by the respective GPU manufacturers are chosen as the basic significant.
|
||
comparison methods to demonstrate the effectiveness of the proposed As shown in Fig. 4, the proposed method achieves an average
|
||
method. Since these methods do not support the way of batch invo- 1.88× speedup performance compared to Wang. It is noted that the
|
||
cation, in this paper, rocBLAS and cuBLAS compute batch GEMM in a advantage of the proposed method is more pronounced when 𝑀 𝑎𝑥_𝐾
|
||
loop manner. No stream operations are used during the computation. and 𝑀 𝑎𝑥_𝑀 are small. For example, in the case of (𝑀 𝑎𝑥_𝑀(𝑁) = 128,
|
||
Meanwhile, we also compared the CUTLASS [23], which supports 𝑀 𝑎𝑥_𝐾 = 128), the average speedup can reach 1.95×. This is mainly
|
||
batch GEMM based on sorting and built-in tiles. We then compare due to the fact that when the dimension of matrix is small, there are
|
||
with MAGMA [8] supported by the University of Tennessee ICL Lab, not enough tiles to cover the time consumption of data loading in the
|
||
which only extends the 𝑔 𝑟𝑖𝑑 .𝑧 to support batch GEMM but does not wavefront, which is more pronounced in workgroups with heavy loads.
|
||
have a fine-grained optimization strategy. The MAGMA comparison The proposed method adjusts the wavefront workload corresponding
|
||
experiments were run on two GPU platforms. Meanwhile, to show the to the tiles through a multi-thread kernel and ensures consistent com-
|
||
advancement of our proposed method, we compare with the state-of- putation and data loading by different workgroups. At the same time,
|
||
the-art methods such as Wang [36] and Li [21] on their respective it has also shown that the state of load and computation balancing
|
||
platforms. All of the above methods perform a warp-up operation to between wavefronts is more conducive to improving the efficiency of
|
||
eliminate the effect of the first kernel boot. GPU parallel computing. In the NVIDIA platform, Fig. 5 shows that the
|
||
Evaluation criteria. In the following experiments, there are 12 sets proposed method has average 1.94× speedup performance compared to
|
||
of GEMM dimension ranges. The experiments with batch sizes 8, 16, Li. The advantage of the proposed method becomes clearer as the batch
|
||
32, 64, 128, and 256 were run continuously for ten epochs under each size increases. There are two reasons for this speedup performance :
|
||
|
||
8
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 4. The comparative results on MI210. (5.09×, 4.37×, 1.88× speedup over rocBLAS, MAGMA, Wang).
|
||
|
||
|
||
|
||
|
||
Fig. 5. The comparative results with on A800. (7.18×, 4.64×, 3.63×, 1.94× speedup over cuBLAS, CUTLASS, MAGMA, Li).
|
||
|
||
|
||
|
||
9
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 6. The kernel occupancy on two GPU platforms.
|
||
|
||
|
||
|
||
|
||
Fig. 7. The time overhead of tiling algorithm.
|
||
|
||
|
||
(1) Li et al. used batching to balance the workload among different wavefronts and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 is the theoretical number of wavefronts that
|
||
blocks but did not consider the difference between the workload of CU can execute simultaneously. 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 and 𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙 represent
|
||
threads in different tiles. (2) When selecting the tiling scheme, the TLP the number of warps in activation and the number of warps that are
|
||
is calculated only by considering the block, and the fine-grained warp theoretically parallelizable simultaneously in the NVIDIA platform.
|
||
level is neglected, which leads to the inaccurate calculation of TLP. The The results of the experiment are shown in Fig. 6. By comparing
|
||
proposed method adjusts the wavefront workload corresponding to the rocBLAS and cuBLAS, it can be seen that the proposed method has a
|
||
tiles through a multi-thread kernel and ensures consistent computation clear advantage in the case of batch GEMM. The proposed method is
|
||
and data loading by different workgroups. At the same time, it has also in the best position compared to the other methods (CUTLASS,
|
||
also shown that the state of load and computation balancing between MAGMA, Wang, Li), showing high efficiency in terms of utilization of
|
||
wavefronts is more conducive to improving the efficiency of GPU GPU resources. As shown in Fig. 6, the proposed method consistently
|
||
parallel computing. maintains the optimal kernel occupancy on both GPU platforms, which
|
||
indicates that the proposed method can better exploit the computing
|
||
power of the GPU.
|
||
5.3. Kernel occupancy
|
||
|
||
5.4. The overhead of tiling algorithm
|
||
To explore the difference between the proposed method and the
|
||
comparison methods in terms of GPU resource utilization, we present
|
||
This section presents the proportion of the runtime that is taken
|
||
the kernel occupancy of the various methods on two GPU platforms.
|
||
up by the tiling algorithm when executing the proposed method on
|
||
The formula for kernel occupancy can be expressed as:
|
||
two different GPU platforms with various batch sizes. The experimental
|
||
𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑
|
||
kernel occupancy = (12) results are presented in Fig. 7. From Fig. 7, it is evident that the tiling
|
||
𝑁 𝑢𝑚_𝑡𝑜𝑡𝑎𝑙
|
||
algorithm’s runtime percentage decreases as the batch size increases.
|
||
To obtain more accurate performance metrics, we utilize Omniperf5 When batch size is 8, the runtime of the tiling algorithm on the two
|
||
and Nsight6 commands, profiling tools provided by AMD and NVIDIA, GPU platforms is 6.06% and 6.37%, respectively. As the batch size
|
||
to evaluate the resource utilization of the kernel during the execution increases, more and more GEMMs are executed on the GPU, and the
|
||
process. The kernel occupancy has distinct interpretations owing to execution time of these GEMMs on the GPU side takes up most of the
|
||
the distinctions in GPU architecture between AMD MI210 and NVIDIA time, resulting in a smaller runtime portion of the tiling algorithm.
|
||
A800. On the AMD platform, 𝑁 𝑢𝑚_𝑎𝑐 𝑡𝑖𝑣𝑒𝑑 is the number of activated For example, with a batch size is 1024, the tiling algorithm takes less
|
||
than 1% of the runtime. The experimental results on two GPUs indicate
|
||
that the time overhead of the tiling algorithm in the batch GEMM
|
||
5
|
||
https://github.com/ROCm/omniperf execution process is negligible, especially when the batch size is large.
|
||
6
|
||
https://docs.nvidia.com/nsight-compute/NsightCompute/index.html In real-world scenarios such as deep learning, where a large number of
|
||
|
||
10
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 8. The performance improvement of the proposed TLP on MI210. (1.077× average speedup).
|
||
|
||
|
||
GEMM operations are often required, the tiling algorithm will have less 4.53×, and 1.62× compared to rocBLAS, MAGMA, and Wang, respec-
|
||
overhead in the execution process. tively. The proposed method has the lowest latency performance on
|
||
MI210, indicating higher computational efficiency and can effectively
|
||
5.5. The performance benefits of the proposed TLP reduce latency. On A800, the proposed method showed performance
|
||
improvements of 3.02×, 2.59×, 2.45×, and 1.89× compared to cuBLAS,
|
||
This section presents the comparative experimental results on two MAGMA, CUTLASS, and Li, respectively. Fig. 10 shows that as the
|
||
GPU platforms to provide a more detailed evaluation of the proposed batch size gradually increases, the kernel latency increases on both
|
||
TLP. The detailed experimental results are shown in Figs. 8–9. From GPU platforms. rocBLAS and cuBLAS have the highest latency as the
|
||
Figs. 8–9, it is clear that the proposed TLP performs better overall than batch size increases. This phenomenon is because the traditional loop
|
||
traditional TLP. The proposed methods have a speedup of 1.077× and scheduling method significantly increases latency consumption due to
|
||
1.085× on MI210 and A800, respectively. From Fig. 8, the proposed context switching between kernels when the batch size is large. From
|
||
method significantly improves performance when the batch size is Fig. 10, it can be seen that some methods exhibit different latency
|
||
larger. For example, on MI210, the proposed method has an average performances at various batch sizes. For example, when batch size
|
||
speedup of 1.04× when batch size <= 16. When batch size >= 32, the <= 16, MAGMA has the highest latency performance on two GPU
|
||
proposed method can improve performance by 1.10×. The performance platforms. When the batch size is large, its computational performance
|
||
improvement gap is because when the batch size and matrix dimension improves, indicating that the MAGMA performs better when there are
|
||
are small, it is difficult to utilize hardware resources fully. When there many matrices. The experimental results on two platforms show that
|
||
are a large number of tiles, the proposed TLP can more accurately the proposed method has the lowest latency under various batch sizes,
|
||
evaluate the thread’s workload and select the optimal tiling scheme. indicating better performance and broad applicability.
|
||
The same performance trend is also reflected in the A800 platform. On
|
||
A800, the proposed TLP has performance improvements of 1.04× and
|
||
1.11× when batch size <= 16 and batch size >= 32, respectively. The 5.7. The improved performance on inception layers of CNN
|
||
effectiveness of the proposed TLP can be further demonstrated through
|
||
comparative experiment results on two GPU platforms. Modern CNN model architectures often have multiple branches to
|
||
capture features at different scales. Convolution operations of differ-
|
||
5.6. The latency ent scales in each branch can be represented as batch GEMM oper-
|
||
ations with various dimensions, e.g. GoogleNet [13], DenseNet [50],
|
||
This section compares kernel latency on two GPU platforms to SqueezeNet [12], etc. To demonstrate the effectiveness of the proposed
|
||
provide a more detailed evaluation of the proposed method. We mea- method in real-world scenarios, we use various Inception module as
|
||
sured kernel latency with different batch sizes in the comparative a typical application to perform the forward computation process on
|
||
experiment. The detailed experimental results are shown in Fig. 10. two GPU platforms. The Inception module involves a large number of
|
||
On MI210, the proposed method has a latency reduction of 3.87×, irregular, small-size GEMM operations. The deep learning frameworks
|
||
|
||
11
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 9. The performance improvement of the proposed TLP on A800. (1.085× average speedup).
|
||
|
||
|
||
|
||
|
||
Fig. 10. The latency performance of the kernel on two GPU platforms.
|
||
|
||
|
||
MIOpen7 and cuDNN8 are used as benchmark implementations on the other Inception module, and the dimensions of these matrices are
|
||
both GPU platforms. In this section, we select several commonly used smaller than the former two. Finally, the proposed method has been
|
||
Inception modules to evaluate the proposed method’s speedup perfor- proven to significantly accelerate CNN models with various branch
|
||
mance. The GEMM sizes in Inception modules are shown in Table 6. structures on two different GPU platforms, particularly in scenarios
|
||
Fig. 11 shows the speedup performance of the proposed method in each involving multiple branches, irregular shapes, and small dimensions.
|
||
Inception module. As shown in Fig. 11, the average speedups are 2.88×
|
||
and 1.87× respectively. The gray boxes represent the average speedup 6. Conclusion
|
||
ratios of the different Inception modules in Fig. 11. The experimental
|
||
results suggest that the Inception 8–9 series has the highest average In this paper, we propose a load-balanced batch GEMM acceleration
|
||
speedup ratio (3.68× and 2.66× respectively) among the Inception method for the problem of low parallel computing efficiency and poor
|
||
modules, because Inception 8–9 has more matrix shapes compared to hardware resource utilization in batch, irregular, and variable matrix
|
||
multiplication scenarios. The kernel occupancy and hardware resource
|
||
utilization can be effectively improved by a multi-thread kernel design
|
||
7
|
||
https://github.com/ROCm/MIOpen that balances the computational and data load in the work-item. A
|
||
8
|
||
https://github.com/NVIDIA/cudnn-frontend novel approach to TLP computation is devised, where the parallelism of
|
||
|
||
12
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
|
||
|
||
Fig. 11. The speedup performance on Inception layers.
|
||
|
||
|
||
Table 6
|
||
The size of GEMM in various Inception modules.
|
||
Inception module GEMM size (M ×N × K)
|
||
Inception-1 784 × 96 × 192, 784 × 64 × 192, 784 × 32 × 192, 784 × 16 × 192
|
||
Inception-2 784 × 64 × 192, 784 × 32 × 192, 784 × 128 × 192
|
||
Inception-3 196 × 192 × 192, 196 × 16 × 192, 196 × 96 × 192, 196 × 64 × 192
|
||
Inception-4 196 × 64 × 192, 196 × 24 × 192, 196 × 160 × 192
|
||
Inception-5 196 × 64 × 192, 196 × 128 × 192, 196 × 24 × 192
|
||
Inception-6 196 × 112 × 192, 196 × 144 × 192, 196 × 32 × 192, 196 × 64 × 192
|
||
Inception-7 196 × 256 × 192, 196 × 160 × 192, 196 × 128 × 192
|
||
Inception-8 49 × 160 × 192, 49 × 128 × 192, 49 × 256 × 192, 49 × 160 × 192, 49 × 32 × 192
|
||
Inception-9 49 × 192 × 192, 49 × 128 × 192, 49 × 384 × 192, 49 × 192 × 192, 49 × 48 × 192
|
||
|
||
|
||
|
||
the tiling scheme is measured by the number of activated wavefronts. References
|
||
This approach allows the optimal tiling scheme to be selected based on
|
||
different GPU architectures. Experiments are conducted on two GPU [1] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using
|
||
the TF32 and TF64 frameworks on low-precision AI tensor cores, in: Proceedings
|
||
platforms to validate the effectiveness and progress of our proposed
|
||
of the SC’23 Workshops of the International Conference on High Performance
|
||
method. Computing, Network, Storage, and Analysis, 2023, pp. 179–186.
|
||
Future work includes exploring batch GEMM with various preci- [2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based
|
||
sion performances. With the development of Transformer-based, many convolutions for deep learning on multicore ARM and RISC-V architectures, J.
|
||
GEMM operations are involved in the training and inference process Syst. Archit. (2024) 103186.
|
||
[3] J. Fornt, P. Fontova-Musté, M. Caro, J. Abella, F. Moll, J. Altet, C. Studer, An
|
||
of Large Language Models (LLMs), which often have lower accuracy, energy-efficient gemm-based convolution accelerator with on-the-fly im2col, IEEE
|
||
such as FP16, FP8, etc. For example, quantized LLMs often involve Trans. Very Large Scale Integr. (VLSI) Syst. 31 (11) (2023) 1874–1878.
|
||
GEMM operations where the weight matrices and activation values [4] H. Kim, W.J. Song, Las: locality-aware scheduling for GEMM-accelerated
|
||
have different precisions, e.g. W4A16, W8A8. More complex precisions convolutions in GPUs, IEEE Trans. Parallel Distrib. Syst. 34 (5) (2023)
|
||
1479–1494.
|
||
and storage formats pose challenges to the performance of GEMM [5] W. Yang, J. Fang, D. Dong, X. Su, Z. Wang, Optimizing full-spectrum matrix
|
||
operations. multiplications on ARMv8 multi-core CPUs, IEEE Trans. Parallel Distrib. Syst.
|
||
(2024).
|
||
CRediT authorship contribution statement [6] AMD, Next generation BLAS implementation for ROCm platform, 2024, https:
|
||
//github.com/ROCm/rocBLAS.
|
||
[7] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore
|
||
Yu Zhang: Writing – review & editing, Writing – original draft. Lu High-Performance Parallel Computing with CUDA, Packt Publishing Ltd, 2018.
|
||
Lu: Writing – review & editing, Supervision. Zhanyu Yang: Writing [8] ICL, Matrix algebra for GPU and multicore architectures, 2024, https://icl.utk.
|
||
– review & editing. Zhihong Liang: Supervision, Conceptualization. edu/magma/.
|
||
[9] T. Faingnaert, T. Besard, B. De Sutter, Flexible performant GEMM kernels on
|
||
Siliang Suo: Supervision, Conceptualization.
|
||
GPUs, IEEE Trans. Parallel Distrib. Syst. 33 (9) (2021) 2230–2248.
|
||
[10] W.S. Moses, I.R. Ivanov, J. Domke, T. Endo, J. Doerfert, O. Zinenko, High-
|
||
Declaration of competing interest performance gpu-to-cpu transpilation and optimization via high-level parallel
|
||
constructs, in: Proceedings of the 28th ACM SIGPLAN Annual Symposium on
|
||
The authors declare that they have no known competing finan- Principles and Practice of Parallel Programming, 2023, pp. 119–134.
|
||
[11] H. Kim, H. Nam, W. Jung, J. Lee, Performance analysis of CNN frameworks
|
||
cial interests or personal relationships that could have appeared to
|
||
for GPUs, in: 2017 IEEE International Symposium on Performance Analysis of
|
||
influence the work reported in this paper. Systems and Software, ISPASS, IEEE, 2017, pp. 55–64.
|
||
[12] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
|
||
Acknowledgments SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB
|
||
model size, 2016, arXiv preprint arXiv:1602.07360.
|
||
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
|
||
This work was supported by the Natural Science Foundation of Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of
|
||
Guangdong Province (2024A1515010204) and the Technological Re- the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.
|
||
search Project of Southern Power Grid Company (ZBKJXM20232483). 1–9.
|
||
[14] G. Pant, D. Yadav, A. Gaur, ResNeXt convolution neural network topology-based
|
||
deep learning model for identification and classification of pediastrum, Algal Res.
|
||
Data availability
|
||
48 (2020) 101932.
|
||
[15] S. Barrachina, M.F. Dolz, P. San Juan, E.S. Quintana-Ortí, Efficient and
|
||
No data was used for the research described in the article. portable GEMM-based convolution operators for deep neural network training
|
||
on multicore processors, J. Parallel Distrib. Comput. 167 (2022) 240–254.
|
||
|
||
|
||
13
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103341
|
||
|
||
|
||
[16] S. Rajbhandari, Y. He, O. Ruwase, M. Carbin, T. Chilimbi, Optimizing cnns on [35] G. Alaejos, A. Castelló, H. Martínez, P. Alonso-Jordá, F.D. Igual, E.S. Quintana-
|
||
multicores for scalability, performance and goodput, ACM SIGARCH Comput. Ortí, Micro-kernels for portable and efficient matrix multiplication in deep
|
||
Archit. News 45 (1) (2017) 267–280. learning, J. Supercomput. 79 (7) (2023) 8124–8147.
|
||
[17] C. Rivera, J. Chen, N. Xiong, S.L. Song, D. Tao, Ism2: Optimizing irregular-shaped [36] R. Wang, Z. Yang, H. Xu, L. Lu, A high-performance batched matrix multiplica-
|
||
matrix-matrix multiplication on gpus, 2020, arXiv preprint arXiv:2002.03258. tion framework for gpus under unbalanced input distribution, J. Supercomput.
|
||
[18] K. Matsumoto, N. Nakasato, S.G. Sedukhin, Performance tuning of matrix 78 (2) (2022) 1741–1758.
|
||
multiplication in opencl on different gpus and CPUs, in: 2012 SC Companion: [37] Y. Zhang, Y. Wang, Z. Mo, Y. Zhou, T. Sun, G. Xu, C. Xing, L. Yang, Accelerating
|
||
High Performance Computing, Networking Storage and Analysis, IEEE, 2012, pp. small matrix multiplications by adaptive batching strategy on GPU, in: 2022
|
||
396–405. IEEE 24th Int Conf on High Performance Computing & Communications; 8th
|
||
[19] G.E. Moon, H. Kwon, G. Jeong, P. Chatarasi, S. Rajamanickam, T. Krishna, Eval- Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int
|
||
uating spatial accelerator architectures with tiled matrix-matrix multiplication, Conf on Dependability in Sensor, Cloud & Big Data Systems & Application,
|
||
IEEE Trans. Parallel Distrib. Syst. 33 (4) (2021) 1002–1014. HPCC/DSS/SmartCity/DependSys, IEEE, 2022, pp. 882–887.
|
||
[20] Q. Han, H. Yang, M. Dun, Z. Luan, L. Gan, G. Yang, D. Qian, Towards [38] A. Abdelfattah, S. Tomov, J. Dongarra, Matrix multiplication on batches of small
|
||
efficient tile low-rank GEMM computation on sunway many-core processors, J. matrices in half and half-complex precisions, J. Parallel Distrib. Comput. 145
|
||
Supercomput. 77 (5) (2021) 4533–4564. (2020) 188–201.
|
||
[21] X. Li, Y. Liang, S. Yan, L. Jia, Y. Li, A coordinated tiling and batching [39] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Novel HPC techniques to batch
|
||
framework for efficient GEMM on GPUs, in: Proceedings of the 24th Symposium execution of many variable size BLAS computations on GPUs, in: Proceedings of
|
||
on Principles and Practice of Parallel Programming, 2019, pp. 229–241. the International Conference on Supercomputing, 2017, pp. 1–10.
|
||
[22] P. Tillet, D. Cox, Input-aware auto-tuning of compute-bound HPC kernels, in: [40] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Performance, design, and
|
||
Proceedings of the International Conference for High Performance Computing, autotuning of batched GEMM for GPUs, in: High Performance Computing: 31st
|
||
Networking, Storage and Analysis, 2017, pp. 1–12. International Conference, ISC High Performance 2016, Frankfurt, Germany, June
|
||
[23] NVIDIA, CUDA templates for linear algebra subroutines, 2024, https://github. 19-23, 2016, Proceedings, Springer, 2016, pp. 21–38.
|
||
com/NVIDIA/cutlass. [41] A. Li, G.-J. van den Braak, H. Corporaal, A. Kumar, Fine-grained synchronizations
|
||
[24] J. Huang, C.D. Yu, R.A.v.d. Geijn, Strassen’s algorithm reloaded on GPUs, ACM and dataflow programming on GPUs, in: Proceedings of the 29th ACM on
|
||
Trans. Math. Softw. 46 (1) (2020) 1–22. International Conference on Supercomputing, 2015, pp. 109–118.
|
||
[25] B. Boyer, J.-G. Dumas, C. Pernet, W. Zhou, Memory efficient scheduling of [42] J. Li, H. Ye, S. Tian, X. Li, J. Zhang, A fine-grained prefetching scheme
|
||
strassen-winograd’s matrix multiplication algorithm, in: Proceedings of the 2009 for DGEMM kernels on GPU with auto-tuning compatibility, in: 2022 IEEE
|
||
International Symposium on Symbolic and Algebraic Computation, 2009, pp. International Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2022,
|
||
55–62. pp. 863–874.
|
||
[26] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, [43] Z. Yang, L. Lu, R. Wang, A batched GEMM optimization framework for deep
|
||
A. Novikov, F.J. R Ruiz, J. Schrittwieser, G. Swirszcz, et al., Discovering faster learning, J. Supercomput. 78 (11) (2022) 13393–13408.
|
||
matrix multiplication algorithms with reinforcement learning, Nature 610 (7930) [44] H. Mei, H. Qu, J. Sun, Y. Gao, H. Lin, G. Sun, GPU occupancy prediction of
|
||
(2022) 47–53. deep learning models using graph neural network, in: 2023 IEEE International
|
||
[27] G. Xiao, C. Yin, T. Zhou, X. Li, Y. Chen, K. Li, A survey of accelerating parallel Conference on Cluster Computing, CLUSTER, IEEE, 2023, pp. 318–329.
|
||
sparse linear algebra, ACM Comput. Surv. 56 (1) (2023) 1–38. [45] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, J.
|
||
[28] Y. Chen, G. Xiao, K. Li, F. Piccialli, A.Y. Zomaya, fgSpMSpV: A fine-grained Dongarra, Algorithms and optimization techniques for high-performance matrix-
|
||
parallel SpMSpV framework on HPC platforms, ACM Trans. Parallel Comput. 9 matrix multiplications of very small matrices, Parallel Comput. 81 (2019)
|
||
(2) (2022) 1–29. 1–21.
|
||
[29] Y. Chen, G. Xiao, W. Yang, Optimizing partitioned CSR-based SpGEMM on the [46] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee,
|
||
sunway TaihuLight, Neural Comput. Appl. 32 (10) (2020) 5571–5582. D. Lee, Lut-gemm: Quantized matrix multiplication based on luts for efficient
|
||
[30] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, T. Li, Performance-aware model for inference in large-scale generative language models, 2022, arXiv preprint arXiv:
|
||
sparse matrix-matrix multiplication on the sunway taihulight supercomputer, 2206.09557.
|
||
IEEE Trans. Parallel Distrib. Syst. 30 (4) (2018) 923–938. [47] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating
|
||
[31] G. Xiao, K. Li, Y. Chen, W. He, A.Y. Zomaya, T. Li, Caspmv: A customized scientific computing on tensor cores with extended precision, in: Proceedings
|
||
and accelerative spmv framework for the sunway taihulight, IEEE Trans. Parallel of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
|
||
Distrib. Syst. 32 (1) (2019) 131–146. Programming, 2021, pp. 278–291.
|
||
[32] G. Xiao, C. Yin, Y. Chen, M. Duan, K. Li, Efficient utilization of multi-threading [48] G. Shobaki, A. Kerbow, S. Mekhanoshin, Optimizing occupancy and ILP on the
|
||
parallelism on heterogeneous systems for sparse tensor contraction, IEEE Trans. GPU using a combinatorial approach, in: Proceedings of the 18th ACM/IEEE
|
||
Parallel Distrib. Syst. (2024). International Symposium on Code Generation and Optimization, 2020, pp.
|
||
[33] D.E. Tanner, Tensile: Auto-tuning gemm gpu assembly for all problem sizes, 133–144.
|
||
in: 2018 IEEE International Parallel and Distributed Processing Symposium [49] A.B. Hayes, L. Li, D. Chavarría-Miranda, S.L. Song, E.Z. Zhang, Orion: A
|
||
Workshops, IPDPSW, IEEE, 2018, pp. 1066–1075. framework for gpu occupancy tuning, in: Proceedings of the 17th International
|
||
[34] S. Wang, FlexGEMM: A flexible micro-kernel generation framework, in: Proceed- Middleware Conference, 2016, pp. 1–13.
|
||
ings of the 5th International Conference on Computer Information and Big Data [50] G. Huang, S. Liu, L. Van der Maaten, K.Q. Weinberger, Condensenet: An
|
||
Applications, 2024, pp. 164–170. efficient densenet using learned group convolutions, in: Proceedings of the IEEE
|
||
Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.
|
||
|
||
|
||
|
||
|
||
14
|
||
|