1006 lines
116 KiB
Plaintext
1006 lines
116 KiB
Plaintext
Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Journal of Systems Architecture
|
||
journal homepage: www.elsevier.com/locate/sysarc
|
||
|
||
|
||
|
||
|
||
LE-GEMM: A lightweight emulation-based GEMM with precision refinement
|
||
on GPU
|
||
Yu Zhang a , Lu Lu a,b,c ,∗, Zhanyu Yang a , Zhihong Liang d,e , Siliang Suo d,e
|
||
a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
|
||
b
|
||
Peng Cheng Laboratory, Shenzhen, 518055, China
|
||
c
|
||
Pazhou Laboratory, Guangzhou 510005, China
|
||
d
|
||
Electric Power Research Institute, CSG, Guangzhou, Guangdong, China
|
||
e
|
||
Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, Guangdong, China
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in
|
||
GEMM various scientific computing scenarios. These units support tensor-level computation with different precisions
|
||
Thread parallelism analytic on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication
|
||
Multi-level pipeline
|
||
(GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits
|
||
Matrix core/Tensor core
|
||
its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes
|
||
a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline
|
||
implementation to accelerate the computation process without compromising the accuracy requirements.
|
||
First, we propose a lightweight emulation algorithm that includes a precision transformation process and
|
||
GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread
|
||
parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on
|
||
various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can
|
||
maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on
|
||
two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that
|
||
the proposed method outperforms the previous approaches in terms of computational accuracy and speed.
|
||
|
||
|
||
|
||
1. Introduction computation process. For example, the half-precision performance with
|
||
Matrix Core in the AMD MI210 achieves 8× higher throughput over
|
||
GEneral Matrix Multiplication (GEMM) has been extensively in- ROCm Core [11]. Based on this feature, previous studies have proposed
|
||
volved in High-Performance Computing (HPC) [1], deep learning ap- a way to explore low-precision matrix to obtain high-precision results
|
||
plications [2], and other scientific computing fields [3]. In these fields, and achieve high performance on GPU [12–14].
|
||
some applications or scenarios in which GEMM is the primary building Exploiting the low-precision to speed up the GEMM is an effective
|
||
block are referred to as GEMM-based scientific computing [4,5]. These
|
||
route in GEMM-based scientific computing fields, where matrix mul-
|
||
applications involve a large number of matrix multiplication workloads,
|
||
tiplication operations can be tolerated for low-precision [15,16]. The
|
||
and GEMM’s computation time is a significant part of the overall
|
||
precision transformation process, which has a significant impact on
|
||
process. Specifically, some typical applications, such as K-Means [6],
|
||
the accuracy of the final calculated results, is a crucial part of this
|
||
KNN [7], Feedforward Neural Networks (FFN) [8], etc., have GEMM
|
||
computation times exceeding 50% in popular implementations. To route [17]. Data value errors exist during the transition from high
|
||
cater this trend, specialized hardware accelerators (e.g., AMD’s Ma- to low precision due to the loss in the mantissa representation. After
|
||
trix Core [9] and NVIDIA’s Tensor Core [10]) on GPU have been several FMA calculations, the final result has some errors compared to
|
||
designed and implemented to speed up GEMM, which employs low- the standard one. This phenomenon is also known as precision loss.
|
||
precision inputs to achieve high performance. In contrast to the regular This restricts the application of mixed-precision calculations signifi-
|
||
computation pattern, these mixed-precision cores can support tensor- cantly [18,19]. Some previous studies have utilized the calculation
|
||
level Fused Multiply-Add (FMA), dramatically speeding up the matrix
|
||
|
||
|
||
∗ Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
|
||
E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang),
|
||
suosl@csg.cn (S. Suo).
|
||
|
||
https://doi.org/10.1016/j.sysarc.2025.103336
|
||
Received 11 July 2024; Received in revised form 8 December 2024; Accepted 2 January 2025
|
||
Available online 17 January 2025
|
||
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
results of additional low-precision variables to reduce precision loss in and NVIDIA are adding specialized hardware units and cores to their
|
||
the calculation results [20–23]. latest GPUs that support mixed-precision computing [24,25]. In the
|
||
Low-precision variables and additional computations are utilized following, we will discuss the details of the specialized cores.
|
||
for precision refinement [10], which reduces the impact of precision Computing Pattern and Memory Layout. The GPU cores can be
|
||
loss on the computational results. The implementation of this approach classified into two types based on supported precision: same-precision
|
||
on GPUs that feature mixed-precision arithmetic is still hindered by cores (FP16 Core, FP32 Core, etc.) that only support the same preci-
|
||
several issues. Primarily, a crucial factor is the data transforma- sion computation and mixed-precision cores (Tensor Core or Matrix
|
||
tion process. Loss of the original data mantissa during the precision Core) that support different precision computation [26,27]. The GEMM
|
||
transformation process has an impact on the final outcome. As a supported by the mixed-precision core can be expressed as 𝐷 = 𝛼 ×
|
||
consequence of the disparity in the mantissa length, low-precision
|
||
𝐴𝐵 + 𝛽 × 𝐶, where 𝐴 and 𝐵 can be FP16 or FP32 matrix, 𝐶 and 𝐷 can
|
||
data cannot be saved beyond the mantissa length range. This leads to
|
||
be configured to FP32 or double precision (FP64) matrix, and 𝛼 and
|
||
numerical errors after the precision transformation process. The effect
|
||
𝛽 denote scalar constants [28]. This means that the mixed-precision
|
||
of this error will be amplified into the final calculation result after
|
||
core can utilize the input of low-precision to obtain high-precision
|
||
multiple FMA operations, which is a gap between calculated result
|
||
and the standard result. The second is the acceleration capabilities calculation results [10,29]. The details are shown in Fig. 1(a). In
|
||
of these specialized hardware units. These specialized hardware terms of computing patterns, the mixed-precision core utilizes a new
|
||
units provide GEMM with powerful acceleration capabilities, accompa- memory layout fragment that stores matrix elements either in row-
|
||
nied by special data layouts, which pose challenges to efficient kernel major or column-major order. Different from the same-precision core
|
||
design. For GEMM operation, tiling is a common way of assigning matrix multiplication operation, the matrix multiplication operation is
|
||
computational tasks, where the matrix is segmented into multiple tiles performed in the mixed-precision core using the fragment as a unit.
|
||
(smaller matrix), and a workgroup (thread block) is responsible for The fragment size is ‘‘16 × 16’’ or ‘‘32 × 32’’, and other fixed shapes
|
||
computing one or several tiles. When the tile size is too large, the (Fig. 1(b) provides an example based on a ‘‘16 × 16’’ fragment). Note
|
||
workload of the threads will become too heavy, reducing thread paral- that different hardware architectures have varying support for fragment
|
||
lelism. Conversely, if the tile size is too small, there will be frequent size. For example, for NVIDIA A800, the maximum supported fragment
|
||
workgroup switching, which will increase scheduling overhead. The size is ‘‘16 × 16’’ [10]. In AMD MI210, rocWMMA can support frag-
|
||
computation allocation of tiles also affects data reuse in a workgroup. In ments with a size of ‘‘16 × 32’’ or ‘‘32 × 32’’ [11]. Different fragment
|
||
terms of data access, these specialized hardware units perform GEMM sizes have a particular impact on the workload allocation of GEMM
|
||
operations with fragments, which requires higher utilization of the operations.
|
||
GPU’s multi-level memory. Ineffective organization and manipulation Programming interface and implementation. Matrix Core and
|
||
of data frequently negate the speed benefits of the tensor-level patterns, Tensor Core are hardware cores delivered by two different GPU ven-
|
||
leading to memory wall bottlenecks in the GPU. dors (AMD and NVIDIA). GPU vendors have implemented various
|
||
To address the above issues, we propose LE-GEMM that includes a
|
||
approaches depending on their hardware configurations to facilitate
|
||
lightweight emulation algorithm, a thread parallelism analytic model,
|
||
their use by researchers and engineers. On the AMD platform, the first
|
||
and a multi-level pipeline implementation to achieve better computa-
|
||
approach is rocWMMA (ROCm Wavefront-level Matrix Multiply and
|
||
tional accuracy and speed. Our contributions can be summarized as
|
||
Accumulate), in which AMD’s C++ interface leverages Matrix Core to
|
||
follow:
|
||
accelerate GEMM operations. The usage and functionality of this inter-
|
||
• A lightweight emulation algorithm is proposed that employs bit- face are very similar to that of WMMA [28]. The difference between
|
||
cutting and scaling techniques to preserve the original values and the two ways is that rocWMMA calls the API handle with the keyword
|
||
reduce the loss of precision while removing redundant terms in rocwmma. The second approach is the compiler instruction [9], which
|
||
the computation process to reduce the computational overhead. follows a specific syntax ‘‘__𝑏𝑢𝑖𝑙𝑡𝑖𝑛_𝑎𝑚𝑑 𝑔 𝑐 𝑛_𝑤𝑚𝑚𝑎_’’. The third approach
|
||
• A thread parallelism analytic model based on the lightweight uses rocBLAS [30], which provides a functional interface and allows
|
||
emulation GEMM is proposed, which allows the selection of the the direct operation of a mixed-precision core by supplying the matrix
|
||
optimal tiling scheme through two aspects: the number of parallel parameter.
|
||
threads and the single thread performance. There are multiple ways to call Tensor Core on the NVIDIA platform,
|
||
• An efficient multi-level pipeline is implemented based on the such as WMMA (Warp-level Matrix Multiply and Accumulate) [10],
|
||
access discrepancy between various levels of memory and the CUTLASS [31], cuBLAS [32], PTX instruction [33], etc. WMMA pro-
|
||
GEMM workflow, utilizing double buffer and pre-fetch techniques
|
||
vides several specific programming interfaces to load (wmma::load_
|
||
to hide access latency and improve Instruction-Level Parallelism
|
||
matrix_sync), compute (wmma::mma_sync), and store (wmma::store_
|
||
(ILP).
|
||
matrix_sync) matrices in fragments [10]. Based on this interface, the ma-
|
||
• The proposed single-precision GEMM accelerated method
|
||
trix is first tiled into multiple fragments, which can be shaped to fixed
|
||
achieves state-of-the-art speed and accuracy performance on two
|
||
sizes, and then matrix multiplication is performed. CUTLASS (CUDA
|
||
GPU platforms.
|
||
Templates for Linear Algebra Subroutines) [31] is a series of CUDA tem-
|
||
The rest of the paper is organized as follows. Section 2 introduces plate header libraries for implementing high-performance GEMM and
|
||
the background of these specialized hardware units (Matrix Core and related computations at all scales and levels within CUDA. It provides
|
||
Tensor Core) and matrix multiplication based on the Ozaki scheme. a variety of GEMM computation interfaces (hgemm, sgemm, dgemm,
|
||
Section 3 provides some related work. Section 4 presents the details of and wgemm) to optimize data movement and multiply-accumulate
|
||
the proposed accelerated method for single-precision GEMM. Section 5
|
||
operations in fragments to achieve high performance. cuBLAS [32] is a
|
||
demonstrates and evaluates the experimental results. The conclusions
|
||
basic linear algebra subroutines library provided and implemented by
|
||
of the paper are given in Section 6. Section 7 outlines potential areas
|
||
NVIDIA. cuBLAS provides two interfaces (cublasGemmEx and cublasS-
|
||
for future work that can be further explored.
|
||
gemm) that support GEMM on Tensor Core. In this approach, it is
|
||
2. Background necessary to set the math model of cuBLAS by using the cublasSetMath-
|
||
Mode function. PTX instruction provides a series of programming APIs
|
||
2.1. Matrix core and tensor core and instruction sets for general-purpose parallel programming. It is
|
||
designed to be efficient on NVIDIA GPUs that support the defined com-
|
||
The amount of computation for GEMM has gradually increased with putation functions. The terminological differences and relationships
|
||
the development of deep learning. In response to this trend, AMD between ROCm and CUDA are also given in Table 1.
|
||
|
||
2
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 1. The precision profiling and the computing pattern of mixed-precision units (Fig. 1(a) shows an example of precision profiling, where input data is FP16-precision, product
|
||
and accumulator is FP32-precision, and output is FP32-precision. Fig. 1(b) illustrates a fragment-based GEMM operation with a ‘‘16 × 16’’ fragment).
|
||
Table 1
|
||
ROCm/CUDA terminology.
|
||
ROCm CUDA Description
|
||
Compute Unit (CU) Streaming One among many parallel vector processors in a GPU with parallel ALUs. All wavefronts in a
|
||
Multiprocessor (SM) workgroup are assigned to the same CU.
|
||
|
||
|
||
Kernel Kernel Functions initiated to the GPU that are executed by multiple parallel CUs or SMs on the GPU.
|
||
Multiple kernels can work in parallel with CPU.
|
||
Wavefront Warp A collection of operations that executed in lockstep, run the same instructions, and follow the same
|
||
control-flow path. Individual lanes can be masked.
|
||
Workgroup Thread block Think of this as a vector thread. A wavefront is a 64-wide vector op in AMD GPU. A warp is a
|
||
32-wide vector op in NVIDIA GPU.
|
||
Work-item/Thread Thread GPU programming models can handle this as a separate thread of execution, although not necessarily
|
||
get forward sub-wavefront progress.
|
||
Global Memory Global Memory The DRAM memory is accessed by a GPU that traverses the cache of some layers.
|
||
Local Memory Shared Memory Scratchpad that allows communication between wavefronts within a workgroup.
|
||
Private Memory Local Memory Per-thread private memory is often mapped to registers.
|
||
|
||
2.2. Matrix multiplication based on the Ozaki scheme where 𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , 𝑅𝐴 , and 𝑅𝐵 denote the matrix with FP16, 𝐴𝐹 𝑃 32
|
||
and 𝐵𝐹 𝑃 32 denote the matrix with FP32.
|
||
The Ozaki scheme is an error-free method for computing matrix
|
||
multiplication [34]. It involves translating the original matrix multi- 3. Related work
|
||
plication into a sum of matrices that can be multiplied together. Fig. 2
|
||
is a schematic of the matrix multiplication based on the Ozaki scheme. The advent of Matrix Core and Tensor Core has greatly accelerated
|
||
The detailed procedure can be divided into three steps: splitting, com- the computational speed of GEMM due to its specific data layout
|
||
putation, and summation. In the splitting step, matrix A and matrix and precision requirements. This feature allows it to be studied and
|
||
B are first divided by element level to obtain multiple submatrix (𝐴𝑖 , applied in many GEMM-based scientific computing fields [24–27].
|
||
𝐵𝑗 ). Then, in the computation step, matrix multiplication is performed Therefore, previous studies have optimized the scheduling methods of
|
||
on the submatrix of A (𝐴𝑖 ) and the submatrix of B (𝐵𝑗 ) to obtain the these special hardware units based on different application scenarios
|
||
submatrix of C (𝐶𝑖,𝑗 ). Finally, all these submatrices are summed to
|
||
to accelerate the computation process [4,12,15,35]. For example, a
|
||
obtain the result matrix C.
|
||
novel approach for accelerated quantization of graph neural networks,
|
||
Based on the Ozaki scheme, a single precision matrix multiplication
|
||
based on the support of low-precision GEMM, was proposed by Wang
|
||
calculation method can be described as follows. For simplicity, the
|
||
et al. [36], which exploits the insensitivity of graph neural networks
|
||
matrix multiplication with FP32 is used as an example. Firstly, in the
|
||
to loss of precision. To make Tensor Core be executed more effi-
|
||
element-level splitting step, the matrix based on FP32 is converted to
|
||
ciently in deep learning applications, Dakkak et al. [26] have optimized
|
||
FP16 precision by the precision transformation function, while the error
|
||
the reduction and the scan terms for matrix multiplication, making
|
||
is preserved. The detailed expression is as follows:
|
||
their programming faster and more accessible. Specifically, Dakkak’s
|
||
𝐴𝐹 𝑃 16 ← toFP16(𝐴𝐹 𝑃 32 ) (1) method [26] was implemented on the Tesla V100 and achieved 89%–
|
||
98% of peak memory bandwidth and reduced power consumption.
|
||
𝑅𝐴 ← toFP16(𝐴𝐹 𝑃 32 − 𝐴𝐹 𝑃 16 ) (2) Li et al. [37] proposed a half-precision Fast Fourier Transform (FFT)
|
||
approach based on Tensor Core. This approach leverages the fragments
|
||
where 𝐴𝐹 𝑃 32 and 𝐴𝐹 𝑃 16 denote the FP32-based and FP16-based matrix
|
||
𝐴, respectively. toFP16() is the transformation function that transforms in Tensor Core to support the special operations required for FFT
|
||
matrix 𝐴 from FP32 to FP16, and 𝑅𝐴 represents the residual matrix and optimizes the data arrangement in GPU memory [38]. To further
|
||
between different precision. Note that this expression is analogous to explore these hardware units’ workflow and application prospects,
|
||
matrix 𝐵. With the support of mixed-precision cores for floating-point Sakamoto et al. [39] improve energy efficiency by analyzing memory
|
||
arithmetic with different precision, 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 can be rewritten as and power consumption of low-precision data during computation. The
|
||
follows: above studies aim to optimize the use of these specialized hardware
|
||
units to accelerate the computing process in specific scientific com-
|
||
𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )𝐵𝐹 𝑃 16 = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝐵𝐹 𝑃 16 (3) puting scenarios based on their computational characteristics [40,41].
|
||
These approaches maximize the acceleration capability of matrix multi-
|
||
Hence, the matrix multiplication for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 can be expressed plication operation without changing the internal computation process
|
||
as: of specialized hardware units. However, FMA operations between dif-
|
||
𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ) ferent precisions will inevitably result in a certain precision loss in
|
||
(4)
|
||
= 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝑅𝐵 the calculation results, which will limit the application scope of these
|
||
|
||
3
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 2. The schematic of the matrix multiplication based on the Ozaki scheme (𝐶 = 𝐴𝐵 as an example).
|
||
|
||
|
||
hardware units [21,22,42]. analysis model, and the efficient multi-level pipeline design. Firstly,
|
||
To mitigate the loss of precision, several improvement approaches we introduced the lightweight emulation algorithm, which can pro-
|
||
have been proposed. Markidis et al. [10] analyzed the programma- vide a more efficient and accurate calculation performance based on
|
||
bility and precision of Tensor Core. Then they proposed a precision specialized hardware units. Then, a thread parallelism analytic model
|
||
refinement method for single-precision (FP32) matrix multiplication, is proposed, which is used to analyze and select the optimal tiling
|
||
which takes advantage of half-precision (FP16) to store residuals in scheme based on various computing scenarios and hardware. Finally,
|
||
transformations. Based on this research, Ootomo et al. [43] improved we present an efficient multi-level pipeline design for GEMM, which
|
||
the rounding method of single-precision floating-point numbers. They utilizes double buffer and pre-fetch techniques to hide access latency
|
||
reduced the redundancy in error-correction terms by considering the and improve ILP.
|
||
effect of the rounding method and implicit bit. Others have studied
|
||
the error analysis of matrix multiplication using Tensor Core and have 4.1. Lightweight emulation algorithm
|
||
given theoretical error bounds for matrix multiplication with different
|
||
precisions [24,44,45]. To explore the efficiency of mixed-precision Traditional emulation algorithm design assumes that the hardware
|
||
cores, Jia et al. [46] studied the arrangement of matrix-to-fragment has the same input and output precision. Typically, a large number
|
||
data in memory and how assembly instructions split the matrix. To of low-precision instructions are used to complete the computation
|
||
support the extended-precision computing pattern, several methods and memory access process, which can result in high computational
|
||
have been proposed in limited-precision hardware, combining multi- overhead and reduce performance benefits [4,40,51,52]. However, spe-
|
||
ple low-precision computation instructions to simulate the extended- cialized hardware units, such as Matrix Core and Tensor core, have
|
||
precision computing methods [47–49]. Inspired by this approach, Feng been introduced to accelerate the GEMM operation. Multiplication and
|
||
et al. [4] improved the Markidis’s method in single-precision matrix addition are fused in special instruction primitives, and the process
|
||
multiplication, which modified the rounding method and optimized the of accumulating intermediate results also supports high-precision ac-
|
||
computation of extended-precision at the instruction level. In addition, cumulation to ensure accurate calculation results. This new hardware
|
||
Fasi et al. [21] invested in the application of multiword methodology feature has motivated us to design a lightweight emulation algorithm
|
||
to improve the performance-accuracy trade-off of matrix multiplication that accelerates the GEMM computation process by achieving high com-
|
||
with mixed-precision cores, focusing specifically on the Tensor Core putational throughput based on low-precision inputs. The lightweight
|
||
emulation algorithm consists of the precision transformation process
|
||
available on NVIDIA GPUs. Following this research direction, Parikh
|
||
and the GEMM emulation calculation, as illustrated in Fig. 3. The preci-
|
||
et al. [20] presented insights and possibilities for implementing higher
|
||
sion transformation process is designed to make the low-precision data
|
||
precision matrix multiplication with Tensor Core, and an implemen-
|
||
better preserve the value of the original data, thus achieving more ac-
|
||
tation of the FP64x2 GEMM scheme with variable FP64 inputs was
|
||
curate calculation results. The GEMM emulation calculation is designed
|
||
provided. The common feature of these studies is using low-precision
|
||
to achieve high-precision results and speed up the calculation process
|
||
inputs to obtain high-precision computational results with some addi-
|
||
by leveraging the specialized hardware units and low-precision input.
|
||
tional computation workload. In this way, these specialized hardware
|
||
Floating-point numbers of different precision types have exponent and
|
||
units (Matrix Core and Tensor Core) can perform high-performance
|
||
mantissa with different lengths. For the single-precision format, 8 bits
|
||
matrix multiplication operations and reduce the impact of precision
|
||
are allotted to the exponent and 23 bits to the mantissa. The half-
|
||
loss [44,50].
|
||
precision representation has a smaller range than the single-precision,
|
||
with the exponent and mantissa being 5 and 10 bits, respectively. In
|
||
4. LE-GEMM emulation algorithm, the first step is to split the original matrix, which
|
||
affects the level of elements in the matrix. There are two different
|
||
This section gives a detailed description of LE-GEMM, which in- splitting routines. One splitting routine is to split the matrix elements
|
||
cludes the lightweight emulation algorithm, the thread parallelism without changing the precision, and the other is to split the matrix
|
||
|
||
4
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 3. The workflow of LE-GEMM.
|
||
|
||
|
||
values and change the precision type. In this paper, we focus on the numbers, and it is compatible across platforms. Fig. 4 illustrates the
|
||
implementation of the second routine. However, the transformation details of the precision transformation process. For the computation
|
||
process inevitably leads to precision loss. Therefore, an additional of 𝑅𝐴 , since the mantissa of FP16 is of length 10, this does not fully
|
||
FP16 is used to preserve the residual and improve the accuracy of save all the remaining mantissa. Based on these two factors, a residual
|
||
the calculated results. Rounding is a popular precision transformation is calculated from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 . Then multiply the residual by the
|
||
technique that has been previously employed to transform from high- scaling factor to preserve the residual, which is 210 . This does not alter
|
||
precision to low-precision [20,45]. The common rounding methods the representation, but it maps the initial value to the larger range that
|
||
include RN, RZ, RU, and RD. 1 The rounding method has the advantage can be represented by 𝑅𝐴 . Note that since 𝑅𝐴 was obtained by scaling,
|
||
of taking into account the representation of the mantissa low-bit part, the processing used in subsequent calculations will return the original
|
||
which is after the rounding bit. However, it is important to note that residual value. Finally, to preserve the mantissa to the maximum extent,
|
||
each rounding method will inevitably introduce some errors in the the residuals are saved using rounding. The precision transformation
|
||
final value. The primary reason for this is that the mantissa involves process can be rewritten as follows:
|
||
rounding up or down, leading to numerical deviations. For instance,
|
||
𝐴𝐹 𝑃 16 = bit-cut(𝐴𝐹 𝑃 32 ) (5)
|
||
when transforming from FP32 to FP16 using the RN method, the final
|
||
value in the 13-bit mantissa considers the low-bit of the mantissa (0–
|
||
12 bit) to achieve the nearest round. 𝐴𝐹 𝑃 16 and 𝑅𝐴 are responsible 𝑅𝐴 = round((𝐴𝐹 𝑃 32 − 𝐴𝐹 𝑃 16 ) × 210 ) (6)
|
||
for preserving the high-bit and low-bit mantissa of 𝐴𝐹 𝑃 32 respectively.
|
||
From a numerical perspective, 𝐴𝐹 𝑃 16 has a larger effect than 𝑅𝐴 on 𝐵𝐹 𝑃 16 = bit-cut(𝐵𝐹 𝑃 32 ) (7)
|
||
computation results. Therefore, it is important to preserve a high-bit of
|
||
the mantissa during the rounding process by saving as many low-bit 𝑅𝐵 = round((𝐵𝐹 𝑃 32 − 𝐵𝐹 𝑃 16 ) × 210 ) (8)
|
||
bits as possible.
|
||
Compared to previous work, we adopted a bit-cut to preserve the where bit-cut() denotes executing bit-cut operation, round() represents
|
||
high-bit bits in FP32 during the precision transformation process, con- RZ. This section presents the details of the precision transformation
|
||
sidering the impact of high-bit and low-bit bits on errors. To achieve process involving bit-cut and scaling. In the scaling step, it is essen-
|
||
precision transformation from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 , a bit-cut method is tial to divide the value by the scaling factor in the last step of the
|
||
employed to preserve the 𝐴𝐹 𝑃 32 mantissa high-bit. In this way, the calculation to cancel the effect of the scaling factor. Therefore, the
|
||
higher mantissa (13–22 bits of 𝐴𝐹 𝑃 32 in Fig. 4) is copied into 𝐴𝐹 𝑃 16 . GEMM emulation calculation, which is based on the previous precision
|
||
For residual 𝑅𝐴 , introducing a scaling operation requires dividing it transformation process, can be described as follows.
|
||
by a scaling factor (210 in Fig. 4) in the calculation process to ensure 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 ∕210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ∕210 )
|
||
the correct results. Note that due to the difference in the length of the
|
||
= 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 + 𝑅𝐴 𝑅𝐵 ∕220
|
||
exponent part between single-precision and half-precision, additional
|
||
processing is required to avoid errors in the transform process of the (9)
|
||
exponent part. Considering that the bias of single-precision and half-
|
||
The fourth calculation term in Eq. (9) is divided by 220 , thus it
|
||
precision sizes are 127 and 15, it is necessary to subtract the bias
|
||
of the exponent part to avoid errors in the transformation exponent has little effect on the final computation result. Moreover, additional
|
||
bit. To avoid the impact of bias, we preserve the exponent part to an computation items will increase the overall computation effort. For this
|
||
intermediate variable, which requires subtracting the bias value. Then, purpose, Eq. (9) is modified as follows:
|
||
add the corresponding bias when saving the intermediate variable for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 ∕210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ∕210 )
|
||
(10)
|
||
the half-precision exponent part. This approach offers the following ≈ 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210
|
||
benefits: compared to traditional rounding, this approach does not gen-
|
||
erate rounding-up or rounding-down, which can affect the stability of Hence, the lightweight emulation algorithm calculation can be de-
|
||
high-bit. This approach does not depend on the rounding API interface scribed as follows:
|
||
of the platform, which differs in the rounding regulations of different
|
||
𝐷𝐹 𝑃 32 = 𝛼 × 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 + 𝛽 × 𝐶𝐹 𝑃 32
|
||
platforms. The bit-cut is based on a binary format for floating point
|
||
≈ 𝛼 × (𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 ) + 𝛽 × 𝐶𝐹 𝑃 32
|
||
(11)
|
||
1
|
||
RN, RZ, RU, and RD represent Round-to-nearest-even, Round-towards-
|
||
Zero, Round-Up, and Round-Down, respectively.
|
||
|
||
|
||
5
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 4. The bit-cut and scaling operation.
|
||
|
||
|
||
where 𝐷𝐹 𝑃 32 and 𝐶𝐹 𝑃 32 denote FP32 matrices, 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 and efficient tiling scheme that represents the impact of the allocation
|
||
𝑅𝐵 represent the same meaning as those in Eq. (10), 𝛼 and 𝛽 are of computational tasks on overall performance is critical. The tiling
|
||
scalar constants. This section introduces a precision transformation scheme is to segment the matrix into multiple tiles and assign each
|
||
process consisting of a two-step bit-cut and rescaling. This process can tile to a cooperative thread array (thread block) responsible for the
|
||
effectively preserve the original data mantissa and avoid numerical computation. Each thread loads and completes the computation of the
|
||
fluctuations caused by rounding. At the same time, based on the above corresponding matrix elements. This process is based on the size of the
|
||
precision transformation process, we have designed a GEMM emulation matrix in GEMM operation and the hardware resources to perform the
|
||
calculation process, which removes the cumulative term in Eq. (9) and allocation of computational tasks.
|
||
reduces the overall computation time of GEMM. Algorithm 1 provides Fig. 5 illustrates that computing a tile in matrix D requires loading
|
||
an overview of the lightweight emulation algorithm. In Algorithm 1, sub-sections of matrices A and B (the gray area) and a tile in matrix
|
||
bit-cut () denotes utilizing the bit-cut technique to preserve the high- C (the blue area) to perform GEMM operation. This allows for the
|
||
bit bits in the mantissa, which is detailed in Section 4.1. round() computation of multiple tiles simultaneously on the GPU. To fully
|
||
represents the use of rounding to preserve the low-bit bits in the exploit the computing power of GPUs, our proposed thread parallelism
|
||
mantissa. mFma_sync() represents utilizing specialized hardware units analysis model analyses and improves the performance of GEMM com-
|
||
to calculate the multiplication of two matrices, and uniformFma_sync() putations by considering the number of parallel threads (Thread-Level
|
||
represents the computation process of Eq. (11) to obtain the final Parallelism, TLP) and the single thread performance. Given the tile size,
|
||
calculation result 𝐷𝐹 𝑃 32 . the computation of GEMM can be divided into multiple subparts, each
|
||
with a set of threads responsible for the computation. This property
|
||
Algorithm 1 The lightweight emulation algorithm allows for the computation of thread parallelism based on a given
|
||
1: Function emulation GEMM operation(𝐷𝐹 𝑃 32 , 𝐴𝐹 𝑃 32 , 𝐵𝐹 𝑃 32 , tiling strategy. Previous methods have been proposed based on the
|
||
𝐶𝐹 𝑃 32 ,𝛼, 𝛽) number of threads in a thread block simply. Because the thread block
|
||
2: /* The precision transformation process */ is further transformed into multiple frontwaves (warps) for execution
|
||
3: 𝐴𝐹 𝑃 16 =bit-cut(𝐴𝐹 𝑃 32 ), 𝑅𝐴 =round((𝐴𝐹 𝑃 32 -𝐴𝐹 𝑃 16 ) × 210 ); on CU, this resulted in the previous method ignoring the underlying
|
||
4: 𝐵𝐹 𝑃 16 =bit-cut(𝐵𝐹 𝑃 32 ), 𝑅𝐵 =round((𝐵𝐹 𝑃 32 -𝐵𝐹 𝑃 16 ) × 210 ); computational pattern on the GPU. Therefore, we use the number
|
||
5: /* GEMM emulation computation process */ of wavefronts to calculate thread parallelism. This provides a more
|
||
6: /* Initialize accumulator */ accurate indication of how active the underlying threads are in practice.
|
||
7: Initialize fragsAcc0, fragsAcc1; Based on the matrix’s tiling scheme and hardware configuration, TLP
|
||
8: /* Calculate each term in Equation (11) */ can be calculated as follows:
|
||
( )
|
||
9: fragsAcc0 = mFma_sync (𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , fragsAcc0); 𝑀 ×𝑁
|
||
𝑇𝑇 𝐿𝑃 = 𝜑 × 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 (12)
|
||
10: fragsAcc1 = mFma_sync (𝐴𝐹 𝑃 16 , 𝑅𝐵 , fragsAcc1) ; 𝑇𝑀 × 𝑇𝑁
|
||
11: fragsAcc1 = mFma_sync (𝐵𝐹 𝑃 16 , 𝑅𝐴 , fragsAcc1) ; where 𝑀 and 𝑁 represent the dimensions of matrix 𝐶, 𝑇𝑀 and 𝑇𝑁
|
||
12: /* Calculate the final result */ are the tile size, 𝜑 is the transform process of workgroup to wavefront,
|
||
13: 𝐷𝐹 𝑃 32 = uniformFma_sync (𝛼, fragsAcc0, fragsAcc1/210 , 𝛽, and 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 represents the number of threads in wavefront. Note
|
||
𝐶𝐹 𝑃 32 ); that 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 has different values across various GPU platforms, for
|
||
14: end Function
|
||
example, on MI210 and A800, the values for 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 are 64 and 32,
|
||
respectively.
|
||
When considering TLP, it is important to consider the performance
|
||
4.2. Thread parallelism analytic model of a single thread, which can be expressed as a compute-to-memory
|
||
ratio. Comparing the clock cycles of memory access instruction and
|
||
The lightweight emulation algorithm is designed to efficiently map computation instruction in threads, it is evident that data access in
|
||
GEMM-based scientific computing to specialized hardware cores, al- memory requires more time. Therefore, it is crucial to ensure that the
|
||
lowing for high-precision outputs with low-precision inputs. To fully data loaded on the thread is fully utilized. During each iteration, every
|
||
exploit the acceleration capabilities of specialized hardware units, an thread block is required to perform two tasks. The first step involves
|
||
|
||
6
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 5. The tiling scheme of a single GEMM.
|
||
|
||
|
||
reading two matrices (𝐴𝐹 𝑃 16 , 𝑅𝐴 ) of size (𝑇𝑀 , 𝑇𝐾 ), two matrices between the thread block and the matrix tiles, which will also affect the
|
||
(𝐵𝐹 𝑃 16 , 𝑅𝐵 ) of size (𝑇𝐾 , 𝑇𝑁 ) and matrix C of size (𝑇𝑀 , 𝑇𝑁 ). Since matrix efficiency of data reuse. As shown in Fig. 5(a), the calculation of 3 tiles
|
||
C of size (𝑇𝑀 , 𝑇𝑁 ) is loaded only in the first iteration and does not in matrix 𝐶 involves the loading of 1 sub-section of matrix 𝐴, and 3 sub-
|
||
need to be loaded again in subsequent steps, its loading overhead can sections of matrix 𝐵, respectively. In Fig. 5(b), 4 tiles can be computed
|
||
be neglected. Thus, the memory access instruction with each iteration by loading two sub-sections of the matrix 𝐴 and 𝐵. Comparing these
|
||
can be expressed as: two routines, it is clear that the larger the cross-sectional area (blue
|
||
tile in Fig. 5) between the sub-sections loaded by matrices 𝐴 and 𝐵.
|
||
𝑇𝑚𝑒𝑚𝑜𝑟𝑦 = 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 ) (13)
|
||
Fig. 5 indicates that the pattern in (b) has better data reuse.
|
||
where 𝑇𝑚𝑒𝑚𝑜𝑟𝑦 represents memory access during each iteration, the last Finally, the process of optimal tiling scheme selection with thread
|
||
2 means that the half-precision data occupies two bytes. According parallelism analytic model is formally represented as maximizing 𝑇𝑇 𝐿𝑃
|
||
to the lightweight emulation algorithm, the computation instruction and 𝑇 𝑓 𝑙𝑜𝑝𝑠 . According to Eqs. (12) and (15), it can be found that
|
||
𝑚𝑒𝑚𝑜𝑟𝑦
|
||
required for each iteration process is as follows:
|
||
𝑇𝑇 𝐿𝑃 and 𝑇 𝑓 𝑙𝑜𝑝𝑠 show an inverse relationship given 𝑀, 𝑁, and 𝑇𝑡ℎ𝑟𝑒𝑎𝑑 .
|
||
𝑇𝑓 𝑙𝑜𝑝𝑠 = 3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 ) (14) 𝑚𝑒𝑚𝑜𝑟𝑦
|
||
|
||
Generally, smaller tiles can achieve better thread parallelism because
|
||
where 𝑇𝑓 𝑙𝑜𝑝𝑠 represents computation with each iteration, 2 × (𝑇𝑀 × computational tasks are assigned to multiple threads. However, the
|
||
𝑇𝑁 × 𝑇𝐾 ) represents the computation amount of matrix multiplication number of activated threads is also limited by the supported number
|
||
between matrix of size (𝑇𝑀 , 𝑇𝐾 ) and matrix of size (𝑇𝐾 , 𝑇𝑁 ), and the by the GPU hardware. When the tile is larger, a larger compute-to-
|
||
constant 3 corresponds to the three terms in Eq. (11). To this end, the memory ratio is often achieved, which represents a better performance
|
||
ratio of computation to memory access (𝑇 𝑓 𝑙𝑜𝑝𝑠 ) during each iteration for a single thread. When choosing a tiling scheme, it is also necessary
|
||
𝑚𝑒𝑚𝑜𝑟𝑦
|
||
is as follows: to consider the workload of the thread according to Eqs. (16) and
|
||
3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 ) 3(𝑇𝑀 × 𝑇𝑁 ) (17). Eqs. (12)–(17), take into account factors that affect the number
|
||
𝑇 𝑓 𝑙𝑜𝑝𝑠 = = (15) of active threads and the performance of a single thread, allowing
|
||
𝑚𝑒𝑚𝑜𝑟𝑦 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 ) 2(𝑇𝑀 + 𝑇𝑁 )
|
||
the optimal tiling scheme to be selected based on the various GPU
|
||
While pursuing higher TLP and compute-to-memory ratios, it is also platforms and computing scenarios. Our thread parallelism analysis
|
||
important to focus on thread workload. The workload of a thread can be model chooses the tiling scheme based on thread-level parallelism
|
||
divided into two parts: the amount of data load and the amount of com- and single thread performance, which can be solved analytically with
|
||
putation. To avoid excessive workload and reduce thread parallelism, existing optimization solvers [53,54]. This can also be used to pro-
|
||
it is necessary to properly set the tiling scheme and the number of tiles vide a more profitable direction when selecting the best parameters
|
||
that the workgroup is responsible for computing. To describe a thread by trial-and-error. Table 2 presents the choice of the parameters for
|
||
the amount of data load (𝑊 𝑜𝑟𝑘_𝐷𝐿) and the amount of computation LE-GEMM. In Table 2, (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) denotes the size of the matrix
|
||
(𝑊 𝑜𝑟𝑘_𝐶 𝑃 ) more accurately, the following two formulas are given: partitioned into tiles, (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) denotes the fragment size for GEMM
|
||
computation, and wavefront/warp size denotes the number of threads
|
||
𝑇𝑀 × 𝑇𝑁 + 𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁
|
||
𝑊 𝑜𝑟𝑘_𝐷𝐿 = (16) in the thread group of the two GPU platforms that is determined by
|
||
𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚
|
||
their hardware architectures. Due to the various sizes and shapes of
|
||
𝑇𝑀 × 𝑇𝑁 the fragment supported by different architectures, the choice of the
|
||
𝑊 𝑜𝑟𝑘_𝐶 𝑃 = (17)
|
||
𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 fragment is also limited to the two GPU platforms. At the same time,
|
||
where 𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 represents the number of thread responsible for com- the wavefront/warp size also affects the thread workload allocation.
|
||
puting the tile. 𝑇𝑀 , 𝑇𝑁 , and 𝑇𝐾 represent tile size, the details of which Therefore, when choosing parameters, it is essential to consider the
|
||
are shown in Fig. 5. 𝑊 𝑜𝑟𝑘_𝐷𝐿 is the number of elements responsible impact of hardware differences on performance.
|
||
for the computation in each thread. To ensure efficient parallelism,
|
||
it is necessary to allocate a different number of threads to various 4.3. An efficient multi-level pipeline for GEMM
|
||
shape tiles. This ensures that the workload is consistent between thread
|
||
blocks. 𝑊 𝑜𝑟𝑘_𝐶 𝑃 represents the number of registers loaded by each In the GPU memory architecture, various memory spaces will have
|
||
thread. When the thread loads too many elements, it will cause the data different sizes and access speeds. Data access speeds can be classified
|
||
in the register to overflow, which will increase the data access over- from fast to slow in terms of register memory, local memory, and global
|
||
head. At the same time, we should also consider the correspondence memory. The speed is inversely proportional to the size of the memory
|
||
|
||
7
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 6. The workflow of GEMM.
|
||
|
||
|
||
Table 2 Table 3
|
||
The choice of the parameters on two GPUs. The configuration of platforms for evaluation.
|
||
Name AMD MI210 NVIDIA A800 Name AMD-platform NVIDIA-platform
|
||
Matrix tile size (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) (128, 128, 32) (128, 256, 32) CPU EPYC 7663 Platinum 8358
|
||
Fragment size (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) (32, 32, 16) (16, 16, 16) GPU MI210 A800
|
||
Wavefront/Warp size 64 32 OS Ubuntu 20.04 Ubuntu 20.04
|
||
ROCm/CUDA ROCm 5.6 CUDA 12.0
|
||
|
||
|
||
|
||
space. Direct access to data from global registers reduces transfer
|
||
efficiency and does not maximize memory bandwidth. Therefore, in The implementation of the multi-stage pipeline above must be
|
||
this paper, double buffers are used to alleviate the problem of high aware of the size of registers and shared memory supported by the
|
||
differences in memory access speeds, and pre-fetch techniques are used specific GPU hardware. Once the maximum capacity limit is exceeded,
|
||
to maximize memory bandwidth and thread instruction parallelism. data overflow will occur and this will greatly increase access latency.
|
||
The workflow for a tile in GEMM is shown in Fig. 6. Indeed, register and shared memory occupancy is related to 𝑇𝑀 , 𝑇𝑁
|
||
First, a double buffer (buffer0 and buffer1) was created in the share and 𝑇𝐾 from the above description. When a tiling scheme is given, 𝑇𝑀
|
||
memory, and the single buffer size is related to the cumulative span and 𝑇𝑁 are also determined. Therefore, data overflow can be prevented
|
||
of matrices 𝐴 and 𝐵 along the 𝐾-dimension. As shown in Fig. 6, by setting 𝑇𝐾 flexibly. For example, on the AMD platform, based on the
|
||
loading instructions is applied to fetch matrix tiles 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 fact that 𝑇𝑀 and 𝑇𝑁 are 128 and 256 respectively, 𝑇𝐾 can be set to a
|
||
and 𝑅𝐵 in iteration #0, and the data is stored in buffer0. The fetch data smaller size such as 16 or 32 to prevent data overflow. Note that to
|
||
correspond to the yellow tiles 𝐴𝐹 𝑃 16 and 𝑅𝐴 (𝑇𝑀 × 𝑇𝐾 ) and the green avoid invalid computations in Matrix Core, 𝑇𝐾 should preferably be an
|
||
tiles 𝐵𝐹 𝑃 16 and 𝑅𝐵 (𝑇𝐾 × 𝑇𝑁 ) in Fig. 6. At the same time, the data to be integer multiple of the fragment size.
|
||
loaded in iteration #1 will also be fetched into buffer1. This allows the
|
||
register to fetch data directly in the share memory, without waiting for 5. Results
|
||
the data to be loaded from the global memory. The thread can load data
|
||
directly from shared memory into private registers, follow a specified This section compares the proposed method with some state-of-
|
||
data layout, and then utilize Matrix Core to complete the computation. the-art approaches and analyzes the experimental results on two GPU
|
||
When the intermediate result of the first iteration is complete, the platforms. The flowchart of the comparative experiment is shown in
|
||
register reads the data needed for the next iteration from buffer1. The Fig. 7. Note that the matrix size of 𝐴 ∈ 𝑅𝑀×𝐾 and 𝐵 ∈ 𝑅𝐾×𝑁 is
|
||
thread continues to load the data needed for the next iteration from represented as ‘‘𝑀*𝑁*𝐾’’ in the following expressions for simplicity.
|
||
buffer1 when the intermediate result of the iteration #0 is complete. The tested range of numbers was [−1,1] in comparative experiments.
|
||
For efficient data pipelining and instruction-level parallelism, data for
|
||
the next iteration can be pre-fetched from global memory into shared 5.1. Setup
|
||
memory when performing data transfers between shared memory and
|
||
registers. Based on the above approach, we can implement a multi- Experiment platform. In this paper, we chose two most common
|
||
level access pipeline between different memory levels, and the access and advanced GPUs (MI210 and A800). The configuration and details
|
||
and computation instructions can be executed in parallel, improving of the experimental platform are given in Tables 3 and 4 respectively.
|
||
the ILP performance of the thread. Note that a thread synchronization Comparison method. To evaluate the proposed method, first, for
|
||
instruction is performed behind the store to ensure data consistency by two GPU platforms, the default single-precision GEMM methods such
|
||
each time a buffer is stored. Each iterative computation produces an as rocBLAS and cuBLAS, are provided by the respective GPU vendors.
|
||
intermediate result which is combined with the accumulation matrix These methods do not use the mixed-precision core and directly utilize
|
||
𝐶𝐹 𝑃 32 for FMA operation. After several computation iterations, the final the FP32 input to compute the FP32 GEMM results. We compare
|
||
calculation result 𝐷𝐹 𝑃 32 can be obtained and written back to the global some methods, such as rocWMMA and WMMA, that use FP16 input to
|
||
memory. compute FP32 results on Matrix Core or Tensor Core without precision
|
||
|
||
8
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 7. The flowchart of the comparative experiment.
|
||
|
||
|
||
Table 4 half-precision matrices in the paper, 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) is the result of a stan-
|
||
The configuration of two GPUs. dard single-precision calculation used to compute Max_error, (𝑖, 𝑗) rep-
|
||
Name AMD MI210 NVIDIA A800 resents the 𝑖th row and 𝑗th column element of the matrix, 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒() is
|
||
Architecture CDNA 2.0 Ampere a precision transformation function to obtain more accurate errors, 𝑀
|
||
Mixed-precision core Matrix Core Tensor Core and 𝑁 represent the dimensions of the resulting matrix, and 𝑀 𝑎𝑥() is
|
||
Caches L1 16 KB(per CU) L1 192 KB (per SM)
|
||
L2 16 MB L2 40 MB
|
||
to choose the maximum value from all errors. Meanwhile, the Mean
|
||
Relative Error Distance (MRED) is also used as an error evaluation
|
||
Memory 64 GB HBM2 80 GB HBM2
|
||
Bandwidth 1.6 TB/s 2.04 TB/s
|
||
metric, which is calculated as follows:
|
||
∑𝑀 ∑ 𝑁 | |
|
||
1 | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) − 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) |
|
||
MRED = | | (20)
|
||
Table 5 |
|
||
𝑀 ∗ 𝑁 𝑖=1 𝑗=1 | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) |
|
||
|
|
||
The description of the baseline method.
|
||
Name Input Output Description where M and 𝑁 denote the size of the result matrix, and 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) and
|
||
rocBLAS [30] FP32 FP32 rocblasSgemm on ROCm Core.
|
||
𝑉𝑝 (𝑖, 𝑗) denote the same meaning as in Eq. (19).
|
||
cuBLAS [32] FP32 FP32 cublasSgemm on CUDA Core.
|
||
rocWMMA [28] FP16 FP32 rocWMMA implemented on Matrix Core
|
||
5.2. The speed improvement with various shapes
|
||
|
||
without precision refinement. To evaluate the speed advantages of our method, we conducted
|
||
WMMA [55] FP16 FP32 WMMA implemented on Tensor Core comparative experiments on two GPU platforms with multiple shapes
|
||
without precision refinement. of the matrix. The results are presented in Figs. 8–9. In Fig. 8(a)–(c),
|
||
Mark [10] FP16 FP32 Markidis’s implemented on two our method improves performance by 1.31× compared to rocBLAS,
|
||
platforms. an arithmetic library commonly used for scientific computing and
|
||
Ooto [43] FP16 FP32 Ootomo’s implemented on two provided by AMD. We also compare rocWMMA, which directly uses
|
||
platforms. half-precision inputs to achieve a single-precision output without a
|
||
precision refinement process. The proposed method achieves an av-
|
||
erage 1.25×, 1.24×, and 1.45× speedup in (n*n*n), (n*4n*n), and
|
||
(n*n*4n), respectively. From Fig. 8(a)–(b), when n ≤ 4096, it can
|
||
refinement. Finally, the experiments compare Markidis’s and Ootomo’s
|
||
be seen that the performance of rocWMMA is superior to that of the
|
||
methods, which calculate FP32 results based on FP16 input with pre-
|
||
proposed method. This phenomenon is mainly due to the following
|
||
cision refinement. In this paper, the experimental results are averaged
|
||
two reasons. First, rocWMMA is a straightforward computation way;
|
||
over 10 experiments. The baseline methods are described in Table 5.
|
||
this does not involve the accuracy refinement procedure; therefore,
|
||
Evaluation criteria. To objectively evaluate the effectiveness of
|
||
the workload of the calculation process is less. Second, the dimen-
|
||
the proposed method, in this paper, two criteria are used to evaluate
|
||
sion of the matrix is minor, so the proposed method cannot fully
|
||
the speed and accuracy. In terms of measuring speed, the experimen- exploit the performance improvement brought by the multi-level flow
|
||
tal results were represented by the average value of TFLOPS (Tera design. As the matrix dimension gradually increases, the performance
|
||
FLoating-point Operations Per Second), which is calculated as: advantage of the proposed method becomes increasingly apparent, as
|
||
2((𝑀 × 𝑁 × 𝐾) + (𝑀 × 𝑁)) seen in Fig. 8. Compared to Markidis’s and Ootomo’s methods, the
|
||
TFLOPS = (18)
|
||
𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 ∗ 1.0𝑒12 proposed method has a speedup of 3.05× and 3.18×, respectively.
|
||
where 𝑀, 𝑁 and 𝐾 represent the matrix dimension of the GEMM, and These two methods utilize precision refinement techniques but require
|
||
𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time in seconds measured on GPU. To more computation workload. The specialized hardware accelerator’s
|
||
evaluate the calculated results, the max relative error (Max_error) is data layout and instruction scheduling characteristics were not con-
|
||
utilized to evaluate the error compared to the CPU computation results, sidered, leading to lower performance. The proposed method utilizes
|
||
which is calculated as: the thread parallelism analytic model to choose the tiling scheme to
|
||
( ) segment the matrix. Additionally, an efficient multi-level pipeline is
|
||
|𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) − 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))| used to improve the efficiency of data access and ILP. This makes it
|
||
Max_error = 𝑀 𝑎𝑥
|
||
|𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗))| + |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))| an effective solution for addressing the issue of low computation speed
|
||
𝑖 = 1, 2, … , 𝑀 , 𝑗 = 1, 2, … , 𝑁 . (19) in precision refinement. Fig. 8 indicates that the proposed method has
|
||
performance benefits when the matrix dimension is large. This serves
|
||
where 𝑉𝑝 (𝑖, 𝑗) denotes the computational result under the precision as further evidence that the suggested LE-GEMM has significantly
|
||
refinement procedure, which can achieve single-precision results with improved performance for both square and skewed matrices.
|
||
|
||
9
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 8. The experimental results with various shapes on MI210.
|
||
|
||
|
||
|
||
|
||
Fig. 9. The experimental results with various shapes on A800.
|
||
|
||
|
||
The experimental results of speed comparison on NVIDIA A800 Table 6
|
||
The speedup on two GPU platforms.
|
||
are shown in Fig. 9. In the square matrix case (n*n*n), the proposed
|
||
Baseline method AMD MI210 NVIDIA A800
|
||
method achieves an average speedup of 1.37×, 1.40×, 2.78× and
|
||
2.74×, which are compared with cuBLAS, WMMA, Mark and Ooto, (n*n*n) (n*4n*n) (n*n*4n) (n*n*n) (n*4n*n) (n*n*4n)
|
||
respectively. As illustrated in Fig. 9, it can be seen that when the matrix rocBLAS/cuBLAS 1.31× 1.31× 1.32× 1.37× 1.40× 1.36×
|
||
size is small, the performance of LE-GEMM is similar to the compared rocWMMA/WMMA 1.26× 1.24× 1.45× 1.40× 1.43× 1.81×
|
||
Mark 2.81× 2.99× 3.35× 2.77× 2.84× 4.67×
|
||
methods. For example, when n ≤ 2048, Fig. 9(a) shows that the Ooto 2.99× 3.10× 3.45× 2.74× 2.80× 3.76×
|
||
proposed method performs similarly to cuBLAS and WMMA. The above
|
||
phenomenon is because a smaller matrix will reduce the number of tiles
|
||
and the number of iterative accumulations within the tiles, reducing
|
||
the performance gain brought by the multi-level pipeline. As the size 5.3. The precision error
|
||
of the matrix increases, the benefits of the proposed method become
|
||
more apparent. When the matrix size is large, the proposed method This section compares several baseline methods in terms of their
|
||
significantly improves performance over other methods. Experiments maximum error. The experiment results are shown in Fig. 10. First, we
|
||
conducted on two GPU platforms demonstrate that the performance of compare with rocWMMA, which computes FP32 GEMM directly from
|
||
FP16 using Matrix Core. In Fig. 10(a), the proposed method reduces
|
||
the Markidis and Ootomo methods is nearly consistent. The reason is
|
||
the average Max_error by 377.33× compared to rocWMMA. The results
|
||
that, while Ootomo’s method reduces the computational effort by 25%,
|
||
indicate that FP16 cannot preserve all the mantissa of FP32 due to the
|
||
Ootomo’s method requires an additional accumulator to perform the
|
||
limitation of the mantissa length. Additionally, this method lacks preci-
|
||
accumulation. The initiation and combination stages of the accumu-
|
||
sion refinement techniques, resulting in a pronounced Max_error. The
|
||
lator also require an additional amount of time. When the matrix is
|
||
comparison result shows that the proposed method can significantly
|
||
skewed, the proposed method outperforms the other baseline methods
|
||
reduce the Max_error compared to the default routine. Mark and Ooto
|
||
significantly. Fig. 9(c), when n>7168, the performance of the WMMA
|
||
methods were then compared separately. These two methods utilize
|
||
method decreases because it is a simple and direct implementation that precision refinement techniques to decrease Max_error. These methods
|
||
ignores the data loading pipeline and multi-level memory architecture have improved computational accuracy compared to the default meth-
|
||
during the computation process. This simple implementation results ods. Fig. 10(a) shows that the proposed method reduces the average
|
||
in lower efficiency when accumulating along the K-dimension, which Max_error by 74.82× compared to Mark. Although Markidis et al. [10]
|
||
is more pronounced when the matrix size is (n*n*4n). This approach employed an additional FP16 execution precision refinement process to
|
||
makes WMMA unable to fully utilize its hardware acceleration capabil- reduce Max_error, the effect of the rounding method on FP16 was not
|
||
ity. Fig. 9(b) illustrates that the proposed method achieves a speedup considered, leading to certain limitations in improving accuracy. The
|
||
of 1.40×, 1.43×, 2.84× and 2.80× over the four baseline methods, proposed method reduces Max_error by an average of 2.69× compared
|
||
respectively. As shown in Fig. 9, the proposed method performs better to Ootomo’s method. However, the addition of FP16 does not preserve
|
||
when the matrix size is large; this performance trend is consistent with the residuals of FP32 and FP16 efficiently due to the limitations of
|
||
the AMD platform. The detailed comparative experimental results of the numerical representation range. Rounding inevitably causes fluctu-
|
||
two GPU platforms are presented in Table 6. ations in values, leading to errors in the values preserved by FP16. To
|
||
|
||
10
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 10. The Max_error with square matrix (n*n*n) on two GPU platforms.
|
||
|
||
|
||
|
||
|
||
Fig. 11. The Mean Relative Error Distance with square matrix (n*n*n) on two GPU platforms.
|
||
|
||
|
||
alleviate this problem, the proposed method employs a bit-cut method 636.92×, 178.93×, and 1.12× on MI210, respectively. By combining the
|
||
for the high-bit of FP32 mantissa to avoid the value fluctuation caused experimental results in Figs. 10–11, it can be found that rocWMMA and
|
||
by rounding. Fig. 10(a) shows that the Max_error increases with the WMMA have enormous Max_error and MRED compared to the other
|
||
matrix dimension (K-dimension). This is due to the fact that the two methods, which also shows the importance of precision refinement.
|
||
FP16 cannot fully preserve the FP32 values, resulting in errors in the In A800, the proposed method reduces the MRED by 814.87×, 1.75×,
|
||
calculated results. As the number of computations increases, the error and 1.28× compared to WMMA, Mark, and Ooto, respectively. The
|
||
gradually becomes more apparent. The effectiveness and advancement experimental results of Figs. 10–11 demonstrate a consistent trend:
|
||
of the proposed method can be demonstrated by comparing with the the Max_error and MRED are more significant when the cumulative
|
||
default (rocWMMA) method and the latest (Mark and Ooto) methods dimension increases. As shown in Figs. 10–11, it can be found that the
|
||
on the AMD-platform. proposed method has a smaller Max_error and MRED, which indicates
|
||
To demonstrate the proposed method’s portability, we have con- that it has a smaller upper limit of error and mean relative error. The
|
||
ducted corresponding experiments on NVIDIA A800, another popular comparative experiment between Max_error and MRED proves that the
|
||
proposed method has better computational accuracy.
|
||
GPU platform. The experimental results for Max_error are shown in
|
||
This section provided error distribution histograms to demonstrate
|
||
Fig. 10(b). To ensure an objective comparison, we applied Markidis’s
|
||
each method’s error range. The error distribution histogram is imple-
|
||
method and Ootomo’s method on the NVIDIA platform, respectively,
|
||
mented on two GPU platforms with a matrix size of 1024*1024*1024.
|
||
using the WMMA interface. The interface name is the only difference
|
||
The detailed error distribution results are shown in Figs. 12–13. As
|
||
between this implementation and the AMD-platform implementation.
|
||
shown in Figs. 12–13, the error distribution of rocWMMA and WMMA
|
||
Fig. 10(b) illustrates that the proposed method significantly reduces
|
||
is the widest, indicating that the error impact without an additional
|
||
Max_error as the matrix size increases, which is consistent with the
|
||
precision computation process is significant. Compared to the Mark
|
||
results obtained on the AMD-platform. The proposed method also has method, the proposed method has obvious advantages in both the error
|
||
a smaller Max_error compared to WMMA, Mark, and Ooto, reducing range and maximum error. This precision advantage is more evident
|
||
average Max_error by 571.75×, 2.78× and 1.41×, respectively. The on MI210. The above gap is because although Mark considers the
|
||
experimental results of the two platforms show the same trend as the impact of down-conversion, a simple rounding way cannot effectively
|
||
results in the AMD-platform. This indicates that the proposed method preserve low-bit bits. Moreover, the differences in the implementation
|
||
has low Max_error on mainstream GPU platforms, further confirming of rounding API across different hardware platforms can also impact
|
||
its effectiveness in terms of computational accuracy. the calculation results. Compared to the Ooto method, the error dis-
|
||
In this section, we have also provided the MRED of several com- tribution histograms of the two methods are similar. The error of the
|
||
parative methods, which allows for a more comprehensive analysis of proposed method is more concentrated around 0 in Figs. 12 and 13,
|
||
the matrix error. The detailed experimental results are presented in indicating that the proposed method is superior to Ooto in terms of
|
||
Fig. 11. As illustrated in Fig. 11(a), the proposed method exhibits a error distribution. The error histograms shown in Figs. 12 and 13 in-
|
||
smaller MRED than rocWMMA, Mark, and Ooto, with a reduction of dicate that the proposed method outperforms other methods regarding
|
||
|
||
11
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 12. Error distribution histogram on MI210.
|
||
|
||
|
||
|
||
|
||
Fig. 13. Error distribution histogram on A800.
|
||
|
||
|
||
both the error upper limit and the error distribution on both platforms, scientific computing scenarios: K-means [6], FFN layer in BERT [8], and
|
||
proving its accuracy advantage. Linear transformation in LLMs [58], which involve numerous GEMM
|
||
operations. Fig. 15 shows the speedup ratio of the proposed method
|
||
5.4. The empirical Roofline analysis compared to the default implementation (rocBLAS and cuBLAS) on
|
||
two GPU platforms. The average speedup of the proposed method in
|
||
To further demonstrate the effectiveness of the proposed method, K-means for rocBLAS and cuBLAS are 1.16× and 1.33×, respectively,
|
||
we present a performance comparison and analysis of empirical as shown on the left of Fig. 15. The GEMM operation’s shape during
|
||
Roofline performance on two GPU platforms. In the comparison ex- computation is determined by altering the number of data points in
|
||
periments, GEMMs with larger dimensions are chosen for profiling, the K-means comparison experiments. The proposed method’s perfor-
|
||
which avoids performance estimation bias due to fewer data elements mance advantage increases gradually with the number of data points
|
||
in the multi-level cache. To measure performance metrics, we utilized and then stabilizes. This suggests that the proposed method has more
|
||
Omniperf [56] and NCU [57] commands, which are profiling tools significant performance gains in large-scale GEMM. Fig. 15 illustrates
|
||
provided by AMD and NVIDIA, respectively. The empirical roofline per- the performance improvement of the FFN layer. The FFN has incor-
|
||
formance is typically composed of two regions: the ‘‘memory-bound’’ porated the proposed method to accelerate the computational process
|
||
and the ‘‘compute-bound’’, which indicate that the current kernel of the BERT layer. As illustrated on Fig. 15, the proposed method has
|
||
performance is limited by memory access and compute capacity, re- achieved 1.16× and 1.18× speedups compared to rocBLAS and cuBLAS.
|
||
spectively. The specific experimental results are shown in Fig. 14. From Experiments on various input sequence lengths show that the proposed
|
||
Fig. 14, it can be seen that the proposed method is in the ‘‘compute- method can significantly improve the speed performance of the FFN
|
||
bound’’ region compared to Mark, Ooto, and rocWMMA, indicating that computation process on two GPU platforms. Meanwhile, we also apply
|
||
the proposed method has a higher bandwidth utilization for multi-level the proposed method to the linear transformation in LLMs to accelerate
|
||
memory in the GPU. This is due to the fact that the above baseline their linear mapping process, which involves various sizes of GEMM
|
||
approach does not take into account the data layout of specialized workload. During the linear transformation process of LLMs, the input
|
||
hardware units and the multi-level cache characteristics on GPUs, features are mapped to product Q, K, and V, commonly known as
|
||
resulting in lower performance. To address this issue, the proposed Query, Key, and Value matrices. The range of matrix data from the
|
||
method uses double buffers and pre-fetching techniques to establish an LLMs training process is [0.5, 2.5]. In Fig. 15, for the computation
|
||
efficient multi-level pipeline design. This design empowers the kernel process of multiple-size weight matrices, the proposed method has a
|
||
to optimize the use of data in the multi-level cache, leading to improved speedup of 1.23× and 1.34× compared to the default implementation,
|
||
performance. For rocBLAS and cuBLAS, although both methods fall respectively. Fig. 16 shows the Max_error and MRED of the proposed
|
||
into ‘‘compute-bound’’, the performance of the proposed method is method in three GEMM-based applications. As illustrated in Fig. 16,
|
||
better. This also shows the computational effectiveness of the proposed the magnitude of the error associated with the proposed method is pri-
|
||
method for single-precision GEMM. The comparison of the empirical marily contingent upon the K-dimension of GEMM. As the K-dimension
|
||
Roofline performance on the two platforms can further confirm the increases, both Max_error and MRED of the proposed method also
|
||
effectiveness and advancement of the proposed method. increase. The computation process in GEMMs with larger K-dimensions
|
||
involves more FMA operations, leading to more significant errors.
|
||
5.5. The improved performance for GEMM-based scientific computing sce-
|
||
narios 5.6. The performance gain analysis
|
||
|
||
To demonstrate the proposed method’s acceleration performance In this section, we evaluate the performance gains of two tech-
|
||
in GEMM-based scientific computing, we selected three real-world niques, thread parallelism analytic model and multi-level pipeline, for
|
||
|
||
12
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 14. The empirical Roofline performance.
|
||
|
||
|
||
|
||
|
||
Fig. 15. The acceleration performance with LE-GEMM (Note that rocBLAS and cuBLAS serve as a baseline for performance comparison. Hence, the speedup of these methods is
|
||
1.).
|
||
|
||
|
||
|
||
|
||
Fig. 16. The Max_error and MRED with LE-GEMM.
|
||
|
||
|
||
LE-GEMM performance. The detailed experimental results are shown in MI210 and A800. When the matrix size gradually increases, the perfor-
|
||
Fig. 17. In Fig. 17, ‘‘LE-GEMM manual paramete’’ represents the use of mance gain is more obvious. For example, when the matrix size ≥ 4000,
|
||
previous experience to select the appropriate parameters. ‘‘LE-GEMM there is a significant performance gap between the proposed method
|
||
plain implementation’’ is the direct use of the API interface without and the manual parameter implementation. This is because different
|
||
the multi-level pipeline. As shown in Fig. 17, the thread parallelism parameters will have a significant performance impact for varying
|
||
analytic model brings 1.36× and 1.78× performance improvement on hardware architectures and matrix sizes. The manual parameter way
|
||
|
||
13
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
|
||
|
||
Fig. 17. The performance gain analysis of LE-GEMM (‘‘LE-GEMM manual parameter’’ indicates LE-GEMM without thread parallelism analytic model and ‘‘LE-GEMM plain
|
||
implementation’’ indicates without multi-level pipeline).
|
||
|
||
|
||
|
||
|
||
Fig. 18. The trade-off between performance and precision.
|
||
|
||
|
||
relies solely on prior experience in selecting kernel parameters. This the error of these two methods is also consistent. CUBLAS_COMPUTE_
|
||
approach often requires several trial-and-error to obtain a suitable 32F_FAST_16F often leads to significant errors, limiting its applica-
|
||
parameter value. However, the lack of detailed analysis of the hard- tion range. The proposed method reduces the precision loss by using
|
||
ware architecture and the computational workflow makes it difficult additional computational resources, which decrease the overall com-
|
||
to obtain the optimal parameters in the manual selection approach. putational speed. Compared to COMPUTE_32F_FAST_16F, the proposed
|
||
We also perform LE-GEMM plain implementation, which ignores the method has better computational accuracy. CUBLAS_COMPUTE_64F
|
||
multi-level memory architecture feature and adopts a simple and direct represents double-precision matrix multiplication operation. It has
|
||
calling way. Fig. 17 illustrates that LE-GEMM exhibits a performance higher data precision and lower computational speed. The experimental
|
||
speedup of 3.02× and 2.72× over the plain implementation on two GPU results in Fig. 18 indicate that higher precision data tends to have
|
||
platforms. In LE-GEMM plain implementation, threads are required lower computational speed. The proposed LE-GEMM achieves single-
|
||
to wait for the next data load after completing a single iteration precision computational results based on low-precision computational
|
||
calculation, which significantly reduces performance. Compared to units, essentially a compromise between computational performance
|
||
plain implementation, LE-GEMM can efficiently utilize various memory and accuracy.
|
||
spaces for pipelined access operations, reducing data loading overhead
|
||
and improving CU utilization. 6. Conclusion
|
||
|
||
5.7. The trade-off between performance and precision This paper proposed LE-GEMM for single-precision matrix multiply,
|
||
which contains a lightweight emulation algorithm, thread parallel an-
|
||
To explore the trade-off between high performance and precision, alytic model, and efficient multi-level pipeline implementation. Specif-
|
||
some comparative experiments with different precision were conducted ically, a lightweight emulation algorithm can provide more accurate
|
||
on the A800. We used the cublasGemmEx() function in the com- computational results and eliminate redundant computations. Then, a
|
||
parative experiment to implement two different computation modes thread-parallel analytical model is designed to determine the optimal
|
||
(CUBLAS_COMPUTE_32F_FAST_16F and CUBLAS_COMPUTE_64F). tiling scheme by considering TLP and single thread performance. A
|
||
CUBLAS_COMPUTE_32F_FAST_16F mode means directly using FP16 multi-level pipeline design is implemented to achieve access latency
|
||
to obtain single precision calculation results, during which down- hiding and higher ILP. Finally, experiments were conducted on two
|
||
conversion is performed. The detailed experimental results have been GPU platforms to validate the effectiveness of the proposed method and
|
||
presented in Fig. 18. As shown in Fig. 18, CUBLAS_COMPUTE_32F_ its progressiveness.
|
||
FAST_16F has the fastest computational performance because this
|
||
method directly calls Tensor Core for matrix operations without any 7. Future work
|
||
precision refinement process. The calculation process of CUBLAS_
|
||
COMPUTE_32F_FAST_16F is consistent with WMMA, which directly Future work includes exploring the impact of different types of low-
|
||
uses down-conversion without a precision refinement process. Hence, precision data on computational results, allowing for better trade-offs
|
||
|
||
14
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
between performance and accuracy. Recently, low-precision hardware [13] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee,
|
||
accelerators have been increasingly applied to various GEMM-based D. Lee, LUT-GEMM: Quantized matrix multiplication based on luts for efficient
|
||
inference in large-scale generative language models, 2022, arXiv preprint arXiv:
|
||
scientific computing scenarios to improve computational speed. The
|
||
2206.09557.
|
||
differences in hardware support and implementation can significantly [14] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, B. Zhuang, Qllm: Accurate and efficient
|
||
impact the calculation results. In future work, we plan to explore low-bitwidth quantization for large language models, 2023, arXiv preprint arXiv:
|
||
more low-precision variables (e.g., BF16, FP8, FP4) to speed up the 2310.08041.
|
||
[15] Y. Wang, B. Feng, Z. Wang, G. Huang, Y. Ding, TC-GNN: Bridging sparse
|
||
computational process in single-precision matrix multiplication.
|
||
GNN computation and dense Tensor Cores on GPUs, in: 2023 USENIX Annual
|
||
Technical Conference (USENIX ATC 23), 2023, pp. 149–164.
|
||
CRediT authorship contribution statement [16] R. Fan, W. Wang, X. Chu, Fast sparse GPU kernels for accelerated training
|
||
of graph neural networks, in: 2023 IEEE International Parallel and Distributed
|
||
Processing Symposium, IPDPS, IEEE, 2023, pp. 501–511.
|
||
Yu Zhang: Writing – review & editing, Writing – original draft. Lu [17] T. Utsugiri, T. Kouya, Acceleration of multiple precision matrix multiplication
|
||
Lu: Writing – review & editing. Zhanyu Yang: Writing – review & using Ozaki scheme, 2023, arXiv preprint arXiv:2301.09960.
|
||
editing, Conceptualization. Zhihong Liang: Writing – review & editing. [18] M. Tatsumi, S.-I. Filip, C. White, O. Sentieys, G. Lemieux, Mixing low-precision
|
||
Siliang Suo: Writing – review & editing. formats in multiply-accumulate units for DNN training, in: 2022 International
|
||
Conference on Field-Programmable Technology, ICFPT, IEEE, 2022, pp. 1–9.
|
||
[19] Y. Tortorella, L. Bertaccini, L. Benini, D. Rossi, F. Conti, RedMule: A mixed-
|
||
Declaration of competing interest precision matrix–matrix operation engine for flexible and energy-efficient on-chip
|
||
linear algebra and TinyML training acceleration, Future Gener. Comput. Syst. 149
|
||
The authors declare that they have no known competing finan- (2023) 122–135.
|
||
[20] D.N. Parikh, R.A. van de Geijn, G.M. Henry, Cascading GEMM: High precision
|
||
cial interests or personal relationships that could have appeared to from low precision, 2023, arXiv preprint arXiv:2303.04353.
|
||
influence the work reported in this paper. [21] M. Fasi, N.J. Higham, F. Lopez, T. Mary, M. Mikaitis, Matrix multiplication in
|
||
multiword arithmetic: Error analysis and application to GPU tensor cores, SIAM
|
||
J. Sci. Comput. 45 (1) (2023) C1–C19.
|
||
Acknowledgments
|
||
[22] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using
|
||
the TF32 and TF64 Frameworks on low-precision AI tensor cores, in: Proceedings
|
||
This work was supported by the Natural Science Foundation of of the SC’23 Workshops of the International Conference on High Performance
|
||
Guangdong Province, China (2024A1515010204) and the Techno- Computing, Network, Storage, and Analysis, 2023, pp. 179–186.
|
||
[23] X.-Y. Liu, Z. Zhang, Z. Wang, H. Lu, X. Wang, A. Walid, High-performance tensor
|
||
logical Research Project of Southern Power Grid Company, China
|
||
learning primitives using GPU Tensor Cores, IEEE Trans. Comput. 72 (6) (2022)
|
||
(ZBKJXM20232483). 1733–1746.
|
||
[24] A. Abdelfattah, H. Anzt, E.G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox,
|
||
Data availability M. Gates, N.J. Higham, X.S. Li, et al., A survey of numerical linear algebra
|
||
methods utilizing mixed-precision arithmetic, Int. J. High Perform. Comput. Appl.
|
||
35 (4) (2021) 344–369.
|
||
No data was used for the research described in the article. [25] M.O. Lam, J.K. Hollingsworth, B.R. de Supinski, M.P. Legendre, Automatically
|
||
adapting programs for mixed-precision floating-point computation, in: Proceed-
|
||
ings of the 27th International ACM Conference on International Conference on
|
||
References Supercomputing, 2013, pp. 369–378.
|
||
[26] A. Dakkak, C. Li, J. Xiong, I. Gelado, W.-m. Hwu, Accelerating reduction and
|
||
[1] M. Meyer, T. Kenter, C. Plessl, Multi-FPGA designs and scaling of HPC challenge scan using tensor core units, in: Proceedings of the ACM International Conference
|
||
benchmarks via MPI and circuit-switched inter-FPGA networks, ACM Trans. on Supercomputing, 2019, pp. 46–57.
|
||
Reconfigurable Technol. Syst. 16 (2) (2023) 1–27. [27] M. Fasi, N.J. Higham, M. Mikaitis, S. Pranesh, Numerical behavior of NVIDIA
|
||
[2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based tensor cores, PeerJ Comput. Sci. 7 (2021) e330.
|
||
convolutions for deep learning on multicore ARM and RISC-V architectures, J. [28] AMD, RocWMMA: ROCm wavefront-level matrix multiply and accumulate, 2024,
|
||
Syst. Archit. (2024) 103186. https://github.com/ROCmSoftwarePlatform/rocWMMA.
|
||
[29] D. Yan, W. Wang, X. Chu, Demystifying Tensor Cores to optimize half-precision
|
||
[3] Y.R. Choi, V. Stegailov, Multi-GPU GEMM algorithm performance analysis for
|
||
matrix multiply, in: 2020 IEEE International Parallel and Distributed Processing
|
||
NVIDIA and AMD GPUs connected by NVLink and PCIe, in: International Con-
|
||
Symposium, IPDPS, IEEE, 2020, pp. 634–643.
|
||
ference on Mathematical Modeling and Supercomputer Technologies, Springer,
|
||
[30] AMD, Next generation BLAS implementation for ROCm platform, 2024, https:
|
||
2022, pp. 281–292.
|
||
//github.com/ROCm/rocBLAS.
|
||
[4] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating
|
||
[31] A. Kerr, D. Merrill, J. Demouth, J. Tran, Cutlass: Fast linear algebra in CUDA
|
||
scientific computing on tensor cores with extended precision, in: Proceedings
|
||
C++, NVIDIA Dev. Blog (2017).
|
||
of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
|
||
[32] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore
|
||
Programming, 2021, pp. 278–291.
|
||
high-performance parallel computing with CUDA, Packt Publishing Ltd, 2018.
|
||
[5] N.S. Murthy, F. Catthoor, M. Verhelst, Optimization of block-scaled integer
|
||
[33] D. Lustig, S. Sahasrabuddhe, O. Giroux, A formal analysis of the NVIDIA PTX
|
||
GeMMs for efficient DNN deployment on scalable in-order vector processors,
|
||
memory consistency model, in: Proceedings of the Twenty-Fourth International
|
||
J. Syst. Archit. (2024) 103236.
|
||
Conference on Architectural Support for Programming Languages and Operating
|
||
[6] M. Ahmed, R. Seraj, S.M.S. Islam, The K-means algorithm: A comprehensive
|
||
Systems, 2019, pp. 257–270.
|
||
survey and performance evaluation, Electronics 9 (8) (2020) 1295. [34] D. Mukunoki, K. Ozaki, T. Ogita, T. Imamura, Accurate matrix multiplication
|
||
[7] G. Chen, Y. Ding, X. Shen, Sweet KNN: An efficient KNN on gpu through on binary128 format accelerated by Ozaki scheme, in: Proceedings of the 50th
|
||
reconciliation between redundancy removal and regularity, in: 2017 IEEE 33rd International Conference on Parallel Processing, 2021, pp. 1–11.
|
||
International Conference on Data Engineering, ICDE, IEEE, 2017, pp. 621–632. [35] J. Huang, L. Rice, D.A. Matthews, R.A. van de Geijn, Generating families
|
||
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep of practical fast matrix multiplication algorithms, in: 2017 IEEE International
|
||
bidirectional transformers for language understanding, 2018, arXiv preprint Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2017, pp. 656–667.
|
||
arXiv:1810.04805. [36] Y. Wang, B. Feng, Y. Ding, QGTC: accelerating quantized graph neural networks
|
||
[9] AMD, Matrix core, 2024, https://gpuopen.com/learn/amd-lab-notes/amd-lab- via GPU tensor core, in: Proceedings of the 27th ACM SIGPLAN Symposium on
|
||
notes-matrix-cores-readme/. Principles and Practice of Parallel Programming, 2022, pp. 107–119.
|
||
[10] S. Markidis, S.W. Der Chien, E. Laure, I.B. Peng, J.S. Vetter, NVIDIA Tensor Core [37] B. Li, S. Cheng, J. Lin, Tcfft: Accelerating half-precision FFT through tensor cores,
|
||
programmability, performance & precision, in: 2018 IEEE International Parallel 2021, arXiv preprint arXiv:2104.11471.
|
||
and Distributed Processing Symposium Workshops, IPDPSW, IEEE, 2018, pp. [38] F. Zhang, Y. Hu, H. Ding, Z. Yao, Z. Wei, X. Zhang, X. Du, Optimizing
|
||
522–531. random access to hierarchically-compressed data on GPU, in: SC22: International
|
||
[11] AMD, AMD CDNA 2.0 architecture, 2024, https://www.amd.com/content/dam/ Conference for High Performance Computing, Networking, Storage and Analysis,
|
||
amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white- IEEE, 2022, pp. 1–15.
|
||
paper.pdf. [39] R. Sakamoto, M. Kondo, K. Fujita, T. Ichimura, K. Nakajima, The effectiveness
|
||
[12] W. Sun, A. Li, T. Geng, S. Stuijk, H. Corporaal, Dissecting Tensor Cores via of low-precision floating arithmetic on numerical codes: a case study on
|
||
microbenchmarks: Latency, throughput and numeric behaviors, IEEE Trans. power consumption, in: Proceedings of the International Conference on High
|
||
Parallel Distrib. Syst. 34 (1) (2022) 246–261. Performance Computing in Asia-Pacific Region, 2020, pp. 199–206.
|
||
|
||
|
||
15
|
||
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
|
||
|
||
|
||
[40] Y. Sun, L. Zheng, Q. Wang, X. Ye, Y. Huang, P. Yao, X. Liao, H. Jin, Accelerating [49] B. Ferrarini, M. Waheed, S. Waheed, S. Ehsan, M.J. Milford, K.D. McDonald-
|
||
sparse deep neural network inference using GPU Tensor Cores, in: 2022 IEEE Maier, Exploring performance bounds of visual place recognition using extended
|
||
High Performance Extreme Computing Conference, HPEC, IEEE, 2022, pp. 1–7. precision, IEEE Robot. Autom. Lett. 5 (2) (2020) 1688–1695.
|
||
[41] Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zhai, Efficiently [50] N.J. Higham, T. Mary, Sharper probabilistic backward error analysis for basic
|
||
emulating high-bitwidth computation with low-bitwidth hardware, in: Proceed- linear algebra kernels with random data, SIAM J. Sci. Comput. 42 (5) (2020)
|
||
ings of the 36th ACM International Conference on Supercomputing, 2022, pp. A3427–A3446.
|
||
1–12. [51] H. Ootomo, H. Manabe, K. Harada, R. Yokota, Quantum circuit simulation
|
||
[42] Z. Lin, A. Sun, X. Zhang, Y. Lu, MixPert: Optimizing mixed-precision floating- by SGEMM emulation on tensor cores and automatic precision selection, in:
|
||
point emulation on GPU integer tensor cores, in: Proceedings of the 25th ACM International Conference on High Performance Computing, Springer, 2023, pp.
|
||
SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools 259–276.
|
||
for Embedded Systems, 2024, pp. 34–45. [52] N. Nakasato, A fast GEMM implementation on the cypress GPU, ACM
|
||
[43] H. Ootomo, R. Yokota, Recovering single precision accuracy from tensor cores SIGMETRICS Perform. Eval. Rev. 38 (4) (2011) 50–55.
|
||
while surpassing the FP32 theoretical peak performance, Int. J. High Perform. [53] M. De Santis, G. Eichfelder, J. Niebling, S. Rocktäschel, Solving multiobjective
|
||
Comput. Appl. 36 (4) (2022) 475–491. mixed integer convex optimization problems, SIAM J. Optim. 30 (4) (2020)
|
||
[44] P. Blanchard, N.J. Higham, F. Lopez, T. Mary, S. Pranesh, Mixed precision block 3122–3145.
|
||
fused multiply-add: Error analysis and application to GPU tensor cores, SIAM J. [54] R.S. Burachik, C.Y. Kaya, M.M. Rizvi, Algorithms for generating pareto fronts of
|
||
Sci. Comput. 42 (3) (2020) C124–C141. multi-objective integer and mixed-integer programming problems, Eng. Optim.
|
||
[45] N.J. Higham, T. Mary, Mixed precision algorithms in numerical linear algebra, 54 (8) (2022) 1413–1425.
|
||
Acta Numer. 31 (2022) 347–414. [55] NVIDIA, Programming tensor cores in CUDA 9, 2024, https://developer.nvidia.
|
||
[46] Z. Jia, M. Maggioni, B. Staiger, D.P. Scarpazza, Dissecting the NVIDIA volta GPU com/blog/programming-tensor-cores-cuda-9/.
|
||
architecture via microbenchmarking, 2018, arXiv preprint arXiv:1804.06826. [56] AMD, Advanced profiling and analytics for AMD hardware, 2024, https://github.
|
||
[47] M. Lu, B. He, Q. Luo, Supporting extended precision on graphics processors, in: com/ROCm/omniperf.
|
||
Proceedings of the Sixth International Workshop on Data Management on New [57] NVIDIA, The user guide for nsight compute, 2024, https://docs.nvidia.com/
|
||
Hardware, 2010, pp. 19–26. nsight-compute/NsightCompute/index.html.
|
||
[48] A. Thall, Extended-precision floating-point numbers for GPU computation, in: [58] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park,
|
||
ACM SIGGRAPH 2006 Research Posters, 2006, pp. 52–es. Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing, in:
|
||
Proceedings of the 29th ACM International Conference on Architectural Support
|
||
for Programming Languages and Operating Systems, Vol. 3, 2024, pp. 722–737.
|
||
|
||
|
||
|
||
|
||
16
|
||
|