Files
opaque-lattice/papers_txt/LE-GEMM--A-lightweight-emulation-based-GEMM-with-p_2025_Journal-of-Systems-A.txt
2026-01-06 12:49:26 -07:00

1006 lines
116 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103336
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
LE-GEMM: A lightweight emulation-based GEMM with precision refinement
on GPU
Yu Zhang a , Lu Lu a,b,c ,, Zhanyu Yang a , Zhihong Liang d,e , Siliang Suo d,e
a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
b
Peng Cheng Laboratory, Shenzhen, 518055, China
c
Pazhou Laboratory, Guangzhou 510005, China
d
Electric Power Research Institute, CSG, Guangzhou, Guangdong, China
e
Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, Guangdong, China
ARTICLE INFO ABSTRACT
Keywords: Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in
GEMM various scientific computing scenarios. These units support tensor-level computation with different precisions
Thread parallelism analytic on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication
Multi-level pipeline
(GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits
Matrix core/Tensor core
its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes
a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline
implementation to accelerate the computation process without compromising the accuracy requirements.
First, we propose a lightweight emulation algorithm that includes a precision transformation process and
GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread
parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on
various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can
maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on
two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that
the proposed method outperforms the previous approaches in terms of computational accuracy and speed.
1. Introduction computation process. For example, the half-precision performance with
Matrix Core in the AMD MI210 achieves 8× higher throughput over
GEneral Matrix Multiplication (GEMM) has been extensively in- ROCm Core [11]. Based on this feature, previous studies have proposed
volved in High-Performance Computing (HPC) [1], deep learning ap- a way to explore low-precision matrix to obtain high-precision results
plications [2], and other scientific computing fields [3]. In these fields, and achieve high performance on GPU [1214].
some applications or scenarios in which GEMM is the primary building Exploiting the low-precision to speed up the GEMM is an effective
block are referred to as GEMM-based scientific computing [4,5]. These
route in GEMM-based scientific computing fields, where matrix mul-
applications involve a large number of matrix multiplication workloads,
tiplication operations can be tolerated for low-precision [15,16]. The
and GEMMs computation time is a significant part of the overall
precision transformation process, which has a significant impact on
process. Specifically, some typical applications, such as K-Means [6],
the accuracy of the final calculated results, is a crucial part of this
KNN [7], Feedforward Neural Networks (FFN) [8], etc., have GEMM
computation times exceeding 50% in popular implementations. To route [17]. Data value errors exist during the transition from high
cater this trend, specialized hardware accelerators (e.g., AMDs Ma- to low precision due to the loss in the mantissa representation. After
trix Core [9] and NVIDIAs Tensor Core [10]) on GPU have been several FMA calculations, the final result has some errors compared to
designed and implemented to speed up GEMM, which employs low- the standard one. This phenomenon is also known as precision loss.
precision inputs to achieve high performance. In contrast to the regular This restricts the application of mixed-precision calculations signifi-
computation pattern, these mixed-precision cores can support tensor- cantly [18,19]. Some previous studies have utilized the calculation
level Fused Multiply-Add (FMA), dramatically speeding up the matrix
Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang),
suosl@csg.cn (S. Suo).
https://doi.org/10.1016/j.sysarc.2025.103336
Received 11 July 2024; Received in revised form 8 December 2024; Accepted 2 January 2025
Available online 17 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
results of additional low-precision variables to reduce precision loss in and NVIDIA are adding specialized hardware units and cores to their
the calculation results [2023]. latest GPUs that support mixed-precision computing [24,25]. In the
Low-precision variables and additional computations are utilized following, we will discuss the details of the specialized cores.
for precision refinement [10], which reduces the impact of precision Computing Pattern and Memory Layout. The GPU cores can be
loss on the computational results. The implementation of this approach classified into two types based on supported precision: same-precision
on GPUs that feature mixed-precision arithmetic is still hindered by cores (FP16 Core, FP32 Core, etc.) that only support the same preci-
several issues. Primarily, a crucial factor is the data transforma- sion computation and mixed-precision cores (Tensor Core or Matrix
tion process. Loss of the original data mantissa during the precision Core) that support different precision computation [26,27]. The GEMM
transformation process has an impact on the final outcome. As a supported by the mixed-precision core can be expressed as 𝐷 = 𝛼 ×
consequence of the disparity in the mantissa length, low-precision
𝐴𝐵 + 𝛽 × 𝐶, where 𝐴 and 𝐵 can be FP16 or FP32 matrix, 𝐶 and 𝐷 can
data cannot be saved beyond the mantissa length range. This leads to
be configured to FP32 or double precision (FP64) matrix, and 𝛼 and
numerical errors after the precision transformation process. The effect
𝛽 denote scalar constants [28]. This means that the mixed-precision
of this error will be amplified into the final calculation result after
core can utilize the input of low-precision to obtain high-precision
multiple FMA operations, which is a gap between calculated result
and the standard result. The second is the acceleration capabilities calculation results [10,29]. The details are shown in Fig. 1(a). In
of these specialized hardware units. These specialized hardware terms of computing patterns, the mixed-precision core utilizes a new
units provide GEMM with powerful acceleration capabilities, accompa- memory layout fragment that stores matrix elements either in row-
nied by special data layouts, which pose challenges to efficient kernel major or column-major order. Different from the same-precision core
design. For GEMM operation, tiling is a common way of assigning matrix multiplication operation, the matrix multiplication operation is
computational tasks, where the matrix is segmented into multiple tiles performed in the mixed-precision core using the fragment as a unit.
(smaller matrix), and a workgroup (thread block) is responsible for The fragment size is 16 × 16 or 32 × 32, and other fixed shapes
computing one or several tiles. When the tile size is too large, the (Fig. 1(b) provides an example based on a 16 × 16 fragment). Note
workload of the threads will become too heavy, reducing thread paral- that different hardware architectures have varying support for fragment
lelism. Conversely, if the tile size is too small, there will be frequent size. For example, for NVIDIA A800, the maximum supported fragment
workgroup switching, which will increase scheduling overhead. The size is 16 × 16 [10]. In AMD MI210, rocWMMA can support frag-
computation allocation of tiles also affects data reuse in a workgroup. In ments with a size of 16 × 32 or 32 × 32 [11]. Different fragment
terms of data access, these specialized hardware units perform GEMM sizes have a particular impact on the workload allocation of GEMM
operations with fragments, which requires higher utilization of the operations.
GPUs multi-level memory. Ineffective organization and manipulation Programming interface and implementation. Matrix Core and
of data frequently negate the speed benefits of the tensor-level patterns, Tensor Core are hardware cores delivered by two different GPU ven-
leading to memory wall bottlenecks in the GPU. dors (AMD and NVIDIA). GPU vendors have implemented various
To address the above issues, we propose LE-GEMM that includes a
approaches depending on their hardware configurations to facilitate
lightweight emulation algorithm, a thread parallelism analytic model,
their use by researchers and engineers. On the AMD platform, the first
and a multi-level pipeline implementation to achieve better computa-
approach is rocWMMA (ROCm Wavefront-level Matrix Multiply and
tional accuracy and speed. Our contributions can be summarized as
Accumulate), in which AMDs C++ interface leverages Matrix Core to
follow:
accelerate GEMM operations. The usage and functionality of this inter-
• A lightweight emulation algorithm is proposed that employs bit- face are very similar to that of WMMA [28]. The difference between
cutting and scaling techniques to preserve the original values and the two ways is that rocWMMA calls the API handle with the keyword
reduce the loss of precision while removing redundant terms in rocwmma. The second approach is the compiler instruction [9], which
the computation process to reduce the computational overhead. follows a specific syntax __𝑏𝑢𝑖𝑙𝑡𝑖𝑛_𝑎𝑚𝑑 𝑔 𝑐 𝑛_𝑤𝑚𝑚𝑎_. The third approach
• A thread parallelism analytic model based on the lightweight uses rocBLAS [30], which provides a functional interface and allows
emulation GEMM is proposed, which allows the selection of the the direct operation of a mixed-precision core by supplying the matrix
optimal tiling scheme through two aspects: the number of parallel parameter.
threads and the single thread performance. There are multiple ways to call Tensor Core on the NVIDIA platform,
• An efficient multi-level pipeline is implemented based on the such as WMMA (Warp-level Matrix Multiply and Accumulate) [10],
access discrepancy between various levels of memory and the CUTLASS [31], cuBLAS [32], PTX instruction [33], etc. WMMA pro-
GEMM workflow, utilizing double buffer and pre-fetch techniques
vides several specific programming interfaces to load (wmma::load_
to hide access latency and improve Instruction-Level Parallelism
matrix_sync), compute (wmma::mma_sync), and store (wmma::store_
(ILP).
matrix_sync) matrices in fragments [10]. Based on this interface, the ma-
• The proposed single-precision GEMM accelerated method
trix is first tiled into multiple fragments, which can be shaped to fixed
achieves state-of-the-art speed and accuracy performance on two
sizes, and then matrix multiplication is performed. CUTLASS (CUDA
GPU platforms.
Templates for Linear Algebra Subroutines) [31] is a series of CUDA tem-
The rest of the paper is organized as follows. Section 2 introduces plate header libraries for implementing high-performance GEMM and
the background of these specialized hardware units (Matrix Core and related computations at all scales and levels within CUDA. It provides
Tensor Core) and matrix multiplication based on the Ozaki scheme. a variety of GEMM computation interfaces (hgemm, sgemm, dgemm,
Section 3 provides some related work. Section 4 presents the details of and wgemm) to optimize data movement and multiply-accumulate
the proposed accelerated method for single-precision GEMM. Section 5
operations in fragments to achieve high performance. cuBLAS [32] is a
demonstrates and evaluates the experimental results. The conclusions
basic linear algebra subroutines library provided and implemented by
of the paper are given in Section 6. Section 7 outlines potential areas
NVIDIA. cuBLAS provides two interfaces (cublasGemmEx and cublasS-
for future work that can be further explored.
gemm) that support GEMM on Tensor Core. In this approach, it is
2. Background necessary to set the math model of cuBLAS by using the cublasSetMath-
Mode function. PTX instruction provides a series of programming APIs
2.1. Matrix core and tensor core and instruction sets for general-purpose parallel programming. It is
designed to be efficient on NVIDIA GPUs that support the defined com-
The amount of computation for GEMM has gradually increased with putation functions. The terminological differences and relationships
the development of deep learning. In response to this trend, AMD between ROCm and CUDA are also given in Table 1.
2
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 1. The precision profiling and the computing pattern of mixed-precision units (Fig. 1(a) shows an example of precision profiling, where input data is FP16-precision, product
and accumulator is FP32-precision, and output is FP32-precision. Fig. 1(b) illustrates a fragment-based GEMM operation with a 16 × 16 fragment).
Table 1
ROCm/CUDA terminology.
ROCm CUDA Description
Compute Unit (CU) Streaming One among many parallel vector processors in a GPU with parallel ALUs. All wavefronts in a
Multiprocessor (SM) workgroup are assigned to the same CU.
Kernel Kernel Functions initiated to the GPU that are executed by multiple parallel CUs or SMs on the GPU.
Multiple kernels can work in parallel with CPU.
Wavefront Warp A collection of operations that executed in lockstep, run the same instructions, and follow the same
control-flow path. Individual lanes can be masked.
Workgroup Thread block Think of this as a vector thread. A wavefront is a 64-wide vector op in AMD GPU. A warp is a
32-wide vector op in NVIDIA GPU.
Work-item/Thread Thread GPU programming models can handle this as a separate thread of execution, although not necessarily
get forward sub-wavefront progress.
Global Memory Global Memory The DRAM memory is accessed by a GPU that traverses the cache of some layers.
Local Memory Shared Memory Scratchpad that allows communication between wavefronts within a workgroup.
Private Memory Local Memory Per-thread private memory is often mapped to registers.
2.2. Matrix multiplication based on the Ozaki scheme where 𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , 𝑅𝐴 , and 𝑅𝐵 denote the matrix with FP16, 𝐴𝐹 𝑃 32
and 𝐵𝐹 𝑃 32 denote the matrix with FP32.
The Ozaki scheme is an error-free method for computing matrix
multiplication [34]. It involves translating the original matrix multi- 3. Related work
plication into a sum of matrices that can be multiplied together. Fig. 2
is a schematic of the matrix multiplication based on the Ozaki scheme. The advent of Matrix Core and Tensor Core has greatly accelerated
The detailed procedure can be divided into three steps: splitting, com- the computational speed of GEMM due to its specific data layout
putation, and summation. In the splitting step, matrix A and matrix and precision requirements. This feature allows it to be studied and
B are first divided by element level to obtain multiple submatrix (𝐴𝑖 , applied in many GEMM-based scientific computing fields [2427].
𝐵𝑗 ). Then, in the computation step, matrix multiplication is performed Therefore, previous studies have optimized the scheduling methods of
on the submatrix of A (𝐴𝑖 ) and the submatrix of B (𝐵𝑗 ) to obtain the these special hardware units based on different application scenarios
submatrix of C (𝐶𝑖,𝑗 ). Finally, all these submatrices are summed to
to accelerate the computation process [4,12,15,35]. For example, a
obtain the result matrix C.
novel approach for accelerated quantization of graph neural networks,
Based on the Ozaki scheme, a single precision matrix multiplication
based on the support of low-precision GEMM, was proposed by Wang
calculation method can be described as follows. For simplicity, the
et al. [36], which exploits the insensitivity of graph neural networks
matrix multiplication with FP32 is used as an example. Firstly, in the
to loss of precision. To make Tensor Core be executed more effi-
element-level splitting step, the matrix based on FP32 is converted to
ciently in deep learning applications, Dakkak et al. [26] have optimized
FP16 precision by the precision transformation function, while the error
the reduction and the scan terms for matrix multiplication, making
is preserved. The detailed expression is as follows:
their programming faster and more accessible. Specifically, Dakkaks
𝐴𝐹 𝑃 16 ← toFP16(𝐴𝐹 𝑃 32 ) (1) method [26] was implemented on the Tesla V100 and achieved 89%
98% of peak memory bandwidth and reduced power consumption.
𝑅𝐴 ← toFP16(𝐴𝐹 𝑃 32 𝐴𝐹 𝑃 16 ) (2) Li et al. [37] proposed a half-precision Fast Fourier Transform (FFT)
approach based on Tensor Core. This approach leverages the fragments
where 𝐴𝐹 𝑃 32 and 𝐴𝐹 𝑃 16 denote the FP32-based and FP16-based matrix
𝐴, respectively. toFP16() is the transformation function that transforms in Tensor Core to support the special operations required for FFT
matrix 𝐴 from FP32 to FP16, and 𝑅𝐴 represents the residual matrix and optimizes the data arrangement in GPU memory [38]. To further
between different precision. Note that this expression is analogous to explore these hardware units workflow and application prospects,
matrix 𝐵. With the support of mixed-precision cores for floating-point Sakamoto et al. [39] improve energy efficiency by analyzing memory
arithmetic with different precision, 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 can be rewritten as and power consumption of low-precision data during computation. The
follows: above studies aim to optimize the use of these specialized hardware
units to accelerate the computing process in specific scientific com-
𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )𝐵𝐹 𝑃 16 = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝐵𝐹 𝑃 16 (3) puting scenarios based on their computational characteristics [40,41].
These approaches maximize the acceleration capability of matrix multi-
Hence, the matrix multiplication for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 can be expressed plication operation without changing the internal computation process
as: of specialized hardware units. However, FMA operations between dif-
𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ) ferent precisions will inevitably result in a certain precision loss in
(4)
= 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝑅𝐵 the calculation results, which will limit the application scope of these
3
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 2. The schematic of the matrix multiplication based on the Ozaki scheme (𝐶 = 𝐴𝐵 as an example).
hardware units [21,22,42]. analysis model, and the efficient multi-level pipeline design. Firstly,
To mitigate the loss of precision, several improvement approaches we introduced the lightweight emulation algorithm, which can pro-
have been proposed. Markidis et al. [10] analyzed the programma- vide a more efficient and accurate calculation performance based on
bility and precision of Tensor Core. Then they proposed a precision specialized hardware units. Then, a thread parallelism analytic model
refinement method for single-precision (FP32) matrix multiplication, is proposed, which is used to analyze and select the optimal tiling
which takes advantage of half-precision (FP16) to store residuals in scheme based on various computing scenarios and hardware. Finally,
transformations. Based on this research, Ootomo et al. [43] improved we present an efficient multi-level pipeline design for GEMM, which
the rounding method of single-precision floating-point numbers. They utilizes double buffer and pre-fetch techniques to hide access latency
reduced the redundancy in error-correction terms by considering the and improve ILP.
effect of the rounding method and implicit bit. Others have studied
the error analysis of matrix multiplication using Tensor Core and have 4.1. Lightweight emulation algorithm
given theoretical error bounds for matrix multiplication with different
precisions [24,44,45]. To explore the efficiency of mixed-precision Traditional emulation algorithm design assumes that the hardware
cores, Jia et al. [46] studied the arrangement of matrix-to-fragment has the same input and output precision. Typically, a large number
data in memory and how assembly instructions split the matrix. To of low-precision instructions are used to complete the computation
support the extended-precision computing pattern, several methods and memory access process, which can result in high computational
have been proposed in limited-precision hardware, combining multi- overhead and reduce performance benefits [4,40,51,52]. However, spe-
ple low-precision computation instructions to simulate the extended- cialized hardware units, such as Matrix Core and Tensor core, have
precision computing methods [4749]. Inspired by this approach, Feng been introduced to accelerate the GEMM operation. Multiplication and
et al. [4] improved the Markidiss method in single-precision matrix addition are fused in special instruction primitives, and the process
multiplication, which modified the rounding method and optimized the of accumulating intermediate results also supports high-precision ac-
computation of extended-precision at the instruction level. In addition, cumulation to ensure accurate calculation results. This new hardware
Fasi et al. [21] invested in the application of multiword methodology feature has motivated us to design a lightweight emulation algorithm
to improve the performance-accuracy trade-off of matrix multiplication that accelerates the GEMM computation process by achieving high com-
with mixed-precision cores, focusing specifically on the Tensor Core putational throughput based on low-precision inputs. The lightweight
emulation algorithm consists of the precision transformation process
available on NVIDIA GPUs. Following this research direction, Parikh
and the GEMM emulation calculation, as illustrated in Fig. 3. The preci-
et al. [20] presented insights and possibilities for implementing higher
sion transformation process is designed to make the low-precision data
precision matrix multiplication with Tensor Core, and an implemen-
better preserve the value of the original data, thus achieving more ac-
tation of the FP64x2 GEMM scheme with variable FP64 inputs was
curate calculation results. The GEMM emulation calculation is designed
provided. The common feature of these studies is using low-precision
to achieve high-precision results and speed up the calculation process
inputs to obtain high-precision computational results with some addi-
by leveraging the specialized hardware units and low-precision input.
tional computation workload. In this way, these specialized hardware
Floating-point numbers of different precision types have exponent and
units (Matrix Core and Tensor Core) can perform high-performance
mantissa with different lengths. For the single-precision format, 8 bits
matrix multiplication operations and reduce the impact of precision
are allotted to the exponent and 23 bits to the mantissa. The half-
loss [44,50].
precision representation has a smaller range than the single-precision,
with the exponent and mantissa being 5 and 10 bits, respectively. In
4. LE-GEMM emulation algorithm, the first step is to split the original matrix, which
affects the level of elements in the matrix. There are two different
This section gives a detailed description of LE-GEMM, which in- splitting routines. One splitting routine is to split the matrix elements
cludes the lightweight emulation algorithm, the thread parallelism without changing the precision, and the other is to split the matrix
4
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 3. The workflow of LE-GEMM.
values and change the precision type. In this paper, we focus on the numbers, and it is compatible across platforms. Fig. 4 illustrates the
implementation of the second routine. However, the transformation details of the precision transformation process. For the computation
process inevitably leads to precision loss. Therefore, an additional of 𝑅𝐴 , since the mantissa of FP16 is of length 10, this does not fully
FP16 is used to preserve the residual and improve the accuracy of save all the remaining mantissa. Based on these two factors, a residual
the calculated results. Rounding is a popular precision transformation is calculated from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 . Then multiply the residual by the
technique that has been previously employed to transform from high- scaling factor to preserve the residual, which is 210 . This does not alter
precision to low-precision [20,45]. The common rounding methods the representation, but it maps the initial value to the larger range that
include RN, RZ, RU, and RD. 1 The rounding method has the advantage can be represented by 𝑅𝐴 . Note that since 𝑅𝐴 was obtained by scaling,
of taking into account the representation of the mantissa low-bit part, the processing used in subsequent calculations will return the original
which is after the rounding bit. However, it is important to note that residual value. Finally, to preserve the mantissa to the maximum extent,
each rounding method will inevitably introduce some errors in the the residuals are saved using rounding. The precision transformation
final value. The primary reason for this is that the mantissa involves process can be rewritten as follows:
rounding up or down, leading to numerical deviations. For instance,
𝐴𝐹 𝑃 16 = bit-cut(𝐴𝐹 𝑃 32 ) (5)
when transforming from FP32 to FP16 using the RN method, the final
value in the 13-bit mantissa considers the low-bit of the mantissa (0
12 bit) to achieve the nearest round. 𝐴𝐹 𝑃 16 and 𝑅𝐴 are responsible 𝑅𝐴 = round((𝐴𝐹 𝑃 32 𝐴𝐹 𝑃 16 ) × 210 ) (6)
for preserving the high-bit and low-bit mantissa of 𝐴𝐹 𝑃 32 respectively.
From a numerical perspective, 𝐴𝐹 𝑃 16 has a larger effect than 𝑅𝐴 on 𝐵𝐹 𝑃 16 = bit-cut(𝐵𝐹 𝑃 32 ) (7)
computation results. Therefore, it is important to preserve a high-bit of
the mantissa during the rounding process by saving as many low-bit 𝑅𝐵 = round((𝐵𝐹 𝑃 32 𝐵𝐹 𝑃 16 ) × 210 ) (8)
bits as possible.
Compared to previous work, we adopted a bit-cut to preserve the where bit-cut() denotes executing bit-cut operation, round() represents
high-bit bits in FP32 during the precision transformation process, con- RZ. This section presents the details of the precision transformation
sidering the impact of high-bit and low-bit bits on errors. To achieve process involving bit-cut and scaling. In the scaling step, it is essen-
precision transformation from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 , a bit-cut method is tial to divide the value by the scaling factor in the last step of the
employed to preserve the 𝐴𝐹 𝑃 32 mantissa high-bit. In this way, the calculation to cancel the effect of the scaling factor. Therefore, the
higher mantissa (1322 bits of 𝐴𝐹 𝑃 32 in Fig. 4) is copied into 𝐴𝐹 𝑃 16 . GEMM emulation calculation, which is based on the previous precision
For residual 𝑅𝐴 , introducing a scaling operation requires dividing it transformation process, can be described as follows.
by a scaling factor (210 in Fig. 4) in the calculation process to ensure 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 210 )
the correct results. Note that due to the difference in the length of the
= 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )210 + 𝑅𝐴 𝑅𝐵 220
exponent part between single-precision and half-precision, additional
processing is required to avoid errors in the transform process of the (9)
exponent part. Considering that the bias of single-precision and half-
The fourth calculation term in Eq. (9) is divided by 220 , thus it
precision sizes are 127 and 15, it is necessary to subtract the bias
of the exponent part to avoid errors in the transformation exponent has little effect on the final computation result. Moreover, additional
bit. To avoid the impact of bias, we preserve the exponent part to an computation items will increase the overall computation effort. For this
intermediate variable, which requires subtracting the bias value. Then, purpose, Eq. (9) is modified as follows:
add the corresponding bias when saving the intermediate variable for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 210 )
(10)
the half-precision exponent part. This approach offers the following ≈ 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )210
benefits: compared to traditional rounding, this approach does not gen-
erate rounding-up or rounding-down, which can affect the stability of Hence, the lightweight emulation algorithm calculation can be de-
high-bit. This approach does not depend on the rounding API interface scribed as follows:
of the platform, which differs in the rounding regulations of different
𝐷𝐹 𝑃 32 = 𝛼 × 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 + 𝛽 × 𝐶𝐹 𝑃 32
platforms. The bit-cut is based on a binary format for floating point
𝛼 × (𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )210 ) + 𝛽 × 𝐶𝐹 𝑃 32
(11)
1
RN, RZ, RU, and RD represent Round-to-nearest-even, Round-towards-
Zero, Round-Up, and Round-Down, respectively.
5
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 4. The bit-cut and scaling operation.
where 𝐷𝐹 𝑃 32 and 𝐶𝐹 𝑃 32 denote FP32 matrices, 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 and efficient tiling scheme that represents the impact of the allocation
𝑅𝐵 represent the same meaning as those in Eq. (10), 𝛼 and 𝛽 are of computational tasks on overall performance is critical. The tiling
scalar constants. This section introduces a precision transformation scheme is to segment the matrix into multiple tiles and assign each
process consisting of a two-step bit-cut and rescaling. This process can tile to a cooperative thread array (thread block) responsible for the
effectively preserve the original data mantissa and avoid numerical computation. Each thread loads and completes the computation of the
fluctuations caused by rounding. At the same time, based on the above corresponding matrix elements. This process is based on the size of the
precision transformation process, we have designed a GEMM emulation matrix in GEMM operation and the hardware resources to perform the
calculation process, which removes the cumulative term in Eq. (9) and allocation of computational tasks.
reduces the overall computation time of GEMM. Algorithm 1 provides Fig. 5 illustrates that computing a tile in matrix D requires loading
an overview of the lightweight emulation algorithm. In Algorithm 1, sub-sections of matrices A and B (the gray area) and a tile in matrix
bit-cut () denotes utilizing the bit-cut technique to preserve the high- C (the blue area) to perform GEMM operation. This allows for the
bit bits in the mantissa, which is detailed in Section 4.1. round() computation of multiple tiles simultaneously on the GPU. To fully
represents the use of rounding to preserve the low-bit bits in the exploit the computing power of GPUs, our proposed thread parallelism
mantissa. mFma_sync() represents utilizing specialized hardware units analysis model analyses and improves the performance of GEMM com-
to calculate the multiplication of two matrices, and uniformFma_sync() putations by considering the number of parallel threads (Thread-Level
represents the computation process of Eq. (11) to obtain the final Parallelism, TLP) and the single thread performance. Given the tile size,
calculation result 𝐷𝐹 𝑃 32 . the computation of GEMM can be divided into multiple subparts, each
with a set of threads responsible for the computation. This property
Algorithm 1 The lightweight emulation algorithm allows for the computation of thread parallelism based on a given
1: Function emulation GEMM operation(𝐷𝐹 𝑃 32 , 𝐴𝐹 𝑃 32 , 𝐵𝐹 𝑃 32 , tiling strategy. Previous methods have been proposed based on the
𝐶𝐹 𝑃 32 ,𝛼, 𝛽) number of threads in a thread block simply. Because the thread block
2: /* The precision transformation process */ is further transformed into multiple frontwaves (warps) for execution
3: 𝐴𝐹 𝑃 16 =bit-cut(𝐴𝐹 𝑃 32 ), 𝑅𝐴 =round((𝐴𝐹 𝑃 32 -𝐴𝐹 𝑃 16 ) × 210 ); on CU, this resulted in the previous method ignoring the underlying
4: 𝐵𝐹 𝑃 16 =bit-cut(𝐵𝐹 𝑃 32 ), 𝑅𝐵 =round((𝐵𝐹 𝑃 32 -𝐵𝐹 𝑃 16 ) × 210 ); computational pattern on the GPU. Therefore, we use the number
5: /* GEMM emulation computation process */ of wavefronts to calculate thread parallelism. This provides a more
6: /* Initialize accumulator */ accurate indication of how active the underlying threads are in practice.
7: Initialize fragsAcc0, fragsAcc1; Based on the matrixs tiling scheme and hardware configuration, TLP
8: /* Calculate each term in Equation (11) */ can be calculated as follows:
( )
9: fragsAcc0 = mFma_sync (𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , fragsAcc0); 𝑀 ×𝑁
𝑇𝑇 𝐿𝑃 = 𝜑 × 𝑇𝑡𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 (12)
10: fragsAcc1 = mFma_sync (𝐴𝐹 𝑃 16 , 𝑅𝐵 , fragsAcc1) ; 𝑇𝑀 × 𝑇𝑁
11: fragsAcc1 = mFma_sync (𝐵𝐹 𝑃 16 , 𝑅𝐴 , fragsAcc1) ; where 𝑀 and 𝑁 represent the dimensions of matrix 𝐶, 𝑇𝑀 and 𝑇𝑁
12: /* Calculate the final result */ are the tile size, 𝜑 is the transform process of workgroup to wavefront,
13: 𝐷𝐹 𝑃 32 = uniformFma_sync (𝛼, fragsAcc0, fragsAcc1/210 , 𝛽, and 𝑇𝑡𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 represents the number of threads in wavefront. Note
𝐶𝐹 𝑃 32 ); that 𝑇𝑡𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 has different values across various GPU platforms, for
14: end Function
example, on MI210 and A800, the values for 𝑇𝑡𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 are 64 and 32,
respectively.
When considering TLP, it is important to consider the performance
4.2. Thread parallelism analytic model of a single thread, which can be expressed as a compute-to-memory
ratio. Comparing the clock cycles of memory access instruction and
The lightweight emulation algorithm is designed to efficiently map computation instruction in threads, it is evident that data access in
GEMM-based scientific computing to specialized hardware cores, al- memory requires more time. Therefore, it is crucial to ensure that the
lowing for high-precision outputs with low-precision inputs. To fully data loaded on the thread is fully utilized. During each iteration, every
exploit the acceleration capabilities of specialized hardware units, an thread block is required to perform two tasks. The first step involves
6
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 5. The tiling scheme of a single GEMM.
reading two matrices (𝐴𝐹 𝑃 16 , 𝑅𝐴 ) of size (𝑇𝑀 , 𝑇𝐾 ), two matrices between the thread block and the matrix tiles, which will also affect the
(𝐵𝐹 𝑃 16 , 𝑅𝐵 ) of size (𝑇𝐾 , 𝑇𝑁 ) and matrix C of size (𝑇𝑀 , 𝑇𝑁 ). Since matrix efficiency of data reuse. As shown in Fig. 5(a), the calculation of 3 tiles
C of size (𝑇𝑀 , 𝑇𝑁 ) is loaded only in the first iteration and does not in matrix 𝐶 involves the loading of 1 sub-section of matrix 𝐴, and 3 sub-
need to be loaded again in subsequent steps, its loading overhead can sections of matrix 𝐵, respectively. In Fig. 5(b), 4 tiles can be computed
be neglected. Thus, the memory access instruction with each iteration by loading two sub-sections of the matrix 𝐴 and 𝐵. Comparing these
can be expressed as: two routines, it is clear that the larger the cross-sectional area (blue
tile in Fig. 5) between the sub-sections loaded by matrices 𝐴 and 𝐵.
𝑇𝑚𝑒𝑚𝑜𝑟𝑦 = 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 ) (13)
Fig. 5 indicates that the pattern in (b) has better data reuse.
where 𝑇𝑚𝑒𝑚𝑜𝑟𝑦 represents memory access during each iteration, the last Finally, the process of optimal tiling scheme selection with thread
2 means that the half-precision data occupies two bytes. According parallelism analytic model is formally represented as maximizing 𝑇𝑇 𝐿𝑃
to the lightweight emulation algorithm, the computation instruction and 𝑇 𝑓 𝑙𝑜𝑝𝑠 . According to Eqs. (12) and (15), it can be found that
𝑚𝑒𝑚𝑜𝑟𝑦
required for each iteration process is as follows:
𝑇𝑇 𝐿𝑃 and 𝑇 𝑓 𝑙𝑜𝑝𝑠 show an inverse relationship given 𝑀, 𝑁, and 𝑇𝑡𝑟𝑒𝑎𝑑 .
𝑇𝑓 𝑙𝑜𝑝𝑠 = 3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 ) (14) 𝑚𝑒𝑚𝑜𝑟𝑦
Generally, smaller tiles can achieve better thread parallelism because
where 𝑇𝑓 𝑙𝑜𝑝𝑠 represents computation with each iteration, 2 × (𝑇𝑀 × computational tasks are assigned to multiple threads. However, the
𝑇𝑁 × 𝑇𝐾 ) represents the computation amount of matrix multiplication number of activated threads is also limited by the supported number
between matrix of size (𝑇𝑀 , 𝑇𝐾 ) and matrix of size (𝑇𝐾 , 𝑇𝑁 ), and the by the GPU hardware. When the tile is larger, a larger compute-to-
constant 3 corresponds to the three terms in Eq. (11). To this end, the memory ratio is often achieved, which represents a better performance
ratio of computation to memory access (𝑇 𝑓 𝑙𝑜𝑝𝑠 ) during each iteration for a single thread. When choosing a tiling scheme, it is also necessary
𝑚𝑒𝑚𝑜𝑟𝑦
is as follows: to consider the workload of the thread according to Eqs. (16) and
3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 ) 3(𝑇𝑀 × 𝑇𝑁 ) (17). Eqs. (12)(17), take into account factors that affect the number
𝑇 𝑓 𝑙𝑜𝑝𝑠 = = (15) of active threads and the performance of a single thread, allowing
𝑚𝑒𝑚𝑜𝑟𝑦 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 ) 2(𝑇𝑀 + 𝑇𝑁 )
the optimal tiling scheme to be selected based on the various GPU
While pursuing higher TLP and compute-to-memory ratios, it is also platforms and computing scenarios. Our thread parallelism analysis
important to focus on thread workload. The workload of a thread can be model chooses the tiling scheme based on thread-level parallelism
divided into two parts: the amount of data load and the amount of com- and single thread performance, which can be solved analytically with
putation. To avoid excessive workload and reduce thread parallelism, existing optimization solvers [53,54]. This can also be used to pro-
it is necessary to properly set the tiling scheme and the number of tiles vide a more profitable direction when selecting the best parameters
that the workgroup is responsible for computing. To describe a thread by trial-and-error. Table 2 presents the choice of the parameters for
the amount of data load (𝑊 𝑜𝑟𝑘_𝐷𝐿) and the amount of computation LE-GEMM. In Table 2, (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) denotes the size of the matrix
(𝑊 𝑜𝑟𝑘_𝐶 𝑃 ) more accurately, the following two formulas are given: partitioned into tiles, (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) denotes the fragment size for GEMM
computation, and wavefront/warp size denotes the number of threads
𝑇𝑀 × 𝑇𝑁 + 𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁
𝑊 𝑜𝑟𝑘_𝐷𝐿 = (16) in the thread group of the two GPU platforms that is determined by
𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚
their hardware architectures. Due to the various sizes and shapes of
𝑇𝑀 × 𝑇𝑁 the fragment supported by different architectures, the choice of the
𝑊 𝑜𝑟𝑘_𝐶 𝑃 = (17)
𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 fragment is also limited to the two GPU platforms. At the same time,
where 𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 represents the number of thread responsible for com- the wavefront/warp size also affects the thread workload allocation.
puting the tile. 𝑇𝑀 , 𝑇𝑁 , and 𝑇𝐾 represent tile size, the details of which Therefore, when choosing parameters, it is essential to consider the
are shown in Fig. 5. 𝑊 𝑜𝑟𝑘_𝐷𝐿 is the number of elements responsible impact of hardware differences on performance.
for the computation in each thread. To ensure efficient parallelism,
it is necessary to allocate a different number of threads to various 4.3. An efficient multi-level pipeline for GEMM
shape tiles. This ensures that the workload is consistent between thread
blocks. 𝑊 𝑜𝑟𝑘_𝐶 𝑃 represents the number of registers loaded by each In the GPU memory architecture, various memory spaces will have
thread. When the thread loads too many elements, it will cause the data different sizes and access speeds. Data access speeds can be classified
in the register to overflow, which will increase the data access over- from fast to slow in terms of register memory, local memory, and global
head. At the same time, we should also consider the correspondence memory. The speed is inversely proportional to the size of the memory
7
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 6. The workflow of GEMM.
Table 2 Table 3
The choice of the parameters on two GPUs. The configuration of platforms for evaluation.
Name AMD MI210 NVIDIA A800 Name AMD-platform NVIDIA-platform
Matrix tile size (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) (128, 128, 32) (128, 256, 32) CPU EPYC 7663 Platinum 8358
Fragment size (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) (32, 32, 16) (16, 16, 16) GPU MI210 A800
Wavefront/Warp size 64 32 OS Ubuntu 20.04 Ubuntu 20.04
ROCm/CUDA ROCm 5.6 CUDA 12.0
space. Direct access to data from global registers reduces transfer
efficiency and does not maximize memory bandwidth. Therefore, in The implementation of the multi-stage pipeline above must be
this paper, double buffers are used to alleviate the problem of high aware of the size of registers and shared memory supported by the
differences in memory access speeds, and pre-fetch techniques are used specific GPU hardware. Once the maximum capacity limit is exceeded,
to maximize memory bandwidth and thread instruction parallelism. data overflow will occur and this will greatly increase access latency.
The workflow for a tile in GEMM is shown in Fig. 6. Indeed, register and shared memory occupancy is related to 𝑇𝑀 , 𝑇𝑁
First, a double buffer (buffer0 and buffer1) was created in the share and 𝑇𝐾 from the above description. When a tiling scheme is given, 𝑇𝑀
memory, and the single buffer size is related to the cumulative span and 𝑇𝑁 are also determined. Therefore, data overflow can be prevented
of matrices 𝐴 and 𝐵 along the 𝐾-dimension. As shown in Fig. 6, by setting 𝑇𝐾 flexibly. For example, on the AMD platform, based on the
loading instructions is applied to fetch matrix tiles 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 fact that 𝑇𝑀 and 𝑇𝑁 are 128 and 256 respectively, 𝑇𝐾 can be set to a
and 𝑅𝐵 in iteration #0, and the data is stored in buffer0. The fetch data smaller size such as 16 or 32 to prevent data overflow. Note that to
correspond to the yellow tiles 𝐴𝐹 𝑃 16 and 𝑅𝐴 (𝑇𝑀 × 𝑇𝐾 ) and the green avoid invalid computations in Matrix Core, 𝑇𝐾 should preferably be an
tiles 𝐵𝐹 𝑃 16 and 𝑅𝐵 (𝑇𝐾 × 𝑇𝑁 ) in Fig. 6. At the same time, the data to be integer multiple of the fragment size.
loaded in iteration #1 will also be fetched into buffer1. This allows the
register to fetch data directly in the share memory, without waiting for 5. Results
the data to be loaded from the global memory. The thread can load data
directly from shared memory into private registers, follow a specified This section compares the proposed method with some state-of-
data layout, and then utilize Matrix Core to complete the computation. the-art approaches and analyzes the experimental results on two GPU
When the intermediate result of the first iteration is complete, the platforms. The flowchart of the comparative experiment is shown in
register reads the data needed for the next iteration from buffer1. The Fig. 7. Note that the matrix size of 𝐴𝑅𝑀×𝐾 and 𝐵𝑅𝐾×𝑁 is
thread continues to load the data needed for the next iteration from represented as 𝑀*𝑁*𝐾 in the following expressions for simplicity.
buffer1 when the intermediate result of the iteration #0 is complete. The tested range of numbers was [1,1] in comparative experiments.
For efficient data pipelining and instruction-level parallelism, data for
the next iteration can be pre-fetched from global memory into shared 5.1. Setup
memory when performing data transfers between shared memory and
registers. Based on the above approach, we can implement a multi- Experiment platform. In this paper, we chose two most common
level access pipeline between different memory levels, and the access and advanced GPUs (MI210 and A800). The configuration and details
and computation instructions can be executed in parallel, improving of the experimental platform are given in Tables 3 and 4 respectively.
the ILP performance of the thread. Note that a thread synchronization Comparison method. To evaluate the proposed method, first, for
instruction is performed behind the store to ensure data consistency by two GPU platforms, the default single-precision GEMM methods such
each time a buffer is stored. Each iterative computation produces an as rocBLAS and cuBLAS, are provided by the respective GPU vendors.
intermediate result which is combined with the accumulation matrix These methods do not use the mixed-precision core and directly utilize
𝐶𝐹 𝑃 32 for FMA operation. After several computation iterations, the final the FP32 input to compute the FP32 GEMM results. We compare
calculation result 𝐷𝐹 𝑃 32 can be obtained and written back to the global some methods, such as rocWMMA and WMMA, that use FP16 input to
memory. compute FP32 results on Matrix Core or Tensor Core without precision
8
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 7. The flowchart of the comparative experiment.
Table 4 half-precision matrices in the paper, 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) is the result of a stan-
The configuration of two GPUs. dard single-precision calculation used to compute Max_error, (𝑖, 𝑗) rep-
Name AMD MI210 NVIDIA A800 resents the 𝑖th row and 𝑗th column element of the matrix, 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒() is
Architecture CDNA 2.0 Ampere a precision transformation function to obtain more accurate errors, 𝑀
Mixed-precision core Matrix Core Tensor Core and 𝑁 represent the dimensions of the resulting matrix, and 𝑀 𝑎𝑥() is
Caches L1 16 KB(per CU) L1 192 KB (per SM)
L2 16 MB L2 40 MB
to choose the maximum value from all errors. Meanwhile, the Mean
Relative Error Distance (MRED) is also used as an error evaluation
Memory 64 GB HBM2 80 GB HBM2
Bandwidth 1.6 TB/s 2.04 TB/s
metric, which is calculated as follows:
∑𝑀 ∑ 𝑁 | |
1 | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) |
MRED = | | (20)
Table 5 |
𝑀 𝑁 𝑖=1 𝑗=1 | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) |
|
The description of the baseline method.
Name Input Output Description where M and 𝑁 denote the size of the result matrix, and 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) and
rocBLAS [30] FP32 FP32 rocblasSgemm on ROCm Core.
𝑉𝑝 (𝑖, 𝑗) denote the same meaning as in Eq. (19).
cuBLAS [32] FP32 FP32 cublasSgemm on CUDA Core.
rocWMMA [28] FP16 FP32 rocWMMA implemented on Matrix Core
5.2. The speed improvement with various shapes
without precision refinement. To evaluate the speed advantages of our method, we conducted
WMMA [55] FP16 FP32 WMMA implemented on Tensor Core comparative experiments on two GPU platforms with multiple shapes
without precision refinement. of the matrix. The results are presented in Figs. 89. In Fig. 8(a)(c),
Mark [10] FP16 FP32 Markidiss implemented on two our method improves performance by 1.31× compared to rocBLAS,
platforms. an arithmetic library commonly used for scientific computing and
Ooto [43] FP16 FP32 Ootomos implemented on two provided by AMD. We also compare rocWMMA, which directly uses
platforms. half-precision inputs to achieve a single-precision output without a
precision refinement process. The proposed method achieves an av-
erage 1.25×, 1.24×, and 1.45× speedup in (n*n*n), (n*4n*n), and
(n*n*4n), respectively. From Fig. 8(a)(b), when n ≤ 4096, it can
refinement. Finally, the experiments compare Markidiss and Ootomos
be seen that the performance of rocWMMA is superior to that of the
methods, which calculate FP32 results based on FP16 input with pre-
proposed method. This phenomenon is mainly due to the following
cision refinement. In this paper, the experimental results are averaged
two reasons. First, rocWMMA is a straightforward computation way;
over 10 experiments. The baseline methods are described in Table 5.
this does not involve the accuracy refinement procedure; therefore,
Evaluation criteria. To objectively evaluate the effectiveness of
the workload of the calculation process is less. Second, the dimen-
the proposed method, in this paper, two criteria are used to evaluate
sion of the matrix is minor, so the proposed method cannot fully
the speed and accuracy. In terms of measuring speed, the experimen- exploit the performance improvement brought by the multi-level flow
tal results were represented by the average value of TFLOPS (Tera design. As the matrix dimension gradually increases, the performance
FLoating-point Operations Per Second), which is calculated as: advantage of the proposed method becomes increasingly apparent, as
2((𝑀 × 𝑁 × 𝐾) + (𝑀 × 𝑁)) seen in Fig. 8. Compared to Markidiss and Ootomos methods, the
TFLOPS = (18)
𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 1.0𝑒12 proposed method has a speedup of 3.05× and 3.18×, respectively.
where 𝑀, 𝑁 and 𝐾 represent the matrix dimension of the GEMM, and These two methods utilize precision refinement techniques but require
𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time in seconds measured on GPU. To more computation workload. The specialized hardware accelerators
evaluate the calculated results, the max relative error (Max_error) is data layout and instruction scheduling characteristics were not con-
utilized to evaluate the error compared to the CPU computation results, sidered, leading to lower performance. The proposed method utilizes
which is calculated as: the thread parallelism analytic model to choose the tiling scheme to
( ) segment the matrix. Additionally, an efficient multi-level pipeline is
|𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))| used to improve the efficiency of data access and ILP. This makes it
Max_error = 𝑀 𝑎𝑥
|𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗))| + |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))| an effective solution for addressing the issue of low computation speed
𝑖 = 1, 2, … , 𝑀 , 𝑗 = 1, 2, … , 𝑁 . (19) in precision refinement. Fig. 8 indicates that the proposed method has
performance benefits when the matrix dimension is large. This serves
where 𝑉𝑝 (𝑖, 𝑗) denotes the computational result under the precision as further evidence that the suggested LE-GEMM has significantly
refinement procedure, which can achieve single-precision results with improved performance for both square and skewed matrices.
9
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 8. The experimental results with various shapes on MI210.
Fig. 9. The experimental results with various shapes on A800.
The experimental results of speed comparison on NVIDIA A800 Table 6
The speedup on two GPU platforms.
are shown in Fig. 9. In the square matrix case (n*n*n), the proposed
Baseline method AMD MI210 NVIDIA A800
method achieves an average speedup of 1.37×, 1.40×, 2.78× and
2.74×, which are compared with cuBLAS, WMMA, Mark and Ooto, (n*n*n) (n*4n*n) (n*n*4n) (n*n*n) (n*4n*n) (n*n*4n)
respectively. As illustrated in Fig. 9, it can be seen that when the matrix rocBLAS/cuBLAS 1.31× 1.31× 1.32× 1.37× 1.40× 1.36×
size is small, the performance of LE-GEMM is similar to the compared rocWMMA/WMMA 1.26× 1.24× 1.45× 1.40× 1.43× 1.81×
Mark 2.81× 2.99× 3.35× 2.77× 2.84× 4.67×
methods. For example, when n ≤ 2048, Fig. 9(a) shows that the Ooto 2.99× 3.10× 3.45× 2.74× 2.80× 3.76×
proposed method performs similarly to cuBLAS and WMMA. The above
phenomenon is because a smaller matrix will reduce the number of tiles
and the number of iterative accumulations within the tiles, reducing
the performance gain brought by the multi-level pipeline. As the size 5.3. The precision error
of the matrix increases, the benefits of the proposed method become
more apparent. When the matrix size is large, the proposed method This section compares several baseline methods in terms of their
significantly improves performance over other methods. Experiments maximum error. The experiment results are shown in Fig. 10. First, we
conducted on two GPU platforms demonstrate that the performance of compare with rocWMMA, which computes FP32 GEMM directly from
FP16 using Matrix Core. In Fig. 10(a), the proposed method reduces
the Markidis and Ootomo methods is nearly consistent. The reason is
the average Max_error by 377.33× compared to rocWMMA. The results
that, while Ootomos method reduces the computational effort by 25%,
indicate that FP16 cannot preserve all the mantissa of FP32 due to the
Ootomos method requires an additional accumulator to perform the
limitation of the mantissa length. Additionally, this method lacks preci-
accumulation. The initiation and combination stages of the accumu-
sion refinement techniques, resulting in a pronounced Max_error. The
lator also require an additional amount of time. When the matrix is
comparison result shows that the proposed method can significantly
skewed, the proposed method outperforms the other baseline methods
reduce the Max_error compared to the default routine. Mark and Ooto
significantly. Fig. 9(c), when n>7168, the performance of the WMMA
methods were then compared separately. These two methods utilize
method decreases because it is a simple and direct implementation that precision refinement techniques to decrease Max_error. These methods
ignores the data loading pipeline and multi-level memory architecture have improved computational accuracy compared to the default meth-
during the computation process. This simple implementation results ods. Fig. 10(a) shows that the proposed method reduces the average
in lower efficiency when accumulating along the K-dimension, which Max_error by 74.82× compared to Mark. Although Markidis et al. [10]
is more pronounced when the matrix size is (n*n*4n). This approach employed an additional FP16 execution precision refinement process to
makes WMMA unable to fully utilize its hardware acceleration capabil- reduce Max_error, the effect of the rounding method on FP16 was not
ity. Fig. 9(b) illustrates that the proposed method achieves a speedup considered, leading to certain limitations in improving accuracy. The
of 1.40×, 1.43×, 2.84× and 2.80× over the four baseline methods, proposed method reduces Max_error by an average of 2.69× compared
respectively. As shown in Fig. 9, the proposed method performs better to Ootomos method. However, the addition of FP16 does not preserve
when the matrix size is large; this performance trend is consistent with the residuals of FP32 and FP16 efficiently due to the limitations of
the AMD platform. The detailed comparative experimental results of the numerical representation range. Rounding inevitably causes fluctu-
two GPU platforms are presented in Table 6. ations in values, leading to errors in the values preserved by FP16. To
10
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 10. The Max_error with square matrix (n*n*n) on two GPU platforms.
Fig. 11. The Mean Relative Error Distance with square matrix (n*n*n) on two GPU platforms.
alleviate this problem, the proposed method employs a bit-cut method 636.92×, 178.93×, and 1.12× on MI210, respectively. By combining the
for the high-bit of FP32 mantissa to avoid the value fluctuation caused experimental results in Figs. 1011, it can be found that rocWMMA and
by rounding. Fig. 10(a) shows that the Max_error increases with the WMMA have enormous Max_error and MRED compared to the other
matrix dimension (K-dimension). This is due to the fact that the two methods, which also shows the importance of precision refinement.
FP16 cannot fully preserve the FP32 values, resulting in errors in the In A800, the proposed method reduces the MRED by 814.87×, 1.75×,
calculated results. As the number of computations increases, the error and 1.28× compared to WMMA, Mark, and Ooto, respectively. The
gradually becomes more apparent. The effectiveness and advancement experimental results of Figs. 1011 demonstrate a consistent trend:
of the proposed method can be demonstrated by comparing with the the Max_error and MRED are more significant when the cumulative
default (rocWMMA) method and the latest (Mark and Ooto) methods dimension increases. As shown in Figs. 1011, it can be found that the
on the AMD-platform. proposed method has a smaller Max_error and MRED, which indicates
To demonstrate the proposed methods portability, we have con- that it has a smaller upper limit of error and mean relative error. The
ducted corresponding experiments on NVIDIA A800, another popular comparative experiment between Max_error and MRED proves that the
proposed method has better computational accuracy.
GPU platform. The experimental results for Max_error are shown in
This section provided error distribution histograms to demonstrate
Fig. 10(b). To ensure an objective comparison, we applied Markidiss
each methods error range. The error distribution histogram is imple-
method and Ootomos method on the NVIDIA platform, respectively,
mented on two GPU platforms with a matrix size of 1024*1024*1024.
using the WMMA interface. The interface name is the only difference
The detailed error distribution results are shown in Figs. 1213. As
between this implementation and the AMD-platform implementation.
shown in Figs. 1213, the error distribution of rocWMMA and WMMA
Fig. 10(b) illustrates that the proposed method significantly reduces
is the widest, indicating that the error impact without an additional
Max_error as the matrix size increases, which is consistent with the
precision computation process is significant. Compared to the Mark
results obtained on the AMD-platform. The proposed method also has method, the proposed method has obvious advantages in both the error
a smaller Max_error compared to WMMA, Mark, and Ooto, reducing range and maximum error. This precision advantage is more evident
average Max_error by 571.75×, 2.78× and 1.41×, respectively. The on MI210. The above gap is because although Mark considers the
experimental results of the two platforms show the same trend as the impact of down-conversion, a simple rounding way cannot effectively
results in the AMD-platform. This indicates that the proposed method preserve low-bit bits. Moreover, the differences in the implementation
has low Max_error on mainstream GPU platforms, further confirming of rounding API across different hardware platforms can also impact
its effectiveness in terms of computational accuracy. the calculation results. Compared to the Ooto method, the error dis-
In this section, we have also provided the MRED of several com- tribution histograms of the two methods are similar. The error of the
parative methods, which allows for a more comprehensive analysis of proposed method is more concentrated around 0 in Figs. 12 and 13,
the matrix error. The detailed experimental results are presented in indicating that the proposed method is superior to Ooto in terms of
Fig. 11. As illustrated in Fig. 11(a), the proposed method exhibits a error distribution. The error histograms shown in Figs. 12 and 13 in-
smaller MRED than rocWMMA, Mark, and Ooto, with a reduction of dicate that the proposed method outperforms other methods regarding
11
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 12. Error distribution histogram on MI210.
Fig. 13. Error distribution histogram on A800.
both the error upper limit and the error distribution on both platforms, scientific computing scenarios: K-means [6], FFN layer in BERT [8], and
proving its accuracy advantage. Linear transformation in LLMs [58], which involve numerous GEMM
operations. Fig. 15 shows the speedup ratio of the proposed method
5.4. The empirical Roofline analysis compared to the default implementation (rocBLAS and cuBLAS) on
two GPU platforms. The average speedup of the proposed method in
To further demonstrate the effectiveness of the proposed method, K-means for rocBLAS and cuBLAS are 1.16× and 1.33×, respectively,
we present a performance comparison and analysis of empirical as shown on the left of Fig. 15. The GEMM operations shape during
Roofline performance on two GPU platforms. In the comparison ex- computation is determined by altering the number of data points in
periments, GEMMs with larger dimensions are chosen for profiling, the K-means comparison experiments. The proposed methods perfor-
which avoids performance estimation bias due to fewer data elements mance advantage increases gradually with the number of data points
in the multi-level cache. To measure performance metrics, we utilized and then stabilizes. This suggests that the proposed method has more
Omniperf [56] and NCU [57] commands, which are profiling tools significant performance gains in large-scale GEMM. Fig. 15 illustrates
provided by AMD and NVIDIA, respectively. The empirical roofline per- the performance improvement of the FFN layer. The FFN has incor-
formance is typically composed of two regions: the memory-bound porated the proposed method to accelerate the computational process
and the compute-bound, which indicate that the current kernel of the BERT layer. As illustrated on Fig. 15, the proposed method has
performance is limited by memory access and compute capacity, re- achieved 1.16× and 1.18× speedups compared to rocBLAS and cuBLAS.
spectively. The specific experimental results are shown in Fig. 14. From Experiments on various input sequence lengths show that the proposed
Fig. 14, it can be seen that the proposed method is in the compute- method can significantly improve the speed performance of the FFN
bound region compared to Mark, Ooto, and rocWMMA, indicating that computation process on two GPU platforms. Meanwhile, we also apply
the proposed method has a higher bandwidth utilization for multi-level the proposed method to the linear transformation in LLMs to accelerate
memory in the GPU. This is due to the fact that the above baseline their linear mapping process, which involves various sizes of GEMM
approach does not take into account the data layout of specialized workload. During the linear transformation process of LLMs, the input
hardware units and the multi-level cache characteristics on GPUs, features are mapped to product Q, K, and V, commonly known as
resulting in lower performance. To address this issue, the proposed Query, Key, and Value matrices. The range of matrix data from the
method uses double buffers and pre-fetching techniques to establish an LLMs training process is [0.5, 2.5]. In Fig. 15, for the computation
efficient multi-level pipeline design. This design empowers the kernel process of multiple-size weight matrices, the proposed method has a
to optimize the use of data in the multi-level cache, leading to improved speedup of 1.23× and 1.34× compared to the default implementation,
performance. For rocBLAS and cuBLAS, although both methods fall respectively. Fig. 16 shows the Max_error and MRED of the proposed
into compute-bound, the performance of the proposed method is method in three GEMM-based applications. As illustrated in Fig. 16,
better. This also shows the computational effectiveness of the proposed the magnitude of the error associated with the proposed method is pri-
method for single-precision GEMM. The comparison of the empirical marily contingent upon the K-dimension of GEMM. As the K-dimension
Roofline performance on the two platforms can further confirm the increases, both Max_error and MRED of the proposed method also
effectiveness and advancement of the proposed method. increase. The computation process in GEMMs with larger K-dimensions
involves more FMA operations, leading to more significant errors.
5.5. The improved performance for GEMM-based scientific computing sce-
narios 5.6. The performance gain analysis
To demonstrate the proposed methods acceleration performance In this section, we evaluate the performance gains of two tech-
in GEMM-based scientific computing, we selected three real-world niques, thread parallelism analytic model and multi-level pipeline, for
12
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 14. The empirical Roofline performance.
Fig. 15. The acceleration performance with LE-GEMM (Note that rocBLAS and cuBLAS serve as a baseline for performance comparison. Hence, the speedup of these methods is
1.).
Fig. 16. The Max_error and MRED with LE-GEMM.
LE-GEMM performance. The detailed experimental results are shown in MI210 and A800. When the matrix size gradually increases, the perfor-
Fig. 17. In Fig. 17, LE-GEMM manual paramete represents the use of mance gain is more obvious. For example, when the matrix size ≥ 4000,
previous experience to select the appropriate parameters. LE-GEMM there is a significant performance gap between the proposed method
plain implementation is the direct use of the API interface without and the manual parameter implementation. This is because different
the multi-level pipeline. As shown in Fig. 17, the thread parallelism parameters will have a significant performance impact for varying
analytic model brings 1.36× and 1.78× performance improvement on hardware architectures and matrix sizes. The manual parameter way
13
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
Fig. 17. The performance gain analysis of LE-GEMM (LE-GEMM manual parameter indicates LE-GEMM without thread parallelism analytic model and LE-GEMM plain
implementation indicates without multi-level pipeline).
Fig. 18. The trade-off between performance and precision.
relies solely on prior experience in selecting kernel parameters. This the error of these two methods is also consistent. CUBLAS_COMPUTE_
approach often requires several trial-and-error to obtain a suitable 32F_FAST_16F often leads to significant errors, limiting its applica-
parameter value. However, the lack of detailed analysis of the hard- tion range. The proposed method reduces the precision loss by using
ware architecture and the computational workflow makes it difficult additional computational resources, which decrease the overall com-
to obtain the optimal parameters in the manual selection approach. putational speed. Compared to COMPUTE_32F_FAST_16F, the proposed
We also perform LE-GEMM plain implementation, which ignores the method has better computational accuracy. CUBLAS_COMPUTE_64F
multi-level memory architecture feature and adopts a simple and direct represents double-precision matrix multiplication operation. It has
calling way. Fig. 17 illustrates that LE-GEMM exhibits a performance higher data precision and lower computational speed. The experimental
speedup of 3.02× and 2.72× over the plain implementation on two GPU results in Fig. 18 indicate that higher precision data tends to have
platforms. In LE-GEMM plain implementation, threads are required lower computational speed. The proposed LE-GEMM achieves single-
to wait for the next data load after completing a single iteration precision computational results based on low-precision computational
calculation, which significantly reduces performance. Compared to units, essentially a compromise between computational performance
plain implementation, LE-GEMM can efficiently utilize various memory and accuracy.
spaces for pipelined access operations, reducing data loading overhead
and improving CU utilization. 6. Conclusion
5.7. The trade-off between performance and precision This paper proposed LE-GEMM for single-precision matrix multiply,
which contains a lightweight emulation algorithm, thread parallel an-
To explore the trade-off between high performance and precision, alytic model, and efficient multi-level pipeline implementation. Specif-
some comparative experiments with different precision were conducted ically, a lightweight emulation algorithm can provide more accurate
on the A800. We used the cublasGemmEx() function in the com- computational results and eliminate redundant computations. Then, a
parative experiment to implement two different computation modes thread-parallel analytical model is designed to determine the optimal
(CUBLAS_COMPUTE_32F_FAST_16F and CUBLAS_COMPUTE_64F). tiling scheme by considering TLP and single thread performance. A
CUBLAS_COMPUTE_32F_FAST_16F mode means directly using FP16 multi-level pipeline design is implemented to achieve access latency
to obtain single precision calculation results, during which down- hiding and higher ILP. Finally, experiments were conducted on two
conversion is performed. The detailed experimental results have been GPU platforms to validate the effectiveness of the proposed method and
presented in Fig. 18. As shown in Fig. 18, CUBLAS_COMPUTE_32F_ its progressiveness.
FAST_16F has the fastest computational performance because this
method directly calls Tensor Core for matrix operations without any 7. Future work
precision refinement process. The calculation process of CUBLAS_
COMPUTE_32F_FAST_16F is consistent with WMMA, which directly Future work includes exploring the impact of different types of low-
uses down-conversion without a precision refinement process. Hence, precision data on computational results, allowing for better trade-offs
14
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
between performance and accuracy. Recently, low-precision hardware [13] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee,
accelerators have been increasingly applied to various GEMM-based D. Lee, LUT-GEMM: Quantized matrix multiplication based on luts for efficient
inference in large-scale generative language models, 2022, arXiv preprint arXiv:
scientific computing scenarios to improve computational speed. The
2206.09557.
differences in hardware support and implementation can significantly [14] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, B. Zhuang, Qllm: Accurate and efficient
impact the calculation results. In future work, we plan to explore low-bitwidth quantization for large language models, 2023, arXiv preprint arXiv:
more low-precision variables (e.g., BF16, FP8, FP4) to speed up the 2310.08041.
[15] Y. Wang, B. Feng, Z. Wang, G. Huang, Y. Ding, TC-GNN: Bridging sparse
computational process in single-precision matrix multiplication.
GNN computation and dense Tensor Cores on GPUs, in: 2023 USENIX Annual
Technical Conference (USENIX ATC 23), 2023, pp. 149164.
CRediT authorship contribution statement [16] R. Fan, W. Wang, X. Chu, Fast sparse GPU kernels for accelerated training
of graph neural networks, in: 2023 IEEE International Parallel and Distributed
Processing Symposium, IPDPS, IEEE, 2023, pp. 501511.
Yu Zhang: Writing review & editing, Writing original draft. Lu [17] T. Utsugiri, T. Kouya, Acceleration of multiple precision matrix multiplication
Lu: Writing review & editing. Zhanyu Yang: Writing review & using Ozaki scheme, 2023, arXiv preprint arXiv:2301.09960.
editing, Conceptualization. Zhihong Liang: Writing review & editing. [18] M. Tatsumi, S.-I. Filip, C. White, O. Sentieys, G. Lemieux, Mixing low-precision
Siliang Suo: Writing review & editing. formats in multiply-accumulate units for DNN training, in: 2022 International
Conference on Field-Programmable Technology, ICFPT, IEEE, 2022, pp. 19.
[19] Y. Tortorella, L. Bertaccini, L. Benini, D. Rossi, F. Conti, RedMule: A mixed-
Declaration of competing interest precision matrixmatrix operation engine for flexible and energy-efficient on-chip
linear algebra and TinyML training acceleration, Future Gener. Comput. Syst. 149
The authors declare that they have no known competing finan- (2023) 122135.
[20] D.N. Parikh, R.A. van de Geijn, G.M. Henry, Cascading GEMM: High precision
cial interests or personal relationships that could have appeared to from low precision, 2023, arXiv preprint arXiv:2303.04353.
influence the work reported in this paper. [21] M. Fasi, N.J. Higham, F. Lopez, T. Mary, M. Mikaitis, Matrix multiplication in
multiword arithmetic: Error analysis and application to GPU tensor cores, SIAM
J. Sci. Comput. 45 (1) (2023) C1C19.
Acknowledgments
[22] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using
the TF32 and TF64 Frameworks on low-precision AI tensor cores, in: Proceedings
This work was supported by the Natural Science Foundation of of the SC23 Workshops of the International Conference on High Performance
Guangdong Province, China (2024A1515010204) and the Techno- Computing, Network, Storage, and Analysis, 2023, pp. 179186.
[23] X.-Y. Liu, Z. Zhang, Z. Wang, H. Lu, X. Wang, A. Walid, High-performance tensor
logical Research Project of Southern Power Grid Company, China
learning primitives using GPU Tensor Cores, IEEE Trans. Comput. 72 (6) (2022)
(ZBKJXM20232483). 17331746.
[24] A. Abdelfattah, H. Anzt, E.G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox,
Data availability M. Gates, N.J. Higham, X.S. Li, et al., A survey of numerical linear algebra
methods utilizing mixed-precision arithmetic, Int. J. High Perform. Comput. Appl.
35 (4) (2021) 344369.
No data was used for the research described in the article. [25] M.O. Lam, J.K. Hollingsworth, B.R. de Supinski, M.P. Legendre, Automatically
adapting programs for mixed-precision floating-point computation, in: Proceed-
ings of the 27th International ACM Conference on International Conference on
References Supercomputing, 2013, pp. 369378.
[26] A. Dakkak, C. Li, J. Xiong, I. Gelado, W.-m. Hwu, Accelerating reduction and
[1] M. Meyer, T. Kenter, C. Plessl, Multi-FPGA designs and scaling of HPC challenge scan using tensor core units, in: Proceedings of the ACM International Conference
benchmarks via MPI and circuit-switched inter-FPGA networks, ACM Trans. on Supercomputing, 2019, pp. 4657.
Reconfigurable Technol. Syst. 16 (2) (2023) 127. [27] M. Fasi, N.J. Higham, M. Mikaitis, S. Pranesh, Numerical behavior of NVIDIA
[2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based tensor cores, PeerJ Comput. Sci. 7 (2021) e330.
convolutions for deep learning on multicore ARM and RISC-V architectures, J. [28] AMD, RocWMMA: ROCm wavefront-level matrix multiply and accumulate, 2024,
Syst. Archit. (2024) 103186. https://github.com/ROCmSoftwarePlatform/rocWMMA.
[29] D. Yan, W. Wang, X. Chu, Demystifying Tensor Cores to optimize half-precision
[3] Y.R. Choi, V. Stegailov, Multi-GPU GEMM algorithm performance analysis for
matrix multiply, in: 2020 IEEE International Parallel and Distributed Processing
NVIDIA and AMD GPUs connected by NVLink and PCIe, in: International Con-
Symposium, IPDPS, IEEE, 2020, pp. 634643.
ference on Mathematical Modeling and Supercomputer Technologies, Springer,
[30] AMD, Next generation BLAS implementation for ROCm platform, 2024, https:
2022, pp. 281292.
//github.com/ROCm/rocBLAS.
[4] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating
[31] A. Kerr, D. Merrill, J. Demouth, J. Tran, Cutlass: Fast linear algebra in CUDA
scientific computing on tensor cores with extended precision, in: Proceedings
C++, NVIDIA Dev. Blog (2017).
of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
[32] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore
Programming, 2021, pp. 278291.
high-performance parallel computing with CUDA, Packt Publishing Ltd, 2018.
[5] N.S. Murthy, F. Catthoor, M. Verhelst, Optimization of block-scaled integer
[33] D. Lustig, S. Sahasrabuddhe, O. Giroux, A formal analysis of the NVIDIA PTX
GeMMs for efficient DNN deployment on scalable in-order vector processors,
memory consistency model, in: Proceedings of the Twenty-Fourth International
J. Syst. Archit. (2024) 103236.
Conference on Architectural Support for Programming Languages and Operating
[6] M. Ahmed, R. Seraj, S.M.S. Islam, The K-means algorithm: A comprehensive
Systems, 2019, pp. 257270.
survey and performance evaluation, Electronics 9 (8) (2020) 1295. [34] D. Mukunoki, K. Ozaki, T. Ogita, T. Imamura, Accurate matrix multiplication
[7] G. Chen, Y. Ding, X. Shen, Sweet KNN: An efficient KNN on gpu through on binary128 format accelerated by Ozaki scheme, in: Proceedings of the 50th
reconciliation between redundancy removal and regularity, in: 2017 IEEE 33rd International Conference on Parallel Processing, 2021, pp. 111.
International Conference on Data Engineering, ICDE, IEEE, 2017, pp. 621632. [35] J. Huang, L. Rice, D.A. Matthews, R.A. van de Geijn, Generating families
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep of practical fast matrix multiplication algorithms, in: 2017 IEEE International
bidirectional transformers for language understanding, 2018, arXiv preprint Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2017, pp. 656667.
arXiv:1810.04805. [36] Y. Wang, B. Feng, Y. Ding, QGTC: accelerating quantized graph neural networks
[9] AMD, Matrix core, 2024, https://gpuopen.com/learn/amd-lab-notes/amd-lab- via GPU tensor core, in: Proceedings of the 27th ACM SIGPLAN Symposium on
notes-matrix-cores-readme/. Principles and Practice of Parallel Programming, 2022, pp. 107119.
[10] S. Markidis, S.W. Der Chien, E. Laure, I.B. Peng, J.S. Vetter, NVIDIA Tensor Core [37] B. Li, S. Cheng, J. Lin, Tcfft: Accelerating half-precision FFT through tensor cores,
programmability, performance & precision, in: 2018 IEEE International Parallel 2021, arXiv preprint arXiv:2104.11471.
and Distributed Processing Symposium Workshops, IPDPSW, IEEE, 2018, pp. [38] F. Zhang, Y. Hu, H. Ding, Z. Yao, Z. Wei, X. Zhang, X. Du, Optimizing
522531. random access to hierarchically-compressed data on GPU, in: SC22: International
[11] AMD, AMD CDNA 2.0 architecture, 2024, https://www.amd.com/content/dam/ Conference for High Performance Computing, Networking, Storage and Analysis,
amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white- IEEE, 2022, pp. 115.
paper.pdf. [39] R. Sakamoto, M. Kondo, K. Fujita, T. Ichimura, K. Nakajima, The effectiveness
[12] W. Sun, A. Li, T. Geng, S. Stuijk, H. Corporaal, Dissecting Tensor Cores via of low-precision floating arithmetic on numerical codes: a case study on
microbenchmarks: Latency, throughput and numeric behaviors, IEEE Trans. power consumption, in: Proceedings of the International Conference on High
Parallel Distrib. Syst. 34 (1) (2022) 246261. Performance Computing in Asia-Pacific Region, 2020, pp. 199206.
15
Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336
[40] Y. Sun, L. Zheng, Q. Wang, X. Ye, Y. Huang, P. Yao, X. Liao, H. Jin, Accelerating [49] B. Ferrarini, M. Waheed, S. Waheed, S. Ehsan, M.J. Milford, K.D. McDonald-
sparse deep neural network inference using GPU Tensor Cores, in: 2022 IEEE Maier, Exploring performance bounds of visual place recognition using extended
High Performance Extreme Computing Conference, HPEC, IEEE, 2022, pp. 17. precision, IEEE Robot. Autom. Lett. 5 (2) (2020) 16881695.
[41] Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zhai, Efficiently [50] N.J. Higham, T. Mary, Sharper probabilistic backward error analysis for basic
emulating high-bitwidth computation with low-bitwidth hardware, in: Proceed- linear algebra kernels with random data, SIAM J. Sci. Comput. 42 (5) (2020)
ings of the 36th ACM International Conference on Supercomputing, 2022, pp. A3427A3446.
112. [51] H. Ootomo, H. Manabe, K. Harada, R. Yokota, Quantum circuit simulation
[42] Z. Lin, A. Sun, X. Zhang, Y. Lu, MixPert: Optimizing mixed-precision floating- by SGEMM emulation on tensor cores and automatic precision selection, in:
point emulation on GPU integer tensor cores, in: Proceedings of the 25th ACM International Conference on High Performance Computing, Springer, 2023, pp.
SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools 259276.
for Embedded Systems, 2024, pp. 3445. [52] N. Nakasato, A fast GEMM implementation on the cypress GPU, ACM
[43] H. Ootomo, R. Yokota, Recovering single precision accuracy from tensor cores SIGMETRICS Perform. Eval. Rev. 38 (4) (2011) 5055.
while surpassing the FP32 theoretical peak performance, Int. J. High Perform. [53] M. De Santis, G. Eichfelder, J. Niebling, S. Rocktäschel, Solving multiobjective
Comput. Appl. 36 (4) (2022) 475491. mixed integer convex optimization problems, SIAM J. Optim. 30 (4) (2020)
[44] P. Blanchard, N.J. Higham, F. Lopez, T. Mary, S. Pranesh, Mixed precision block 31223145.
fused multiply-add: Error analysis and application to GPU tensor cores, SIAM J. [54] R.S. Burachik, C.Y. Kaya, M.M. Rizvi, Algorithms for generating pareto fronts of
Sci. Comput. 42 (3) (2020) C124C141. multi-objective integer and mixed-integer programming problems, Eng. Optim.
[45] N.J. Higham, T. Mary, Mixed precision algorithms in numerical linear algebra, 54 (8) (2022) 14131425.
Acta Numer. 31 (2022) 347414. [55] NVIDIA, Programming tensor cores in CUDA 9, 2024, https://developer.nvidia.
[46] Z. Jia, M. Maggioni, B. Staiger, D.P. Scarpazza, Dissecting the NVIDIA volta GPU com/blog/programming-tensor-cores-cuda-9/.
architecture via microbenchmarking, 2018, arXiv preprint arXiv:1804.06826. [56] AMD, Advanced profiling and analytics for AMD hardware, 2024, https://github.
[47] M. Lu, B. He, Q. Luo, Supporting extended precision on graphics processors, in: com/ROCm/omniperf.
Proceedings of the Sixth International Workshop on Data Management on New [57] NVIDIA, The user guide for nsight compute, 2024, https://docs.nvidia.com/
Hardware, 2010, pp. 1926. nsight-compute/NsightCompute/index.html.
[48] A. Thall, Extended-precision floating-point numbers for GPU computation, in: [58] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park,
ACM SIGGRAPH 2006 Research Posters, 2006, pp. 52es. Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing, in:
Proceedings of the 29th ACM International Conference on Architectural Support
for Programming Languages and Operating Systems, Vol. 3, 2024, pp. 722737.
16