Journal of Systems Architecture 160 (2025) 103336


                                                                 Contents lists available at ScienceDirect


                                                        Journal of Systems Architecture
                                                       journal homepage: www.elsevier.com/locate/sysarc


LE-GEMM: A lightweight emulation-based GEMM with precision refinement
on GPU
Yu Zhang a , Lu Lu a,b,c ,∗, Zhanyu Yang a , Zhihong Liang d,e , Siliang Suo d,e
a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
b
  Peng Cheng Laboratory, Shenzhen, 518055, China
c
  Pazhou Laboratory, Guangzhou 510005, China
d
  Electric Power Research Institute, CSG, Guangzhou, Guangdong, China
e
  Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, Guangdong, China


ARTICLE               INFO                              ABSTRACT

Keywords:                                               Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in
GEMM                                                    various scientific computing scenarios. These units support tensor-level computation with different precisions
Thread parallelism analytic                             on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication
Multi-level pipeline
                                                        (GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits
Matrix core/Tensor core
                                                        its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes
                                                        a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline
                                                        implementation to accelerate the computation process without compromising the accuracy requirements.
                                                        First, we propose a lightweight emulation algorithm that includes a precision transformation process and
                                                        GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread
                                                        parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on
                                                        various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can
                                                        maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on
                                                        two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that
                                                        the proposed method outperforms the previous approaches in terms of computational accuracy and speed.


1. Introduction                                                                             computation process. For example, the half-precision performance with
                                                                                            Matrix Core in the AMD MI210 achieves 8× higher throughput over
    GEneral Matrix Multiplication (GEMM) has been extensively in-                           ROCm Core [11]. Based on this feature, previous studies have proposed
volved in High-Performance Computing (HPC) [1], deep learning ap-                           a way to explore low-precision matrix to obtain high-precision results
plications [2], and other scientific computing fields [3]. In these fields,                 and achieve high performance on GPU [12–14].
some applications or scenarios in which GEMM is the primary building                            Exploiting the low-precision to speed up the GEMM is an effective
block are referred to as GEMM-based scientific computing [4,5]. These
                                                                                            route in GEMM-based scientific computing fields, where matrix mul-
applications involve a large number of matrix multiplication workloads,
                                                                                            tiplication operations can be tolerated for low-precision [15,16]. The
and GEMM’s computation time is a significant part of the overall
                                                                                            precision transformation process, which has a significant impact on
process. Specifically, some typical applications, such as K-Means [6],
                                                                                            the accuracy of the final calculated results, is a crucial part of this
KNN [7], Feedforward Neural Networks (FFN) [8], etc., have GEMM
computation times exceeding 50% in popular implementations. To                              route [17]. Data value errors exist during the transition from high
cater this trend, specialized hardware accelerators (e.g., AMD’s Ma-                        to low precision due to the loss in the mantissa representation. After
trix Core [9] and NVIDIA’s Tensor Core [10]) on GPU have been                               several FMA calculations, the final result has some errors compared to
designed and implemented to speed up GEMM, which employs low-                               the standard one. This phenomenon is also known as precision loss.
precision inputs to achieve high performance. In contrast to the regular                    This restricts the application of mixed-precision calculations signifi-
computation pattern, these mixed-precision cores can support tensor-                        cantly [18,19]. Some previous studies have utilized the calculation
level Fused Multiply-Add (FMA), dramatically speeding up the matrix


    ∗ Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
     E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang),
suosl@csg.cn (S. Suo).

https://doi.org/10.1016/j.sysarc.2025.103336
Received 11 July 2024; Received in revised form 8 December 2024; Accepted 2 January 2025
Available online 17 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Zhang et al.                                                                                                  Journal of Systems Architecture 160 (2025) 103336


results of additional low-precision variables to reduce precision loss in        and NVIDIA are adding specialized hardware units and cores to their
the calculation results [20–23].                                                 latest GPUs that support mixed-precision computing [24,25]. In the
    Low-precision variables and additional computations are utilized             following, we will discuss the details of the specialized cores.
for precision refinement [10], which reduces the impact of precision                 Computing Pattern and Memory Layout. The GPU cores can be
loss on the computational results. The implementation of this approach           classified into two types based on supported precision: same-precision
on GPUs that feature mixed-precision arithmetic is still hindered by             cores (FP16 Core, FP32 Core, etc.) that only support the same preci-
several issues. Primarily, a crucial factor is the data transforma-              sion computation and mixed-precision cores (Tensor Core or Matrix
tion process. Loss of the original data mantissa during the precision            Core) that support different precision computation [26,27]. The GEMM
transformation process has an impact on the final outcome. As a                  supported by the mixed-precision core can be expressed as 𝐷 = 𝛼 ×
consequence of the disparity in the mantissa length, low-precision
                                                                                 𝐴𝐵 + 𝛽 × 𝐶, where 𝐴 and 𝐵 can be FP16 or FP32 matrix, 𝐶 and 𝐷 can
data cannot be saved beyond the mantissa length range. This leads to
                                                                                 be configured to FP32 or double precision (FP64) matrix, and 𝛼 and
numerical errors after the precision transformation process. The effect
                                                                                 𝛽 denote scalar constants [28]. This means that the mixed-precision
of this error will be amplified into the final calculation result after
                                                                                 core can utilize the input of low-precision to obtain high-precision
multiple FMA operations, which is a gap between calculated result
and the standard result. The second is the acceleration capabilities             calculation results [10,29]. The details are shown in Fig. 1(a). In
of these specialized hardware units. These specialized hardware                  terms of computing patterns, the mixed-precision core utilizes a new
units provide GEMM with powerful acceleration capabilities, accompa-             memory layout fragment that stores matrix elements either in row-
nied by special data layouts, which pose challenges to efficient kernel          major or column-major order. Different from the same-precision core
design. For GEMM operation, tiling is a common way of assigning                  matrix multiplication operation, the matrix multiplication operation is
computational tasks, where the matrix is segmented into multiple tiles           performed in the mixed-precision core using the fragment as a unit.
(smaller matrix), and a workgroup (thread block) is responsible for              The fragment size is ‘‘16 × 16’’ or ‘‘32 × 32’’, and other fixed shapes
computing one or several tiles. When the tile size is too large, the             (Fig. 1(b) provides an example based on a ‘‘16 × 16’’ fragment). Note
workload of the threads will become too heavy, reducing thread paral-            that different hardware architectures have varying support for fragment
lelism. Conversely, if the tile size is too small, there will be frequent        size. For example, for NVIDIA A800, the maximum supported fragment
workgroup switching, which will increase scheduling overhead. The                size is ‘‘16 × 16’’ [10]. In AMD MI210, rocWMMA can support frag-
computation allocation of tiles also affects data reuse in a workgroup. In       ments with a size of ‘‘16 × 32’’ or ‘‘32 × 32’’ [11]. Different fragment
terms of data access, these specialized hardware units perform GEMM              sizes have a particular impact on the workload allocation of GEMM
operations with fragments, which requires higher utilization of the              operations.
GPU’s multi-level memory. Ineffective organization and manipulation                  Programming interface and implementation. Matrix Core and
of data frequently negate the speed benefits of the tensor-level patterns,       Tensor Core are hardware cores delivered by two different GPU ven-
leading to memory wall bottlenecks in the GPU.                                   dors (AMD and NVIDIA). GPU vendors have implemented various
    To address the above issues, we propose LE-GEMM that includes a
                                                                                 approaches depending on their hardware configurations to facilitate
lightweight emulation algorithm, a thread parallelism analytic model,
                                                                                 their use by researchers and engineers. On the AMD platform, the first
and a multi-level pipeline implementation to achieve better computa-
                                                                                 approach is rocWMMA (ROCm Wavefront-level Matrix Multiply and
tional accuracy and speed. Our contributions can be summarized as
                                                                                 Accumulate), in which AMD’s C++ interface leverages Matrix Core to
follow:
                                                                                 accelerate GEMM operations. The usage and functionality of this inter-
    • A lightweight emulation algorithm is proposed that employs bit-            face are very similar to that of WMMA [28]. The difference between
      cutting and scaling techniques to preserve the original values and         the two ways is that rocWMMA calls the API handle with the keyword
      reduce the loss of precision while removing redundant terms in             rocwmma. The second approach is the compiler instruction [9], which
      the computation process to reduce the computational overhead.              follows a specific syntax ‘‘__𝑏𝑢𝑖𝑙𝑡𝑖𝑛_𝑎𝑚𝑑 𝑔 𝑐 𝑛_𝑤𝑚𝑚𝑎_’’. The third approach
    • A thread parallelism analytic model based on the lightweight               uses rocBLAS [30], which provides a functional interface and allows
      emulation GEMM is proposed, which allows the selection of the              the direct operation of a mixed-precision core by supplying the matrix
      optimal tiling scheme through two aspects: the number of parallel          parameter.
      threads and the single thread performance.                                     There are multiple ways to call Tensor Core on the NVIDIA platform,
    • An efficient multi-level pipeline is implemented based on the              such as WMMA (Warp-level Matrix Multiply and Accumulate) [10],
      access discrepancy between various levels of memory and the                CUTLASS [31], cuBLAS [32], PTX instruction [33], etc. WMMA pro-
      GEMM workflow, utilizing double buffer and pre-fetch techniques
                                                                                 vides several specific programming interfaces to load (wmma::load_
      to hide access latency and improve Instruction-Level Parallelism
                                                                                 matrix_sync), compute (wmma::mma_sync), and store (wmma::store_
      (ILP).
                                                                                 matrix_sync) matrices in fragments [10]. Based on this interface, the ma-
    • The proposed single-precision GEMM accelerated method
                                                                                 trix is first tiled into multiple fragments, which can be shaped to fixed
      achieves state-of-the-art speed and accuracy performance on two
                                                                                 sizes, and then matrix multiplication is performed. CUTLASS (CUDA
      GPU platforms.
                                                                                 Templates for Linear Algebra Subroutines) [31] is a series of CUDA tem-
    The rest of the paper is organized as follows. Section 2 introduces          plate header libraries for implementing high-performance GEMM and
the background of these specialized hardware units (Matrix Core and              related computations at all scales and levels within CUDA. It provides
Tensor Core) and matrix multiplication based on the Ozaki scheme.                a variety of GEMM computation interfaces (hgemm, sgemm, dgemm,
Section 3 provides some related work. Section 4 presents the details of          and wgemm) to optimize data movement and multiply-accumulate
the proposed accelerated method for single-precision GEMM. Section 5
                                                                                 operations in fragments to achieve high performance. cuBLAS [32] is a
demonstrates and evaluates the experimental results. The conclusions
                                                                                 basic linear algebra subroutines library provided and implemented by
of the paper are given in Section 6. Section 7 outlines potential areas
                                                                                 NVIDIA. cuBLAS provides two interfaces (cublasGemmEx and cublasS-
for future work that can be further explored.
                                                                                 gemm) that support GEMM on Tensor Core. In this approach, it is
2. Background                                                                    necessary to set the math model of cuBLAS by using the cublasSetMath-
                                                                                 Mode function. PTX instruction provides a series of programming APIs
2.1. Matrix core and tensor core                                                 and instruction sets for general-purpose parallel programming. It is
                                                                                 designed to be efficient on NVIDIA GPUs that support the defined com-
   The amount of computation for GEMM has gradually increased with               putation functions. The terminological differences and relationships
the development of deep learning. In response to this trend, AMD                 between ROCm and CUDA are also given in Table 1.

                                                                             2
Y. Zhang et al.                                                                                                                  Journal of Systems Architecture 160 (2025) 103336


Fig. 1. The precision profiling and the computing pattern of mixed-precision units (Fig. 1(a) shows an example of precision profiling, where input data is FP16-precision, product
and accumulator is FP32-precision, and output is FP32-precision. Fig. 1(b) illustrates a fragment-based GEMM operation with a ‘‘16 × 16’’ fragment).
Table 1
ROCm/CUDA terminology.
 ROCm                                 CUDA                                Description
 Compute Unit (CU)                    Streaming                           One among many parallel vector processors in a GPU with parallel ALUs. All wavefronts in a
                                      Multiprocessor (SM)                 workgroup are assigned to the same CU.


 Kernel                               Kernel                              Functions initiated to the GPU that are executed by multiple parallel CUs or SMs on the GPU.
                                                                          Multiple kernels can work in parallel with CPU.
 Wavefront                            Warp                                A collection of operations that executed in lockstep, run the same instructions, and follow the same
                                                                          control-flow path. Individual lanes can be masked.
 Workgroup                            Thread block                        Think of this as a vector thread. A wavefront is a 64-wide vector op in AMD GPU. A warp is a
                                                                          32-wide vector op in NVIDIA GPU.
 Work-item/Thread                     Thread                              GPU programming models can handle this as a separate thread of execution, although not necessarily
                                                                          get forward sub-wavefront progress.
 Global Memory                        Global Memory                       The DRAM memory is accessed by a GPU that traverses the cache of some layers.
 Local Memory                         Shared Memory                       Scratchpad that allows communication between wavefronts within a workgroup.
 Private Memory                       Local Memory                        Per-thread private memory is often mapped to registers.

2.2. Matrix multiplication based on the Ozaki scheme                                        where 𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , 𝑅𝐴 , and 𝑅𝐵 denote the matrix with FP16, 𝐴𝐹 𝑃 32
                                                                                            and 𝐵𝐹 𝑃 32 denote the matrix with FP32.
    The Ozaki scheme is an error-free method for computing matrix
multiplication [34]. It involves translating the original matrix multi-                     3. Related work
plication into a sum of matrices that can be multiplied together. Fig. 2
is a schematic of the matrix multiplication based on the Ozaki scheme.                          The advent of Matrix Core and Tensor Core has greatly accelerated
The detailed procedure can be divided into three steps: splitting, com-                     the computational speed of GEMM due to its specific data layout
putation, and summation. In the splitting step, matrix A and matrix                         and precision requirements. This feature allows it to be studied and
B are first divided by element level to obtain multiple submatrix (𝐴𝑖 ,                     applied in many GEMM-based scientific computing fields [24–27].
𝐵𝑗 ). Then, in the computation step, matrix multiplication is performed                     Therefore, previous studies have optimized the scheduling methods of
on the submatrix of A (𝐴𝑖 ) and the submatrix of B (𝐵𝑗 ) to obtain the                      these special hardware units based on different application scenarios
submatrix of C (𝐶𝑖,𝑗 ). Finally, all these submatrices are summed to
                                                                                            to accelerate the computation process [4,12,15,35]. For example, a
obtain the result matrix C.
                                                                                            novel approach for accelerated quantization of graph neural networks,
    Based on the Ozaki scheme, a single precision matrix multiplication
                                                                                            based on the support of low-precision GEMM, was proposed by Wang
calculation method can be described as follows. For simplicity, the
                                                                                            et al. [36], which exploits the insensitivity of graph neural networks
matrix multiplication with FP32 is used as an example. Firstly, in the
                                                                                            to loss of precision. To make Tensor Core be executed more effi-
element-level splitting step, the matrix based on FP32 is converted to
                                                                                            ciently in deep learning applications, Dakkak et al. [26] have optimized
FP16 precision by the precision transformation function, while the error
                                                                                            the reduction and the scan terms for matrix multiplication, making
is preserved. The detailed expression is as follows:
                                                                                            their programming faster and more accessible. Specifically, Dakkak’s
𝐴𝐹 𝑃 16 ← toFP16(𝐴𝐹 𝑃 32 )                                                        (1)       method [26] was implemented on the Tesla V100 and achieved 89%–
                                                                                            98% of peak memory bandwidth and reduced power consumption.
𝑅𝐴 ← toFP16(𝐴𝐹 𝑃 32 − 𝐴𝐹 𝑃 16 )                                                   (2)       Li et al. [37] proposed a half-precision Fast Fourier Transform (FFT)
                                                                                            approach based on Tensor Core. This approach leverages the fragments
where 𝐴𝐹 𝑃 32 and 𝐴𝐹 𝑃 16 denote the FP32-based and FP16-based matrix
𝐴, respectively. toFP16() is the transformation function that transforms                    in Tensor Core to support the special operations required for FFT
matrix 𝐴 from FP32 to FP16, and 𝑅𝐴 represents the residual matrix                           and optimizes the data arrangement in GPU memory [38]. To further
between different precision. Note that this expression is analogous to                      explore these hardware units’ workflow and application prospects,
matrix 𝐵. With the support of mixed-precision cores for floating-point                      Sakamoto et al. [39] improve energy efficiency by analyzing memory
arithmetic with different precision, 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 can be rewritten as                    and power consumption of low-precision data during computation. The
follows:                                                                                    above studies aim to optimize the use of these specialized hardware
                                                                                            units to accelerate the computing process in specific scientific com-
𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )𝐵𝐹 𝑃 16 = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝐵𝐹 𝑃 16           (3)       puting scenarios based on their computational characteristics [40,41].
                                                                                            These approaches maximize the acceleration capability of matrix multi-
      Hence, the matrix multiplication for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 can be expressed                 plication operation without changing the internal computation process
as:                                                                                         of specialized hardware units. However, FMA operations between dif-
𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 )                                            ferent precisions will inevitably result in a certain precision loss in
                                                                                  (4)
                  = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝑅𝐵                       the calculation results, which will limit the application scope of these

                                                                                        3
Y. Zhang et al.                                                                                                       Journal of Systems Architecture 160 (2025) 103336


                              Fig. 2. The schematic of the matrix multiplication based on the Ozaki scheme (𝐶 = 𝐴𝐵 as an example).


hardware units [21,22,42].                                                         analysis model, and the efficient multi-level pipeline design. Firstly,
    To mitigate the loss of precision, several improvement approaches              we introduced the lightweight emulation algorithm, which can pro-
have been proposed. Markidis et al. [10] analyzed the programma-                   vide a more efficient and accurate calculation performance based on
bility and precision of Tensor Core. Then they proposed a precision                specialized hardware units. Then, a thread parallelism analytic model
refinement method for single-precision (FP32) matrix multiplication,               is proposed, which is used to analyze and select the optimal tiling
which takes advantage of half-precision (FP16) to store residuals in               scheme based on various computing scenarios and hardware. Finally,
transformations. Based on this research, Ootomo et al. [43] improved               we present an efficient multi-level pipeline design for GEMM, which
the rounding method of single-precision floating-point numbers. They               utilizes double buffer and pre-fetch techniques to hide access latency
reduced the redundancy in error-correction terms by considering the                and improve ILP.
effect of the rounding method and implicit bit. Others have studied
the error analysis of matrix multiplication using Tensor Core and have             4.1. Lightweight emulation algorithm
given theoretical error bounds for matrix multiplication with different
precisions [24,44,45]. To explore the efficiency of mixed-precision                    Traditional emulation algorithm design assumes that the hardware
cores, Jia et al. [46] studied the arrangement of matrix-to-fragment               has the same input and output precision. Typically, a large number
data in memory and how assembly instructions split the matrix. To                  of low-precision instructions are used to complete the computation
support the extended-precision computing pattern, several methods                  and memory access process, which can result in high computational
have been proposed in limited-precision hardware, combining multi-                 overhead and reduce performance benefits [4,40,51,52]. However, spe-
ple low-precision computation instructions to simulate the extended-               cialized hardware units, such as Matrix Core and Tensor core, have
precision computing methods [47–49]. Inspired by this approach, Feng               been introduced to accelerate the GEMM operation. Multiplication and
et al. [4] improved the Markidis’s method in single-precision matrix               addition are fused in special instruction primitives, and the process
multiplication, which modified the rounding method and optimized the               of accumulating intermediate results also supports high-precision ac-
computation of extended-precision at the instruction level. In addition,           cumulation to ensure accurate calculation results. This new hardware
Fasi et al. [21] invested in the application of multiword methodology              feature has motivated us to design a lightweight emulation algorithm
to improve the performance-accuracy trade-off of matrix multiplication             that accelerates the GEMM computation process by achieving high com-
with mixed-precision cores, focusing specifically on the Tensor Core               putational throughput based on low-precision inputs. The lightweight
                                                                                   emulation algorithm consists of the precision transformation process
available on NVIDIA GPUs. Following this research direction, Parikh
                                                                                   and the GEMM emulation calculation, as illustrated in Fig. 3. The preci-
et al. [20] presented insights and possibilities for implementing higher
                                                                                   sion transformation process is designed to make the low-precision data
precision matrix multiplication with Tensor Core, and an implemen-
                                                                                   better preserve the value of the original data, thus achieving more ac-
tation of the FP64x2 GEMM scheme with variable FP64 inputs was
                                                                                   curate calculation results. The GEMM emulation calculation is designed
provided. The common feature of these studies is using low-precision
                                                                                   to achieve high-precision results and speed up the calculation process
inputs to obtain high-precision computational results with some addi-
                                                                                   by leveraging the specialized hardware units and low-precision input.
tional computation workload. In this way, these specialized hardware
                                                                                   Floating-point numbers of different precision types have exponent and
units (Matrix Core and Tensor Core) can perform high-performance
                                                                                   mantissa with different lengths. For the single-precision format, 8 bits
matrix multiplication operations and reduce the impact of precision
                                                                                   are allotted to the exponent and 23 bits to the mantissa. The half-
loss [44,50].
                                                                                   precision representation has a smaller range than the single-precision,
                                                                                   with the exponent and mantissa being 5 and 10 bits, respectively. In
4. LE-GEMM                                                                         emulation algorithm, the first step is to split the original matrix, which
                                                                                   affects the level of elements in the matrix. There are two different
   This section gives a detailed description of LE-GEMM, which in-                 splitting routines. One splitting routine is to split the matrix elements
cludes the lightweight emulation algorithm, the thread parallelism                 without changing the precision, and the other is to split the matrix

                                                                               4
Y. Zhang et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103336


                                                             Fig. 3. The workflow of LE-GEMM.


values and change the precision type. In this paper, we focus on the             numbers, and it is compatible across platforms. Fig. 4 illustrates the
implementation of the second routine. However, the transformation                details of the precision transformation process. For the computation
process inevitably leads to precision loss. Therefore, an additional             of 𝑅𝐴 , since the mantissa of FP16 is of length 10, this does not fully
FP16 is used to preserve the residual and improve the accuracy of                save all the remaining mantissa. Based on these two factors, a residual
the calculated results. Rounding is a popular precision transformation           is calculated from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 . Then multiply the residual by the
technique that has been previously employed to transform from high-              scaling factor to preserve the residual, which is 210 . This does not alter
precision to low-precision [20,45]. The common rounding methods                  the representation, but it maps the initial value to the larger range that
include RN, RZ, RU, and RD. 1 The rounding method has the advantage              can be represented by 𝑅𝐴 . Note that since 𝑅𝐴 was obtained by scaling,
of taking into account the representation of the mantissa low-bit part,          the processing used in subsequent calculations will return the original
which is after the rounding bit. However, it is important to note that           residual value. Finally, to preserve the mantissa to the maximum extent,
each rounding method will inevitably introduce some errors in the                the residuals are saved using rounding. The precision transformation
final value. The primary reason for this is that the mantissa involves           process can be rewritten as follows:
rounding up or down, leading to numerical deviations. For instance,
                                                                                 𝐴𝐹 𝑃 16 = bit-cut(𝐴𝐹 𝑃 32 )                                                       (5)
when transforming from FP32 to FP16 using the RN method, the final
value in the 13-bit mantissa considers the low-bit of the mantissa (0–
12 bit) to achieve the nearest round. 𝐴𝐹 𝑃 16 and 𝑅𝐴 are responsible             𝑅𝐴 = round((𝐴𝐹 𝑃 32 − 𝐴𝐹 𝑃 16 ) × 210 )                                           (6)
for preserving the high-bit and low-bit mantissa of 𝐴𝐹 𝑃 32 respectively.
From a numerical perspective, 𝐴𝐹 𝑃 16 has a larger effect than 𝑅𝐴 on             𝐵𝐹 𝑃 16 = bit-cut(𝐵𝐹 𝑃 32 )                                                       (7)
computation results. Therefore, it is important to preserve a high-bit of
the mantissa during the rounding process by saving as many low-bit               𝑅𝐵 = round((𝐵𝐹 𝑃 32 − 𝐵𝐹 𝑃 16 ) × 210 )                                           (8)
bits as possible.
    Compared to previous work, we adopted a bit-cut to preserve the              where bit-cut() denotes executing bit-cut operation, round() represents
high-bit bits in FP32 during the precision transformation process, con-          RZ. This section presents the details of the precision transformation
sidering the impact of high-bit and low-bit bits on errors. To achieve           process involving bit-cut and scaling. In the scaling step, it is essen-
precision transformation from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 , a bit-cut method is           tial to divide the value by the scaling factor in the last step of the
employed to preserve the 𝐴𝐹 𝑃 32 mantissa high-bit. In this way, the             calculation to cancel the effect of the scaling factor. Therefore, the
higher mantissa (13–22 bits of 𝐴𝐹 𝑃 32 in Fig. 4) is copied into 𝐴𝐹 𝑃 16 .       GEMM emulation calculation, which is based on the previous precision
For residual 𝑅𝐴 , introducing a scaling operation requires dividing it           transformation process, can be described as follows.
by a scaling factor (210 in Fig. 4) in the calculation process to ensure         𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 ∕210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ∕210 )
the correct results. Note that due to the difference in the length of the
                                                                                                = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 + 𝑅𝐴 𝑅𝐵 ∕220
exponent part between single-precision and half-precision, additional
processing is required to avoid errors in the transform process of the                                                                                             (9)
exponent part. Considering that the bias of single-precision and half-
                                                                                    The fourth calculation term in Eq. (9) is divided by 220 , thus it
precision sizes are 127 and 15, it is necessary to subtract the bias
of the exponent part to avoid errors in the transformation exponent              has little effect on the final computation result. Moreover, additional
bit. To avoid the impact of bias, we preserve the exponent part to an            computation items will increase the overall computation effort. For this
intermediate variable, which requires subtracting the bias value. Then,          purpose, Eq. (9) is modified as follows:
add the corresponding bias when saving the intermediate variable for             𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 ∕210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ∕210 )
                                                                                                                                                    (10)
the half-precision exponent part. This approach offers the following                             ≈ 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210
benefits: compared to traditional rounding, this approach does not gen-
erate rounding-up or rounding-down, which can affect the stability of                Hence, the lightweight emulation algorithm calculation can be de-
high-bit. This approach does not depend on the rounding API interface            scribed as follows:
of the platform, which differs in the rounding regulations of different
                                                                                 𝐷𝐹 𝑃 32 = 𝛼 × 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 + 𝛽 × 𝐶𝐹 𝑃 32
platforms. The bit-cut is based on a binary format for floating point
                                                                                         ≈ 𝛼 × (𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 ) + 𝛽 × 𝐶𝐹 𝑃 32
                                                                                                                                                                (11)
  1
    RN, RZ, RU, and RD represent Round-to-nearest-even, Round-towards-
Zero, Round-Up, and Round-Down, respectively.


                                                                             5
Y. Zhang et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103336


                                                                Fig. 4. The bit-cut and scaling operation.


where 𝐷𝐹 𝑃 32 and 𝐶𝐹 𝑃 32 denote FP32 matrices, 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 and              efficient tiling scheme that represents the impact of the allocation
𝑅𝐵 represent the same meaning as those in Eq. (10), 𝛼 and 𝛽 are                         of computational tasks on overall performance is critical. The tiling
scalar constants. This section introduces a precision transformation                    scheme is to segment the matrix into multiple tiles and assign each
process consisting of a two-step bit-cut and rescaling. This process can                tile to a cooperative thread array (thread block) responsible for the
effectively preserve the original data mantissa and avoid numerical                     computation. Each thread loads and completes the computation of the
fluctuations caused by rounding. At the same time, based on the above                   corresponding matrix elements. This process is based on the size of the
precision transformation process, we have designed a GEMM emulation                     matrix in GEMM operation and the hardware resources to perform the
calculation process, which removes the cumulative term in Eq. (9) and                   allocation of computational tasks.
reduces the overall computation time of GEMM. Algorithm 1 provides                          Fig. 5 illustrates that computing a tile in matrix D requires loading
an overview of the lightweight emulation algorithm. In Algorithm 1,                     sub-sections of matrices A and B (the gray area) and a tile in matrix
bit-cut () denotes utilizing the bit-cut technique to preserve the high-                C (the blue area) to perform GEMM operation. This allows for the
bit bits in the mantissa, which is detailed in Section 4.1. round()                     computation of multiple tiles simultaneously on the GPU. To fully
represents the use of rounding to preserve the low-bit bits in the                      exploit the computing power of GPUs, our proposed thread parallelism
mantissa. mFma_sync() represents utilizing specialized hardware units                   analysis model analyses and improves the performance of GEMM com-
to calculate the multiplication of two matrices, and uniformFma_sync()                  putations by considering the number of parallel threads (Thread-Level
represents the computation process of Eq. (11) to obtain the final                      Parallelism, TLP) and the single thread performance. Given the tile size,
calculation result 𝐷𝐹 𝑃 32 .                                                            the computation of GEMM can be divided into multiple subparts, each
                                                                                        with a set of threads responsible for the computation. This property
Algorithm 1 The lightweight emulation algorithm                                         allows for the computation of thread parallelism based on a given
 1: Function       emulation GEMM operation(𝐷𝐹 𝑃 32 , 𝐴𝐹 𝑃 32 , 𝐵𝐹 𝑃 32 ,               tiling strategy. Previous methods have been proposed based on the
    𝐶𝐹 𝑃 32 ,𝛼, 𝛽)                                                                      number of threads in a thread block simply. Because the thread block
 2:   /* The precision transformation process */                                        is further transformed into multiple frontwaves (warps) for execution
 3:            𝐴𝐹 𝑃 16 =bit-cut(𝐴𝐹 𝑃 32 ), 𝑅𝐴 =round((𝐴𝐹 𝑃 32 -𝐴𝐹 𝑃 16 ) × 210 );       on CU, this resulted in the previous method ignoring the underlying
 4:            𝐵𝐹 𝑃 16 =bit-cut(𝐵𝐹 𝑃 32 ), 𝑅𝐵 =round((𝐵𝐹 𝑃 32 -𝐵𝐹 𝑃 16 ) × 210 );       computational pattern on the GPU. Therefore, we use the number
 5:   /* GEMM emulation computation process */                                          of wavefronts to calculate thread parallelism. This provides a more
 6:       /* Initialize accumulator */                                                  accurate indication of how active the underlying threads are in practice.
 7:            Initialize fragsAcc0, fragsAcc1;                                         Based on the matrix’s tiling scheme and hardware configuration, TLP
 8:       /* Calculate each term in Equation (11) */                                    can be calculated as follows:
                                                                                                   (           )
 9:            fragsAcc0 = mFma_sync (𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , fragsAcc0);                                  𝑀 ×𝑁
                                                                                        𝑇𝑇 𝐿𝑃 = 𝜑                × 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚                               (12)
10:            fragsAcc1 = mFma_sync (𝐴𝐹 𝑃 16 , 𝑅𝐵 , fragsAcc1) ;                                     𝑇𝑀 × 𝑇𝑁
11:            fragsAcc1 = mFma_sync (𝐵𝐹 𝑃 16 , 𝑅𝐴 , fragsAcc1) ;                       where 𝑀 and 𝑁 represent the dimensions of matrix 𝐶, 𝑇𝑀 and 𝑇𝑁
12:       /* Calculate the final result */                                              are the tile size, 𝜑 is the transform process of workgroup to wavefront,
13:            𝐷𝐹 𝑃 32 = uniformFma_sync (𝛼, fragsAcc0, fragsAcc1/210 , 𝛽,              and 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 represents the number of threads in wavefront. Note
    𝐶𝐹 𝑃 32 );                                                                          that 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 has different values across various GPU platforms, for
14: end Function
                                                                                        example, on MI210 and A800, the values for 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 are 64 and 32,
                                                                                        respectively.
                                                                                            When considering TLP, it is important to consider the performance
4.2. Thread parallelism analytic model                                                  of a single thread, which can be expressed as a compute-to-memory
                                                                                        ratio. Comparing the clock cycles of memory access instruction and
   The lightweight emulation algorithm is designed to efficiently map                   computation instruction in threads, it is evident that data access in
GEMM-based scientific computing to specialized hardware cores, al-                      memory requires more time. Therefore, it is crucial to ensure that the
lowing for high-precision outputs with low-precision inputs. To fully                   data loaded on the thread is fully utilized. During each iteration, every
exploit the acceleration capabilities of specialized hardware units, an                 thread block is required to perform two tasks. The first step involves

                                                                                    6
Y. Zhang et al.                                                                                                        Journal of Systems Architecture 160 (2025) 103336


                                                             Fig. 5. The tiling scheme of a single GEMM.


reading two matrices (𝐴𝐹 𝑃 16 , 𝑅𝐴 ) of size (𝑇𝑀 , 𝑇𝐾 ), two matrices                  between the thread block and the matrix tiles, which will also affect the
(𝐵𝐹 𝑃 16 , 𝑅𝐵 ) of size (𝑇𝐾 , 𝑇𝑁 ) and matrix C of size (𝑇𝑀 , 𝑇𝑁 ). Since matrix       efficiency of data reuse. As shown in Fig. 5(a), the calculation of 3 tiles
C of size (𝑇𝑀 , 𝑇𝑁 ) is loaded only in the first iteration and does not                in matrix 𝐶 involves the loading of 1 sub-section of matrix 𝐴, and 3 sub-
need to be loaded again in subsequent steps, its loading overhead can                  sections of matrix 𝐵, respectively. In Fig. 5(b), 4 tiles can be computed
be neglected. Thus, the memory access instruction with each iteration                  by loading two sub-sections of the matrix 𝐴 and 𝐵. Comparing these
can be expressed as:                                                                   two routines, it is clear that the larger the cross-sectional area (blue
                                                                                       tile in Fig. 5) between the sub-sections loaded by matrices 𝐴 and 𝐵.
𝑇𝑚𝑒𝑚𝑜𝑟𝑦 = 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 )                                     (13)
                                                                                       Fig. 5 indicates that the pattern in (b) has better data reuse.
where 𝑇𝑚𝑒𝑚𝑜𝑟𝑦 represents memory access during each iteration, the last                     Finally, the process of optimal tiling scheme selection with thread
2 means that the half-precision data occupies two bytes. According                     parallelism analytic model is formally represented as maximizing 𝑇𝑇 𝐿𝑃
to the lightweight emulation algorithm, the computation instruction                    and 𝑇 𝑓 𝑙𝑜𝑝𝑠 . According to Eqs. (12) and (15), it can be found that
                                                                                             𝑚𝑒𝑚𝑜𝑟𝑦
required for each iteration process is as follows:
                                                                                       𝑇𝑇 𝐿𝑃 and 𝑇 𝑓 𝑙𝑜𝑝𝑠 show an inverse relationship given 𝑀, 𝑁, and 𝑇𝑡ℎ𝑟𝑒𝑎𝑑 .
𝑇𝑓 𝑙𝑜𝑝𝑠 = 3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 )                                          (14)                       𝑚𝑒𝑚𝑜𝑟𝑦

                                                                                       Generally, smaller tiles can achieve better thread parallelism because
where 𝑇𝑓 𝑙𝑜𝑝𝑠 represents computation with each iteration, 2 × (𝑇𝑀 ×                    computational tasks are assigned to multiple threads. However, the
𝑇𝑁 × 𝑇𝐾 ) represents the computation amount of matrix multiplication                   number of activated threads is also limited by the supported number
between matrix of size (𝑇𝑀 , 𝑇𝐾 ) and matrix of size (𝑇𝐾 , 𝑇𝑁 ), and the               by the GPU hardware. When the tile is larger, a larger compute-to-
constant 3 corresponds to the three terms in Eq. (11). To this end, the                memory ratio is often achieved, which represents a better performance
ratio of computation to memory access (𝑇 𝑓 𝑙𝑜𝑝𝑠 ) during each iteration                for a single thread. When choosing a tiling scheme, it is also necessary
                                                𝑚𝑒𝑚𝑜𝑟𝑦
is as follows:                                                                         to consider the workload of the thread according to Eqs. (16) and
                3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 )     3(𝑇𝑀 × 𝑇𝑁 )                                (17). Eqs. (12)–(17), take into account factors that affect the number
𝑇 𝑓 𝑙𝑜𝑝𝑠 =                                =                                (15)        of active threads and the performance of a single thread, allowing
  𝑚𝑒𝑚𝑜𝑟𝑦     2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 )   2(𝑇𝑀 + 𝑇𝑁 )
                                                                                       the optimal tiling scheme to be selected based on the various GPU
     While pursuing higher TLP and compute-to-memory ratios, it is also                platforms and computing scenarios. Our thread parallelism analysis
important to focus on thread workload. The workload of a thread can be                 model chooses the tiling scheme based on thread-level parallelism
divided into two parts: the amount of data load and the amount of com-                 and single thread performance, which can be solved analytically with
putation. To avoid excessive workload and reduce thread parallelism,                   existing optimization solvers [53,54]. This can also be used to pro-
it is necessary to properly set the tiling scheme and the number of tiles              vide a more profitable direction when selecting the best parameters
that the workgroup is responsible for computing. To describe a thread                  by trial-and-error. Table 2 presents the choice of the parameters for
the amount of data load (𝑊 𝑜𝑟𝑘_𝐷𝐿) and the amount of computation                       LE-GEMM. In Table 2, (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) denotes the size of the matrix
(𝑊 𝑜𝑟𝑘_𝐶 𝑃 ) more accurately, the following two formulas are given:                    partitioned into tiles, (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) denotes the fragment size for GEMM
                                                                                       computation, and wavefront/warp size denotes the number of threads
            𝑇𝑀 × 𝑇𝑁 + 𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁
𝑊 𝑜𝑟𝑘_𝐷𝐿 =                                                                 (16)        in the thread group of the two GPU platforms that is determined by
                     𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚
                                                                                       their hardware architectures. Due to the various sizes and shapes of
             𝑇𝑀 × 𝑇𝑁                                                                   the fragment supported by different architectures, the choice of the
𝑊 𝑜𝑟𝑘_𝐶 𝑃 =                                                                (17)
            𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚                                                                  fragment is also limited to the two GPU platforms. At the same time,
where 𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 represents the number of thread responsible for com-                   the wavefront/warp size also affects the thread workload allocation.
puting the tile. 𝑇𝑀 , 𝑇𝑁 , and 𝑇𝐾 represent tile size, the details of which            Therefore, when choosing parameters, it is essential to consider the
are shown in Fig. 5. 𝑊 𝑜𝑟𝑘_𝐷𝐿 is the number of elements responsible                    impact of hardware differences on performance.
for the computation in each thread. To ensure efficient parallelism,
it is necessary to allocate a different number of threads to various                   4.3. An efficient multi-level pipeline for GEMM
shape tiles. This ensures that the workload is consistent between thread
blocks. 𝑊 𝑜𝑟𝑘_𝐶 𝑃 represents the number of registers loaded by each                        In the GPU memory architecture, various memory spaces will have
thread. When the thread loads too many elements, it will cause the data                different sizes and access speeds. Data access speeds can be classified
in the register to overflow, which will increase the data access over-                 from fast to slow in terms of register memory, local memory, and global
head. At the same time, we should also consider the correspondence                     memory. The speed is inversely proportional to the size of the memory

                                                                                   7
Y. Zhang et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103336


                                                                Fig. 6. The workflow of GEMM.


Table 2                                                                            Table 3
The choice of the parameters on two GPUs.                                          The configuration of platforms for evaluation.
 Name                                   AMD MI210            NVIDIA A800            Name                             AMD-platform                     NVIDIA-platform
 Matrix tile size (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 )       (128, 128, 32)       (128, 256, 32)         CPU                              EPYC 7663                        Platinum 8358
 Fragment size (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 )          (32, 32, 16)         (16, 16, 16)           GPU                              MI210                            A800
 Wavefront/Warp size                    64                   32                     OS                               Ubuntu 20.04                     Ubuntu 20.04
                                                                                    ROCm/CUDA                        ROCm 5.6                         CUDA 12.0


space. Direct access to data from global registers reduces transfer
efficiency and does not maximize memory bandwidth. Therefore, in                       The implementation of the multi-stage pipeline above must be
this paper, double buffers are used to alleviate the problem of high               aware of the size of registers and shared memory supported by the
differences in memory access speeds, and pre-fetch techniques are used             specific GPU hardware. Once the maximum capacity limit is exceeded,
to maximize memory bandwidth and thread instruction parallelism.                   data overflow will occur and this will greatly increase access latency.
The workflow for a tile in GEMM is shown in Fig. 6.                                Indeed, register and shared memory occupancy is related to 𝑇𝑀 , 𝑇𝑁
    First, a double buffer (buffer0 and buffer1) was created in the share          and 𝑇𝐾 from the above description. When a tiling scheme is given, 𝑇𝑀
memory, and the single buffer size is related to the cumulative span               and 𝑇𝑁 are also determined. Therefore, data overflow can be prevented
of matrices 𝐴 and 𝐵 along the 𝐾-dimension. As shown in Fig. 6,                     by setting 𝑇𝐾 flexibly. For example, on the AMD platform, based on the
loading instructions is applied to fetch matrix tiles 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16       fact that 𝑇𝑀 and 𝑇𝑁 are 128 and 256 respectively, 𝑇𝐾 can be set to a
and 𝑅𝐵 in iteration #0, and the data is stored in buffer0. The fetch data          smaller size such as 16 or 32 to prevent data overflow. Note that to
correspond to the yellow tiles 𝐴𝐹 𝑃 16 and 𝑅𝐴 (𝑇𝑀 × 𝑇𝐾 ) and the green             avoid invalid computations in Matrix Core, 𝑇𝐾 should preferably be an
tiles 𝐵𝐹 𝑃 16 and 𝑅𝐵 (𝑇𝐾 × 𝑇𝑁 ) in Fig. 6. At the same time, the data to be        integer multiple of the fragment size.
loaded in iteration #1 will also be fetched into buffer1. This allows the
register to fetch data directly in the share memory, without waiting for           5. Results
the data to be loaded from the global memory. The thread can load data
directly from shared memory into private registers, follow a specified                This section compares the proposed method with some state-of-
data layout, and then utilize Matrix Core to complete the computation.             the-art approaches and analyzes the experimental results on two GPU
When the intermediate result of the first iteration is complete, the               platforms. The flowchart of the comparative experiment is shown in
register reads the data needed for the next iteration from buffer1. The            Fig. 7. Note that the matrix size of 𝐴 ∈ 𝑅𝑀×𝐾 and 𝐵 ∈ 𝑅𝐾×𝑁 is
thread continues to load the data needed for the next iteration from               represented as ‘‘𝑀*𝑁*𝐾’’ in the following expressions for simplicity.
buffer1 when the intermediate result of the iteration #0 is complete.              The tested range of numbers was [−1,1] in comparative experiments.
For efficient data pipelining and instruction-level parallelism, data for
the next iteration can be pre-fetched from global memory into shared               5.1. Setup
memory when performing data transfers between shared memory and
registers. Based on the above approach, we can implement a multi-                      Experiment platform. In this paper, we chose two most common
level access pipeline between different memory levels, and the access              and advanced GPUs (MI210 and A800). The configuration and details
and computation instructions can be executed in parallel, improving                of the experimental platform are given in Tables 3 and 4 respectively.
the ILP performance of the thread. Note that a thread synchronization                  Comparison method. To evaluate the proposed method, first, for
instruction is performed behind the store to ensure data consistency by            two GPU platforms, the default single-precision GEMM methods such
each time a buffer is stored. Each iterative computation produces an               as rocBLAS and cuBLAS, are provided by the respective GPU vendors.
intermediate result which is combined with the accumulation matrix                 These methods do not use the mixed-precision core and directly utilize
𝐶𝐹 𝑃 32 for FMA operation. After several computation iterations, the final         the FP32 input to compute the FP32 GEMM results. We compare
calculation result 𝐷𝐹 𝑃 32 can be obtained and written back to the global          some methods, such as rocWMMA and WMMA, that use FP16 input to
memory.                                                                            compute FP32 results on Matrix Core or Tensor Core without precision

                                                                               8
Y. Zhang et al.                                                                                                            Journal of Systems Architecture 160 (2025) 103336


                                                            Fig. 7. The flowchart of the comparative experiment.


Table 4                                                                                  half-precision matrices in the paper, 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) is the result of a stan-
The configuration of two GPUs.                                                           dard single-precision calculation used to compute Max_error, (𝑖, 𝑗) rep-
 Name                            AMD MI210                    NVIDIA A800                resents the 𝑖th row and 𝑗th column element of the matrix, 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒() is
 Architecture                    CDNA 2.0                     Ampere                     a precision transformation function to obtain more accurate errors, 𝑀
 Mixed-precision core            Matrix Core                  Tensor Core                and 𝑁 represent the dimensions of the resulting matrix, and 𝑀 𝑎𝑥() is
 Caches                          L1 16 KB(per CU)             L1 192 KB (per SM)
                                 L2 16 MB                     L2 40 MB
                                                                                         to choose the maximum value from all errors. Meanwhile, the Mean
                                                                                         Relative Error Distance (MRED) is also used as an error evaluation
 Memory                          64 GB HBM2                   80 GB HBM2
 Bandwidth                       1.6 TB/s                     2.04 TB/s
                                                                                         metric, which is calculated as follows:
                                                                                                         ∑𝑀 ∑ 𝑁 |                                                 |
                                                                                                     1           | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) − 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) |
                                                                                         MRED =                  |                                                | (20)
Table 5                                                                                                          |
                                                                                                   𝑀 ∗ 𝑁 𝑖=1 𝑗=1 |          𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))              |
                                                                                                                                                                  |
The description of the baseline method.
 Name                   Input    Output    Description                                   where M and 𝑁 denote the size of the result matrix, and 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) and
 rocBLAS [30]           FP32     FP32      rocblasSgemm on ROCm Core.
                                                                                         𝑉𝑝 (𝑖, 𝑗) denote the same meaning as in Eq. (19).
 cuBLAS [32]            FP32     FP32      cublasSgemm on CUDA Core.
 rocWMMA [28]           FP16     FP32      rocWMMA implemented on Matrix Core
                                                                                         5.2. The speed improvement with various shapes

                                           without precision refinement.                     To evaluate the speed advantages of our method, we conducted
 WMMA [55]              FP16     FP32      WMMA implemented on Tensor Core               comparative experiments on two GPU platforms with multiple shapes
                                           without precision refinement.                 of the matrix. The results are presented in Figs. 8–9. In Fig. 8(a)–(c),
 Mark [10]              FP16     FP32      Markidis’s implemented on two                 our method improves performance by 1.31× compared to rocBLAS,
                                           platforms.                                    an arithmetic library commonly used for scientific computing and
 Ooto [43]              FP16     FP32      Ootomo’s implemented on two                   provided by AMD. We also compare rocWMMA, which directly uses
                                           platforms.                                    half-precision inputs to achieve a single-precision output without a
                                                                                         precision refinement process. The proposed method achieves an av-
                                                                                         erage 1.25×, 1.24×, and 1.45× speedup in (n*n*n), (n*4n*n), and
                                                                                         (n*n*4n), respectively. From Fig. 8(a)–(b), when n ≤ 4096, it can
refinement. Finally, the experiments compare Markidis’s and Ootomo’s
                                                                                         be seen that the performance of rocWMMA is superior to that of the
methods, which calculate FP32 results based on FP16 input with pre-
                                                                                         proposed method. This phenomenon is mainly due to the following
cision refinement. In this paper, the experimental results are averaged
                                                                                         two reasons. First, rocWMMA is a straightforward computation way;
over 10 experiments. The baseline methods are described in Table 5.
                                                                                         this does not involve the accuracy refinement procedure; therefore,
    Evaluation criteria. To objectively evaluate the effectiveness of
                                                                                         the workload of the calculation process is less. Second, the dimen-
the proposed method, in this paper, two criteria are used to evaluate
                                                                                         sion of the matrix is minor, so the proposed method cannot fully
the speed and accuracy. In terms of measuring speed, the experimen-                      exploit the performance improvement brought by the multi-level flow
tal results were represented by the average value of TFLOPS (Tera                        design. As the matrix dimension gradually increases, the performance
FLoating-point Operations Per Second), which is calculated as:                           advantage of the proposed method becomes increasingly apparent, as
              2((𝑀 × 𝑁 × 𝐾) + (𝑀 × 𝑁))                                                   seen in Fig. 8. Compared to Markidis’s and Ootomo’s methods, the
TFLOPS =                                                                      (18)
                  𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 ∗ 1.0𝑒12                                                    proposed method has a speedup of 3.05× and 3.18×, respectively.
where 𝑀, 𝑁 and 𝐾 represent the matrix dimension of the GEMM, and                         These two methods utilize precision refinement techniques but require
𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time in seconds measured on GPU. To                    more computation workload. The specialized hardware accelerator’s
evaluate the calculated results, the max relative error (Max_error) is                   data layout and instruction scheduling characteristics were not con-
utilized to evaluate the error compared to the CPU computation results,                  sidered, leading to lower performance. The proposed method utilizes
which is calculated as:                                                                  the thread parallelism analytic model to choose the tiling scheme to
                    (                                                    )               segment the matrix. Additionally, an efficient multi-level pipeline is
                       |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) − 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))|                  used to improve the efficiency of data access and ILP. This makes it
Max_error = 𝑀 𝑎𝑥
                      |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗))| + |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))|                 an effective solution for addressing the issue of low computation speed
                  𝑖 = 1, 2, … , 𝑀 , 𝑗 = 1, 2, … , 𝑁 .                         (19)       in precision refinement. Fig. 8 indicates that the proposed method has
                                                                                         performance benefits when the matrix dimension is large. This serves
where 𝑉𝑝 (𝑖, 𝑗) denotes the computational result under the precision                     as further evidence that the suggested LE-GEMM has significantly
refinement procedure, which can achieve single-precision results with                    improved performance for both square and skewed matrices.

                                                                                     9
Y. Zhang et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103336


                                                 Fig. 8. The experimental results with various shapes on MI210.


                                                  Fig. 9. The experimental results with various shapes on A800.


    The experimental results of speed comparison on NVIDIA A800                     Table 6
                                                                                    The speedup on two GPU platforms.
are shown in Fig. 9. In the square matrix case (n*n*n), the proposed
                                                                                     Baseline method     AMD MI210                       NVIDIA A800
method achieves an average speedup of 1.37×, 1.40×, 2.78× and
2.74×, which are compared with cuBLAS, WMMA, Mark and Ooto,                                              (n*n*n)   (n*4n*n)   (n*n*4n)   (n*n*n)   (n*4n*n)   (n*n*4n)
respectively. As illustrated in Fig. 9, it can be seen that when the matrix          rocBLAS/cuBLAS      1.31×     1.31×      1.32×      1.37×     1.40×      1.36×
size is small, the performance of LE-GEMM is similar to the compared                 rocWMMA/WMMA        1.26×     1.24×      1.45×      1.40×     1.43×      1.81×
                                                                                     Mark                2.81×     2.99×      3.35×      2.77×     2.84×      4.67×
methods. For example, when n ≤ 2048, Fig. 9(a) shows that the                        Ooto                2.99×     3.10×      3.45×      2.74×     2.80×      3.76×
proposed method performs similarly to cuBLAS and WMMA. The above
phenomenon is because a smaller matrix will reduce the number of tiles
and the number of iterative accumulations within the tiles, reducing
the performance gain brought by the multi-level pipeline. As the size               5.3. The precision error
of the matrix increases, the benefits of the proposed method become
more apparent. When the matrix size is large, the proposed method                       This section compares several baseline methods in terms of their
significantly improves performance over other methods. Experiments                  maximum error. The experiment results are shown in Fig. 10. First, we
conducted on two GPU platforms demonstrate that the performance of                  compare with rocWMMA, which computes FP32 GEMM directly from
                                                                                    FP16 using Matrix Core. In Fig. 10(a), the proposed method reduces
the Markidis and Ootomo methods is nearly consistent. The reason is
                                                                                    the average Max_error by 377.33× compared to rocWMMA. The results
that, while Ootomo’s method reduces the computational effort by 25%,
                                                                                    indicate that FP16 cannot preserve all the mantissa of FP32 due to the
Ootomo’s method requires an additional accumulator to perform the
                                                                                    limitation of the mantissa length. Additionally, this method lacks preci-
accumulation. The initiation and combination stages of the accumu-
                                                                                    sion refinement techniques, resulting in a pronounced Max_error. The
lator also require an additional amount of time. When the matrix is
                                                                                    comparison result shows that the proposed method can significantly
skewed, the proposed method outperforms the other baseline methods
                                                                                    reduce the Max_error compared to the default routine. Mark and Ooto
significantly. Fig. 9(c), when n>7168, the performance of the WMMA
                                                                                    methods were then compared separately. These two methods utilize
method decreases because it is a simple and direct implementation that              precision refinement techniques to decrease Max_error. These methods
ignores the data loading pipeline and multi-level memory architecture               have improved computational accuracy compared to the default meth-
during the computation process. This simple implementation results                  ods. Fig. 10(a) shows that the proposed method reduces the average
in lower efficiency when accumulating along the K-dimension, which                  Max_error by 74.82× compared to Mark. Although Markidis et al. [10]
is more pronounced when the matrix size is (n*n*4n). This approach                  employed an additional FP16 execution precision refinement process to
makes WMMA unable to fully utilize its hardware acceleration capabil-               reduce Max_error, the effect of the rounding method on FP16 was not
ity. Fig. 9(b) illustrates that the proposed method achieves a speedup              considered, leading to certain limitations in improving accuracy. The
of 1.40×, 1.43×, 2.84× and 2.80× over the four baseline methods,                    proposed method reduces Max_error by an average of 2.69× compared
respectively. As shown in Fig. 9, the proposed method performs better               to Ootomo’s method. However, the addition of FP16 does not preserve
when the matrix size is large; this performance trend is consistent with            the residuals of FP32 and FP16 efficiently due to the limitations of
the AMD platform. The detailed comparative experimental results of                  the numerical representation range. Rounding inevitably causes fluctu-
two GPU platforms are presented in Table 6.                                         ations in values, leading to errors in the values preserved by FP16. To

                                                                               10
Y. Zhang et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103336


                                           Fig. 10. The Max_error with square matrix (n*n*n) on two GPU platforms.


                                  Fig. 11. The Mean Relative Error Distance with square matrix (n*n*n) on two GPU platforms.


alleviate this problem, the proposed method employs a bit-cut method              636.92×, 178.93×, and 1.12× on MI210, respectively. By combining the
for the high-bit of FP32 mantissa to avoid the value fluctuation caused           experimental results in Figs. 10–11, it can be found that rocWMMA and
by rounding. Fig. 10(a) shows that the Max_error increases with the               WMMA have enormous Max_error and MRED compared to the other
matrix dimension (K-dimension). This is due to the fact that the two              methods, which also shows the importance of precision refinement.
FP16 cannot fully preserve the FP32 values, resulting in errors in the            In A800, the proposed method reduces the MRED by 814.87×, 1.75×,
calculated results. As the number of computations increases, the error            and 1.28× compared to WMMA, Mark, and Ooto, respectively. The
gradually becomes more apparent. The effectiveness and advancement                experimental results of Figs. 10–11 demonstrate a consistent trend:
of the proposed method can be demonstrated by comparing with the                  the Max_error and MRED are more significant when the cumulative
default (rocWMMA) method and the latest (Mark and Ooto) methods                   dimension increases. As shown in Figs. 10–11, it can be found that the
on the AMD-platform.                                                              proposed method has a smaller Max_error and MRED, which indicates
    To demonstrate the proposed method’s portability, we have con-                that it has a smaller upper limit of error and mean relative error. The
ducted corresponding experiments on NVIDIA A800, another popular                  comparative experiment between Max_error and MRED proves that the
                                                                                  proposed method has better computational accuracy.
GPU platform. The experimental results for Max_error are shown in
                                                                                      This section provided error distribution histograms to demonstrate
Fig. 10(b). To ensure an objective comparison, we applied Markidis’s
                                                                                  each method’s error range. The error distribution histogram is imple-
method and Ootomo’s method on the NVIDIA platform, respectively,
                                                                                  mented on two GPU platforms with a matrix size of 1024*1024*1024.
using the WMMA interface. The interface name is the only difference
                                                                                  The detailed error distribution results are shown in Figs. 12–13. As
between this implementation and the AMD-platform implementation.
                                                                                  shown in Figs. 12–13, the error distribution of rocWMMA and WMMA
Fig. 10(b) illustrates that the proposed method significantly reduces
                                                                                  is the widest, indicating that the error impact without an additional
Max_error as the matrix size increases, which is consistent with the
                                                                                  precision computation process is significant. Compared to the Mark
results obtained on the AMD-platform. The proposed method also has                method, the proposed method has obvious advantages in both the error
a smaller Max_error compared to WMMA, Mark, and Ooto, reducing                    range and maximum error. This precision advantage is more evident
average Max_error by 571.75×, 2.78× and 1.41×, respectively. The                  on MI210. The above gap is because although Mark considers the
experimental results of the two platforms show the same trend as the              impact of down-conversion, a simple rounding way cannot effectively
results in the AMD-platform. This indicates that the proposed method              preserve low-bit bits. Moreover, the differences in the implementation
has low Max_error on mainstream GPU platforms, further confirming                 of rounding API across different hardware platforms can also impact
its effectiveness in terms of computational accuracy.                             the calculation results. Compared to the Ooto method, the error dis-
    In this section, we have also provided the MRED of several com-               tribution histograms of the two methods are similar. The error of the
parative methods, which allows for a more comprehensive analysis of               proposed method is more concentrated around 0 in Figs. 12 and 13,
the matrix error. The detailed experimental results are presented in              indicating that the proposed method is superior to Ooto in terms of
Fig. 11. As illustrated in Fig. 11(a), the proposed method exhibits a             error distribution. The error histograms shown in Figs. 12 and 13 in-
smaller MRED than rocWMMA, Mark, and Ooto, with a reduction of                    dicate that the proposed method outperforms other methods regarding

                                                                             11
Y. Zhang et al.                                                                                                  Journal of Systems Architecture 160 (2025) 103336


                                                       Fig. 12. Error distribution histogram on MI210.


                                                       Fig. 13. Error distribution histogram on A800.


both the error upper limit and the error distribution on both platforms,          scientific computing scenarios: K-means [6], FFN layer in BERT [8], and
proving its accuracy advantage.                                                   Linear transformation in LLMs [58], which involve numerous GEMM
                                                                                  operations. Fig. 15 shows the speedup ratio of the proposed method
5.4. The empirical Roofline analysis                                              compared to the default implementation (rocBLAS and cuBLAS) on
                                                                                  two GPU platforms. The average speedup of the proposed method in
    To further demonstrate the effectiveness of the proposed method,              K-means for rocBLAS and cuBLAS are 1.16× and 1.33×, respectively,
we present a performance comparison and analysis of empirical                     as shown on the left of Fig. 15. The GEMM operation’s shape during
Roofline performance on two GPU platforms. In the comparison ex-                  computation is determined by altering the number of data points in
periments, GEMMs with larger dimensions are chosen for profiling,                 the K-means comparison experiments. The proposed method’s perfor-
which avoids performance estimation bias due to fewer data elements               mance advantage increases gradually with the number of data points
in the multi-level cache. To measure performance metrics, we utilized             and then stabilizes. This suggests that the proposed method has more
Omniperf [56] and NCU [57] commands, which are profiling tools                    significant performance gains in large-scale GEMM. Fig. 15 illustrates
provided by AMD and NVIDIA, respectively. The empirical roofline per-             the performance improvement of the FFN layer. The FFN has incor-
formance is typically composed of two regions: the ‘‘memory-bound’’               porated the proposed method to accelerate the computational process
and the ‘‘compute-bound’’, which indicate that the current kernel                 of the BERT layer. As illustrated on Fig. 15, the proposed method has
performance is limited by memory access and compute capacity, re-                 achieved 1.16× and 1.18× speedups compared to rocBLAS and cuBLAS.
spectively. The specific experimental results are shown in Fig. 14. From          Experiments on various input sequence lengths show that the proposed
Fig. 14, it can be seen that the proposed method is in the ‘‘compute-             method can significantly improve the speed performance of the FFN
bound’’ region compared to Mark, Ooto, and rocWMMA, indicating that               computation process on two GPU platforms. Meanwhile, we also apply
the proposed method has a higher bandwidth utilization for multi-level            the proposed method to the linear transformation in LLMs to accelerate
memory in the GPU. This is due to the fact that the above baseline                their linear mapping process, which involves various sizes of GEMM
approach does not take into account the data layout of specialized                workload. During the linear transformation process of LLMs, the input
hardware units and the multi-level cache characteristics on GPUs,                 features are mapped to product Q, K, and V, commonly known as
resulting in lower performance. To address this issue, the proposed               Query, Key, and Value matrices. The range of matrix data from the
method uses double buffers and pre-fetching techniques to establish an            LLMs training process is [0.5, 2.5]. In Fig. 15, for the computation
efficient multi-level pipeline design. This design empowers the kernel            process of multiple-size weight matrices, the proposed method has a
to optimize the use of data in the multi-level cache, leading to improved         speedup of 1.23× and 1.34× compared to the default implementation,
performance. For rocBLAS and cuBLAS, although both methods fall                   respectively. Fig. 16 shows the Max_error and MRED of the proposed
into ‘‘compute-bound’’, the performance of the proposed method is                 method in three GEMM-based applications. As illustrated in Fig. 16,
better. This also shows the computational effectiveness of the proposed           the magnitude of the error associated with the proposed method is pri-
method for single-precision GEMM. The comparison of the empirical                 marily contingent upon the K-dimension of GEMM. As the K-dimension
Roofline performance on the two platforms can further confirm the                 increases, both Max_error and MRED of the proposed method also
effectiveness and advancement of the proposed method.                             increase. The computation process in GEMMs with larger K-dimensions
                                                                                  involves more FMA operations, leading to more significant errors.
5.5. The improved performance for GEMM-based scientific computing sce-
narios                                                                            5.6. The performance gain analysis

   To demonstrate the proposed method’s acceleration performance                     In this section, we evaluate the performance gains of two tech-
in GEMM-based scientific computing, we selected three real-world                  niques, thread parallelism analytic model and multi-level pipeline, for

                                                                             12
Y. Zhang et al.                                                                                                          Journal of Systems Architecture 160 (2025) 103336


                                                              Fig. 14. The empirical Roofline performance.


Fig. 15. The acceleration performance with LE-GEMM (Note that rocBLAS and cuBLAS serve as a baseline for performance comparison. Hence, the speedup of these methods is
1.).


                                                           Fig. 16. The Max_error and MRED with LE-GEMM.


LE-GEMM performance. The detailed experimental results are shown in                    MI210 and A800. When the matrix size gradually increases, the perfor-
Fig. 17. In Fig. 17, ‘‘LE-GEMM manual paramete’’ represents the use of                 mance gain is more obvious. For example, when the matrix size ≥ 4000,
previous experience to select the appropriate parameters. ‘‘LE-GEMM                    there is a significant performance gap between the proposed method
plain implementation’’ is the direct use of the API interface without                  and the manual parameter implementation. This is because different
the multi-level pipeline. As shown in Fig. 17, the thread parallelism                  parameters will have a significant performance impact for varying
analytic model brings 1.36× and 1.78× performance improvement on                       hardware architectures and matrix sizes. The manual parameter way

                                                                                  13
Y. Zhang et al.                                                                                                     Journal of Systems Architecture 160 (2025) 103336


Fig. 17. The performance gain analysis of LE-GEMM (‘‘LE-GEMM manual parameter’’ indicates LE-GEMM without thread parallelism analytic model and ‘‘LE-GEMM plain
implementation’’ indicates without multi-level pipeline).


                                                     Fig. 18. The trade-off between performance and precision.


relies solely on prior experience in selecting kernel parameters. This               the error of these two methods is also consistent. CUBLAS_COMPUTE_
approach often requires several trial-and-error to obtain a suitable                 32F_FAST_16F often leads to significant errors, limiting its applica-
parameter value. However, the lack of detailed analysis of the hard-                 tion range. The proposed method reduces the precision loss by using
ware architecture and the computational workflow makes it difficult                  additional computational resources, which decrease the overall com-
to obtain the optimal parameters in the manual selection approach.                   putational speed. Compared to COMPUTE_32F_FAST_16F, the proposed
We also perform LE-GEMM plain implementation, which ignores the                      method has better computational accuracy. CUBLAS_COMPUTE_64F
multi-level memory architecture feature and adopts a simple and direct               represents double-precision matrix multiplication operation. It has
calling way. Fig. 17 illustrates that LE-GEMM exhibits a performance                 higher data precision and lower computational speed. The experimental
speedup of 3.02× and 2.72× over the plain implementation on two GPU                  results in Fig. 18 indicate that higher precision data tends to have
platforms. In LE-GEMM plain implementation, threads are required                     lower computational speed. The proposed LE-GEMM achieves single-
to wait for the next data load after completing a single iteration                   precision computational results based on low-precision computational
calculation, which significantly reduces performance. Compared to                    units, essentially a compromise between computational performance
plain implementation, LE-GEMM can efficiently utilize various memory                 and accuracy.
spaces for pipelined access operations, reducing data loading overhead
and improving CU utilization.                                                        6. Conclusion

5.7. The trade-off between performance and precision                                     This paper proposed LE-GEMM for single-precision matrix multiply,
                                                                                     which contains a lightweight emulation algorithm, thread parallel an-
   To explore the trade-off between high performance and precision,                  alytic model, and efficient multi-level pipeline implementation. Specif-
some comparative experiments with different precision were conducted                 ically, a lightweight emulation algorithm can provide more accurate
on the A800. We used the cublasGemmEx() function in the com-                         computational results and eliminate redundant computations. Then, a
parative experiment to implement two different computation modes                     thread-parallel analytical model is designed to determine the optimal
(CUBLAS_COMPUTE_32F_FAST_16F and CUBLAS_COMPUTE_64F).                                tiling scheme by considering TLP and single thread performance. A
CUBLAS_COMPUTE_32F_FAST_16F mode means directly using FP16                           multi-level pipeline design is implemented to achieve access latency
to obtain single precision calculation results, during which down-                   hiding and higher ILP. Finally, experiments were conducted on two
conversion is performed. The detailed experimental results have been                 GPU platforms to validate the effectiveness of the proposed method and
presented in Fig. 18. As shown in Fig. 18, CUBLAS_COMPUTE_32F_                       its progressiveness.
FAST_16F has the fastest computational performance because this
method directly calls Tensor Core for matrix operations without any                  7. Future work
precision refinement process. The calculation process of CUBLAS_
COMPUTE_32F_FAST_16F is consistent with WMMA, which directly                            Future work includes exploring the impact of different types of low-
uses down-conversion without a precision refinement process. Hence,                  precision data on computational results, allowing for better trade-offs

                                                                                14
Y. Zhang et al.                                                                                                                      Journal of Systems Architecture 160 (2025) 103336


between performance and accuracy. Recently, low-precision hardware                            [13] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee,
accelerators have been increasingly applied to various GEMM-based                                  D. Lee, LUT-GEMM: Quantized matrix multiplication based on luts for efficient
                                                                                                   inference in large-scale generative language models, 2022, arXiv preprint arXiv:
scientific computing scenarios to improve computational speed. The
                                                                                                   2206.09557.
differences in hardware support and implementation can significantly                          [14] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, B. Zhuang, Qllm: Accurate and efficient
impact the calculation results. In future work, we plan to explore                                 low-bitwidth quantization for large language models, 2023, arXiv preprint arXiv:
more low-precision variables (e.g., BF16, FP8, FP4) to speed up the                                2310.08041.
                                                                                              [15] Y. Wang, B. Feng, Z. Wang, G. Huang, Y. Ding, TC-GNN: Bridging sparse
computational process in single-precision matrix multiplication.
                                                                                                   GNN computation and dense Tensor Cores on GPUs, in: 2023 USENIX Annual
                                                                                                   Technical Conference (USENIX ATC 23), 2023, pp. 149–164.
CRediT authorship contribution statement                                                      [16] R. Fan, W. Wang, X. Chu, Fast sparse GPU kernels for accelerated training
                                                                                                   of graph neural networks, in: 2023 IEEE International Parallel and Distributed
                                                                                                   Processing Symposium, IPDPS, IEEE, 2023, pp. 501–511.
    Yu Zhang: Writing – review & editing, Writing – original draft. Lu                        [17] T. Utsugiri, T. Kouya, Acceleration of multiple precision matrix multiplication
Lu: Writing – review & editing. Zhanyu Yang: Writing – review &                                    using Ozaki scheme, 2023, arXiv preprint arXiv:2301.09960.
editing, Conceptualization. Zhihong Liang: Writing – review & editing.                        [18] M. Tatsumi, S.-I. Filip, C. White, O. Sentieys, G. Lemieux, Mixing low-precision
Siliang Suo: Writing – review & editing.                                                           formats in multiply-accumulate units for DNN training, in: 2022 International
                                                                                                   Conference on Field-Programmable Technology, ICFPT, IEEE, 2022, pp. 1–9.
                                                                                              [19] Y. Tortorella, L. Bertaccini, L. Benini, D. Rossi, F. Conti, RedMule: A mixed-
Declaration of competing interest                                                                  precision matrix–matrix operation engine for flexible and energy-efficient on-chip
                                                                                                   linear algebra and TinyML training acceleration, Future Gener. Comput. Syst. 149
    The authors declare that they have no known competing finan-                                   (2023) 122–135.
                                                                                              [20] D.N. Parikh, R.A. van de Geijn, G.M. Henry, Cascading GEMM: High precision
cial interests or personal relationships that could have appeared to                               from low precision, 2023, arXiv preprint arXiv:2303.04353.
influence the work reported in this paper.                                                    [21] M. Fasi, N.J. Higham, F. Lopez, T. Mary, M. Mikaitis, Matrix multiplication in
                                                                                                   multiword arithmetic: Error analysis and application to GPU tensor cores, SIAM
                                                                                                   J. Sci. Comput. 45 (1) (2023) C1–C19.
Acknowledgments
                                                                                              [22] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using
                                                                                                   the TF32 and TF64 Frameworks on low-precision AI tensor cores, in: Proceedings
   This work was supported by the Natural Science Foundation of                                    of the SC’23 Workshops of the International Conference on High Performance
Guangdong Province, China (2024A1515010204) and the Techno-                                        Computing, Network, Storage, and Analysis, 2023, pp. 179–186.
                                                                                              [23] X.-Y. Liu, Z. Zhang, Z. Wang, H. Lu, X. Wang, A. Walid, High-performance tensor
logical Research Project of Southern Power Grid Company, China
                                                                                                   learning primitives using GPU Tensor Cores, IEEE Trans. Comput. 72 (6) (2022)
(ZBKJXM20232483).                                                                                  1733–1746.
                                                                                              [24] A. Abdelfattah, H. Anzt, E.G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox,
Data availability                                                                                  M. Gates, N.J. Higham, X.S. Li, et al., A survey of numerical linear algebra
                                                                                                   methods utilizing mixed-precision arithmetic, Int. J. High Perform. Comput. Appl.
                                                                                                   35 (4) (2021) 344–369.
    No data was used for the research described in the article.                               [25] M.O. Lam, J.K. Hollingsworth, B.R. de Supinski, M.P. Legendre, Automatically
                                                                                                   adapting programs for mixed-precision floating-point computation, in: Proceed-
                                                                                                   ings of the 27th International ACM Conference on International Conference on
References                                                                                         Supercomputing, 2013, pp. 369–378.
                                                                                              [26] A. Dakkak, C. Li, J. Xiong, I. Gelado, W.-m. Hwu, Accelerating reduction and
 [1] M. Meyer, T. Kenter, C. Plessl, Multi-FPGA designs and scaling of HPC challenge               scan using tensor core units, in: Proceedings of the ACM International Conference
     benchmarks via MPI and circuit-switched inter-FPGA networks, ACM Trans.                       on Supercomputing, 2019, pp. 46–57.
     Reconfigurable Technol. Syst. 16 (2) (2023) 1–27.                                        [27] M. Fasi, N.J. Higham, M. Mikaitis, S. Pranesh, Numerical behavior of NVIDIA
 [2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based                 tensor cores, PeerJ Comput. Sci. 7 (2021) e330.
     convolutions for deep learning on multicore ARM and RISC-V architectures, J.             [28] AMD, RocWMMA: ROCm wavefront-level matrix multiply and accumulate, 2024,
     Syst. Archit. (2024) 103186.                                                                  https://github.com/ROCmSoftwarePlatform/rocWMMA.
                                                                                              [29] D. Yan, W. Wang, X. Chu, Demystifying Tensor Cores to optimize half-precision
 [3] Y.R. Choi, V. Stegailov, Multi-GPU GEMM algorithm performance analysis for
                                                                                                   matrix multiply, in: 2020 IEEE International Parallel and Distributed Processing
     NVIDIA and AMD GPUs connected by NVLink and PCIe, in: International Con-
                                                                                                   Symposium, IPDPS, IEEE, 2020, pp. 634–643.
     ference on Mathematical Modeling and Supercomputer Technologies, Springer,
                                                                                              [30] AMD, Next generation BLAS implementation for ROCm platform, 2024, https:
     2022, pp. 281–292.
                                                                                                   //github.com/ROCm/rocBLAS.
 [4] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating
                                                                                              [31] A. Kerr, D. Merrill, J. Demouth, J. Tran, Cutlass: Fast linear algebra in CUDA
     scientific computing on tensor cores with extended precision, in: Proceedings
                                                                                                   C++, NVIDIA Dev. Blog (2017).
     of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel
                                                                                              [32] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore
     Programming, 2021, pp. 278–291.
                                                                                                   high-performance parallel computing with CUDA, Packt Publishing Ltd, 2018.
 [5] N.S. Murthy, F. Catthoor, M. Verhelst, Optimization of block-scaled integer
                                                                                              [33] D. Lustig, S. Sahasrabuddhe, O. Giroux, A formal analysis of the NVIDIA PTX
     GeMMs for efficient DNN deployment on scalable in-order vector processors,
                                                                                                   memory consistency model, in: Proceedings of the Twenty-Fourth International
     J. Syst. Archit. (2024) 103236.
                                                                                                   Conference on Architectural Support for Programming Languages and Operating
 [6] M. Ahmed, R. Seraj, S.M.S. Islam, The K-means algorithm: A comprehensive
                                                                                                   Systems, 2019, pp. 257–270.
     survey and performance evaluation, Electronics 9 (8) (2020) 1295.                        [34] D. Mukunoki, K. Ozaki, T. Ogita, T. Imamura, Accurate matrix multiplication
 [7] G. Chen, Y. Ding, X. Shen, Sweet KNN: An efficient KNN on gpu through                         on binary128 format accelerated by Ozaki scheme, in: Proceedings of the 50th
     reconciliation between redundancy removal and regularity, in: 2017 IEEE 33rd                  International Conference on Parallel Processing, 2021, pp. 1–11.
     International Conference on Data Engineering, ICDE, IEEE, 2017, pp. 621–632.             [35] J. Huang, L. Rice, D.A. Matthews, R.A. van de Geijn, Generating families
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep                      of practical fast matrix multiplication algorithms, in: 2017 IEEE International
     bidirectional transformers for language understanding, 2018, arXiv preprint                   Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2017, pp. 656–667.
     arXiv:1810.04805.                                                                        [36] Y. Wang, B. Feng, Y. Ding, QGTC: accelerating quantized graph neural networks
 [9] AMD, Matrix core, 2024, https://gpuopen.com/learn/amd-lab-notes/amd-lab-                      via GPU tensor core, in: Proceedings of the 27th ACM SIGPLAN Symposium on
     notes-matrix-cores-readme/.                                                                   Principles and Practice of Parallel Programming, 2022, pp. 107–119.
[10] S. Markidis, S.W. Der Chien, E. Laure, I.B. Peng, J.S. Vetter, NVIDIA Tensor Core        [37] B. Li, S. Cheng, J. Lin, Tcfft: Accelerating half-precision FFT through tensor cores,
     programmability, performance & precision, in: 2018 IEEE International Parallel                2021, arXiv preprint arXiv:2104.11471.
     and Distributed Processing Symposium Workshops, IPDPSW, IEEE, 2018, pp.                  [38] F. Zhang, Y. Hu, H. Ding, Z. Yao, Z. Wei, X. Zhang, X. Du, Optimizing
     522–531.                                                                                      random access to hierarchically-compressed data on GPU, in: SC22: International
[11] AMD, AMD CDNA 2.0 architecture, 2024, https://www.amd.com/content/dam/                        Conference for High Performance Computing, Networking, Storage and Analysis,
     amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white-                         IEEE, 2022, pp. 1–15.
     paper.pdf.                                                                               [39] R. Sakamoto, M. Kondo, K. Fujita, T. Ichimura, K. Nakajima, The effectiveness
[12] W. Sun, A. Li, T. Geng, S. Stuijk, H. Corporaal, Dissecting Tensor Cores via                  of low-precision floating arithmetic on numerical codes: a case study on
     microbenchmarks: Latency, throughput and numeric behaviors, IEEE Trans.                       power consumption, in: Proceedings of the International Conference on High
     Parallel Distrib. Syst. 34 (1) (2022) 246–261.                                                Performance Computing in Asia-Pacific Region, 2020, pp. 199–206.


                                                                                         15
Y. Zhang et al.                                                                                                                    Journal of Systems Architecture 160 (2025) 103336


[40] Y. Sun, L. Zheng, Q. Wang, X. Ye, Y. Huang, P. Yao, X. Liao, H. Jin, Accelerating        [49] B. Ferrarini, M. Waheed, S. Waheed, S. Ehsan, M.J. Milford, K.D. McDonald-
     sparse deep neural network inference using GPU Tensor Cores, in: 2022 IEEE                    Maier, Exploring performance bounds of visual place recognition using extended
     High Performance Extreme Computing Conference, HPEC, IEEE, 2022, pp. 1–7.                     precision, IEEE Robot. Autom. Lett. 5 (2) (2020) 1688–1695.
[41] Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zhai, Efficiently          [50] N.J. Higham, T. Mary, Sharper probabilistic backward error analysis for basic
     emulating high-bitwidth computation with low-bitwidth hardware, in: Proceed-                  linear algebra kernels with random data, SIAM J. Sci. Comput. 42 (5) (2020)
     ings of the 36th ACM International Conference on Supercomputing, 2022, pp.                    A3427–A3446.
     1–12.                                                                                    [51] H. Ootomo, H. Manabe, K. Harada, R. Yokota, Quantum circuit simulation
[42] Z. Lin, A. Sun, X. Zhang, Y. Lu, MixPert: Optimizing mixed-precision floating-                by SGEMM emulation on tensor cores and automatic precision selection, in:
     point emulation on GPU integer tensor cores, in: Proceedings of the 25th ACM                  International Conference on High Performance Computing, Springer, 2023, pp.
     SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools                    259–276.
     for Embedded Systems, 2024, pp. 34–45.                                                   [52] N. Nakasato, A fast GEMM implementation on the cypress GPU, ACM
[43] H. Ootomo, R. Yokota, Recovering single precision accuracy from tensor cores                  SIGMETRICS Perform. Eval. Rev. 38 (4) (2011) 50–55.
     while surpassing the FP32 theoretical peak performance, Int. J. High Perform.            [53] M. De Santis, G. Eichfelder, J. Niebling, S. Rocktäschel, Solving multiobjective
     Comput. Appl. 36 (4) (2022) 475–491.                                                          mixed integer convex optimization problems, SIAM J. Optim. 30 (4) (2020)
[44] P. Blanchard, N.J. Higham, F. Lopez, T. Mary, S. Pranesh, Mixed precision block               3122–3145.
     fused multiply-add: Error analysis and application to GPU tensor cores, SIAM J.          [54] R.S. Burachik, C.Y. Kaya, M.M. Rizvi, Algorithms for generating pareto fronts of
     Sci. Comput. 42 (3) (2020) C124–C141.                                                         multi-objective integer and mixed-integer programming problems, Eng. Optim.
[45] N.J. Higham, T. Mary, Mixed precision algorithms in numerical linear algebra,                 54 (8) (2022) 1413–1425.
     Acta Numer. 31 (2022) 347–414.                                                           [55] NVIDIA, Programming tensor cores in CUDA 9, 2024, https://developer.nvidia.
[46] Z. Jia, M. Maggioni, B. Staiger, D.P. Scarpazza, Dissecting the NVIDIA volta GPU              com/blog/programming-tensor-cores-cuda-9/.
     architecture via microbenchmarking, 2018, arXiv preprint arXiv:1804.06826.               [56] AMD, Advanced profiling and analytics for AMD hardware, 2024, https://github.
[47] M. Lu, B. He, Q. Luo, Supporting extended precision on graphics processors, in:               com/ROCm/omniperf.
     Proceedings of the Sixth International Workshop on Data Management on New                [57] NVIDIA, The user guide for nsight compute, 2024, https://docs.nvidia.com/
     Hardware, 2010, pp. 19–26.                                                                    nsight-compute/NsightCompute/index.html.
[48] A. Thall, Extended-precision floating-point numbers for GPU computation, in:             [58] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park,
     ACM SIGGRAPH 2006 Research Posters, 2006, pp. 52–es.                                          Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing, in:
                                                                                                   Proceedings of the 29th ACM International Conference on Architectural Support
                                                                                                   for Programming Languages and Operating Systems, Vol. 3, 2024, pp. 722–737.


                                                                                         16