Journal of Systems Architecture 160 (2025) 103336 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc LE-GEMM: A lightweight emulation-based GEMM with precision refinement on GPU Yu Zhang a , Lu Lu a,b,c ,∗, Zhanyu Yang a , Zhihong Liang d,e , Siliang Suo d,e a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China b Peng Cheng Laboratory, Shenzhen, 518055, China c Pazhou Laboratory, Guangzhou 510005, China d Electric Power Research Institute, CSG, Guangzhou, Guangdong, China e Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou, Guangdong, China ARTICLE INFO ABSTRACT Keywords: Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in GEMM various scientific computing scenarios. These units support tensor-level computation with different precisions Thread parallelism analytic on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication Multi-level pipeline (GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits Matrix core/Tensor core its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline implementation to accelerate the computation process without compromising the accuracy requirements. First, we propose a lightweight emulation algorithm that includes a precision transformation process and GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that the proposed method outperforms the previous approaches in terms of computational accuracy and speed. 1. Introduction computation process. For example, the half-precision performance with Matrix Core in the AMD MI210 achieves 8× higher throughput over GEneral Matrix Multiplication (GEMM) has been extensively in- ROCm Core [11]. Based on this feature, previous studies have proposed volved in High-Performance Computing (HPC) [1], deep learning ap- a way to explore low-precision matrix to obtain high-precision results plications [2], and other scientific computing fields [3]. In these fields, and achieve high performance on GPU [12–14]. some applications or scenarios in which GEMM is the primary building Exploiting the low-precision to speed up the GEMM is an effective block are referred to as GEMM-based scientific computing [4,5]. These route in GEMM-based scientific computing fields, where matrix mul- applications involve a large number of matrix multiplication workloads, tiplication operations can be tolerated for low-precision [15,16]. The and GEMM’s computation time is a significant part of the overall precision transformation process, which has a significant impact on process. Specifically, some typical applications, such as K-Means [6], the accuracy of the final calculated results, is a crucial part of this KNN [7], Feedforward Neural Networks (FFN) [8], etc., have GEMM computation times exceeding 50% in popular implementations. To route [17]. Data value errors exist during the transition from high cater this trend, specialized hardware accelerators (e.g., AMD’s Ma- to low precision due to the loss in the mantissa representation. After trix Core [9] and NVIDIA’s Tensor Core [10]) on GPU have been several FMA calculations, the final result has some errors compared to designed and implemented to speed up GEMM, which employs low- the standard one. This phenomenon is also known as precision loss. precision inputs to achieve high performance. In contrast to the regular This restricts the application of mixed-precision calculations signifi- computation pattern, these mixed-precision cores can support tensor- cantly [18,19]. Some previous studies have utilized the calculation level Fused Multiply-Add (FMA), dramatically speeding up the matrix ∗ Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China. E-mail addresses: yuzhang0722@163.com (Y. Zhang), lul@scut.edu.cn (L. Lu), yangzhanyu@hotmail.com (Z. Yang), liangzh@csg.cn (Z. Liang), suosl@csg.cn (S. Suo). https://doi.org/10.1016/j.sysarc.2025.103336 Received 11 July 2024; Received in revised form 8 December 2024; Accepted 2 January 2025 Available online 17 January 2025 1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 results of additional low-precision variables to reduce precision loss in and NVIDIA are adding specialized hardware units and cores to their the calculation results [20–23]. latest GPUs that support mixed-precision computing [24,25]. In the Low-precision variables and additional computations are utilized following, we will discuss the details of the specialized cores. for precision refinement [10], which reduces the impact of precision Computing Pattern and Memory Layout. The GPU cores can be loss on the computational results. The implementation of this approach classified into two types based on supported precision: same-precision on GPUs that feature mixed-precision arithmetic is still hindered by cores (FP16 Core, FP32 Core, etc.) that only support the same preci- several issues. Primarily, a crucial factor is the data transforma- sion computation and mixed-precision cores (Tensor Core or Matrix tion process. Loss of the original data mantissa during the precision Core) that support different precision computation [26,27]. The GEMM transformation process has an impact on the final outcome. As a supported by the mixed-precision core can be expressed as 𝐷 = 𝛼 × consequence of the disparity in the mantissa length, low-precision 𝐴𝐵 + 𝛽 × 𝐶, where 𝐴 and 𝐵 can be FP16 or FP32 matrix, 𝐶 and 𝐷 can data cannot be saved beyond the mantissa length range. This leads to be configured to FP32 or double precision (FP64) matrix, and 𝛼 and numerical errors after the precision transformation process. The effect 𝛽 denote scalar constants [28]. This means that the mixed-precision of this error will be amplified into the final calculation result after core can utilize the input of low-precision to obtain high-precision multiple FMA operations, which is a gap between calculated result and the standard result. The second is the acceleration capabilities calculation results [10,29]. The details are shown in Fig. 1(a). In of these specialized hardware units. These specialized hardware terms of computing patterns, the mixed-precision core utilizes a new units provide GEMM with powerful acceleration capabilities, accompa- memory layout fragment that stores matrix elements either in row- nied by special data layouts, which pose challenges to efficient kernel major or column-major order. Different from the same-precision core design. For GEMM operation, tiling is a common way of assigning matrix multiplication operation, the matrix multiplication operation is computational tasks, where the matrix is segmented into multiple tiles performed in the mixed-precision core using the fragment as a unit. (smaller matrix), and a workgroup (thread block) is responsible for The fragment size is ‘‘16 × 16’’ or ‘‘32 × 32’’, and other fixed shapes computing one or several tiles. When the tile size is too large, the (Fig. 1(b) provides an example based on a ‘‘16 × 16’’ fragment). Note workload of the threads will become too heavy, reducing thread paral- that different hardware architectures have varying support for fragment lelism. Conversely, if the tile size is too small, there will be frequent size. For example, for NVIDIA A800, the maximum supported fragment workgroup switching, which will increase scheduling overhead. The size is ‘‘16 × 16’’ [10]. In AMD MI210, rocWMMA can support frag- computation allocation of tiles also affects data reuse in a workgroup. In ments with a size of ‘‘16 × 32’’ or ‘‘32 × 32’’ [11]. Different fragment terms of data access, these specialized hardware units perform GEMM sizes have a particular impact on the workload allocation of GEMM operations with fragments, which requires higher utilization of the operations. GPU’s multi-level memory. Ineffective organization and manipulation Programming interface and implementation. Matrix Core and of data frequently negate the speed benefits of the tensor-level patterns, Tensor Core are hardware cores delivered by two different GPU ven- leading to memory wall bottlenecks in the GPU. dors (AMD and NVIDIA). GPU vendors have implemented various To address the above issues, we propose LE-GEMM that includes a approaches depending on their hardware configurations to facilitate lightweight emulation algorithm, a thread parallelism analytic model, their use by researchers and engineers. On the AMD platform, the first and a multi-level pipeline implementation to achieve better computa- approach is rocWMMA (ROCm Wavefront-level Matrix Multiply and tional accuracy and speed. Our contributions can be summarized as Accumulate), in which AMD’s C++ interface leverages Matrix Core to follow: accelerate GEMM operations. The usage and functionality of this inter- • A lightweight emulation algorithm is proposed that employs bit- face are very similar to that of WMMA [28]. The difference between cutting and scaling techniques to preserve the original values and the two ways is that rocWMMA calls the API handle with the keyword reduce the loss of precision while removing redundant terms in rocwmma. The second approach is the compiler instruction [9], which the computation process to reduce the computational overhead. follows a specific syntax ‘‘__𝑏𝑢𝑖𝑙𝑡𝑖𝑛_𝑎𝑚𝑑 𝑔 𝑐 𝑛_𝑤𝑚𝑚𝑎_’’. The third approach • A thread parallelism analytic model based on the lightweight uses rocBLAS [30], which provides a functional interface and allows emulation GEMM is proposed, which allows the selection of the the direct operation of a mixed-precision core by supplying the matrix optimal tiling scheme through two aspects: the number of parallel parameter. threads and the single thread performance. There are multiple ways to call Tensor Core on the NVIDIA platform, • An efficient multi-level pipeline is implemented based on the such as WMMA (Warp-level Matrix Multiply and Accumulate) [10], access discrepancy between various levels of memory and the CUTLASS [31], cuBLAS [32], PTX instruction [33], etc. WMMA pro- GEMM workflow, utilizing double buffer and pre-fetch techniques vides several specific programming interfaces to load (wmma::load_ to hide access latency and improve Instruction-Level Parallelism matrix_sync), compute (wmma::mma_sync), and store (wmma::store_ (ILP). matrix_sync) matrices in fragments [10]. Based on this interface, the ma- • The proposed single-precision GEMM accelerated method trix is first tiled into multiple fragments, which can be shaped to fixed achieves state-of-the-art speed and accuracy performance on two sizes, and then matrix multiplication is performed. CUTLASS (CUDA GPU platforms. Templates for Linear Algebra Subroutines) [31] is a series of CUDA tem- The rest of the paper is organized as follows. Section 2 introduces plate header libraries for implementing high-performance GEMM and the background of these specialized hardware units (Matrix Core and related computations at all scales and levels within CUDA. It provides Tensor Core) and matrix multiplication based on the Ozaki scheme. a variety of GEMM computation interfaces (hgemm, sgemm, dgemm, Section 3 provides some related work. Section 4 presents the details of and wgemm) to optimize data movement and multiply-accumulate the proposed accelerated method for single-precision GEMM. Section 5 operations in fragments to achieve high performance. cuBLAS [32] is a demonstrates and evaluates the experimental results. The conclusions basic linear algebra subroutines library provided and implemented by of the paper are given in Section 6. Section 7 outlines potential areas NVIDIA. cuBLAS provides two interfaces (cublasGemmEx and cublasS- for future work that can be further explored. gemm) that support GEMM on Tensor Core. In this approach, it is 2. Background necessary to set the math model of cuBLAS by using the cublasSetMath- Mode function. PTX instruction provides a series of programming APIs 2.1. Matrix core and tensor core and instruction sets for general-purpose parallel programming. It is designed to be efficient on NVIDIA GPUs that support the defined com- The amount of computation for GEMM has gradually increased with putation functions. The terminological differences and relationships the development of deep learning. In response to this trend, AMD between ROCm and CUDA are also given in Table 1. 2 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 1. The precision profiling and the computing pattern of mixed-precision units (Fig. 1(a) shows an example of precision profiling, where input data is FP16-precision, product and accumulator is FP32-precision, and output is FP32-precision. Fig. 1(b) illustrates a fragment-based GEMM operation with a ‘‘16 × 16’’ fragment). Table 1 ROCm/CUDA terminology. ROCm CUDA Description Compute Unit (CU) Streaming One among many parallel vector processors in a GPU with parallel ALUs. All wavefronts in a Multiprocessor (SM) workgroup are assigned to the same CU. Kernel Kernel Functions initiated to the GPU that are executed by multiple parallel CUs or SMs on the GPU. Multiple kernels can work in parallel with CPU. Wavefront Warp A collection of operations that executed in lockstep, run the same instructions, and follow the same control-flow path. Individual lanes can be masked. Workgroup Thread block Think of this as a vector thread. A wavefront is a 64-wide vector op in AMD GPU. A warp is a 32-wide vector op in NVIDIA GPU. Work-item/Thread Thread GPU programming models can handle this as a separate thread of execution, although not necessarily get forward sub-wavefront progress. Global Memory Global Memory The DRAM memory is accessed by a GPU that traverses the cache of some layers. Local Memory Shared Memory Scratchpad that allows communication between wavefronts within a workgroup. Private Memory Local Memory Per-thread private memory is often mapped to registers. 2.2. Matrix multiplication based on the Ozaki scheme where 𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , 𝑅𝐴 , and 𝑅𝐵 denote the matrix with FP16, 𝐴𝐹 𝑃 32 and 𝐵𝐹 𝑃 32 denote the matrix with FP32. The Ozaki scheme is an error-free method for computing matrix multiplication [34]. It involves translating the original matrix multi- 3. Related work plication into a sum of matrices that can be multiplied together. Fig. 2 is a schematic of the matrix multiplication based on the Ozaki scheme. The advent of Matrix Core and Tensor Core has greatly accelerated The detailed procedure can be divided into three steps: splitting, com- the computational speed of GEMM due to its specific data layout putation, and summation. In the splitting step, matrix A and matrix and precision requirements. This feature allows it to be studied and B are first divided by element level to obtain multiple submatrix (𝐴𝑖 , applied in many GEMM-based scientific computing fields [24–27]. 𝐵𝑗 ). Then, in the computation step, matrix multiplication is performed Therefore, previous studies have optimized the scheduling methods of on the submatrix of A (𝐴𝑖 ) and the submatrix of B (𝐵𝑗 ) to obtain the these special hardware units based on different application scenarios submatrix of C (𝐶𝑖,𝑗 ). Finally, all these submatrices are summed to to accelerate the computation process [4,12,15,35]. For example, a obtain the result matrix C. novel approach for accelerated quantization of graph neural networks, Based on the Ozaki scheme, a single precision matrix multiplication based on the support of low-precision GEMM, was proposed by Wang calculation method can be described as follows. For simplicity, the et al. [36], which exploits the insensitivity of graph neural networks matrix multiplication with FP32 is used as an example. Firstly, in the to loss of precision. To make Tensor Core be executed more effi- element-level splitting step, the matrix based on FP32 is converted to ciently in deep learning applications, Dakkak et al. [26] have optimized FP16 precision by the precision transformation function, while the error the reduction and the scan terms for matrix multiplication, making is preserved. The detailed expression is as follows: their programming faster and more accessible. Specifically, Dakkak’s 𝐴𝐹 𝑃 16 ← toFP16(𝐴𝐹 𝑃 32 ) (1) method [26] was implemented on the Tesla V100 and achieved 89%– 98% of peak memory bandwidth and reduced power consumption. 𝑅𝐴 ← toFP16(𝐴𝐹 𝑃 32 − 𝐴𝐹 𝑃 16 ) (2) Li et al. [37] proposed a half-precision Fast Fourier Transform (FFT) approach based on Tensor Core. This approach leverages the fragments where 𝐴𝐹 𝑃 32 and 𝐴𝐹 𝑃 16 denote the FP32-based and FP16-based matrix 𝐴, respectively. toFP16() is the transformation function that transforms in Tensor Core to support the special operations required for FFT matrix 𝐴 from FP32 to FP16, and 𝑅𝐴 represents the residual matrix and optimizes the data arrangement in GPU memory [38]. To further between different precision. Note that this expression is analogous to explore these hardware units’ workflow and application prospects, matrix 𝐵. With the support of mixed-precision cores for floating-point Sakamoto et al. [39] improve energy efficiency by analyzing memory arithmetic with different precision, 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 can be rewritten as and power consumption of low-precision data during computation. The follows: above studies aim to optimize the use of these specialized hardware units to accelerate the computing process in specific scientific com- 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 16 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )𝐵𝐹 𝑃 16 = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝐵𝐹 𝑃 16 (3) puting scenarios based on their computational characteristics [40,41]. These approaches maximize the acceleration capability of matrix multi- Hence, the matrix multiplication for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 can be expressed plication operation without changing the internal computation process as: of specialized hardware units. However, FMA operations between dif- 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ) ferent precisions will inevitably result in a certain precision loss in (4) = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + 𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 + 𝑅𝐴 𝑅𝐵 the calculation results, which will limit the application scope of these 3 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 2. The schematic of the matrix multiplication based on the Ozaki scheme (𝐶 = 𝐴𝐵 as an example). hardware units [21,22,42]. analysis model, and the efficient multi-level pipeline design. Firstly, To mitigate the loss of precision, several improvement approaches we introduced the lightweight emulation algorithm, which can pro- have been proposed. Markidis et al. [10] analyzed the programma- vide a more efficient and accurate calculation performance based on bility and precision of Tensor Core. Then they proposed a precision specialized hardware units. Then, a thread parallelism analytic model refinement method for single-precision (FP32) matrix multiplication, is proposed, which is used to analyze and select the optimal tiling which takes advantage of half-precision (FP16) to store residuals in scheme based on various computing scenarios and hardware. Finally, transformations. Based on this research, Ootomo et al. [43] improved we present an efficient multi-level pipeline design for GEMM, which the rounding method of single-precision floating-point numbers. They utilizes double buffer and pre-fetch techniques to hide access latency reduced the redundancy in error-correction terms by considering the and improve ILP. effect of the rounding method and implicit bit. Others have studied the error analysis of matrix multiplication using Tensor Core and have 4.1. Lightweight emulation algorithm given theoretical error bounds for matrix multiplication with different precisions [24,44,45]. To explore the efficiency of mixed-precision Traditional emulation algorithm design assumes that the hardware cores, Jia et al. [46] studied the arrangement of matrix-to-fragment has the same input and output precision. Typically, a large number data in memory and how assembly instructions split the matrix. To of low-precision instructions are used to complete the computation support the extended-precision computing pattern, several methods and memory access process, which can result in high computational have been proposed in limited-precision hardware, combining multi- overhead and reduce performance benefits [4,40,51,52]. However, spe- ple low-precision computation instructions to simulate the extended- cialized hardware units, such as Matrix Core and Tensor core, have precision computing methods [47–49]. Inspired by this approach, Feng been introduced to accelerate the GEMM operation. Multiplication and et al. [4] improved the Markidis’s method in single-precision matrix addition are fused in special instruction primitives, and the process multiplication, which modified the rounding method and optimized the of accumulating intermediate results also supports high-precision ac- computation of extended-precision at the instruction level. In addition, cumulation to ensure accurate calculation results. This new hardware Fasi et al. [21] invested in the application of multiword methodology feature has motivated us to design a lightweight emulation algorithm to improve the performance-accuracy trade-off of matrix multiplication that accelerates the GEMM computation process by achieving high com- with mixed-precision cores, focusing specifically on the Tensor Core putational throughput based on low-precision inputs. The lightweight emulation algorithm consists of the precision transformation process available on NVIDIA GPUs. Following this research direction, Parikh and the GEMM emulation calculation, as illustrated in Fig. 3. The preci- et al. [20] presented insights and possibilities for implementing higher sion transformation process is designed to make the low-precision data precision matrix multiplication with Tensor Core, and an implemen- better preserve the value of the original data, thus achieving more ac- tation of the FP64x2 GEMM scheme with variable FP64 inputs was curate calculation results. The GEMM emulation calculation is designed provided. The common feature of these studies is using low-precision to achieve high-precision results and speed up the calculation process inputs to obtain high-precision computational results with some addi- by leveraging the specialized hardware units and low-precision input. tional computation workload. In this way, these specialized hardware Floating-point numbers of different precision types have exponent and units (Matrix Core and Tensor Core) can perform high-performance mantissa with different lengths. For the single-precision format, 8 bits matrix multiplication operations and reduce the impact of precision are allotted to the exponent and 23 bits to the mantissa. The half- loss [44,50]. precision representation has a smaller range than the single-precision, with the exponent and mantissa being 5 and 10 bits, respectively. In 4. LE-GEMM emulation algorithm, the first step is to split the original matrix, which affects the level of elements in the matrix. There are two different This section gives a detailed description of LE-GEMM, which in- splitting routines. One splitting routine is to split the matrix elements cludes the lightweight emulation algorithm, the thread parallelism without changing the precision, and the other is to split the matrix 4 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 3. The workflow of LE-GEMM. values and change the precision type. In this paper, we focus on the numbers, and it is compatible across platforms. Fig. 4 illustrates the implementation of the second routine. However, the transformation details of the precision transformation process. For the computation process inevitably leads to precision loss. Therefore, an additional of 𝑅𝐴 , since the mantissa of FP16 is of length 10, this does not fully FP16 is used to preserve the residual and improve the accuracy of save all the remaining mantissa. Based on these two factors, a residual the calculated results. Rounding is a popular precision transformation is calculated from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 . Then multiply the residual by the technique that has been previously employed to transform from high- scaling factor to preserve the residual, which is 210 . This does not alter precision to low-precision [20,45]. The common rounding methods the representation, but it maps the initial value to the larger range that include RN, RZ, RU, and RD. 1 The rounding method has the advantage can be represented by 𝑅𝐴 . Note that since 𝑅𝐴 was obtained by scaling, of taking into account the representation of the mantissa low-bit part, the processing used in subsequent calculations will return the original which is after the rounding bit. However, it is important to note that residual value. Finally, to preserve the mantissa to the maximum extent, each rounding method will inevitably introduce some errors in the the residuals are saved using rounding. The precision transformation final value. The primary reason for this is that the mantissa involves process can be rewritten as follows: rounding up or down, leading to numerical deviations. For instance, 𝐴𝐹 𝑃 16 = bit-cut(𝐴𝐹 𝑃 32 ) (5) when transforming from FP32 to FP16 using the RN method, the final value in the 13-bit mantissa considers the low-bit of the mantissa (0– 12 bit) to achieve the nearest round. 𝐴𝐹 𝑃 16 and 𝑅𝐴 are responsible 𝑅𝐴 = round((𝐴𝐹 𝑃 32 − 𝐴𝐹 𝑃 16 ) × 210 ) (6) for preserving the high-bit and low-bit mantissa of 𝐴𝐹 𝑃 32 respectively. From a numerical perspective, 𝐴𝐹 𝑃 16 has a larger effect than 𝑅𝐴 on 𝐵𝐹 𝑃 16 = bit-cut(𝐵𝐹 𝑃 32 ) (7) computation results. Therefore, it is important to preserve a high-bit of the mantissa during the rounding process by saving as many low-bit 𝑅𝐵 = round((𝐵𝐹 𝑃 32 − 𝐵𝐹 𝑃 16 ) × 210 ) (8) bits as possible. Compared to previous work, we adopted a bit-cut to preserve the where bit-cut() denotes executing bit-cut operation, round() represents high-bit bits in FP32 during the precision transformation process, con- RZ. This section presents the details of the precision transformation sidering the impact of high-bit and low-bit bits on errors. To achieve process involving bit-cut and scaling. In the scaling step, it is essen- precision transformation from 𝐴𝐹 𝑃 32 to 𝐴𝐹 𝑃 16 , a bit-cut method is tial to divide the value by the scaling factor in the last step of the employed to preserve the 𝐴𝐹 𝑃 32 mantissa high-bit. In this way, the calculation to cancel the effect of the scaling factor. Therefore, the higher mantissa (13–22 bits of 𝐴𝐹 𝑃 32 in Fig. 4) is copied into 𝐴𝐹 𝑃 16 . GEMM emulation calculation, which is based on the previous precision For residual 𝑅𝐴 , introducing a scaling operation requires dividing it transformation process, can be described as follows. by a scaling factor (210 in Fig. 4) in the calculation process to ensure 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 ∕210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ∕210 ) the correct results. Note that due to the difference in the length of the = 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 + 𝑅𝐴 𝑅𝐵 ∕220 exponent part between single-precision and half-precision, additional processing is required to avoid errors in the transform process of the (9) exponent part. Considering that the bias of single-precision and half- The fourth calculation term in Eq. (9) is divided by 220 , thus it precision sizes are 127 and 15, it is necessary to subtract the bias of the exponent part to avoid errors in the transformation exponent has little effect on the final computation result. Moreover, additional bit. To avoid the impact of bias, we preserve the exponent part to an computation items will increase the overall computation effort. For this intermediate variable, which requires subtracting the bias value. Then, purpose, Eq. (9) is modified as follows: add the corresponding bias when saving the intermediate variable for 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 = (𝐴𝐹 𝑃 16 + 𝑅𝐴 ∕210 )(𝐵𝐹 𝑃 16 + 𝑅𝐵 ∕210 ) (10) the half-precision exponent part. This approach offers the following ≈ 𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 benefits: compared to traditional rounding, this approach does not gen- erate rounding-up or rounding-down, which can affect the stability of Hence, the lightweight emulation algorithm calculation can be de- high-bit. This approach does not depend on the rounding API interface scribed as follows: of the platform, which differs in the rounding regulations of different 𝐷𝐹 𝑃 32 = 𝛼 × 𝐴𝐹 𝑃 32 𝐵𝐹 𝑃 32 + 𝛽 × 𝐶𝐹 𝑃 32 platforms. The bit-cut is based on a binary format for floating point ≈ 𝛼 × (𝐴𝐹 𝑃 16 𝐵𝐹 𝑃 16 + (𝐴𝐹 𝑃 16 𝑅𝐵 + 𝑅𝐴 𝐵𝐹 𝑃 16 )∕210 ) + 𝛽 × 𝐶𝐹 𝑃 32 (11) 1 RN, RZ, RU, and RD represent Round-to-nearest-even, Round-towards- Zero, Round-Up, and Round-Down, respectively. 5 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 4. The bit-cut and scaling operation. where 𝐷𝐹 𝑃 32 and 𝐶𝐹 𝑃 32 denote FP32 matrices, 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 and efficient tiling scheme that represents the impact of the allocation 𝑅𝐵 represent the same meaning as those in Eq. (10), 𝛼 and 𝛽 are of computational tasks on overall performance is critical. The tiling scalar constants. This section introduces a precision transformation scheme is to segment the matrix into multiple tiles and assign each process consisting of a two-step bit-cut and rescaling. This process can tile to a cooperative thread array (thread block) responsible for the effectively preserve the original data mantissa and avoid numerical computation. Each thread loads and completes the computation of the fluctuations caused by rounding. At the same time, based on the above corresponding matrix elements. This process is based on the size of the precision transformation process, we have designed a GEMM emulation matrix in GEMM operation and the hardware resources to perform the calculation process, which removes the cumulative term in Eq. (9) and allocation of computational tasks. reduces the overall computation time of GEMM. Algorithm 1 provides Fig. 5 illustrates that computing a tile in matrix D requires loading an overview of the lightweight emulation algorithm. In Algorithm 1, sub-sections of matrices A and B (the gray area) and a tile in matrix bit-cut () denotes utilizing the bit-cut technique to preserve the high- C (the blue area) to perform GEMM operation. This allows for the bit bits in the mantissa, which is detailed in Section 4.1. round() computation of multiple tiles simultaneously on the GPU. To fully represents the use of rounding to preserve the low-bit bits in the exploit the computing power of GPUs, our proposed thread parallelism mantissa. mFma_sync() represents utilizing specialized hardware units analysis model analyses and improves the performance of GEMM com- to calculate the multiplication of two matrices, and uniformFma_sync() putations by considering the number of parallel threads (Thread-Level represents the computation process of Eq. (11) to obtain the final Parallelism, TLP) and the single thread performance. Given the tile size, calculation result 𝐷𝐹 𝑃 32 . the computation of GEMM can be divided into multiple subparts, each with a set of threads responsible for the computation. This property Algorithm 1 The lightweight emulation algorithm allows for the computation of thread parallelism based on a given 1: Function emulation GEMM operation(𝐷𝐹 𝑃 32 , 𝐴𝐹 𝑃 32 , 𝐵𝐹 𝑃 32 , tiling strategy. Previous methods have been proposed based on the 𝐶𝐹 𝑃 32 ,𝛼, 𝛽) number of threads in a thread block simply. Because the thread block 2: /* The precision transformation process */ is further transformed into multiple frontwaves (warps) for execution 3: 𝐴𝐹 𝑃 16 =bit-cut(𝐴𝐹 𝑃 32 ), 𝑅𝐴 =round((𝐴𝐹 𝑃 32 -𝐴𝐹 𝑃 16 ) × 210 ); on CU, this resulted in the previous method ignoring the underlying 4: 𝐵𝐹 𝑃 16 =bit-cut(𝐵𝐹 𝑃 32 ), 𝑅𝐵 =round((𝐵𝐹 𝑃 32 -𝐵𝐹 𝑃 16 ) × 210 ); computational pattern on the GPU. Therefore, we use the number 5: /* GEMM emulation computation process */ of wavefronts to calculate thread parallelism. This provides a more 6: /* Initialize accumulator */ accurate indication of how active the underlying threads are in practice. 7: Initialize fragsAcc0, fragsAcc1; Based on the matrix’s tiling scheme and hardware configuration, TLP 8: /* Calculate each term in Equation (11) */ can be calculated as follows: ( ) 9: fragsAcc0 = mFma_sync (𝐴𝐹 𝑃 16 , 𝐵𝐹 𝑃 16 , fragsAcc0); 𝑀 ×𝑁 𝑇𝑇 𝐿𝑃 = 𝜑 × 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 (12) 10: fragsAcc1 = mFma_sync (𝐴𝐹 𝑃 16 , 𝑅𝐵 , fragsAcc1) ; 𝑇𝑀 × 𝑇𝑁 11: fragsAcc1 = mFma_sync (𝐵𝐹 𝑃 16 , 𝑅𝐴 , fragsAcc1) ; where 𝑀 and 𝑁 represent the dimensions of matrix 𝐶, 𝑇𝑀 and 𝑇𝑁 12: /* Calculate the final result */ are the tile size, 𝜑 is the transform process of workgroup to wavefront, 13: 𝐷𝐹 𝑃 32 = uniformFma_sync (𝛼, fragsAcc0, fragsAcc1/210 , 𝛽, and 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 represents the number of threads in wavefront. Note 𝐶𝐹 𝑃 32 ); that 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 has different values across various GPU platforms, for 14: end Function example, on MI210 and A800, the values for 𝑇𝑡ℎ𝑟𝑒𝑎𝑑_𝑛𝑢𝑚 are 64 and 32, respectively. When considering TLP, it is important to consider the performance 4.2. Thread parallelism analytic model of a single thread, which can be expressed as a compute-to-memory ratio. Comparing the clock cycles of memory access instruction and The lightweight emulation algorithm is designed to efficiently map computation instruction in threads, it is evident that data access in GEMM-based scientific computing to specialized hardware cores, al- memory requires more time. Therefore, it is crucial to ensure that the lowing for high-precision outputs with low-precision inputs. To fully data loaded on the thread is fully utilized. During each iteration, every exploit the acceleration capabilities of specialized hardware units, an thread block is required to perform two tasks. The first step involves 6 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 5. The tiling scheme of a single GEMM. reading two matrices (𝐴𝐹 𝑃 16 , 𝑅𝐴 ) of size (𝑇𝑀 , 𝑇𝐾 ), two matrices between the thread block and the matrix tiles, which will also affect the (𝐵𝐹 𝑃 16 , 𝑅𝐵 ) of size (𝑇𝐾 , 𝑇𝑁 ) and matrix C of size (𝑇𝑀 , 𝑇𝑁 ). Since matrix efficiency of data reuse. As shown in Fig. 5(a), the calculation of 3 tiles C of size (𝑇𝑀 , 𝑇𝑁 ) is loaded only in the first iteration and does not in matrix 𝐶 involves the loading of 1 sub-section of matrix 𝐴, and 3 sub- need to be loaded again in subsequent steps, its loading overhead can sections of matrix 𝐵, respectively. In Fig. 5(b), 4 tiles can be computed be neglected. Thus, the memory access instruction with each iteration by loading two sub-sections of the matrix 𝐴 and 𝐵. Comparing these can be expressed as: two routines, it is clear that the larger the cross-sectional area (blue tile in Fig. 5) between the sub-sections loaded by matrices 𝐴 and 𝐵. 𝑇𝑚𝑒𝑚𝑜𝑟𝑦 = 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 ) (13) Fig. 5 indicates that the pattern in (b) has better data reuse. where 𝑇𝑚𝑒𝑚𝑜𝑟𝑦 represents memory access during each iteration, the last Finally, the process of optimal tiling scheme selection with thread 2 means that the half-precision data occupies two bytes. According parallelism analytic model is formally represented as maximizing 𝑇𝑇 𝐿𝑃 to the lightweight emulation algorithm, the computation instruction and 𝑇 𝑓 𝑙𝑜𝑝𝑠 . According to Eqs. (12) and (15), it can be found that 𝑚𝑒𝑚𝑜𝑟𝑦 required for each iteration process is as follows: 𝑇𝑇 𝐿𝑃 and 𝑇 𝑓 𝑙𝑜𝑝𝑠 show an inverse relationship given 𝑀, 𝑁, and 𝑇𝑡ℎ𝑟𝑒𝑎𝑑 . 𝑇𝑓 𝑙𝑜𝑝𝑠 = 3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 ) (14) 𝑚𝑒𝑚𝑜𝑟𝑦 Generally, smaller tiles can achieve better thread parallelism because where 𝑇𝑓 𝑙𝑜𝑝𝑠 represents computation with each iteration, 2 × (𝑇𝑀 × computational tasks are assigned to multiple threads. However, the 𝑇𝑁 × 𝑇𝐾 ) represents the computation amount of matrix multiplication number of activated threads is also limited by the supported number between matrix of size (𝑇𝑀 , 𝑇𝐾 ) and matrix of size (𝑇𝐾 , 𝑇𝑁 ), and the by the GPU hardware. When the tile is larger, a larger compute-to- constant 3 corresponds to the three terms in Eq. (11). To this end, the memory ratio is often achieved, which represents a better performance ratio of computation to memory access (𝑇 𝑓 𝑙𝑜𝑝𝑠 ) during each iteration for a single thread. When choosing a tiling scheme, it is also necessary 𝑚𝑒𝑚𝑜𝑟𝑦 is as follows: to consider the workload of the thread according to Eqs. (16) and 3 × 2 × (𝑇𝑀 × 𝑇𝑁 × 𝑇𝐾 ) 3(𝑇𝑀 × 𝑇𝑁 ) (17). Eqs. (12)–(17), take into account factors that affect the number 𝑇 𝑓 𝑙𝑜𝑝𝑠 = = (15) of active threads and the performance of a single thread, allowing 𝑚𝑒𝑚𝑜𝑟𝑦 2 × 2 × (𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 ) 2(𝑇𝑀 + 𝑇𝑁 ) the optimal tiling scheme to be selected based on the various GPU While pursuing higher TLP and compute-to-memory ratios, it is also platforms and computing scenarios. Our thread parallelism analysis important to focus on thread workload. The workload of a thread can be model chooses the tiling scheme based on thread-level parallelism divided into two parts: the amount of data load and the amount of com- and single thread performance, which can be solved analytically with putation. To avoid excessive workload and reduce thread parallelism, existing optimization solvers [53,54]. This can also be used to pro- it is necessary to properly set the tiling scheme and the number of tiles vide a more profitable direction when selecting the best parameters that the workgroup is responsible for computing. To describe a thread by trial-and-error. Table 2 presents the choice of the parameters for the amount of data load (𝑊 𝑜𝑟𝑘_𝐷𝐿) and the amount of computation LE-GEMM. In Table 2, (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) denotes the size of the matrix (𝑊 𝑜𝑟𝑘_𝐶 𝑃 ) more accurately, the following two formulas are given: partitioned into tiles, (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) denotes the fragment size for GEMM computation, and wavefront/warp size denotes the number of threads 𝑇𝑀 × 𝑇𝑁 + 𝑇𝑀 × 𝑇𝐾 + 𝑇𝐾 × 𝑇𝑁 𝑊 𝑜𝑟𝑘_𝐷𝐿 = (16) in the thread group of the two GPU platforms that is determined by 𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 their hardware architectures. Due to the various sizes and shapes of 𝑇𝑀 × 𝑇𝑁 the fragment supported by different architectures, the choice of the 𝑊 𝑜𝑟𝑘_𝐶 𝑃 = (17) 𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 fragment is also limited to the two GPU platforms. At the same time, where 𝑊 𝑜𝑟𝑘_𝑛𝑢𝑚 represents the number of thread responsible for com- the wavefront/warp size also affects the thread workload allocation. puting the tile. 𝑇𝑀 , 𝑇𝑁 , and 𝑇𝐾 represent tile size, the details of which Therefore, when choosing parameters, it is essential to consider the are shown in Fig. 5. 𝑊 𝑜𝑟𝑘_𝐷𝐿 is the number of elements responsible impact of hardware differences on performance. for the computation in each thread. To ensure efficient parallelism, it is necessary to allocate a different number of threads to various 4.3. An efficient multi-level pipeline for GEMM shape tiles. This ensures that the workload is consistent between thread blocks. 𝑊 𝑜𝑟𝑘_𝐶 𝑃 represents the number of registers loaded by each In the GPU memory architecture, various memory spaces will have thread. When the thread loads too many elements, it will cause the data different sizes and access speeds. Data access speeds can be classified in the register to overflow, which will increase the data access over- from fast to slow in terms of register memory, local memory, and global head. At the same time, we should also consider the correspondence memory. The speed is inversely proportional to the size of the memory 7 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 6. The workflow of GEMM. Table 2 Table 3 The choice of the parameters on two GPUs. The configuration of platforms for evaluation. Name AMD MI210 NVIDIA A800 Name AMD-platform NVIDIA-platform Matrix tile size (𝑇𝑀 , 𝑇𝑁 , 𝑇𝐾 ) (128, 128, 32) (128, 256, 32) CPU EPYC 7663 Platinum 8358 Fragment size (𝑓𝑚 , 𝑓𝑛 , 𝑓𝑘 ) (32, 32, 16) (16, 16, 16) GPU MI210 A800 Wavefront/Warp size 64 32 OS Ubuntu 20.04 Ubuntu 20.04 ROCm/CUDA ROCm 5.6 CUDA 12.0 space. Direct access to data from global registers reduces transfer efficiency and does not maximize memory bandwidth. Therefore, in The implementation of the multi-stage pipeline above must be this paper, double buffers are used to alleviate the problem of high aware of the size of registers and shared memory supported by the differences in memory access speeds, and pre-fetch techniques are used specific GPU hardware. Once the maximum capacity limit is exceeded, to maximize memory bandwidth and thread instruction parallelism. data overflow will occur and this will greatly increase access latency. The workflow for a tile in GEMM is shown in Fig. 6. Indeed, register and shared memory occupancy is related to 𝑇𝑀 , 𝑇𝑁 First, a double buffer (buffer0 and buffer1) was created in the share and 𝑇𝐾 from the above description. When a tiling scheme is given, 𝑇𝑀 memory, and the single buffer size is related to the cumulative span and 𝑇𝑁 are also determined. Therefore, data overflow can be prevented of matrices 𝐴 and 𝐵 along the 𝐾-dimension. As shown in Fig. 6, by setting 𝑇𝐾 flexibly. For example, on the AMD platform, based on the loading instructions is applied to fetch matrix tiles 𝐴𝐹 𝑃 16 , 𝑅𝐴 , 𝐵𝐹 𝑃 16 fact that 𝑇𝑀 and 𝑇𝑁 are 128 and 256 respectively, 𝑇𝐾 can be set to a and 𝑅𝐵 in iteration #0, and the data is stored in buffer0. The fetch data smaller size such as 16 or 32 to prevent data overflow. Note that to correspond to the yellow tiles 𝐴𝐹 𝑃 16 and 𝑅𝐴 (𝑇𝑀 × 𝑇𝐾 ) and the green avoid invalid computations in Matrix Core, 𝑇𝐾 should preferably be an tiles 𝐵𝐹 𝑃 16 and 𝑅𝐵 (𝑇𝐾 × 𝑇𝑁 ) in Fig. 6. At the same time, the data to be integer multiple of the fragment size. loaded in iteration #1 will also be fetched into buffer1. This allows the register to fetch data directly in the share memory, without waiting for 5. Results the data to be loaded from the global memory. The thread can load data directly from shared memory into private registers, follow a specified This section compares the proposed method with some state-of- data layout, and then utilize Matrix Core to complete the computation. the-art approaches and analyzes the experimental results on two GPU When the intermediate result of the first iteration is complete, the platforms. The flowchart of the comparative experiment is shown in register reads the data needed for the next iteration from buffer1. The Fig. 7. Note that the matrix size of 𝐴 ∈ 𝑅𝑀×𝐾 and 𝐵 ∈ 𝑅𝐾×𝑁 is thread continues to load the data needed for the next iteration from represented as ‘‘𝑀*𝑁*𝐾’’ in the following expressions for simplicity. buffer1 when the intermediate result of the iteration #0 is complete. The tested range of numbers was [−1,1] in comparative experiments. For efficient data pipelining and instruction-level parallelism, data for the next iteration can be pre-fetched from global memory into shared 5.1. Setup memory when performing data transfers between shared memory and registers. Based on the above approach, we can implement a multi- Experiment platform. In this paper, we chose two most common level access pipeline between different memory levels, and the access and advanced GPUs (MI210 and A800). The configuration and details and computation instructions can be executed in parallel, improving of the experimental platform are given in Tables 3 and 4 respectively. the ILP performance of the thread. Note that a thread synchronization Comparison method. To evaluate the proposed method, first, for instruction is performed behind the store to ensure data consistency by two GPU platforms, the default single-precision GEMM methods such each time a buffer is stored. Each iterative computation produces an as rocBLAS and cuBLAS, are provided by the respective GPU vendors. intermediate result which is combined with the accumulation matrix These methods do not use the mixed-precision core and directly utilize 𝐶𝐹 𝑃 32 for FMA operation. After several computation iterations, the final the FP32 input to compute the FP32 GEMM results. We compare calculation result 𝐷𝐹 𝑃 32 can be obtained and written back to the global some methods, such as rocWMMA and WMMA, that use FP16 input to memory. compute FP32 results on Matrix Core or Tensor Core without precision 8 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 7. The flowchart of the comparative experiment. Table 4 half-precision matrices in the paper, 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) is the result of a stan- The configuration of two GPUs. dard single-precision calculation used to compute Max_error, (𝑖, 𝑗) rep- Name AMD MI210 NVIDIA A800 resents the 𝑖th row and 𝑗th column element of the matrix, 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒() is Architecture CDNA 2.0 Ampere a precision transformation function to obtain more accurate errors, 𝑀 Mixed-precision core Matrix Core Tensor Core and 𝑁 represent the dimensions of the resulting matrix, and 𝑀 𝑎𝑥() is Caches L1 16 KB(per CU) L1 192 KB (per SM) L2 16 MB L2 40 MB to choose the maximum value from all errors. Meanwhile, the Mean Relative Error Distance (MRED) is also used as an error evaluation Memory 64 GB HBM2 80 GB HBM2 Bandwidth 1.6 TB/s 2.04 TB/s metric, which is calculated as follows: ∑𝑀 ∑ 𝑁 | | 1 | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) − 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) | MRED = | | (20) Table 5 | 𝑀 ∗ 𝑁 𝑖=1 𝑗=1 | 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗)) | | The description of the baseline method. Name Input Output Description where M and 𝑁 denote the size of the result matrix, and 𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗) and rocBLAS [30] FP32 FP32 rocblasSgemm on ROCm Core. 𝑉𝑝 (𝑖, 𝑗) denote the same meaning as in Eq. (19). cuBLAS [32] FP32 FP32 cublasSgemm on CUDA Core. rocWMMA [28] FP16 FP32 rocWMMA implemented on Matrix Core 5.2. The speed improvement with various shapes without precision refinement. To evaluate the speed advantages of our method, we conducted WMMA [55] FP16 FP32 WMMA implemented on Tensor Core comparative experiments on two GPU platforms with multiple shapes without precision refinement. of the matrix. The results are presented in Figs. 8–9. In Fig. 8(a)–(c), Mark [10] FP16 FP32 Markidis’s implemented on two our method improves performance by 1.31× compared to rocBLAS, platforms. an arithmetic library commonly used for scientific computing and Ooto [43] FP16 FP32 Ootomo’s implemented on two provided by AMD. We also compare rocWMMA, which directly uses platforms. half-precision inputs to achieve a single-precision output without a precision refinement process. The proposed method achieves an av- erage 1.25×, 1.24×, and 1.45× speedup in (n*n*n), (n*4n*n), and (n*n*4n), respectively. From Fig. 8(a)–(b), when n ≤ 4096, it can refinement. Finally, the experiments compare Markidis’s and Ootomo’s be seen that the performance of rocWMMA is superior to that of the methods, which calculate FP32 results based on FP16 input with pre- proposed method. This phenomenon is mainly due to the following cision refinement. In this paper, the experimental results are averaged two reasons. First, rocWMMA is a straightforward computation way; over 10 experiments. The baseline methods are described in Table 5. this does not involve the accuracy refinement procedure; therefore, Evaluation criteria. To objectively evaluate the effectiveness of the workload of the calculation process is less. Second, the dimen- the proposed method, in this paper, two criteria are used to evaluate sion of the matrix is minor, so the proposed method cannot fully the speed and accuracy. In terms of measuring speed, the experimen- exploit the performance improvement brought by the multi-level flow tal results were represented by the average value of TFLOPS (Tera design. As the matrix dimension gradually increases, the performance FLoating-point Operations Per Second), which is calculated as: advantage of the proposed method becomes increasingly apparent, as 2((𝑀 × 𝑁 × 𝐾) + (𝑀 × 𝑁)) seen in Fig. 8. Compared to Markidis’s and Ootomo’s methods, the TFLOPS = (18) 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 ∗ 1.0𝑒12 proposed method has a speedup of 3.05× and 3.18×, respectively. where 𝑀, 𝑁 and 𝐾 represent the matrix dimension of the GEMM, and These two methods utilize precision refinement techniques but require 𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒 represents the running time in seconds measured on GPU. To more computation workload. The specialized hardware accelerator’s evaluate the calculated results, the max relative error (Max_error) is data layout and instruction scheduling characteristics were not con- utilized to evaluate the error compared to the CPU computation results, sidered, leading to lower performance. The proposed method utilizes which is calculated as: the thread parallelism analytic model to choose the tiling scheme to ( ) segment the matrix. Additionally, an efficient multi-level pipeline is |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗)) − 𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))| used to improve the efficiency of data access and ILP. This makes it Max_error = 𝑀 𝑎𝑥 |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑝 (𝑖, 𝑗))| + |𝑡𝑜𝐷𝑜𝑢𝑏𝑙𝑒(𝑉𝑠𝑖𝑛𝑔𝑙𝑒 (𝑖, 𝑗))| an effective solution for addressing the issue of low computation speed 𝑖 = 1, 2, … , 𝑀 , 𝑗 = 1, 2, … , 𝑁 . (19) in precision refinement. Fig. 8 indicates that the proposed method has performance benefits when the matrix dimension is large. This serves where 𝑉𝑝 (𝑖, 𝑗) denotes the computational result under the precision as further evidence that the suggested LE-GEMM has significantly refinement procedure, which can achieve single-precision results with improved performance for both square and skewed matrices. 9 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 8. The experimental results with various shapes on MI210. Fig. 9. The experimental results with various shapes on A800. The experimental results of speed comparison on NVIDIA A800 Table 6 The speedup on two GPU platforms. are shown in Fig. 9. In the square matrix case (n*n*n), the proposed Baseline method AMD MI210 NVIDIA A800 method achieves an average speedup of 1.37×, 1.40×, 2.78× and 2.74×, which are compared with cuBLAS, WMMA, Mark and Ooto, (n*n*n) (n*4n*n) (n*n*4n) (n*n*n) (n*4n*n) (n*n*4n) respectively. As illustrated in Fig. 9, it can be seen that when the matrix rocBLAS/cuBLAS 1.31× 1.31× 1.32× 1.37× 1.40× 1.36× size is small, the performance of LE-GEMM is similar to the compared rocWMMA/WMMA 1.26× 1.24× 1.45× 1.40× 1.43× 1.81× Mark 2.81× 2.99× 3.35× 2.77× 2.84× 4.67× methods. For example, when n ≤ 2048, Fig. 9(a) shows that the Ooto 2.99× 3.10× 3.45× 2.74× 2.80× 3.76× proposed method performs similarly to cuBLAS and WMMA. The above phenomenon is because a smaller matrix will reduce the number of tiles and the number of iterative accumulations within the tiles, reducing the performance gain brought by the multi-level pipeline. As the size 5.3. The precision error of the matrix increases, the benefits of the proposed method become more apparent. When the matrix size is large, the proposed method This section compares several baseline methods in terms of their significantly improves performance over other methods. Experiments maximum error. The experiment results are shown in Fig. 10. First, we conducted on two GPU platforms demonstrate that the performance of compare with rocWMMA, which computes FP32 GEMM directly from FP16 using Matrix Core. In Fig. 10(a), the proposed method reduces the Markidis and Ootomo methods is nearly consistent. The reason is the average Max_error by 377.33× compared to rocWMMA. The results that, while Ootomo’s method reduces the computational effort by 25%, indicate that FP16 cannot preserve all the mantissa of FP32 due to the Ootomo’s method requires an additional accumulator to perform the limitation of the mantissa length. Additionally, this method lacks preci- accumulation. The initiation and combination stages of the accumu- sion refinement techniques, resulting in a pronounced Max_error. The lator also require an additional amount of time. When the matrix is comparison result shows that the proposed method can significantly skewed, the proposed method outperforms the other baseline methods reduce the Max_error compared to the default routine. Mark and Ooto significantly. Fig. 9(c), when n>7168, the performance of the WMMA methods were then compared separately. These two methods utilize method decreases because it is a simple and direct implementation that precision refinement techniques to decrease Max_error. These methods ignores the data loading pipeline and multi-level memory architecture have improved computational accuracy compared to the default meth- during the computation process. This simple implementation results ods. Fig. 10(a) shows that the proposed method reduces the average in lower efficiency when accumulating along the K-dimension, which Max_error by 74.82× compared to Mark. Although Markidis et al. [10] is more pronounced when the matrix size is (n*n*4n). This approach employed an additional FP16 execution precision refinement process to makes WMMA unable to fully utilize its hardware acceleration capabil- reduce Max_error, the effect of the rounding method on FP16 was not ity. Fig. 9(b) illustrates that the proposed method achieves a speedup considered, leading to certain limitations in improving accuracy. The of 1.40×, 1.43×, 2.84× and 2.80× over the four baseline methods, proposed method reduces Max_error by an average of 2.69× compared respectively. As shown in Fig. 9, the proposed method performs better to Ootomo’s method. However, the addition of FP16 does not preserve when the matrix size is large; this performance trend is consistent with the residuals of FP32 and FP16 efficiently due to the limitations of the AMD platform. The detailed comparative experimental results of the numerical representation range. Rounding inevitably causes fluctu- two GPU platforms are presented in Table 6. ations in values, leading to errors in the values preserved by FP16. To 10 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 10. The Max_error with square matrix (n*n*n) on two GPU platforms. Fig. 11. The Mean Relative Error Distance with square matrix (n*n*n) on two GPU platforms. alleviate this problem, the proposed method employs a bit-cut method 636.92×, 178.93×, and 1.12× on MI210, respectively. By combining the for the high-bit of FP32 mantissa to avoid the value fluctuation caused experimental results in Figs. 10–11, it can be found that rocWMMA and by rounding. Fig. 10(a) shows that the Max_error increases with the WMMA have enormous Max_error and MRED compared to the other matrix dimension (K-dimension). This is due to the fact that the two methods, which also shows the importance of precision refinement. FP16 cannot fully preserve the FP32 values, resulting in errors in the In A800, the proposed method reduces the MRED by 814.87×, 1.75×, calculated results. As the number of computations increases, the error and 1.28× compared to WMMA, Mark, and Ooto, respectively. The gradually becomes more apparent. The effectiveness and advancement experimental results of Figs. 10–11 demonstrate a consistent trend: of the proposed method can be demonstrated by comparing with the the Max_error and MRED are more significant when the cumulative default (rocWMMA) method and the latest (Mark and Ooto) methods dimension increases. As shown in Figs. 10–11, it can be found that the on the AMD-platform. proposed method has a smaller Max_error and MRED, which indicates To demonstrate the proposed method’s portability, we have con- that it has a smaller upper limit of error and mean relative error. The ducted corresponding experiments on NVIDIA A800, another popular comparative experiment between Max_error and MRED proves that the proposed method has better computational accuracy. GPU platform. The experimental results for Max_error are shown in This section provided error distribution histograms to demonstrate Fig. 10(b). To ensure an objective comparison, we applied Markidis’s each method’s error range. The error distribution histogram is imple- method and Ootomo’s method on the NVIDIA platform, respectively, mented on two GPU platforms with a matrix size of 1024*1024*1024. using the WMMA interface. The interface name is the only difference The detailed error distribution results are shown in Figs. 12–13. As between this implementation and the AMD-platform implementation. shown in Figs. 12–13, the error distribution of rocWMMA and WMMA Fig. 10(b) illustrates that the proposed method significantly reduces is the widest, indicating that the error impact without an additional Max_error as the matrix size increases, which is consistent with the precision computation process is significant. Compared to the Mark results obtained on the AMD-platform. The proposed method also has method, the proposed method has obvious advantages in both the error a smaller Max_error compared to WMMA, Mark, and Ooto, reducing range and maximum error. This precision advantage is more evident average Max_error by 571.75×, 2.78× and 1.41×, respectively. The on MI210. The above gap is because although Mark considers the experimental results of the two platforms show the same trend as the impact of down-conversion, a simple rounding way cannot effectively results in the AMD-platform. This indicates that the proposed method preserve low-bit bits. Moreover, the differences in the implementation has low Max_error on mainstream GPU platforms, further confirming of rounding API across different hardware platforms can also impact its effectiveness in terms of computational accuracy. the calculation results. Compared to the Ooto method, the error dis- In this section, we have also provided the MRED of several com- tribution histograms of the two methods are similar. The error of the parative methods, which allows for a more comprehensive analysis of proposed method is more concentrated around 0 in Figs. 12 and 13, the matrix error. The detailed experimental results are presented in indicating that the proposed method is superior to Ooto in terms of Fig. 11. As illustrated in Fig. 11(a), the proposed method exhibits a error distribution. The error histograms shown in Figs. 12 and 13 in- smaller MRED than rocWMMA, Mark, and Ooto, with a reduction of dicate that the proposed method outperforms other methods regarding 11 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 12. Error distribution histogram on MI210. Fig. 13. Error distribution histogram on A800. both the error upper limit and the error distribution on both platforms, scientific computing scenarios: K-means [6], FFN layer in BERT [8], and proving its accuracy advantage. Linear transformation in LLMs [58], which involve numerous GEMM operations. Fig. 15 shows the speedup ratio of the proposed method 5.4. The empirical Roofline analysis compared to the default implementation (rocBLAS and cuBLAS) on two GPU platforms. The average speedup of the proposed method in To further demonstrate the effectiveness of the proposed method, K-means for rocBLAS and cuBLAS are 1.16× and 1.33×, respectively, we present a performance comparison and analysis of empirical as shown on the left of Fig. 15. The GEMM operation’s shape during Roofline performance on two GPU platforms. In the comparison ex- computation is determined by altering the number of data points in periments, GEMMs with larger dimensions are chosen for profiling, the K-means comparison experiments. The proposed method’s perfor- which avoids performance estimation bias due to fewer data elements mance advantage increases gradually with the number of data points in the multi-level cache. To measure performance metrics, we utilized and then stabilizes. This suggests that the proposed method has more Omniperf [56] and NCU [57] commands, which are profiling tools significant performance gains in large-scale GEMM. Fig. 15 illustrates provided by AMD and NVIDIA, respectively. The empirical roofline per- the performance improvement of the FFN layer. The FFN has incor- formance is typically composed of two regions: the ‘‘memory-bound’’ porated the proposed method to accelerate the computational process and the ‘‘compute-bound’’, which indicate that the current kernel of the BERT layer. As illustrated on Fig. 15, the proposed method has performance is limited by memory access and compute capacity, re- achieved 1.16× and 1.18× speedups compared to rocBLAS and cuBLAS. spectively. The specific experimental results are shown in Fig. 14. From Experiments on various input sequence lengths show that the proposed Fig. 14, it can be seen that the proposed method is in the ‘‘compute- method can significantly improve the speed performance of the FFN bound’’ region compared to Mark, Ooto, and rocWMMA, indicating that computation process on two GPU platforms. Meanwhile, we also apply the proposed method has a higher bandwidth utilization for multi-level the proposed method to the linear transformation in LLMs to accelerate memory in the GPU. This is due to the fact that the above baseline their linear mapping process, which involves various sizes of GEMM approach does not take into account the data layout of specialized workload. During the linear transformation process of LLMs, the input hardware units and the multi-level cache characteristics on GPUs, features are mapped to product Q, K, and V, commonly known as resulting in lower performance. To address this issue, the proposed Query, Key, and Value matrices. The range of matrix data from the method uses double buffers and pre-fetching techniques to establish an LLMs training process is [0.5, 2.5]. In Fig. 15, for the computation efficient multi-level pipeline design. This design empowers the kernel process of multiple-size weight matrices, the proposed method has a to optimize the use of data in the multi-level cache, leading to improved speedup of 1.23× and 1.34× compared to the default implementation, performance. For rocBLAS and cuBLAS, although both methods fall respectively. Fig. 16 shows the Max_error and MRED of the proposed into ‘‘compute-bound’’, the performance of the proposed method is method in three GEMM-based applications. As illustrated in Fig. 16, better. This also shows the computational effectiveness of the proposed the magnitude of the error associated with the proposed method is pri- method for single-precision GEMM. The comparison of the empirical marily contingent upon the K-dimension of GEMM. As the K-dimension Roofline performance on the two platforms can further confirm the increases, both Max_error and MRED of the proposed method also effectiveness and advancement of the proposed method. increase. The computation process in GEMMs with larger K-dimensions involves more FMA operations, leading to more significant errors. 5.5. The improved performance for GEMM-based scientific computing sce- narios 5.6. The performance gain analysis To demonstrate the proposed method’s acceleration performance In this section, we evaluate the performance gains of two tech- in GEMM-based scientific computing, we selected three real-world niques, thread parallelism analytic model and multi-level pipeline, for 12 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 14. The empirical Roofline performance. Fig. 15. The acceleration performance with LE-GEMM (Note that rocBLAS and cuBLAS serve as a baseline for performance comparison. Hence, the speedup of these methods is 1.). Fig. 16. The Max_error and MRED with LE-GEMM. LE-GEMM performance. The detailed experimental results are shown in MI210 and A800. When the matrix size gradually increases, the perfor- Fig. 17. In Fig. 17, ‘‘LE-GEMM manual paramete’’ represents the use of mance gain is more obvious. For example, when the matrix size ≥ 4000, previous experience to select the appropriate parameters. ‘‘LE-GEMM there is a significant performance gap between the proposed method plain implementation’’ is the direct use of the API interface without and the manual parameter implementation. This is because different the multi-level pipeline. As shown in Fig. 17, the thread parallelism parameters will have a significant performance impact for varying analytic model brings 1.36× and 1.78× performance improvement on hardware architectures and matrix sizes. The manual parameter way 13 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 Fig. 17. The performance gain analysis of LE-GEMM (‘‘LE-GEMM manual parameter’’ indicates LE-GEMM without thread parallelism analytic model and ‘‘LE-GEMM plain implementation’’ indicates without multi-level pipeline). Fig. 18. The trade-off between performance and precision. relies solely on prior experience in selecting kernel parameters. This the error of these two methods is also consistent. CUBLAS_COMPUTE_ approach often requires several trial-and-error to obtain a suitable 32F_FAST_16F often leads to significant errors, limiting its applica- parameter value. However, the lack of detailed analysis of the hard- tion range. The proposed method reduces the precision loss by using ware architecture and the computational workflow makes it difficult additional computational resources, which decrease the overall com- to obtain the optimal parameters in the manual selection approach. putational speed. Compared to COMPUTE_32F_FAST_16F, the proposed We also perform LE-GEMM plain implementation, which ignores the method has better computational accuracy. CUBLAS_COMPUTE_64F multi-level memory architecture feature and adopts a simple and direct represents double-precision matrix multiplication operation. It has calling way. Fig. 17 illustrates that LE-GEMM exhibits a performance higher data precision and lower computational speed. The experimental speedup of 3.02× and 2.72× over the plain implementation on two GPU results in Fig. 18 indicate that higher precision data tends to have platforms. In LE-GEMM plain implementation, threads are required lower computational speed. The proposed LE-GEMM achieves single- to wait for the next data load after completing a single iteration precision computational results based on low-precision computational calculation, which significantly reduces performance. Compared to units, essentially a compromise between computational performance plain implementation, LE-GEMM can efficiently utilize various memory and accuracy. spaces for pipelined access operations, reducing data loading overhead and improving CU utilization. 6. Conclusion 5.7. The trade-off between performance and precision This paper proposed LE-GEMM for single-precision matrix multiply, which contains a lightweight emulation algorithm, thread parallel an- To explore the trade-off between high performance and precision, alytic model, and efficient multi-level pipeline implementation. Specif- some comparative experiments with different precision were conducted ically, a lightweight emulation algorithm can provide more accurate on the A800. We used the cublasGemmEx() function in the com- computational results and eliminate redundant computations. Then, a parative experiment to implement two different computation modes thread-parallel analytical model is designed to determine the optimal (CUBLAS_COMPUTE_32F_FAST_16F and CUBLAS_COMPUTE_64F). tiling scheme by considering TLP and single thread performance. A CUBLAS_COMPUTE_32F_FAST_16F mode means directly using FP16 multi-level pipeline design is implemented to achieve access latency to obtain single precision calculation results, during which down- hiding and higher ILP. Finally, experiments were conducted on two conversion is performed. The detailed experimental results have been GPU platforms to validate the effectiveness of the proposed method and presented in Fig. 18. As shown in Fig. 18, CUBLAS_COMPUTE_32F_ its progressiveness. FAST_16F has the fastest computational performance because this method directly calls Tensor Core for matrix operations without any 7. Future work precision refinement process. The calculation process of CUBLAS_ COMPUTE_32F_FAST_16F is consistent with WMMA, which directly Future work includes exploring the impact of different types of low- uses down-conversion without a precision refinement process. Hence, precision data on computational results, allowing for better trade-offs 14 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 between performance and accuracy. Recently, low-precision hardware [13] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S.J. Kwon, B. Kim, Y. Lee, accelerators have been increasingly applied to various GEMM-based D. Lee, LUT-GEMM: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models, 2022, arXiv preprint arXiv: scientific computing scenarios to improve computational speed. The 2206.09557. differences in hardware support and implementation can significantly [14] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, B. Zhuang, Qllm: Accurate and efficient impact the calculation results. In future work, we plan to explore low-bitwidth quantization for large language models, 2023, arXiv preprint arXiv: more low-precision variables (e.g., BF16, FP8, FP4) to speed up the 2310.08041. [15] Y. Wang, B. Feng, Z. Wang, G. Huang, Y. Ding, TC-GNN: Bridging sparse computational process in single-precision matrix multiplication. GNN computation and dense Tensor Cores on GPUs, in: 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 149–164. CRediT authorship contribution statement [16] R. Fan, W. Wang, X. Chu, Fast sparse GPU kernels for accelerated training of graph neural networks, in: 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2023, pp. 501–511. Yu Zhang: Writing – review & editing, Writing – original draft. Lu [17] T. Utsugiri, T. Kouya, Acceleration of multiple precision matrix multiplication Lu: Writing – review & editing. Zhanyu Yang: Writing – review & using Ozaki scheme, 2023, arXiv preprint arXiv:2301.09960. editing, Conceptualization. Zhihong Liang: Writing – review & editing. [18] M. Tatsumi, S.-I. Filip, C. White, O. Sentieys, G. Lemieux, Mixing low-precision Siliang Suo: Writing – review & editing. formats in multiply-accumulate units for DNN training, in: 2022 International Conference on Field-Programmable Technology, ICFPT, IEEE, 2022, pp. 1–9. [19] Y. Tortorella, L. Bertaccini, L. Benini, D. Rossi, F. Conti, RedMule: A mixed- Declaration of competing interest precision matrix–matrix operation engine for flexible and energy-efficient on-chip linear algebra and TinyML training acceleration, Future Gener. Comput. Syst. 149 The authors declare that they have no known competing finan- (2023) 122–135. [20] D.N. Parikh, R.A. van de Geijn, G.M. Henry, Cascading GEMM: High precision cial interests or personal relationships that could have appeared to from low precision, 2023, arXiv preprint arXiv:2303.04353. influence the work reported in this paper. [21] M. Fasi, N.J. Higham, F. Lopez, T. Mary, M. Mikaitis, Matrix multiplication in multiword arithmetic: Error analysis and application to GPU tensor cores, SIAM J. Sci. Comput. 45 (1) (2023) C1–C19. Acknowledgments [22] P. Valero-Lara, I. Jorquera, F. Lui, J. Vetter, Mixed-precision S/DGEMM using the TF32 and TF64 Frameworks on low-precision AI tensor cores, in: Proceedings This work was supported by the Natural Science Foundation of of the SC’23 Workshops of the International Conference on High Performance Guangdong Province, China (2024A1515010204) and the Techno- Computing, Network, Storage, and Analysis, 2023, pp. 179–186. [23] X.-Y. Liu, Z. Zhang, Z. Wang, H. Lu, X. Wang, A. Walid, High-performance tensor logical Research Project of Southern Power Grid Company, China learning primitives using GPU Tensor Cores, IEEE Trans. Comput. 72 (6) (2022) (ZBKJXM20232483). 1733–1746. [24] A. Abdelfattah, H. Anzt, E.G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox, Data availability M. Gates, N.J. Higham, X.S. Li, et al., A survey of numerical linear algebra methods utilizing mixed-precision arithmetic, Int. J. High Perform. Comput. Appl. 35 (4) (2021) 344–369. No data was used for the research described in the article. [25] M.O. Lam, J.K. Hollingsworth, B.R. de Supinski, M.P. Legendre, Automatically adapting programs for mixed-precision floating-point computation, in: Proceed- ings of the 27th International ACM Conference on International Conference on References Supercomputing, 2013, pp. 369–378. [26] A. Dakkak, C. Li, J. Xiong, I. Gelado, W.-m. Hwu, Accelerating reduction and [1] M. Meyer, T. Kenter, C. Plessl, Multi-FPGA designs and scaling of HPC challenge scan using tensor core units, in: Proceedings of the ACM International Conference benchmarks via MPI and circuit-switched inter-FPGA networks, ACM Trans. on Supercomputing, 2019, pp. 46–57. Reconfigurable Technol. Syst. 16 (2) (2023) 1–27. [27] M. Fasi, N.J. Higham, M. Mikaitis, S. Pranesh, Numerical behavior of NVIDIA [2] H. Martínez, S. Catalán, A. Castelló, E.S. Quintana-Ortí, Parallel GEMM-based tensor cores, PeerJ Comput. Sci. 7 (2021) e330. convolutions for deep learning on multicore ARM and RISC-V architectures, J. [28] AMD, RocWMMA: ROCm wavefront-level matrix multiply and accumulate, 2024, Syst. Archit. (2024) 103186. https://github.com/ROCmSoftwarePlatform/rocWMMA. [29] D. Yan, W. Wang, X. Chu, Demystifying Tensor Cores to optimize half-precision [3] Y.R. Choi, V. Stegailov, Multi-GPU GEMM algorithm performance analysis for matrix multiply, in: 2020 IEEE International Parallel and Distributed Processing NVIDIA and AMD GPUs connected by NVLink and PCIe, in: International Con- Symposium, IPDPS, IEEE, 2020, pp. 634–643. ference on Mathematical Modeling and Supercomputer Technologies, Springer, [30] AMD, Next generation BLAS implementation for ROCm platform, 2024, https: 2022, pp. 281–292. //github.com/ROCm/rocBLAS. [4] B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, EGEMM-TC: accelerating [31] A. Kerr, D. Merrill, J. Demouth, J. Tran, Cutlass: Fast linear algebra in CUDA scientific computing on tensor cores with extended precision, in: Proceedings C++, NVIDIA Dev. Blog (2017). of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel [32] B. Tuomanen, Hands-On GPU Programming with Python and CUDA: Explore Programming, 2021, pp. 278–291. high-performance parallel computing with CUDA, Packt Publishing Ltd, 2018. [5] N.S. Murthy, F. Catthoor, M. Verhelst, Optimization of block-scaled integer [33] D. Lustig, S. Sahasrabuddhe, O. Giroux, A formal analysis of the NVIDIA PTX GeMMs for efficient DNN deployment on scalable in-order vector processors, memory consistency model, in: Proceedings of the Twenty-Fourth International J. Syst. Archit. (2024) 103236. Conference on Architectural Support for Programming Languages and Operating [6] M. Ahmed, R. Seraj, S.M.S. Islam, The K-means algorithm: A comprehensive Systems, 2019, pp. 257–270. survey and performance evaluation, Electronics 9 (8) (2020) 1295. [34] D. Mukunoki, K. Ozaki, T. Ogita, T. Imamura, Accurate matrix multiplication [7] G. Chen, Y. Ding, X. Shen, Sweet KNN: An efficient KNN on gpu through on binary128 format accelerated by Ozaki scheme, in: Proceedings of the 50th reconciliation between redundancy removal and regularity, in: 2017 IEEE 33rd International Conference on Parallel Processing, 2021, pp. 1–11. International Conference on Data Engineering, ICDE, IEEE, 2017, pp. 621–632. [35] J. Huang, L. Rice, D.A. Matthews, R.A. van de Geijn, Generating families [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep of practical fast matrix multiplication algorithms, in: 2017 IEEE International bidirectional transformers for language understanding, 2018, arXiv preprint Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2017, pp. 656–667. arXiv:1810.04805. [36] Y. Wang, B. Feng, Y. Ding, QGTC: accelerating quantized graph neural networks [9] AMD, Matrix core, 2024, https://gpuopen.com/learn/amd-lab-notes/amd-lab- via GPU tensor core, in: Proceedings of the 27th ACM SIGPLAN Symposium on notes-matrix-cores-readme/. Principles and Practice of Parallel Programming, 2022, pp. 107–119. [10] S. Markidis, S.W. Der Chien, E. Laure, I.B. Peng, J.S. Vetter, NVIDIA Tensor Core [37] B. Li, S. Cheng, J. Lin, Tcfft: Accelerating half-precision FFT through tensor cores, programmability, performance & precision, in: 2018 IEEE International Parallel 2021, arXiv preprint arXiv:2104.11471. and Distributed Processing Symposium Workshops, IPDPSW, IEEE, 2018, pp. [38] F. Zhang, Y. Hu, H. Ding, Z. Yao, Z. Wei, X. Zhang, X. Du, Optimizing 522–531. random access to hierarchically-compressed data on GPU, in: SC22: International [11] AMD, AMD CDNA 2.0 architecture, 2024, https://www.amd.com/content/dam/ Conference for High Performance Computing, Networking, Storage and Analysis, amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white- IEEE, 2022, pp. 1–15. paper.pdf. [39] R. Sakamoto, M. Kondo, K. Fujita, T. Ichimura, K. Nakajima, The effectiveness [12] W. Sun, A. Li, T. Geng, S. Stuijk, H. Corporaal, Dissecting Tensor Cores via of low-precision floating arithmetic on numerical codes: a case study on microbenchmarks: Latency, throughput and numeric behaviors, IEEE Trans. power consumption, in: Proceedings of the International Conference on High Parallel Distrib. Syst. 34 (1) (2022) 246–261. Performance Computing in Asia-Pacific Region, 2020, pp. 199–206. 15 Y. Zhang et al. Journal of Systems Architecture 160 (2025) 103336 [40] Y. Sun, L. Zheng, Q. Wang, X. Ye, Y. Huang, P. Yao, X. Liao, H. Jin, Accelerating [49] B. Ferrarini, M. Waheed, S. Waheed, S. Ehsan, M.J. Milford, K.D. McDonald- sparse deep neural network inference using GPU Tensor Cores, in: 2022 IEEE Maier, Exploring performance bounds of visual place recognition using extended High Performance Extreme Computing Conference, HPEC, IEEE, 2022, pp. 1–7. precision, IEEE Robot. Autom. Lett. 5 (2) (2020) 1688–1695. [41] Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zhai, Efficiently [50] N.J. Higham, T. Mary, Sharper probabilistic backward error analysis for basic emulating high-bitwidth computation with low-bitwidth hardware, in: Proceed- linear algebra kernels with random data, SIAM J. Sci. Comput. 42 (5) (2020) ings of the 36th ACM International Conference on Supercomputing, 2022, pp. A3427–A3446. 1–12. [51] H. Ootomo, H. Manabe, K. Harada, R. Yokota, Quantum circuit simulation [42] Z. Lin, A. Sun, X. Zhang, Y. Lu, MixPert: Optimizing mixed-precision floating- by SGEMM emulation on tensor cores and automatic precision selection, in: point emulation on GPU integer tensor cores, in: Proceedings of the 25th ACM International Conference on High Performance Computing, Springer, 2023, pp. SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools 259–276. for Embedded Systems, 2024, pp. 34–45. [52] N. Nakasato, A fast GEMM implementation on the cypress GPU, ACM [43] H. Ootomo, R. Yokota, Recovering single precision accuracy from tensor cores SIGMETRICS Perform. Eval. Rev. 38 (4) (2011) 50–55. while surpassing the FP32 theoretical peak performance, Int. J. High Perform. [53] M. De Santis, G. Eichfelder, J. Niebling, S. Rocktäschel, Solving multiobjective Comput. Appl. 36 (4) (2022) 475–491. mixed integer convex optimization problems, SIAM J. Optim. 30 (4) (2020) [44] P. Blanchard, N.J. Higham, F. Lopez, T. Mary, S. Pranesh, Mixed precision block 3122–3145. fused multiply-add: Error analysis and application to GPU tensor cores, SIAM J. [54] R.S. Burachik, C.Y. Kaya, M.M. Rizvi, Algorithms for generating pareto fronts of Sci. Comput. 42 (3) (2020) C124–C141. multi-objective integer and mixed-integer programming problems, Eng. Optim. [45] N.J. Higham, T. Mary, Mixed precision algorithms in numerical linear algebra, 54 (8) (2022) 1413–1425. Acta Numer. 31 (2022) 347–414. [55] NVIDIA, Programming tensor cores in CUDA 9, 2024, https://developer.nvidia. [46] Z. Jia, M. Maggioni, B. Staiger, D.P. Scarpazza, Dissecting the NVIDIA volta GPU com/blog/programming-tensor-cores-cuda-9/. architecture via microbenchmarking, 2018, arXiv preprint arXiv:1804.06826. [56] AMD, Advanced profiling and analytics for AMD hardware, 2024, https://github. [47] M. Lu, B. He, Q. Luo, Supporting extended precision on graphics processors, in: com/ROCm/omniperf. Proceedings of the Sixth International Workshop on Data Management on New [57] NVIDIA, The user guide for nsight compute, 2024, https://docs.nvidia.com/ Hardware, 2010, pp. 19–26. nsight-compute/NsightCompute/index.html. [48] A. Thall, Extended-precision floating-point numbers for GPU computation, in: [58] G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, J. Park, ACM SIGGRAPH 2006 Research Posters, 2006, pp. 52–es. Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing, in: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 3, 2024, pp. 722–737. 16