Journal of Systems Architecture 160 (2025) 103359 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc GTA: Generating high-performance tensorized program with dual-task scheduling Anxing Xie a ,1 , Yonghua Hu a ,∗, Yaohua Wang b , Zhe Li b,c , Yuxiang Gao a , Zenghua Cheng a a School of Computer Science and Engineering, Hunan University of Science and Technology, Taoyuan Road, Xiangtan, 411201, Hunan, China b School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410073, Hunan, China c Tianjin Institute of Advanced Technology, Huixiang Road, 300459, Tianjin, China ARTICLE INFO ABSTRACT Keywords: Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring Mapping the efficient execution of deep neural networks. But, producing such programs for different operators Code generation across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent Compiler optimization acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However, Tensor computation their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs and significant search time overhead. In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates. As a result, GTA can find high-performance tensor programs that are outside the search space of existing state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88× over AMOS and 2.29× over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49× over Ansor and 2.76× over PyTorch on CPU with AVX512. 1. Introduction also bridge the gap between high-level tensor programs and low-level instructions, a process we refer to as tensorized program generation Recently, the successful deployment of machine learning models with automatic mapping optimization. However, generating high- has revolutionized diverse application domains, such as image recog- performance tensorized programs for various DLAs remains challenging nition [1–3], natural language processing [4–6], and autonomous driv- for several reasons. ing [7–9]. This rapid development has created a demand for generat- Firstly, inefficient exploration of the intrinsic mapping space leads ing high-performance tensor programs for deep learning accelerators to substantial overhead in search time. For instance, mapping the 7 (DLAs), such as Google TPUs [10], mobile devices [11–13], FPGAs [14– 16], and more. To accelerate machine learning, hardware vendors loops of a 2D convolution to the 3D of Tensor Core can involve 35 have introduced domain-specific intrinsics for tensor computations, different ways [22]. Current strategies [22,23] treat each mapping can- such as NVIDIA’s Tensor Cores [17–19] and CPU’s AVX512 [20]. This didate equally, generating a tensorized program for each and ultimately demand has led to the process known as tensorization [21], which selecting the one with the best performance. This approach incurs involves transforming computations using these intrinsic instructions. significant time overhead and is inefficient, as it fails to prioritize more However, hardware specialization complicates the task of generating promising candidates during the exploration process. Our experiments high-performance tensorized programs. reveal that many mapping candidates for a given subgraph ultimately To support hardware intrinsic instructions across different acceler- fail to produce high-performance tensorized programs, indicating that ators, existing methods [22–24] use unified hardware abstractions to a large portion of the explored mappings are ineffective in optimizing enable end-to-end automatic mapping space exploration. These abstrac- performance. tions not only convert opaque intrinsics into an analyzable format but ∗ Corresponding author. E-mail address: huyh@hnust.cn (Y. Hu). 1 Part of this work was done at National University of Defense Technology. https://doi.org/10.1016/j.sysarc.2025.103359 Received 23 November 2024; Received in revised form 8 January 2025; Accepted 30 January 2025 Available online 7 February 2025 1383-7621/© 2025 Published by Elsevier B.V. A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Fig. 1. Comparison of different task scheduling strategies. Part (a): task scheduling with gradient decent. In round 1, all 𝑡𝑎𝑠𝑘𝑠𝑖 are executed sequentially. In subsequent rounds, 𝑡𝑎𝑠𝑘𝑠𝑖 are selectively executed based on the performance gradients calculated from the feedback of each task. Part (b): sequential execution of sub-tasks without dual-task scheduling. Part (c): slice the time and prioritize important subgraphs and intrinsic mapping candidates, meaning that not all main-tasks and sub-tasks will be executed. For example, an intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and discarded mapping candidates. The former will proceed to subsequent tensor program optimization and tuning, while the latter will not participate in further optimization unless they are selected in the next scheduling round. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Secondly, existing rule-based tensor program exploration meth- backends. These compilers take model definitions, expressed in frame- ods [25] lack the ability to perform automatic tuning and optimization works like PyTorch [33] or TensorFlow [34], as inputs and generate tailored to domain-specific intrinsics. As a result, these methods often efficient code implementations for specific hardware platforms, such as fail in auto-tuning and produce suboptimal tensorized programs. To CPUs, GPUs. The compilation process often adopts a progressive multi- overcome these limitations, there is an urgent need for more efficient layer optimization approach. It begins with the front-end, where neural exploration of subgraph mapping spaces, along with auto-tuning strate- network models serve as input, and proceeds through intermediate gies that can effectively support domain-specific intrinsics, enabling the representation (IR) stages. These include graph-level IR [35–39] for automatic generation of high-performance tensorized programs. structural optimizations and loop-level IR [40–42] for fine-grained In this paper, we introduce GTA, a new compiler framework de- transformations. Finally, the back-end generates hardware-specific ex- signed to generate high-performance tensorized programs. GTA auto- ecutable code using traditional compiler techniques, ensuring efficient matically generates an extensive search space optimized for hardware execution on the target platform. intrinsics, simultaneously increasing the likelihood of selecting the most A key innovation in deep learning compilers is the compute- efficient mapping configuration. For generating the search space, we schedule separation first introduced by Halide [43] and adopted by employ rule-based strategies to construct a large scheduling search frameworks like TVM [21]. Compute represents the mathematical space and apply pruning techniques based on hardware cache resource description of tensor operations, such as addition, convolution, or limitations to eliminate invalid program candidates. Finally, as shown matrix multiplication, while schedule defines how these operations in Fig. 1, for search strategy implementation, we use a dual-task are executed on hardware. Schedule specifies program transforma- scheduling algorithm to allocate tuning resources across all subgraphs tions, including loop tiling, vectorization, and unrolling, to optimize (𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 as shown by the blue box in Fig. 1) in the neural net- performance for specific hardware architectures. This decoupling sim- work and their intrinsic mapping candidates(𝑠𝑢𝑏-𝑡𝑎𝑠𝑘𝑖 as shown by the plifies the representation of tensor computations, enabling flexible orange box and gray box). This algorithm prioritizes subgraphs with optimization strategies tailored to different backends. greater potential for performance improvement, allocating them more Recent advancements [22–24,44] in deep learning compilers focus tuning opportunities, while reducing tuning efforts on less promising on leveraging hardware intrinsics to further optimize tensor programs. mapping candidates based on performance feedback, thereby minimiz- By integrating intrinsic-specific mapping abstractions, these compil- ing overall tuning time. In summary, this paper makes the following ers can directly utilize the specialized instructions of DLAs, such as contributions: NVIDIA’s Tensor Cores or CPU’s AVX512, to achieve higher compu- tational efficiency. These developments mark a shift from general- • We integrated intrinsic-based mapping abstraction with a rule- purpose optimizations to hardware-aware designs, laying the founda- based program generation strategy to expand the search space tion for intrinsic-based mapping strategies. significantly. • We developed and implemented an efficient dual-task schedul- 2.2. Intrinsic-based mapping abstraction ing strategy for tensorized programs, effectively reducing tuning efforts while enhancing performance. The development of DLAs has led to the creation of specialized • We propose a compilation framework called GTA, which supports instructions [45–48], known as intrinsics, designed to enhance the the generation of high-performance tensorized programs at both computational efficiency of tensor operations. These instructions serve the operator level and the full network level on NVIDIA GPUs and as essential interfaces between hardware and compilers, enabling opti- CPUs. mized execution of key operations like matrix multiplication and data • We implemented and comprehensively evaluated the GTA sys- movement. tem, demonstrating that the aforementioned techniques outper- Intrinsics provide an efficient mechanism for managing kernel op- form state-of-the-art systems across various deep neural networks erations in tensor programs, typically categorized into compute in- (DNNs). trinsics for performing computations and memory intrinsics for data handling [22]. For example, NVIDIA Tensor Cores [17–19] and CPU 2. Background and motivation AVX512 [20] offer specialized intrinsics that allow accelerated ma- trix and vector operations, respectively, facilitating high-performance 2.1. Deep learning compilers computation across various accelerators. Intrinsic-based mapping abstraction further unifies tensor pro- Deep learning compilers [21–32] have emerged as essential tools for gram optimization by representing diverse intrinsic behaviors in a com- bridging the gap between deep learning models and diverse hardware mon, analyzable form. Frameworks like AMOS [22] and TensorIR [23] 2 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Table 1 ❸ Rule-based mapping: Rule-based mapping [24–27,57] gener- State-of-the-art compilers∕mappings for hardware accelerators. ates efficient tensor programs through predefined scheduling primi- Name Mapping Method tives, streamlining tensor program creation without user-defined tem- ❶ plates. This approach leverages scheduling techniques like loop tiling, AutoTVM Hand-written templates + Tuning fusion, and vectorization, as demonstrated by frameworks like An- Triton Hand-written templates ❷ sor [25], which automatically create search spaces using these rules. Tiramisu Polyhedral model This method simplifies tensor program generation in deep learning AKG Polyhedral model + Templates applications. However, it also has limitations: users must ensure that ❸ the predefined rules align with the specific operators and hardware, or Ansor Generated rules + Tuning the generated programs may fail to achieve optimal performance. XLA Templates and rules Heron Constraint-based rules + Tuning ❹ Analyzable abstraction mapping: Analyzable abstraction map- MetaSchedule Generated rules + Tuning ping [22,23,44,58,59] unifies tensor program optimization by abstract- ❹ ing diverse hardware intrinsic behaviors into a common representation, UNIT Analyzable abstraction + Tuning facilitating efficient mapping and transformation of tensorized pro- ROLLER Tile abstraction + Construction policy AMOS Analyzable abstraction + Tuning grams. Examples like AMOS and TensorIR establish direct mappings TensorIR Analyzable abstraction and generated rules + Tuning between software and hardware, guiding the automated generation ❺ of tensorized programs. This approach broadens the scope of explo- Hidet Task-mapping + Post-scheduling fusion ration by identifying foundational software-to-hardware combinations, EINNET Derivation-based + Tuning increasing the potential for discovering optimized mappings. TensorMap Reinforcement learning + Tuning ❺ Other mapping: Other mapping methods [13,40,60,61] reformu- GTA Analyzable abstraction and generated rules + Tuning late deep learning optimization problems using strategies from other domains to enhance efficiency. For example, CoSA [56] and Heron [24] convert the scheduling space search into a constrained optimization leverage this approach to directly map software operations to hard- problem and leverage solvers to rapidly explore the space. Alterna- ware intrinsics, supporting automated generation and transformation tively, TLM [62] and Soter [63] treat tensor program exploration as of tensorized programs. This abstraction broadens the search space for a language model generation task, where tensor programs are rep- high-performance configurations by identifying fundamental software- resented as sequences and tunable parameters as language tokens. to-hardware mappings, thus enhancing optimization potential across Specifically, they leverage a large language model (LLM) to generate different hardware backends. these tokens for tunable parameters, enabling efficient exploration of mapping schemes and more effective optimization of tensor programs. 2.3. Tensor program generation strategy Building on this foundation, we reviewed five primary mapping approaches used for deep learning accelerators: hand-written, rule- In Table 1, we summarize state-of-the-art compiler mapping tech- based, polyhedral model, analyzable abstraction, and other mapping niques used to generate optimized tensor programs on hardware accel- methods. Each approach brings unique advantages—hand-written and erators. Most existing compilers leverage programmable intrinsics as rule-based mappings allow fine-tuned performance but require exten- part of their mapping strategy, enabling developers to focus on high- sive manual intervention or rigid predefined rules, while polyhedral level optimization while the compiler handles low-level architectural and analyzable abstraction mappings offer more automated solutions details. These mapping methods streamline tensor program generation but are challenged by complexity and limited applicability. Methods by abstracting hardware-specific operations, thereby enhancing both borrowing from other domains, such as optimization solvers and lan- efficiency and portability. guage models, open new directions but may lack consistency across Specifically, we categorize the state-of-the-art compilers/mappers diverse hardware. In summary, intrinsic-based mapping abstraction of- for DLAs into five main approaches: fers a unified framework for optimizing tensor programs across diverse ❶ Hand-written mapping: Hand-written mapping [29,49] requires hardware accelerators by abstracting hardware intrinsic behaviors into developers to manually define mappings for tensorized programs using a common representation. Systems like AMOS and TensorIR leverage compiler-provided tensorize interfaces. This approach enables fine- this approach to enable efficient and adaptable mappings for tensorized grained optimization, especially for specialized hardware like NVIDIA programs. Tensor Cores. However, it demands significant expertise and high Despite these advances, significant challenges remain in achieving development costs, as developers must continually rewrite templates flexible, high-performance mappings that are adaptable to new hard- to support new operators and accelerators [50–52]. While hand-written ware accelerators, such as the inefficiency of existing approaches in mapping can achieve high performance for specific workloads, its lack handling diverse architectural constraints and their inability to effec- of scalability and adaptability limits its effectiveness compared to more tively explore large and complex search spaces. To better illustrate our automated methods. motivation, we present an example to illustrate the specific challenges ❷ Polyhedral model mapping: Polyhedral model mapping [28,32, within existing analyzable abstraction mapping systems, motivating the 53–56] provides a powerful strategy for optimizing tensor programs by development of our approach. restructuring execution and managing complex memory dependencies. Mapping intrinsic instructions onto hardware accelerators poses In the realm of tensor program compilation, this approach plays a significant challenges due to the vast number of possible configurations critical role in handling intricate memory structures and optimizing ex- and their impact on performance. The process of selecting the optimal ecution. For example, AKG [32] leverages polyhedral scheduling to re- mapping for intrinsic instructions, such as those used in Tensor Cores, is structure execution order through new linear relationships, effectively complex, given the numerous potential mapping candidates. Each map- eliminating inter-loop dependencies. This method is particularly advan- ping choice can critically affect performance factors like data locality tageous for hardware like TPUs, where enhancing parallel computation and parallelism. For example, as shown in Table 2, AMOS identified 35 is essential. By exploring a broader range of affine transformations distinct ways to map the seven loops of a 2D convolution onto the 3D compared to methods such as TVM [21], polyhedral mapping optimizes loops of the Tensor Core. Exhaustively exploring all configurations is in- performance for diverse workloads. However, the model’s inherent efficient and rarely yields substantial performance gains. Thus, a more complexity limits its general applicability, making it less feasible for efficient approach is required, one that prioritizes the most promising simpler or less resource-intensive tasks. mappings to reduce search overhead and maximize performance. 3 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Table 2 Mapping candidates choices. This example maps a 2D convolution index to Tensor Core index (type: float16). Space loops: n, k, p, q, 𝑖1 , 𝑖2 ; Reduction loops: rc, rr, rs, 𝑟1 . The mapping choices can be categorized into basic mapping and complex mapping. Basic mapping means selecting only one choice at a time, while complex mapping allows multiple choices to be combined for mixed mapping. mapping1 mapping2 mapping3 mapping4 mapping5 mapping6 mapping7 i1 n n n p p q q i2 k k k k k k k r1 rc rr rs rc rs rc rr Choices 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Fig. 2. The compilation flow of GTA. 𝑡𝑛 denotes the 𝑛th non-intrinsic main-task (blue box), and 𝑡𝑛𝑘 denotes the 𝑘th mapping candidate of 𝑛th intrinsic-enabled main-task (orange box). All mapping candidates are ranked and executed based on performance feedback. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) A second challenge lies in the scheduling of tensor programs, which 4. Dual-task scheduling often lacks consideration for DLAs intrinsics. Existing systems do not sufficiently incorporate these intrinsics when generating the scheduling Most existing compiler frameworks adopt a performance-aware tun- search space, limiting their ability to optimize tensorized programs for ing strategy to fine-tune generated programs, a method proven effective specialized hardware. To address this, a more comprehensive approach by systems such as Ansor and AMOS. For example, Ansor refines its to scheduling is needed, integrating primitives like tiling, fusion, and cost model by updating task weights based on feedback from each vectorization that are tailored to the unique characteristics of DLAs. search iteration, while dynamically allocating subgraph trials. Building Without such a targeted approach, the scheduling search space cannot on this approach, when multiple intrinsic instruction mapping options fully leverage the potential of available mappings, thereby constraining are available, feeding performance results of each mapping back into the system’s capacity to produce high-performance programs. the front-end further enhances the framework by enabling seamless co-design between the front and back-end stages. 3. GTA overview To optimize tuning resource allocation, a DNN can be decomposed into multiple independent subgraphs (e.g., conv2d + ReLU). For some To address the aforementioned issues, we propose GTA, a compila- subgraphs, spending time on tuning may not significantly improve the tion framework designed to automatically generate high-performance overall network performance. This may occur when a subgraph is not a tensorized programs for specialized hardware. As shown in Fig. 2, it performance bottleneck, or when any tuning yields only marginal gains. takes deep neural networks (DNNs) as input, converting them into Similarly, a subgraph may have multiple intrinsic mapping candidates, computation graphs represented as directed acyclic graphs (DAGs). In but further tuning on certain mappings may not result in meaningful these graphs, each node corresponds to a tensor operation, and each improvements. This is often because certain mapping schemes exhibit edge denotes a producer–consumer relationship between operations. To inefficient memory access patterns, limiting their ability to leverage the handle the complexity of large computational graphs, GTA partitions unique features of the underlying hardware and thereby restricting the the DNN’s computation graph into smaller, manageable subgraphs us- potential for significant performance gains. ing Relay’s operator fusion algorithm, which has minimal performance impact due to the layer-by-layer structure of DNNs (𝑡1 , 𝑡2 , … , 𝑡𝑛 in To illustrate the dual-task scheduling (DTS) process, we use Fig. 2). ResNet18 as an example. After splitting ResNet18 into subgraphs, To maximize performance across multiple subgraphs, GTA dynam- there are 24 unique subgraphs, most of which are convolution layers ically prioritizes subgraphs and mapping candidates most likely to with varying shape configurations (e.g., input size, kernel size, stride). enhance end-to-end efficiency. It uses a dual-task scheduling ap- Following Ansor’s task scheduling methodology, we define a task as the proach (detailed in Section 4) that allocates tuning time across both process of generating high-performance programs for each subgraph. subgraph and mapping candidate levels. By allocating varying amounts Thus, optimizing a single DNN like ResNet18 requires completing of time to different subgraphs and probabilistically discarding less effi- multiple tasks (e.g., 24 tasks for ResNet18). cient candidates based on performance feedback, dual-task scheduling To efficiently allocate tuning resources across these tasks, GTA helps avoid wasted tuning resources on low-impact mappings. employs a DTS approach. This method dynamically assigns varying Additionally, resource-constrained rules (explained in Section 5) amounts of time to different subgraphs and probabilistically discards in- guide program generation on both DLAs and general-purpose acceler- efficient mapping candidates based on program performance feedback. ators. GTA designs these rules by abstracting common architectural DTS operates on two levels: the subgraph level and the mapping can- characteristics across DLAs, such as coarse-grained hardware intrin- didate level, helping GTA focus tuning resources on the most impactful sics (e.g., WMMA in Tensor Core) and dedicated scratchpad memory configurations and avoid spending time on low-impact mappings. (e.g., Unified Buffer in TPU). This design allows GTA to efficiently As shown in Fig. 1, the DTS iteratively allocates tuning resources to leverage hardware-specific features, optimizing tensorized programs to different tasks. In each round, the first step selects a subgraph for pro- fully exploit the underlying hardware capabilities. gram generation, GTA generates a set of intrinsic-compatible mapping 4 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Algorithm 1: Dual-Task Scheduling Table 3 Notations. Input: Notation Description/Definition 𝐺: native deep learning neural network target : target hardware platform Main-task Subgraph process for generating high-performance programs Sub-task Intrinsic mapping candidate satisfying hardware constraints trials: total tuning counts 𝛥𝑡 Small backward window size MEASURE_NUM : number of measures per round 𝑁𝑖 The set of similar task of i Output: 𝑏𝑒𝑠𝑡_𝑡𝑎𝑠𝑘𝑠: best performance tasks 𝐶𝑖 The number of floating point operation in task i 1 Function dual_scheduling 𝑉𝑘 The number of floating point operation per second we can 2 Initialize local variables 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐵𝑡𝑎𝑠𝑘 , 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐶𝑡𝑎𝑠𝑘 , 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ; achieve in task k 3 tasks = 𝑒𝑥𝑡𝑟𝑎𝑐 𝑡_𝑡𝑎𝑠𝑘𝑠(𝐺, 𝑡𝑎𝑟𝑔 𝑒𝑡); 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 Best mapping latency set of tasks 4 while 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 < trials do 𝐵𝑡𝑎𝑠𝑘 Best mapping tasks set of DNN 5 tid = gradient_scheduling (tasks, 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 ); 𝐶𝑠𝑎𝑚𝑝𝑙𝑒 Samples selected from all mappings 6 𝑀𝑐 𝑎𝑛𝑑 𝑖 = 𝑚𝑎𝑡𝑐 ℎ_𝑖𝑛𝑡𝑟𝑖𝑛𝑠𝑖𝑐(tasks[tid], target ); 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 Current number of trials 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 Current mapping selection 7 if 𝑀𝑐 𝑎𝑛𝑑 𝑖 not NULL then G Native neural network 8 for 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 in 𝑀𝑐 𝑎𝑛𝑑 𝑖 do 𝑚𝑖 (𝑡) Minimum execution time for𝑖th task 9 if 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then 𝑚𝑖𝑘 (𝑡) Execution time of 𝑘th mapping for 𝑚𝑖 (𝑡) 10 if 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 not in 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 Latency set of all tasks 11 continue; 𝑀𝑐 𝑎𝑛𝑑 𝑖 Set of all mapping candidates 12 end 𝛼𝑘 Sampling probability of mapping k 13 end 𝛽 Hyperparameter for increasing probability 𝜔𝑖 Number of appearances of task i in the network 14 latency = tasks[tid].tune(𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 ); 15 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 .append(latency); 16 if latency < 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 then 17 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 [tid] = latency; defined as: 18 𝐵𝑡𝑎𝑠𝑘 [tid] = tasks[tid]; 19 end ∑ 𝑛 𝑓 (𝐺) = (𝜔𝑖 × 𝑚𝑎𝑥(𝛽(𝛼1 ⋅ 𝑚𝑖1 (𝑡), 𝛼2 ⋅ 𝑚𝑖2 (𝑡), ..., 𝛼𝑘 ⋅ 𝑚𝑖𝑘 (𝑡)))) (1) 20 end 𝑖=1 21 𝐶𝑠𝑎𝑚𝑝𝑙𝑒 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒(𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 ); 22 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 += MEASURE_NUM Let 𝜔𝑖 denote the number of appearances of main-task 𝑖 in the 23 end network, where 𝑖 is the main-task index. If a main-task has already met 24 end its latency requirement, no additional tuning resources are allocated to 25 return 𝐵𝑡𝑎𝑠𝑘 ; it. The variable 𝛼𝑘 represents the sampling probability assigned to sub- task k. Unlike other frameworks, our approach introduces probabilistic allocation for intrinsic mapping candidates (sub-task). Once perfor- mance feedback for all mapping candidates of a subgraph is received, candidates for the intrinsic-enabled 𝑡𝑎𝑠𝑘𝑖 . This effectively breaks the sampling probabilities are assigned based on time cost, candidates main-task into several sub-tasks (as shown by the orange box in Fig. 1). with lower time costs are assigned higher probabilities, while those The second step then generates a batch of promising programs for these with higher time costs receive lower probabilities. We also introduce a sub-tasks and measures their performance on hardware. Each round is hyperparameter 𝛽 to adjust sampling probabilities for specific mapping defined as one unit of time resource. When a time resource is allocated candidates, helping to avoid convergence on locally optimal solutions. to a task, the task gains the opportunity to generate and measure new programs, increasing the chance of discovering better-performing ones. 4.2. Optimizing with gradient and probability In the following section, we introduce the formulation of the scheduling problem and our solution. Inspired by the gradient descent-based task scheduling approach presented in [25], we propose a DTS algorithm (Algorithm 1) that com- 4.1. Problem formulation bines gradient descent with probability-based selection to efficiently optimize the objective function. Starting from the current allocation t, the algorithm approximates the gradient of the objective function, 𝜕𝜕𝑓𝑡 , 𝑖 and identifies the primary task i by maximizing the absolute gradient, In defining the scheduling problem, we divide DTS into two types of 𝜕𝑓 defined as 𝑖 = ar g max𝑖 | 𝜕 𝑡 |. This gradient approximation serves as tasks: main-tasks and sub-tasks. In this framework, a DNN can be split 𝑖 the foundation for selecting the main-task with the highest potential into several subgraphs (main-tasks). If the computation type, data type, impact. and computation shape of a main-task meet the limitations required ( ( ( ) ))) for utilizing hardware intrinsic resources, multiple intrinsic mapping 𝜕𝑓 𝜕 𝑓 ( 𝛥𝑚 𝑚𝑖 𝑡𝑖 𝐶𝑖 ( ) ≈ 𝛼 +(1 − 𝜂) min − ,𝜃 − 𝑚𝑖 𝑡𝑖 candidates will be generated for the main-task. Each of these intrinsic 𝜕 𝑡𝑖 𝜕 𝑚𝑖 𝛥𝑡 𝑡𝑖 max𝑘∈𝑁(𝑖) 𝑉𝑘 mapping candidates is referred to as a sub-task. A main-task represents a process performed to generate high-performance programs for a (2) ( ) ( ) subgraph, meaning that optimizing a single DNN requires completing where 𝛥𝑚 = 𝑚𝑖 𝑡𝑖 − 𝑚𝑖 𝑡𝑖 − 𝛥𝑡 and other variables are defined in dozens of main-tasks. And related notions used in this paper are shown Table 3. The parameter 𝜂 and 𝜃 control the weight to trust some in Table 3. predictions. We define 𝑚𝑖 (𝑡) as the minimum execution time required for the GTA initializes the algorithm with t = 0 and begins with a round- 𝑖th main-task at time 𝑡, and 𝑚𝑖𝑘 (𝑡) as the execution time of the 𝑘th robin warm-up phase, resulting in an initial allocation vector of 𝑡 = mapping scheme for the 𝑖th main-task. The optimal execution time {1, 1, … , 1}. After the warm-up, as shown in line 5 of Algorithm 1, for subgraph 𝑖 is represented as min(𝑚𝑖1 (𝑡), 𝑚𝑖2 (𝑡), … , 𝑚𝑖𝑘 (𝑡)). The end- the gradient for each main-task is computed, and the main-task with to-end execution time of the entire network, denoted by 𝐺(𝑚1 (𝑡), 𝑚2 (𝑡), the maximum absolute gradient, 𝑖 = argmax𝑖 | 𝜕𝜕𝑓𝑡 |, is selected. A tuning 𝑖 … , 𝑚𝑛 (𝑡)), represents the aggregate time across all main-tasks. Our time unit is then allocated to this main-task, updating its allocation to objective is to minimize this function to achieve the lowest possible 𝑡𝑖 = 𝑡𝑖 + 1. The optimization process continues until the tuning time overall execution time for the DNN. Thus, the objective function is budget is exhausted. 5 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Afterward, GTA searches for a hardware intrinsic that matches the Table 4 Resource-constrained rules and related conditions. specified main-task. Once a suitable set of hardware intrinsics is identi- fied, tensor programs are generated for all mapping candidates, serving No. Rule Condition as a warm-up for the sub-task. This warm-up allows GTA to select the HasDataReuse(R, i) & R1 Multi-Level Tiling HasMultiLevelCache(R, i) most promising mapping candidates by assigning probabilities based HasDataReuse(R, i) & on their performance feedback. In subsequent rounds, only mapping R2 Set Multi-Scope HasMultiScopeCache(R, i) candidates prioritized by their previously assigned probabilities are R3 Fuse Main Op HasStagesFused(R) executed. This selective exploration avoids spending time on inefficient R4 Fuse Output Op HasStagesFused(R) candidates, enhancing tuning efficiency and allowing higher-potential R5 AddMemLimit HasDSM(R)a ... Ansor Defined Ruleb ... candidates more opportunities for optimization. a The 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒 algorithm, as called in line 21 of Algorithm DSM: dedicated scratchpad memory. b Ansor [25]. 1, is designed to probabilistically select mapping candidates for further analysis and optimization. We first introduce the notation: let 𝑅 = {𝑟1 , 𝑟2 , … , 𝑟𝑛 } represent the set of all mapping results, where 𝑟𝑖 denotes the 𝑖th result with a performance value 𝑉 (𝑟𝑖 ). The total weight 𝑊 is calculated by considering each result’s in- Second, current approaches are largely tailored to general-purpose verse performance value, normalized with respect to the maximum processors and lack consideration for specific architectural constraints. performance in 𝑅, as follows: This highlights the need to construct a high-quality kernel design ∑ 1 1 space to effectively reduce inefficient exploration and improve overall 𝑊 = ⋅ performance. 𝑉 (𝑟𝑖 ) max 1 𝑟 ∈𝑅 𝑖 𝑟𝑗 ∈𝑅 𝑉 (𝑟 ) 𝑗 To address these challenges, GTA’s implementation of resource- constrained generation rules is based on existing open-source code This ensures that weights are scaled and relative to the most perfor- for DLAs and general-purpose accelerators [22,25]. In particular, the mant candidate in the result set 𝑅. Using this normalized total weight DLA-specific rules are adapted to leverage hardware intrinsics and 𝑊 , the initial probability assigned to each result 𝑟𝑖 is given by: dedicated scratchpad memory (DSM) efficiently. From a programmer’s 1 1 ⋅ 𝑉 (𝑟𝑖 ) max𝑟 ∈𝑅 1 perspective, DLAs, in contrast to general-purpose accelerators, feature 𝑗 𝑉 (𝑟𝑗 ) 𝑃 (𝑟𝑖 ) = coarse-grained hardware intrinsics (e.g., WMMA in Tensor Core) and 𝑊 user-programmable dedicated DSM (e.g., Unified Buffer in TPU). Based on these existing implementations, we made targeted modifications To encourage exploration, the algorithm applies a probability in- to better align the rules with the search strategies and optimization crease factor 𝛽 to selected results. The probability adjustment is defined methods proposed in this work. Table 4 summarizes five key generation by weighting the original probability 𝑃 (𝑟𝑖 ) with an exploration boost: rules that GTA employs to optimize data movement, operation fusion, (1 + 𝛽) ⋅ 𝑃 (𝑟𝑖 ) and memory management in DLAs. Each rule addresses specific chal- 𝑃 ′ (𝑟𝑖 ) = ∑ 𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ) lenges to enhance computational efficiency and resource utilization. The following is a detailed description of each rule: Here, 𝛽𝑗 is a task-specific exploration factor, applied selectively Rule-R1 generates multiple nodes for data movement between dif- to candidates 𝑟𝑗 , where 𝛽𝑗 = 𝛽 for selected candidates and 𝛽𝑗 = 0 ferent levels of on-chip DSMs. To apply this rule, GTA first checks for otherwise. The inclusion of the initial probability 𝑃 (𝑟𝑗 ), derived from data reuse opportunities and verifies if the DLA has multiple DSM levels each candidate’s performance value 𝑉 (𝑟𝑗 ), serves as the foundation of (e.g., Tensor Core provides two levels of DSMs for WMMA fragments the adjusted probabilities. This ensures that 𝑃 ′ (𝑟𝑖 ) retains the relative and shared memory). If these conditions are met, GTA inserts 𝑐 𝑎𝑐 ℎ𝑒𝑟𝑒𝑎𝑑 importance of each candidate while allowing selective exploration primitives for the node and its producers to facilitate data movement. through 𝛽𝑗 . Rule-R2 marks the data storage scope for each operation within ∑ The normalization term, 𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ), ensures that the the DSM hierarchy. To apply this rule, GTA first checks for data reuse adjusted probabilities remain valid and sum to 1. By combining the opportunities and verifies whether the DLA provides multiple DSM task-specific exploration factor with the initial performance-weighted scopes for different data types. If these conditions are satisfied, GTA probability 𝑃 (𝑟𝑗 ), this formula balances exploitation of high-priority assigns 𝑐 𝑎𝑐 ℎ𝑒𝑤𝑟𝑖𝑡𝑒 primitives to the node and 𝑐 𝑎𝑐 ℎ𝑒𝑟𝑒𝑎𝑑 primitives to its candidates with exploration of less performant options. Furthermore, producers, ensuring that data is efficiently stored and accessed within 𝑃 (𝑟𝑗 ) prevents the adjustment from overly concentrating on a small the appropriate DSM levels. subset of candidates, promoting diversity and fairness across the result Rule-R3 enables the fusion of main operations within a subgraph set 𝑅. by identifying opportunities to combine operations with shared data Finally, the algorithm selects the top 𝑁 results based on the adjusted dependencies. This reduces data movement overhead and improves probabilities 𝑃 ′ (𝑟𝑖 ). The selection process is expressed as: computational efficiency. When multiple stages are fused, GTA inserts ( ) the appropriate primitives to implement the fusion, streamlining the {𝑟𝑖 }𝑁 𝑖=1 = Top𝑁 𝑃 ′ (𝑟1 ), 𝑃 ′ (𝑟2 ), … , 𝑃 ′ (𝑟𝑛 ) , (3) execution flow. where 𝑁 is dynamically determined based on a fraction of the total Rule-R4 focuses on fusing output operations within a computational result set 𝑅, denoted by 𝑁 = ⌈𝜅 ⋅ |𝑅|⌉, and 𝜅 ∈ (0, 1] is a user-defined graph. Similar to Rule-R3, it targets operations that can be combined parameter controlling selection size. to minimize data transfer costs and enhance throughput. By analyz- ing data flow between operations, GTA inserts necessary primitives 5. Resource-constrained rules to achieve output fusion, resulting in a more compact and efficient execution structure. Existing exploration-based methods face significant challenges in Rule-R5 constrains memory usage for operations that utilize ded- both performance and scalability, primarily due to two factors. First, icated scratchpad memory (DSM). By evaluating each operation and although the design space is vast, it contains numerous inefficient ker- its memory requirements, GTA ensures memory limits are respected, nels. For example, in the GEMM operation with dimensions preventing allocations from exceeding hardware capacity, which could 512 × 768 × 3072 (used in GPT-1 on Tensor Core), the kernel space size lead to inefficient execution. This rule helps maintain an efficient reaches 𝑂(1016 ), with over 90% of the kernels being inefficient [63,64]. memory allocation strategy, optimizing overall resource utilization. 6 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Fig. 3. An illustrative example of tensorized program generation for a GEMM-ReLU operator, demonstrating the transformation of the input program from a mathematical expression (𝑝0 ) to a tensor expression (𝑝1 ) written in a domain-specific language using TVM. The process further includes intrinsic matching based on the type and shape of the input operator to select and generate intrinsic mapping candidates, followed by the application of resource-constrained rules to guide the creation of a tensorized program sketch (𝑝2 ). An example. Fig. 3 illustrates how resource-constrained rules are into smaller subgraphs. Notably, for some subgraphs, spending time on applied during tensorized program generation. Starting from the input tuning may not significantly enhance the end-to-end performance of program written as a mathematical expression (𝑝0 ), the process con- the DNN. In this work, we adopt TVM’s subgraph partitioning strategy verts it into a tensor expression (𝑝1 ) using domain-specific language to divide the input DNN into multiple smaller subgraphs, referred to as (DSL) in TVM. The intrinsic matching step leverages compute abstrac- main-tasks. A main-task is considered a process executed to generate tion and memory abstraction, as proposed in AMOS [22], to complete high-performance programs for a subgraph. TVM categorizes operators the software-hardware mapping generation. This process selects and into four types: injective (e.g., add operations), reduction (e.g., sum generates intrinsic mapping candidates by analyzing the operator’s operations), complex-out-feasible (e.g., matrix multiplication, where computation type, data type, and memory access patterns based on element-wise mappings can fuse to the output), and opaque (e.g., sort its shape and hardware-specific constraints. Subsequently, resource- operations, which cannot be fused). Subgraph fusion is then performed constrained rules play a critical role in guiding the generation of the based on predefined generic rules. tensorized program sketch, ensuring efficient utilization of hardware in- trinsic functions while respecting memory and architectural constraints. Mapping Generation and Scheduling. At each iteration, based on Specifically, the derivation for generated rules and the transformed the intrinsic mapping generation approach described in AMOS [22], program can be expressed as: main-tasks can be classified into intrinsic-disabled and intrinsic-enabled 𝑅2 𝑅1 tasks. For intrinsic-disabled main-tasks, we adopt Ansor’s [25] com- 𝑖𝑛𝑝𝑢𝑡 𝑝1 → 𝑀𝑐 𝑎𝑛𝑑 𝑖 → 𝑜(𝑆0 , 𝑖 = 3) → 𝑜(𝑆1 , 𝑖 = 3) → 𝑜(𝑆2 , 𝑖 = 2) pilation optimization to generate programs. In contrast, for intrinsic- (4) 𝑅3 𝑅5 → … → 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝2 enabled main-tasks, GTA optimizes task scheduling based on gradients and probabilities. This algorithm prioritizes subgraphs with higher We defined the state as 𝑜 = (𝑆 , 𝑖), where S represents the current potential for performance improvement, allocating them more tuning partially generated sketch program for the DAG, and i denotes the opportunities while reducing efforts on less promising mapping can- index of the node currently being transformed. For each rule, if the didates based on performance feedback. Slice the time and prioritize application conditions are met, the rule is applied to 𝑜 = (𝑆 , 𝑖), resulting important subgraphs and intrinsic mapping candidates, meaning that in a new state 𝑜′ = (𝑆 ′ , 𝑖′ ), where 𝑖′ ≤ 𝑖. This ensures that the index not all main-tasks and sub-tasks will be executed. For example, an i (indicating the transforming node) decreases monotonically. A state intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and reaches a terminal condition when 𝑖 = 0. During the enumeration discarded mapping candidates. The former will proceed to subsequent process, multiple rules may be applicable to a single state, generat- tensor program optimization and tuning, while the latter will not ing several succeeding states. Additionally, a single rule can produce participate in further optimization unless they are selected in the next multiple succeeding states under certain conditions. scheduling round. 6. Implementation Search Space Exploration. Subsequently, GTA applies resource- constrained rules and existing derivation rules (Table 4) to each sub- In this section, we delve into the technical details in our imple- graph under the guidance of a genetic algorithm [25]. During this mentation. GTA extends TVM, a end-to-end deep learning compiler, to process, tens of thousands of tensor programs are generated, and cost support loop scheduling and generate high-performance programs with model is employed to filter out the most promising candidates with intrinsic instructions. near-optimal performance. These selected candidates are then executed Task Generation. To mitigate the issue of search space explosion, on the target hardware to identify the tensor program with the best compilers typically divide the large computational graph of a DNN performance. 7 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 7. Evaluation • PyTorch: PyTorch, a widely-used deep learning framework, serves as a strong baseline for evaluating GTA’s ability to out- 7.1. Evaluation platforms perform standard hand-tuned implementations in practical deep learning applications. Our experiments include both PyTorch 1.13, which relies heavily on vendor-optimized libraries such Our experiments were conducted on two distinct hardware plat- as cuDNN and cuBLAS for high-performance computations, and forms to evaluate the performance of the proposed GTA framework: PyTorch 2.0, which introduces the TorchInductor compiler. • NVIDIA GPUs: We performed experiments on two NVIDIA GPUs, For a fair comparison, we evaluate AutoTVM, Ansor, AMOS, and GTA specifically the RTX 3060 and A100, which are equipped with with up to 200 measurement trials per test case and report the best Tensor Cores optimized for deep learning tasks. The RTX 3060 performance achieved. For the vendor-optimized libraries on Tensor represents a consumer-grade GPU, while the A100 is a data Core, we use PyTorch, which relies on hand-optimized libraries such as center-grade GPU designed for high-performance computing. cuDNN to support various types of operators. These optimized libraries • AMD CPU: We evaluated the performance on an AMD Ryzen 7 serve as strong baseline references for evaluating the performance of 7840H CPU,2 which supports advanced SIMD (Single Instruction, GTA. Multiple Data) instructions, enabling efficient vectorized compu- tations. This CPU platform provides a competitive environment 7.4. Experimental results for testing AVX512-like optimizations in general-purpose proces- sors, allowing us to benchmark GTA’s performance on non-GPU We evaluate the performance of GTA on both operators and neural hardware. networks, comparing it against several baselines on two DLAs: GPU Tensor Cores and CPU AVX512. To further demonstrate the effective- ness of GTA, we analyze the quality of the generated search spaces and 7.2. Evaluated benchmarks the efficiency of the exploration process. Finally, we highlight how the dual-task scheduling strategy significantly reduces compilation time by We evaluate the performance of GTA using both deep learning (DL) dynamically prioritizing subgraphs and mapping candidates, effectively operators and complete neural network models. cutting down unnecessary search efforts. • Operator-Level Evaluation: We select nine widely-used opera- 7.5. Operator performance tors for this evaluation: General Matrix Multiplication (GEMM), 1D convolution (C1D), 2D convolution (C2D), 3D convolution Tensor Core. First, we compare GTA with PyTorch, which relies on (C3D), transposed 2D convolution (T2D), dilated convolution hand-optimized libraries such as cuDNN to support various operators. (DIL), batch matrix multiplication (BMM), General Matrix–Vector Fig. 4 shows the results for all operators with batch size 1 on the multiplication (GEMV), and scan (SCAN). For each operator, we NVIDIA RTX 3060. GTA consistently outperforms PyTorch across all test 6–10 different shape configurations and report the geometric operators, achieving an average 2.44× geometric mean speedup. The mean of speedups normalized to GTA. The shape configurations speedup is attributed to GTA’s comprehensive software-hardware map- ping exploration, which contrasts with PyTorch’s use of fixed mappings are consistent with those used in Ansor and AMOS to ensure a fair from hand-optimized libraries, often leading to suboptimal perfor- comparison. mance. • Network-Level Evaluation: We benchmark six commonly-used Next, we evaluate the performance on the NVIDIA A100 GPU for neural network models: ResNet18 and ResNet50 [1], BERT (base various operator. As shown in Fig. 9, GTA achieves 1.26×, 5.24×, configuration) [65], MI-LSTM [66], MobileNet-V1 [67], and Shuf- and 1.93× geometric mean speedup over Ansor, PyTorch, and AMOS, fleNet [11]. For each model, we evaluate the performance with respectively. The significant improvement is due to GTA’s ability to batch sizes of 1 and 16. effectively utilize the high-performance Tensor Core units through enhanced mapping and scheduling strategies. 7.3. Comparison baselines We also compare GTA with state-of-the-art compilers on RTX 3060 using the C2D in NCHW layout. We test all convolution layers from ResNet18 (a total of 12 configurations, labeled as C0–C11). These Our evaluation compares GTA against three state-of-the-art auto- configurations are standard benchmarks from well-known networks. matic generation methods (AutoTVM [49], Ansor [25] (v0.8), and The results are shown in Figs. 4, 5, and 6. GTA achieves speedups of AMOS [22] (commit: 0f39742)) as well as two vendor-optimized, hand- 1.85×, 1.76×, and 2.10× over Ansor, AMOS, and hand-tuned PyTorch, tuned libraries (cuDNN (v11.6) and PyTorch (v1.13.1, v2.0.1)): respectively. Compared to Ansor, GTA leverages high-performance Tensor Core units alongside efficient auto-scheduling strategies, re- • AutoTVM: This method uses hand-written templates to support sulting in better optimization. In contrast to AMOS, GTA employs all three selected platforms, demonstrating high performance DTS to efficiently explore the scheduling space, reducing search time across a range of baseline operators. while enhancing program performance. Moreover, AMOS cannot utilize • AMOS: AMOS systematically explores various mappings of loop resource-constrained rules for shared memory allocation, leading to iterations to DLAs, representing the state-of-the-art for operators the generation of some tensor programs that exceed hardware re- with multiple feasible mappings, such as C1D and C2D. source limits. This limitation reduces AMOS’s capability to achieve • Ansor: As a leading method for GPU CUDA Core and CPU code higher-performing programs. generation, Ansor does not support DLAs like Tensor Core due AVX512. On the AMD CPU platform, we utilize hardware abstrac- to architectural limitations. However, comparing GTA with An- tion for AVX512 intrinsics (specifically for matrix–vector multiplica- sor highlights the benefits of leveraging DLA-specific features in tion) and apply GTA to generate code for C2D. As shown in Fig. 7, GTA tensor program generation. achieves 1.49× and 2.76× performance improvements over Ansor and PyTorch, respectively. GTA’s advantage stems from combining high- performance AVX512 intrinsics with efficient auto-scheduling strate- 2 Intel CPUs also support AVX512 instructions and could be used for similar gies, leading to superior program optimization compared to baseline experiments. methods. 8 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Fig. 4. Single operator performances comparison on NVIDIA RTX 3060. Fig. 5. Performance comparison of C2D on NVIDIA RTX 3060 with batch size = 1, using all convolution layers from ResNet18 (12 configurations, labeled as C0–C11). Fig. 6. Performances comparison for C2D on NVIDIA RTX 3060 with batch size = 16. Fig. 7. Performance on AMD Ryzen 7 7840H CPU relative to Ansor and PyTorch. Fig. 8. Performance of different networks relative to GTA on Tensor Core. 9 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Fig. 9. Performance comparison of GTA across multiple individual operators on the NVIDIA A100 GPU, compared with baseline methods. Fig. 10. Compilation time overhead and corresponding performance variations under different sampling rates. 7.6. Network performance optimizes resource allocation during the search process and enables the rapid identification of high-performance tensor programs. Fig. 8 illustrates the performance of GTA on six evaluated networks. Unlike traditional methods that exhaustively explore all mapping On average, GTA achieves 1.75×, 1.42× and 1.29× speedups over candidates, GTA employs a dynamic prioritization strategy that adap- AMOS, PyTorch 1.13 and PyTorch 2.0 with TorchInductor, respec- tively allocates tuning resources based on performance feedback. This tively. For ResNet18 and ResNet50, GTA finds better mappings for strategy ensures that the most promising subgraphs and intrinsic map- operators, enabling more extensive utilization of Tensor Cores com- ping candidates are prioritized, while less promising candidates receive pared to hand-tuned libraries and AMOS’s optimized templates. GTA fewer tuning opportunities. By combining this with a sampling-based overcomes the limitations of these baselines by generating accurate approach, GTA minimizes unnecessary exploration while maintaining search spaces that encompass most high-performance programs, along high-quality tensor programs. These results underscore GTA’s suit- with an efficient search algorithm for finding optimal or near-optimal ability for real-world deployment scenarios, where both rapid code solutions. The results demonstrate GTA’s capability to handle complex generation and performance optimization are critical. Furthermore, the operators and effectively leverage Tensor Cores for high performance. ability to adjust sampling rates offers flexibility in balancing search time and performance, making GTA a robust solution for optimizing 7.7. Compilation time tensor programs across diverse workloads. The search time overhead is a critical factor for practical deploy- 8. Related work ment in deep learning frameworks, as reducing it can significantly en- hance usability. To evaluate the efficiency of our dual-task scheduling In addition to reviewing DLAs, we summarize related work on strategy, we analyze the search time and corresponding performance numeric precision and dynamic shape optimization for deep learning. variations under different sampling rates, specifically comparing GTA Deep learning accelerators. DLAs offer several significant ad- at sampling rates of 40% (GTA-0.4), 60% (GTA-0.6), and 100% (GTA- vantages, making them essential for advancing DNN research and Raw). In this experiment, GTA operates at a sampling rate of 20% deployment. First, DLAs feature large memory capacities, which ac- (GTA-0.2), representing a highly efficient configuration with minimal commodate the rapidly growing number of parameters in modern search overhead. The results, as shown in Fig. 10, demonstrate that models and facilitate efficient training processes. Second, they provide as the sampling rate decreases, the search time is significantly reduced model-specific optimizations while maintaining a degree of flexibility, while maintaining less than a 5% performance degradation on average, enabling tailored performance improvements for various architectures. thereby achieving an excellent balance between search efficiency and Additionally, DLAs support a broader range of data formats, such performance. as FP16, BF16, and INT8, which enhance computational efficiency Additionally, we compare GTA’s search time overhead and perfor- and reduce memory usage. Third, DLAs are equipped with a high mance with AMOS, a state-of-the-art compiler designed for DLAs. Our number of computing units, enabling extensive parallelism to handle findings reveal that GTA achieves an average performance improve- the computational demands of DNNs effectively. These characteris- ment of 1.88× over AMOS while maintaining significantly lower search tics position DLAs as a cornerstone technology for accelerating the time. Specifically, AMOS’s average compilation time is approximately training and inference of deep learning models. Following this trend, five times that of GTA. This substantial reduction in search time under- many emerging accelerators have been proposed, targeting specific scores the effectiveness of GTA’s dual-task scheduling strategy, which algorithms or utilizing new technologies. In academia, the DianNao 10 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 family [68–71] significantly improves DL computation performance by search space by coordinating intrinsic-based automatic mapping ab- leveraging specialized functional units, memory hierarchy, and inter- straction with rule-based tensor program generation strategy and ap- connects. Meanwhile, the expansion of DL applications in industry has plies pruning rules to eliminate ineffective program candidates. Ad- led hardware vendors (e.g., NVIDIA Tensor Core [17–19] and Intel ditionally, GTA employs dual-task scheduling strategy for tensorized NNP [72]), internet giants (e.g., Tesla Dojo [73], Huawei Ascend [74], programs, effectively reducing tuning efforts while enhancing perfor- Google TPU [10] and Apple M4 [75,76]), and startups (e.g., Cambricon mance. Experimental results on three DLAs show that GTA outperforms MLU [77] and Graphcore IPU [78]) to develop various DLAs. Both state-of-the-art automatic generation approaches and vendor-provided academic and industry DLAs are fundamentally domain-specific, rather hand-tuned libraries by 1.88× and 2.29×, respectively. than general-purpose accelerators, inevitably leading to complex and diverse architectural constraints. CRediT authorship contribution statement Numeric precision optimization. Quantization [79,80], a pivotal technique in deep learning, reduces the numeric precision of weights Anxing Xie: Writing – original draft, Software, Resources, Project and activations to enhance computational efficiency and lower resource administration, Methodology, Investigation, Data curation. Yonghua requirements. By transitioning from high-precision formats such as Hu: Writing – review & editing, Supervision, Investigation, Funding FP32 to lower-precision formats like FP16, INT8, or even single-bit acquisition. Yaohua Wang: Writing – review & editing, Supervision, representations [81,82], quantization enables significant reductions in Methodology, Investigation, Funding acquisition, Formal analysis. Zhe memory usage and power consumption [83,84]. The progression of Li: Writing – review & editing, Supervision, Investigation, Formal hardware architectures aligns with the increasing demands for low- analysis. Yuxiang Gao: Investigation. Zenghua Cheng: Investigation. precision computations. For instance, NVIDIA’s recent developments, such as the Turing and Ampere architectures, incorporated INT8 and Declaration of competing interest INT4 tensor cores to enhance efficiency. Meanwhile, the latest Hop- per architecture has shifted focus by replacing INT4 support with The authors declare that they have no known competing finan- FP8 tensor cores, prioritizing improved numerical precision. These ad- cial interests or personal relationships that could have appeared to influence the work reported in this paper. vancements allow large-scale models, including Large Language Models (LLMs) [85], to be deployed on resource-constrained devices like edge Acknowledgments devices and DLAs without sacrificing performance. Compilers play a critical role in making quantization effective. Tools like AMOS [22], We would like to thank the anonymous reviewers for their valu- PreTuner [86] and LADDER [39] introduce advanced optimizations able suggestions. This work is supported by the National Key R&D for low-precision data types, including hardware-aware scheduling, Program of China (No. 2022ZD0119003), Hunan Provincial Natural loop tiling, and fine-grained scaling strategies. Expanding on existing Science Foundation (No. 2023JJ50019), the Postgraduate Scientific techniques, an automated approach [87] integrates bit-slicing into Research Innovation Project of Hunan Province (No. CX20231019) and the scheduling phase, treating quantization as part of the schedule the National Natural Science Foundation of China (No. 62272477). space. Coupled with program synthesis, this method efficiently gener- ates hardware-specific kernels, supporting diverse quantization config- Data availability urations and ensuring seamless adaptation to new hardware architec- tures. Data will be made available on request. Dynamic shape optimization. Dynamic-shape workloads are char- acteristic of DNN models where tensor shapes vary at runtime based on input data, such as the sequence length in Transformer models. References These workloads pose substantial challenges for existing autotuning frameworks like TVM, which primarily rely on static input shapes to [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: construct search spaces and cost models. For instance, TVM’s second- Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. generation IR, Relay [35], lacks the capability to represent dynamic [2] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, tensors. While its third-generation IR, Relax [88], introduces symbolic H. Shahrzad, A. Navruzyan, N. Duffy, et al., Evolving deep neural networks, shapes to support dynamic workloads, Relax still depends on hand- in: Artificial Intelligence in the Age of Neural Networks and Brain Computing, written templates for tensor program generation and lacks automatic Elsevier, 2024, pp. 269–287. [3] C.-Y. Wang, I.-H. Yeh, H.-Y. Mark Liao, Yolov9: Learning what you want to learn tuning support. To address these limitations, recent works such as using programmable gradient information, in: European Conference on Computer Nimble [89], DietCode [90], FTuner [91], and MIKPOLY [92] have Vision, Springer, 2025, pp. 1–21. introduced innovative techniques. These approaches construct shape- [4] A. Vaswani, Attention is all you need, Adv. Neural Inf. Process. Syst. (2017). agnostic search spaces and cost models to optimize dynamic-shape [5] P.P. Ray, ChatGPT: A comprehensive review on background, applications, key workloads. For example, DietCode effectively groups kernels with vary- challenges, bias, ethics, limitations and future scope, Internet Things Cyber- Phys. Syst. 3 (2023) 121–154. ing shapes into unified workloads, enabling efficient tuning as a single [6] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, entity and significantly reducing overall tuning time. FTuner introduces A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models, 2024, arXiv a uKernel-based approach for dynamic tensors, leveraging hardware- preprint arXiv:2407.21783. aware constraints to generate high-performance kernel programs and [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. combining uKernels during runtime to optimize padding and execution Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene efficiency. While these advancements mark significant progress, further understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223. research is needed to fully exploit the potential of dynamic-shape DNNs [8] D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, Y. Qiao, Drive like a human: on modern hardware accelerators. Rethinking autonomous driving with large language models, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 910–919. 9. Conclusion [9] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liao, et al., A survey on multimodal large language models for autonomous We propose GTA, a novel compilation framework for high- driving, in: Proceedings of the IEEE/CVF Winter Conference on Applications of performance tensor program generation on DLAs. GTA expands the Computer Vision, 2024, pp. 958–979. 11 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 [10] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, [30] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, S. Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis T. Shpeisman, N. Vasilache, O. Zinenko, MLIR: A compiler infrastructure for the of a tensor processing unit, in: Proceedings of the 44th Annual International end of Moore’s law, 2020, arXiv preprint arXiv:2002.11054. Symposium on Computer Architecture, 2017, pp. 1–12. [31] L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L. [11] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolu- Zhang, L. Zhou, Rammer: Enabling holistic deep learning compiler optimizations tional neural network for mobile devices, in: Proceedings of the IEEE Conference with {rtasks}, in: 14th USENIX Symposium on Operating Systems Design and on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856. Implementation, OSDI 20, 2020, pp. 881–897. [12] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: Trainable bag-of-freebies [32] J. Zhao, B. Li, W. Nie, Z. Geng, R. Zhang, X. Gao, B. Cheng, C. Wu, Y. Cheng, sets new state-of-the-art for real-time object detectors, in: Proceedings of the Z. Li, et al., AKG: automatic kernel generation for neural processing units IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. using polyhedral transformations, in: Proceedings of the 42nd ACM SIGPLAN 7464–7475. International Conference on Programming Language Design and Implementation, [13] Z. Xu, W. Wang, H. Dai, Y. Xu, XFC: Enabling automatic and fast operator 2021, pp. 1233–1248. synthesis for mobile deep learning compilation, J. Syst. Archit. 142 (2023) [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, 102921. N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, Adv. Neural Inf. Process. Syst. 32 (2019). [14] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, D. Chen, [34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on Ghemawat, G. Irving, M. Isard, et al., {TensorFlow}: a system for {large-scale} the edge, in: Proceedings of the 56th Annual Design Automation Conference machine learning, in: 12th USENIX Symposium on Operating Systems Design and 2019, 2019, pp. 1–6. Implementation, OSDI 16, 2016, pp. 265–283. [15] W. Jiang, L. Yang, E.H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi, J. Hu, Hard- [35] J. Roesch, S. Lyubomirsky, M. Kirisame, L. Weber, J. Pollock, L. Vega, Z. Jiang, ware/software co-exploration of neural architectures, IEEE Trans. Comput.-Aided T. Chen, T. Moreau, Z. Tatlock, Relay: A high-level compiler for deep learning, Des. Integr. Circuits Syst. 39 (12) (2020) 4805–4815. 2019, arXiv preprint arXiv:1904.08368. [16] Z. Xie, M. Emani, X. Yu, D. Tao, X. He, P. Su, K. Zhou, V. Vishwanath, [36] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng, Centimani: Enabling fast {AI} accelerator selection for {dNN} training with a X. Jin, Apollo: Automatic partition-based operator fusion through layer by layer novel performance predictor, in: 2024 USENIX Annual Technical Conference, optimization., in: MLSys, 2022. USENIX ATC 24, 2024, pp. 1203–1221. [37] Y. Shi, Z. Yang, J. Xue, L. Ma, Y. Xia, Z. Miao, Y. Guo, F. Yang, L. Zhou, [17] Nvidia, Ampere architecture white paper, 2022, URL: https://www.nvidia. Welder: Scheduling deep learning memory access via tile-graph, in: 17th USENIX com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture- Symposium on Operating Systems Design and Implementation, OSDI 23, 2023, whitepaper.pdf Online (Accessed 13 November 2024). pp. 701–718. [18] Nvidia, Turing architecture white paper, 2022, URL: https://www.nvidia. [38] C. Xia, J. Zhao, Q. Sun, Z. Wang, Y. Wen, T. Yu, X. Feng, H. Cui, Optimizing com/content/dam/en-zz/Solutions/design-visualization/technologies/turing- deep learning inference via global analysis and tensor expressions, in: Proceed- architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Online (Accessed 13 ings of the 29th ACM International Conference on Architectural Support for November 2024). Programming Languages and Operating Systems, Volume 1, 2024, pp. 286–301. [19] Nvidia, Volta architecture white paper, 2022, URL: https://images.nvidia. [39] L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y. Shi, N. Zheng, Z. Miao, F. Yang, T. com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Online Cao, et al., Ladder: Enabling efficient {low-precision} deep learning computing (Accessed 13 November 2024). through hardware-aware tensor transformation, in: 18th USENIX Symposium on [20] K. Troester, R. Bhargava, AMD next generation ‘‘Zen 4’’ core and 4th Gen AMD Operating Systems Design and Implementation, OSDI 24, 2024, pp. 307–323. EPYC™ 9004 server CPU, in: 2023 IEEE Hot Chips 35 Symposium, HCS, IEEE [40] F. Wang, M. Shen, Y. Lu, N. Xiao, TensorMap: A deep RL-based tensor mapping Computer Society, 2023, pp. 1–25. framework for spatial accelerators, IEEE Trans. Comput. (2024). [21] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. [41] Y. Zhao, H. Sharif, V. Adve, S. Misailovic, Felix: Optimizing tensor programs Hu, L. Ceze, et al., {TVM}: An automated {end-to-end} optimizing compiler for with gradient descent, in: Proceedings of the 29th ACM International Conference deep learning, in: 13th USENIX Symposium on Operating Systems Design and on Architectural Support for Programming Languages and Operating Systems, Implementation, OSDI 18, 2018, pp. 578–594. Volume 3, 2024, pp. 367–381. [22] S. Zheng, R. Chen, A. Wei, Y. Jin, Q. Han, L. Lu, B. Wu, X. Li, S. Yan, Y. [42] Q. Zhao, R. Wang, Y. Liu, H. Yang, Z. Luan, D. Qian, Sifter: An efficient operator Liang, AMOS: enabling automatic mapping for tensor computations on spatial auto-tuner with speculative design space exploration for deep learning compiler, accelerators with hardware abstraction, in: Proceedings of the 49th Annual IEEE Trans. Comput. (2024). International Symposium on Computer Architecture, 2022, pp. 874–887. [43] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, [23] S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C.H. Yu, Y. Yu, Halide: a language and compiler for optimizing parallelism, locality, and re- et al., Tensorir: An abstraction for automatic tensorized program optimization, computation in image processing pipelines, Acm Sigplan Not. 48 (6) (2013) in: Proceedings of the 28th ACM International Conference on Architectural 519–530. Support for Programming Languages and Operating Systems, Volume 2, 2023, [44] Y. Bai, X. Yao, Q. Sun, W. Zhao, S. Chen, Z. Wang, B. Yu, Gtco: Graph and pp. 804–817. tensor co-design for transformer-based image recognition on tensor cores, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023). [24] J. Bi, Q. Guo, X. Li, Y. Zhao, Y. Wen, Y. Guo, E. Zhou, X. Hu, Z. Du, L. Li, et al., [45] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, A. Parashar, Maestro: Heron: Automatically constrained high-performance library generation for deep A data-centric approach to understand reuse, performance, and hardware cost of learning accelerators, in: Proceedings of the 28th ACM International Conference dnn mappings, IEEE Micro 40 (3) (2020) 20–29. on Architectural Support for Programming Languages and Operating Systems, [46] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, Y. Liang, Tenet: A frame- Volume 3, 2023, pp. 314–328. work for modeling tensor dataflow based on relation-centric notation, in: 2021 [25] L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. ACM/IEEE 48th Annual International Symposium on Computer Architecture, Zhuo, K. Sen, et al., Ansor: Generating {high-performance} tensor programs for ISCA, IEEE, 2021, pp. 720–733. deep learning, in: 14th USENIX Symposium on Operating Systems Design and [47] A. Parashar, P. Raina, Y.S. Shao, Y.-H. Chen, V.A. Ying, A. Mukkara, R. Venkate- Implementation, OSDI 20, 2020, pp. 863–879. san, B. Khailany, S.W. Keckler, J. Emer, Timeloop: A systematic approach to dnn [26] S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic accelerator evaluation, in: 2019 IEEE International Symposium on Performance schedule exploration and optimization framework for tensor computation on het- Analysis of Systems and Software, ISPASS, IEEE, 2019, pp. 304–315. erogeneous system, in: Proceedings of the Twenty-Fifth International Conference [48] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, on Architectural Support for Programming Languages and Operating Systems, P. Raina, et al., Interstellar: Using halide’s scheduling language to analyze dnn 2020, pp. 859–873. accelerators, in: Proceedings of the Twenty-Fifth International Conference on [27] A. Sabne, Xla: Compiling machine learning for peak performance, Google Res Architectural Support for Programming Languages and Operating Systems, 2020, (2020). pp. 369–383. [28] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W.S. Moses, S. [49] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, A. Verdoolaege, A. Adams, A. Cohen, Tensor comprehensions: Framework-agnostic Krishnamurthy, Learning to optimize tensor programs, Adv. Neural Inf. Process. high-performance machine learning abstractions, 2018, arXiv preprint arXiv: Syst. 31 (2018). 1802.04730. [50] J. Appleyard, S. Yokim, NVIDIA developer technical blog, 2017, [29] P. Tillet, H.-T. Kung, D. Cox, Triton: an intermediate language and compiler for URL:https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9 Online tiled neural network computations, in: Proceedings of the 3rd ACM SIGPLAN (Accessed 13 November 2024). International Workshop on Machine Learning and Programming Languages, [51] NVIDIA, Basic linear algebra on NVIDIA GPUs, 2024, URL: https://developer. 2019, pp. 10–19. nvidia.com/cublas Online (Accessed 13 November 2024) n.d. 12 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 [52] A. Kerr, H. Wu, M. Gupta, D. Blasig, P. Ramini, D. Merrill, A. Shivam, P. [75] Apple, Apple introduces M4 chip, 2024, URL: https://www.apple.com/sg/ Majcher, P. Springer, M. Hohnerbach, J. Wang, M. Nicely, CUTLASS, 2022, URL: newsroom/2024/05/apple-introduces-m4-chip/ Online (Accessed 13 November https://github.com/NVIDIA/cutlass Online (Accessed 13 November 2024). 2024). [53] T. Zerrell, J. Bruestle, Stripe: Tensor compilation via the nested polyhedral [76] Apple, Apple introduces M4 pro and M4 max, 2024, URL: https://www.apple. model, 2019, arXiv preprint arXiv:1903.06498. com/sg/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/ Online (Ac- [54] R. Baghdadi, J. Ray, M.B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, cessed 13 November 2024). P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: A polyhedral compiler for [77] Cambricon, Cambricon MLU, 2024, URL: https://www.cambricon.com/ Online expressing fast and portable code, in: 2019 IEEE/ACM International Symposium (Accessed 13 November 2024) n.d.. on Code Generation and Optimization, CGO, IEEE, 2019, pp. 193–205. [78] Z. Jia, B. Tillman, M. Maggioni, D.P. Scarpazza, Dissecting the graphcore ipu [55] S. Tavarageri, A. Heinecke, S. Avancha, B. Kaul, G. Goyal, R. Upadrasta, Polydl: architecture via microbenchmarking, 2019, arXiv preprint arXiv:1912.03413. Polyhedral optimizations for creation of high-performance dl primitives, ACM [79] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural Trans. Archit. Code Optim. ( TACO) 18 (1) (2021) 1–27. networks: Training neural networks with low precision weights and activations, [56] Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, J. Mach. Learn. Res. 18 (187) (2018) 1–30. Y.S. Shao, Cosa: Scheduling by constrained optimization for spatial accelera- [80] T. Liang, J. Glossner, L. Wang, S. Shi, X. Zhang, Pruning and quantization for tors, in: 2021 ACM/IEEE 48th Annual International Symposium on Computer deep neural network acceleration: A survey, Neurocomput. 461 (2021) 370–403. Architecture, ISCA, IEEE, 2021, pp. 554–566. [81] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized [57] M. Sotoudeh, A. Venkat, M. Anderson, E. Georganas, A. Heinecke, J. Knight, neural networks: Training deep neural networks with weights and activations ISA mapper: a compute and hardware agnostic deep learning compiler, in: constrained to+ 1 or-1, 2016, arXiv preprint arXiv:1602.02830. Proceedings of the 16th ACM International Conference on Computing Frontiers, [82] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: Imagenet classifi- 2019, pp. 164–173. cation using binary convolutional neural networks, in: European Conference on [58] J. Weng, A. Jain, J. Wang, L. Wang, Y. Wang, T. Nowatzki, UNIT: Unifying Computer Vision, Springer, 2016, pp. 525–542. tensorized instruction compilation, in: 2021 IEEE/ACM International Symposium [83] C.-C. Yang, Y.-R. Chen, H.-H. Liao, Y.-M. Chang, J.-K. Lee, Auto-tuning fixed- on Code Generation and Optimization, CGO, IEEE, 2021, pp. 77–89. point precision with TVM on RISC-v packed SIMD extension, ACM Trans. Des. [59] H. Zhu, R. Wu, Y. Diao, S. Ke, H. Li, C. Zhang, J. Xue, L. Ma, Y. Xia, W. Cui, Autom. Electron. Syst. 28 (3) (2023) 1–21. et al., {RollER}: Fast and efficient tensor compilation for deep learning, in: 16th [84] D. Diamantopoulos, B. Ringlein, M. Purandare, G. Singh, C. Hagleitner, Agile USENIX Symposium on Operating Systems Design and Implementation, OSDI 22, autotuning of a transprecision tensor accelerator overlay for TVM compiler 2022, pp. 233–248. stack, in: 2020 30th International Conference on Field-Programmable Logic and [60] Y. Ding, C.H. Yu, B. Zheng, Y. Liu, Y. Wang, G. Pekhimenko, Hidet: Task-mapping Applications, FPL, IEEE, 2020, pp. 310–316. programming paradigm for deep learning tensor programs, in: Proceedings of the [85] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, Z. Jia, Towards efficient 28th ACM International Conference on Architectural Support for Programming generative large language model serving: A survey from algorithms to systems, Languages and Operating Systems, Volume 2, 2023, pp. 370–384. 2023, arXiv preprint ArXiv:2312.15234. [61] L. Zheng, H. Wang, J. Zhai, M. Hu, Z. Ma, T. Wang, S. Huang, X. Miao, S. Tang, [86] J. Xu, G. Song, B. Zhou, F. Li, J. Hao, J. Zhao, A holistic approach to automatic K. Huang, et al., {EINNET}: Optimizing tensor programs with {derivation-based} mixed-precision code generation and tuning for affine programs, in: Proceedings transformations, in: 17th USENIX Symposium on Operating Systems Design and of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Implementation, OSDI 23, 2023, pp. 739–755. Parallel Programming, 2024, pp. 55–67. [62] Y. Zhai, S. Yang, K. Pan, R. Zhang, S. Liu, C. Liu, Z. Ye, J. Ji, J. Zhao, Y. Zhang, et [87] M. Cowan, T. Moreau, T. Chen, J. Bornholt, L. Ceze, Automatic generation of al., Enabling tensor language model to assist in generating {High-Performance} high-performance quantized machine learning kernels, in: Proceedings of the tensor programs for deep learning, in: 18th USENIX Symposium on Operating 18th ACM/IEEE International Symposium on Code Generation and Optimization, Systems Design and Implementation, OSDI 24, 2024, pp. 289–305. 2020, pp. 305–316. [63] F. Wang, M. Shen, Y. Ding, N. Xiao, Soter: Analytical tensor-architecture [88] R. Lai, J. Shao, S. Feng, S.S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin, modeling and automatic tensor program tuning for spatial accelerators, in: 2024 J. Liu, et al., Relax: Composable abstractions for end-to-end dynamic machine ACM/IEEE 51st Annual International Symposium on Computer Architecture, learning, 2023, arXiv preprint arXiv:2311.02103. ISCA, IEEE, 2024, pp. 991–1004. [89] H. Shen, J. Roesch, Z. Chen, W. Chen, Y. Wu, M. Li, V. Sharma, Z. Tatlock, [64] F. Wang, M. Shen, Automatic kernel generation for large language models on Y. Wang, Nimble: Efficiently compiling dynamic neural networks for model deep learning accelerators, in: 2023 IEEE/ACM International Conference on inference, Proc. Mach. Learn. Syst. 3 (2021) 208–222. Computer Aided Design, ICCAD, IEEE, 2023, pp. 1–9. [90] B. Zheng, Z. Jiang, C.H. Yu, H. Shen, J. Fromm, Y. Liu, Y. Wang, L. Ceze, [65] J. Devlin, Bert: Pre-training of deep bidirectional transformers for language T. Chen, G. Pekhimenko, DietCode: Automatic optimization for dynamic tensor understanding, 2018, arXiv preprint arXiv:1810.04805. programs, Proc. Mach. Learn. Syst. 4 (2022) 848–863. [66] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, R.R. Salakhutdinov, On multiplicative [91] P. Mu, L. Wei, Y. Liu, R. Wang, FTuner: A fast dynamic shape tensors program integration with recurrent neural networks, Adv. Neural Inf. Process. Syst. 29 auto-tuner for deep learning compilers, 2024, arXiv preprint arXiv:2407.21418. (2016). [92] F. Yu, G. Li, J. Zhao, H. Cui, X. Feng, J. Xue, Optimizing dynamic-shape [67] A.G. Howard, Mobilenets: Efficient convolutional neural networks for mobile neural networks on accelerators via on-the-fly micro-kernel polymerization, vision applications, 2017, arXiv preprint arXiv:1704.04861. in: Proceedings of the 29th ACM International Conference on Architectural [68] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao: Support for Programming Languages and Operating Systems, Volume 2, 2024, A small-footprint high-throughput accelerator for ubiquitous machine-learning, pp. 797–812. ACM SIGARCH Comput. Archit. News 42 (1) (2014) 269–284. [69] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., Dadiannao: A machine-learning supercomputer, in: 2014 47th Anxing Xie is currently working toward a Ph.D. degree in the School of Computer Science and Engineering, Hu- Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 2014, nan University of Science and Technology, China. He is pp. 609–622. currently working on deep learning automatic compilation [70] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, Y. Chen, optimization and high-performance computation. His re- Pudiannao: A polyvalent machine learning accelerator, ACM SIGARCH Comput. search interests include compiler optimization, and parallel Archit. News 43 (1) (2015) 369–381. computing. [71] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, O. Temam, ShiDianNao: Shifting vision processing closer to the sensor, in: Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015, pp. 92–104. [72] B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski, S. Avancha, Intel nervana neural network processor-t (nnp-t) fused floating point many-term dot Yonghua Hu is a professor in School of Computer Sci- product, in: 2020 IEEE 27th Symposium on Computer Arithmetic, ARITH, IEEE, ence and Engineering, Hunan University of Science and 2020, pp. 133–136. Technology, China. He received the Ph.D degree in Com- [73] E. Talpes, D. Williams, D.D. Sarma, Dojo: The microarchitecture of tesla’s exa- puter Application Technology from Hunan University, in scale computer, in: 2022 IEEE Hot Chips 34 Symposium, HCS, IEEE Computer 2008. He went to University at Buffalo SUNY as a visiting Society, 2022, pp. 1–28. scholar in 2019. His research interests include compilation [74] H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable optimization, artificial intelligence and parallel computing. and unified architecture for ubiquitous deep neural network computing: Industry track paper, in: 2021 IEEE International Symposium on High-Performance Computer Architecture, HPCA, IEEE, 2021, pp. 789–801. 13 A. Xie et al. Journal of Systems Architecture 160 (2025) 103359 Yaohua Wang is currently a professor with the College Yuxiang Gao is currently working toward an M.S. degree of Computer Science, National University of Defense Tech- in the School of Computer Science and Engineering, Hunan nology. His research interest is in computer architecture, University of Science and Technology, China. He is currently machine learning and security. His work spans and stretches working on code optimization and compilation technol- the boundaries of computer architecture. He is especially ex- ogy. His research interests include automatic compilation cited about novel, fundamentally-efficient computation, and optimization and code generation. memory/storage paradigms, applied to emerging machine learning applications. Zhe Li received the Ph.D. degree in Computer Science Zenghua Cheng is currently working toward an M.S. degree from Jilin University in 2022. He is currently working at in the School of Computer Science and Engineering, Hunan Tianjin Advanced Technology Institute. His research inter- University of Science and Technology, China. He is currently ests include deep learning compilation and combinatorial working on code optimization and compilation technol- optimization. ogy. His research interests include automatic compilation optimization and Web security. 14