opaque-lattice/papers_txt/GTA--Generating-high-performance-tensorized-progra_2025_Journal-of-Systems-A.txt

                                                              Journal of Systems Architecture 160 (2025) 103359


                                                                   Contents lists available at ScienceDirect


                                                          Journal of Systems Architecture
                                                          journal homepage: www.elsevier.com/locate/sysarc


GTA: Generating high-performance tensorized program with dual-task
scheduling
Anxing Xie a ,1 , Yonghua Hu a ,∗, Yaohua Wang b , Zhe Li b,c , Yuxiang Gao a , Zenghua Cheng a
a School of Computer Science and Engineering, Hunan University of Science and Technology, Taoyuan Road, Xiangtan, 411201, Hunan, China
b
    School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410073, Hunan, China
c
    Tianjin Institute of Advanced Technology, Huixiang Road, 300459, Tianjin, China


ARTICLE                 INFO                              ABSTRACT

Keywords:                                                 Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring
Mapping                                                   the efficient execution of deep neural networks. But, producing such programs for different operators
Code generation                                           across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent
Compiler optimization
                                                          acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However,
Tensor computation
                                                          their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs
                                                          and significant search time overhead.
                                                              In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs
                                                          for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with
                                                          rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate
                                                          ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy
                                                          to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates.
                                                          As a result, GTA can find high-performance tensor programs that are outside the search space of existing
                                                          state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88×
                                                          over AMOS and 2.29× over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49× over Ansor and 2.76×
                                                          over PyTorch on CPU with AVX512.


1. Introduction                                                                              also bridge the gap between high-level tensor programs and low-level
                                                                                             instructions, a process we refer to as tensorized program generation
    Recently, the successful deployment of machine learning models                           with automatic mapping optimization. However, generating high-
has revolutionized diverse application domains, such as image recog-                         performance tensorized programs for various DLAs remains challenging
nition [1–3], natural language processing [4–6], and autonomous driv-                        for several reasons.
ing [7–9]. This rapid development has created a demand for generat-
                                                                                                 Firstly, inefficient exploration of the intrinsic mapping space leads
ing high-performance tensor programs for deep learning accelerators
                                                                                             to substantial overhead in search time. For instance, mapping the 7
(DLAs), such as Google TPUs [10], mobile devices [11–13], FPGAs [14–
16], and more. To accelerate machine learning, hardware vendors                              loops of a 2D convolution to the 3D of Tensor Core can involve 35
have introduced domain-specific intrinsics for tensor computations,                          different ways [22]. Current strategies [22,23] treat each mapping can-
such as NVIDIA’s Tensor Cores [17–19] and CPU’s AVX512 [20]. This                            didate equally, generating a tensorized program for each and ultimately
demand has led to the process known as tensorization [21], which                             selecting the one with the best performance. This approach incurs
involves transforming computations using these intrinsic instructions.                       significant time overhead and is inefficient, as it fails to prioritize more
However, hardware specialization complicates the task of generating                          promising candidates during the exploration process. Our experiments
high-performance tensorized programs.                                                        reveal that many mapping candidates for a given subgraph ultimately
    To support hardware intrinsic instructions across different acceler-                     fail to produce high-performance tensorized programs, indicating that
ators, existing methods [22–24] use unified hardware abstractions to                         a large portion of the explored mappings are ineffective in optimizing
enable end-to-end automatic mapping space exploration. These abstrac-
                                                                                             performance.
tions not only convert opaque intrinsics into an analyzable format but


     ∗ Corresponding author.
         E-mail address: huyh@hnust.cn (Y. Hu).
     1
         Part of this work was done at National University of Defense Technology.

https://doi.org/10.1016/j.sysarc.2025.103359
Received 23 November 2024; Received in revised form 8 January 2025; Accepted 30 January 2025
Available online 7 February 2025
1383-7621/© 2025 Published by Elsevier B.V.
A. Xie et al.                                                                                                                          Journal of Systems Architecture 160 (2025) 103359


Fig. 1. Comparison of different task scheduling strategies. Part (a): task scheduling with gradient decent. In round 1, all 𝑡𝑎𝑠𝑘𝑠𝑖 are executed sequentially. In subsequent rounds,
𝑡𝑎𝑠𝑘𝑠𝑖 are selectively executed based on the performance gradients calculated from the feedback of each task. Part (b): sequential execution of sub-tasks without dual-task scheduling.
Part (c): slice the time and prioritize important subgraphs and intrinsic mapping candidates, meaning that not all main-tasks and sub-tasks will be executed. For example, an
intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and discarded mapping candidates. The former will proceed to subsequent tensor program optimization and tuning,
while the latter will not participate in further optimization unless they are selected in the next scheduling round. (For interpretation of the references to color in this figure legend,
the reader is referred to the web version of this article.)


    Secondly, existing rule-based tensor program exploration meth-                              backends. These compilers take model definitions, expressed in frame-
ods [25] lack the ability to perform automatic tuning and optimization                          works like PyTorch [33] or TensorFlow [34], as inputs and generate
tailored to domain-specific intrinsics. As a result, these methods often                        efficient code implementations for specific hardware platforms, such as
fail in auto-tuning and produce suboptimal tensorized programs. To                              CPUs, GPUs. The compilation process often adopts a progressive multi-
overcome these limitations, there is an urgent need for more efficient                          layer optimization approach. It begins with the front-end, where neural
exploration of subgraph mapping spaces, along with auto-tuning strate-                          network models serve as input, and proceeds through intermediate
gies that can effectively support domain-specific intrinsics, enabling the                      representation (IR) stages. These include graph-level IR [35–39] for
automatic generation of high-performance tensorized programs.                                   structural optimizations and loop-level IR [40–42] for fine-grained
    In this paper, we introduce GTA, a new compiler framework de-                               transformations. Finally, the back-end generates hardware-specific ex-
signed to generate high-performance tensorized programs. GTA auto-                              ecutable code using traditional compiler techniques, ensuring efficient
matically generates an extensive search space optimized for hardware                            execution on the target platform.
intrinsics, simultaneously increasing the likelihood of selecting the most                          A key innovation in deep learning compilers is the compute-
efficient mapping configuration. For generating the search space, we                            schedule separation first introduced by Halide [43] and adopted by
employ rule-based strategies to construct a large scheduling search                             frameworks like TVM [21]. Compute represents the mathematical
space and apply pruning techniques based on hardware cache resource                             description of tensor operations, such as addition, convolution, or
limitations to eliminate invalid program candidates. Finally, as shown                          matrix multiplication, while schedule defines how these operations
in Fig. 1, for search strategy implementation, we use a dual-task                               are executed on hardware. Schedule specifies program transforma-
scheduling algorithm to allocate tuning resources across all subgraphs                          tions, including loop tiling, vectorization, and unrolling, to optimize
(𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 as shown by the blue box in Fig. 1) in the neural net-                              performance for specific hardware architectures. This decoupling sim-
work and their intrinsic mapping candidates(𝑠𝑢𝑏-𝑡𝑎𝑠𝑘𝑖 as shown by the                           plifies the representation of tensor computations, enabling flexible
orange box and gray box). This algorithm prioritizes subgraphs with                             optimization strategies tailored to different backends.
greater potential for performance improvement, allocating them more                                 Recent advancements [22–24,44] in deep learning compilers focus
tuning opportunities, while reducing tuning efforts on less promising                           on leveraging hardware intrinsics to further optimize tensor programs.
mapping candidates based on performance feedback, thereby minimiz-                              By integrating intrinsic-specific mapping abstractions, these compil-
ing overall tuning time. In summary, this paper makes the following                             ers can directly utilize the specialized instructions of DLAs, such as
contributions:                                                                                  NVIDIA’s Tensor Cores or CPU’s AVX512, to achieve higher compu-
                                                                                                tational efficiency. These developments mark a shift from general-
     • We integrated intrinsic-based mapping abstraction with a rule-
                                                                                                purpose optimizations to hardware-aware designs, laying the founda-
       based program generation strategy to expand the search space
                                                                                                tion for intrinsic-based mapping strategies.
       significantly.
     • We developed and implemented an efficient dual-task schedul-
                                                                                                2.2. Intrinsic-based mapping abstraction
       ing strategy for tensorized programs, effectively reducing tuning
       efforts while enhancing performance.
                                                                                                    The development of DLAs has led to the creation of specialized
     • We propose a compilation framework called GTA, which supports                            instructions [45–48], known as intrinsics, designed to enhance the
       the generation of high-performance tensorized programs at both                           computational efficiency of tensor operations. These instructions serve
       the operator level and the full network level on NVIDIA GPUs and                         as essential interfaces between hardware and compilers, enabling opti-
       CPUs.                                                                                    mized execution of key operations like matrix multiplication and data
     • We implemented and comprehensively evaluated the GTA sys-                                movement.
       tem, demonstrating that the aforementioned techniques outper-                                Intrinsics provide an efficient mechanism for managing kernel op-
       form state-of-the-art systems across various deep neural networks                        erations in tensor programs, typically categorized into compute in-
       (DNNs).                                                                                  trinsics for performing computations and memory intrinsics for data
                                                                                                handling [22]. For example, NVIDIA Tensor Cores [17–19] and CPU
2. Background and motivation                                                                    AVX512 [20] offer specialized intrinsics that allow accelerated ma-
                                                                                                trix and vector operations, respectively, facilitating high-performance
2.1. Deep learning compilers                                                                    computation across various accelerators.
                                                                                                    Intrinsic-based mapping abstraction further unifies tensor pro-
   Deep learning compilers [21–32] have emerged as essential tools for                          gram optimization by representing diverse intrinsic behaviors in a com-
bridging the gap between deep learning models and diverse hardware                              mon, analyzable form. Frameworks like AMOS [22] and TensorIR [23]

                                                                                            2
A. Xie et al.                                                                                                        Journal of Systems Architecture 160 (2025) 103359

Table 1                                                                                  ❸ Rule-based mapping: Rule-based mapping [24–27,57] gener-
State-of-the-art compilers∕mappings for hardware accelerators.
                                                                                     ates efficient tensor programs through predefined scheduling primi-
 Name                      Mapping Method                                            tives, streamlining tensor program creation without user-defined tem-
 ❶                                                                                   plates. This approach leverages scheduling techniques like loop tiling,
 AutoTVM                   Hand-written templates + Tuning
                                                                                     fusion, and vectorization, as demonstrated by frameworks like An-
 Triton                    Hand-written templates
 ❷
                                                                                     sor [25], which automatically create search spaces using these rules.
 Tiramisu                  Polyhedral model                                          This method simplifies tensor program generation in deep learning
 AKG                       Polyhedral model + Templates                              applications. However, it also has limitations: users must ensure that
 ❸                                                                                   the predefined rules align with the specific operators and hardware, or
 Ansor                     Generated rules + Tuning
                                                                                     the generated programs may fail to achieve optimal performance.
 XLA                       Templates and rules
 Heron                     Constraint-based rules + Tuning                               ❹ Analyzable abstraction mapping: Analyzable abstraction map-
 MetaSchedule              Generated rules + Tuning                                  ping [22,23,44,58,59] unifies tensor program optimization by abstract-
 ❹                                                                                   ing diverse hardware intrinsic behaviors into a common representation,
 UNIT                      Analyzable abstraction + Tuning                           facilitating efficient mapping and transformation of tensorized pro-
 ROLLER                    Tile abstraction + Construction policy
 AMOS                      Analyzable abstraction + Tuning
                                                                                     grams. Examples like AMOS and TensorIR establish direct mappings
 TensorIR                  Analyzable abstraction and generated rules + Tuning       between software and hardware, guiding the automated generation
 ❺                                                                                   of tensorized programs. This approach broadens the scope of explo-
 Hidet                     Task-mapping + Post-scheduling fusion                     ration by identifying foundational software-to-hardware combinations,
 EINNET                    Derivation-based + Tuning
                                                                                     increasing the potential for discovering optimized mappings.
 TensorMap                 Reinforcement learning + Tuning
                                                                                         ❺ Other mapping: Other mapping methods [13,40,60,61] reformu-
 GTA                       Analyzable abstraction and generated rules + Tuning
                                                                                     late deep learning optimization problems using strategies from other
                                                                                     domains to enhance efficiency. For example, CoSA [56] and Heron [24]
                                                                                     convert the scheduling space search into a constrained optimization
leverage this approach to directly map software operations to hard-                  problem and leverage solvers to rapidly explore the space. Alterna-
ware intrinsics, supporting automated generation and transformation                  tively, TLM [62] and Soter [63] treat tensor program exploration as
of tensorized programs. This abstraction broadens the search space for               a language model generation task, where tensor programs are rep-
high-performance configurations by identifying fundamental software-                 resented as sequences and tunable parameters as language tokens.
to-hardware mappings, thus enhancing optimization potential across                   Specifically, they leverage a large language model (LLM) to generate
different hardware backends.                                                         these tokens for tunable parameters, enabling efficient exploration of
                                                                                     mapping schemes and more effective optimization of tensor programs.
2.3. Tensor program generation strategy                                                  Building on this foundation, we reviewed five primary mapping
                                                                                     approaches used for deep learning accelerators: hand-written, rule-
    In Table 1, we summarize state-of-the-art compiler mapping tech-                 based, polyhedral model, analyzable abstraction, and other mapping
niques used to generate optimized tensor programs on hardware accel-                 methods. Each approach brings unique advantages—hand-written and
erators. Most existing compilers leverage programmable intrinsics as                 rule-based mappings allow fine-tuned performance but require exten-
part of their mapping strategy, enabling developers to focus on high-                sive manual intervention or rigid predefined rules, while polyhedral
level optimization while the compiler handles low-level architectural                and analyzable abstraction mappings offer more automated solutions
details. These mapping methods streamline tensor program generation                  but are challenged by complexity and limited applicability. Methods
by abstracting hardware-specific operations, thereby enhancing both                  borrowing from other domains, such as optimization solvers and lan-
efficiency and portability.                                                          guage models, open new directions but may lack consistency across
    Specifically, we categorize the state-of-the-art compilers/mappers               diverse hardware. In summary, intrinsic-based mapping abstraction of-
for DLAs into five main approaches:                                                  fers a unified framework for optimizing tensor programs across diverse
    ❶ Hand-written mapping: Hand-written mapping [29,49] requires                    hardware accelerators by abstracting hardware intrinsic behaviors into
developers to manually define mappings for tensorized programs using                 a common representation. Systems like AMOS and TensorIR leverage
compiler-provided tensorize interfaces. This approach enables fine-                  this approach to enable efficient and adaptable mappings for tensorized
grained optimization, especially for specialized hardware like NVIDIA                programs.
Tensor Cores. However, it demands significant expertise and high                         Despite these advances, significant challenges remain in achieving
development costs, as developers must continually rewrite templates                  flexible, high-performance mappings that are adaptable to new hard-
to support new operators and accelerators [50–52]. While hand-written                ware accelerators, such as the inefficiency of existing approaches in
mapping can achieve high performance for specific workloads, its lack                handling diverse architectural constraints and their inability to effec-
of scalability and adaptability limits its effectiveness compared to more            tively explore large and complex search spaces. To better illustrate our
automated methods.                                                                   motivation, we present an example to illustrate the specific challenges
    ❷ Polyhedral model mapping: Polyhedral model mapping [28,32,                     within existing analyzable abstraction mapping systems, motivating the
53–56] provides a powerful strategy for optimizing tensor programs by                development of our approach.
restructuring execution and managing complex memory dependencies.                        Mapping intrinsic instructions onto hardware accelerators poses
In the realm of tensor program compilation, this approach plays a                    significant challenges due to the vast number of possible configurations
critical role in handling intricate memory structures and optimizing ex-             and their impact on performance. The process of selecting the optimal
ecution. For example, AKG [32] leverages polyhedral scheduling to re-                mapping for intrinsic instructions, such as those used in Tensor Cores, is
structure execution order through new linear relationships, effectively              complex, given the numerous potential mapping candidates. Each map-
eliminating inter-loop dependencies. This method is particularly advan-              ping choice can critically affect performance factors like data locality
tageous for hardware like TPUs, where enhancing parallel computation                 and parallelism. For example, as shown in Table 2, AMOS identified 35
is essential. By exploring a broader range of affine transformations                 distinct ways to map the seven loops of a 2D convolution onto the 3D
compared to methods such as TVM [21], polyhedral mapping optimizes                   loops of the Tensor Core. Exhaustively exploring all configurations is in-
performance for diverse workloads. However, the model’s inherent                     efficient and rarely yields substantial performance gains. Thus, a more
complexity limits its general applicability, making it less feasible for             efficient approach is required, one that prioritizes the most promising
simpler or less resource-intensive tasks.                                            mappings to reduce search overhead and maximize performance.

                                                                                 3
A. Xie et al.                                                                                                                   Journal of Systems Architecture 160 (2025) 103359

                 Table 2
                 Mapping candidates choices. This example maps a 2D convolution index to Tensor Core index (type: float16). Space loops: n, k, p, q, 𝑖1 , 𝑖2 ;
                 Reduction loops: rc, rr, rs, 𝑟1 . The mapping choices can be categorized into basic mapping and complex mapping. Basic mapping means selecting
                 only one choice at a time, while complex mapping allows multiple choices to be combined for mixed mapping.
                                   mapping1          mapping2           mapping3           mapping4          mapping5           mapping6           mapping7
                  i1               n                 n                  n                  p                 p                  q                  q
                  i2               k                 k                  k                  k                 k                  k                  k
                  r1               rc                rr                 rs                 rc                rs                 rc                 rr
                  Choices          0/1               0/1                0/1                0/1               0/1                0/1                0/1


Fig. 2. The compilation flow of GTA. 𝑡𝑛 denotes the 𝑛th non-intrinsic main-task (blue box), and 𝑡𝑛𝑘 denotes the 𝑘th mapping candidate of 𝑛th intrinsic-enabled main-task (orange
box). All mapping candidates are ranked and executed based on performance feedback. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of this article.)


    A second challenge lies in the scheduling of tensor programs, which                    4. Dual-task scheduling
often lacks consideration for DLAs intrinsics. Existing systems do not
sufficiently incorporate these intrinsics when generating the scheduling                       Most existing compiler frameworks adopt a performance-aware tun-
search space, limiting their ability to optimize tensorized programs for                   ing strategy to fine-tune generated programs, a method proven effective
specialized hardware. To address this, a more comprehensive approach                       by systems such as Ansor and AMOS. For example, Ansor refines its
to scheduling is needed, integrating primitives like tiling, fusion, and                   cost model by updating task weights based on feedback from each
vectorization that are tailored to the unique characteristics of DLAs.                     search iteration, while dynamically allocating subgraph trials. Building
Without such a targeted approach, the scheduling search space cannot                       on this approach, when multiple intrinsic instruction mapping options
fully leverage the potential of available mappings, thereby constraining                   are available, feeding performance results of each mapping back into
the system’s capacity to produce high-performance programs.                                the front-end further enhances the framework by enabling seamless
                                                                                           co-design between the front and back-end stages.
3. GTA overview                                                                                To optimize tuning resource allocation, a DNN can be decomposed
                                                                                           into multiple independent subgraphs (e.g., conv2d + ReLU). For some
    To address the aforementioned issues, we propose GTA, a compila-
                                                                                           subgraphs, spending time on tuning may not significantly improve the
tion framework designed to automatically generate high-performance
                                                                                           overall network performance. This may occur when a subgraph is not a
tensorized programs for specialized hardware. As shown in Fig. 2, it
                                                                                           performance bottleneck, or when any tuning yields only marginal gains.
takes deep neural networks (DNNs) as input, converting them into
                                                                                           Similarly, a subgraph may have multiple intrinsic mapping candidates,
computation graphs represented as directed acyclic graphs (DAGs). In
                                                                                           but further tuning on certain mappings may not result in meaningful
these graphs, each node corresponds to a tensor operation, and each
                                                                                           improvements. This is often because certain mapping schemes exhibit
edge denotes a producer–consumer relationship between operations. To
                                                                                           inefficient memory access patterns, limiting their ability to leverage the
handle the complexity of large computational graphs, GTA partitions
                                                                                           unique features of the underlying hardware and thereby restricting the
the DNN’s computation graph into smaller, manageable subgraphs us-
                                                                                           potential for significant performance gains.
ing Relay’s operator fusion algorithm, which has minimal performance
impact due to the layer-by-layer structure of DNNs (𝑡1 , 𝑡2 , … , 𝑡𝑛 in                        To illustrate the dual-task scheduling (DTS) process, we use
Fig. 2).                                                                                   ResNet18 as an example. After splitting ResNet18 into subgraphs,
    To maximize performance across multiple subgraphs, GTA dynam-                          there are 24 unique subgraphs, most of which are convolution layers
ically prioritizes subgraphs and mapping candidates most likely to                         with varying shape configurations (e.g., input size, kernel size, stride).
enhance end-to-end efficiency. It uses a dual-task scheduling ap-                          Following Ansor’s task scheduling methodology, we define a task as the
proach (detailed in Section 4) that allocates tuning time across both                      process of generating high-performance programs for each subgraph.
subgraph and mapping candidate levels. By allocating varying amounts                       Thus, optimizing a single DNN like ResNet18 requires completing
of time to different subgraphs and probabilistically discarding less effi-                 multiple tasks (e.g., 24 tasks for ResNet18).
cient candidates based on performance feedback, dual-task scheduling                           To efficiently allocate tuning resources across these tasks, GTA
helps avoid wasted tuning resources on low-impact mappings.                                employs a DTS approach. This method dynamically assigns varying
    Additionally, resource-constrained rules (explained in Section 5)                      amounts of time to different subgraphs and probabilistically discards in-
guide program generation on both DLAs and general-purpose acceler-                         efficient mapping candidates based on program performance feedback.
ators. GTA designs these rules by abstracting common architectural                         DTS operates on two levels: the subgraph level and the mapping can-
characteristics across DLAs, such as coarse-grained hardware intrin-                       didate level, helping GTA focus tuning resources on the most impactful
sics (e.g., WMMA in Tensor Core) and dedicated scratchpad memory                           configurations and avoid spending time on low-impact mappings.
(e.g., Unified Buffer in TPU). This design allows GTA to efficiently                           As shown in Fig. 1, the DTS iteratively allocates tuning resources to
leverage hardware-specific features, optimizing tensorized programs to                     different tasks. In each round, the first step selects a subgraph for pro-
fully exploit the underlying hardware capabilities.                                        gram generation, GTA generates a set of intrinsic-compatible mapping

                                                                                       4
A. Xie et al.                                                                                                                    Journal of Systems Architecture 160 (2025) 103359


Algorithm 1: Dual-Task Scheduling                                                          Table 3
                                                                                           Notations.
    Input:
                                                                                            Notation              Description/Definition
       𝐺: native deep learning neural network
       target : target hardware platform                                                    Main-task             Subgraph process for generating high-performance programs
                                                                                            Sub-task              Intrinsic mapping candidate satisfying hardware constraints
       trials: total tuning counts
                                                                                            𝛥𝑡                    Small backward window size
       MEASURE_NUM : number of measures per round
                                                                                            𝑁𝑖                    The set of similar task of i
    Output: 𝑏𝑒𝑠𝑡_𝑡𝑎𝑠𝑘𝑠: best performance tasks                                              𝐶𝑖                    The number of floating point operation in task i
 1 Function dual_scheduling                                                                 𝑉𝑘                    The number of floating point operation per second we can
 2     Initialize local variables 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐵𝑡𝑎𝑠𝑘 , 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐶𝑡𝑎𝑠𝑘 , 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ;                              achieve in task k
 3     tasks = 𝑒𝑥𝑡𝑟𝑎𝑐 𝑡_𝑡𝑎𝑠𝑘𝑠(𝐺, 𝑡𝑎𝑟𝑔 𝑒𝑡);                                                  𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦             Best mapping latency set of tasks
 4     while 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 < trials do                                                            𝐵𝑡𝑎𝑠𝑘                 Best mapping tasks set of DNN
  5         tid = gradient_scheduling (tasks, 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 );                                  𝐶𝑠𝑎𝑚𝑝𝑙𝑒               Samples selected from all mappings
  6         𝑀𝑐 𝑎𝑛𝑑 𝑖 = 𝑚𝑎𝑡𝑐 ℎ_𝑖𝑛𝑡𝑟𝑖𝑛𝑠𝑖𝑐(tasks[tid], target );                               𝐶𝑡𝑟𝑖𝑎𝑙𝑠               Current number of trials
                                                                                            𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔              Current mapping selection
  7         if 𝑀𝑐 𝑎𝑛𝑑 𝑖 not NULL then
                                                                                            G                     Native neural network
  8              for 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 in 𝑀𝑐 𝑎𝑛𝑑 𝑖 do
                                                                                            𝑚𝑖 (𝑡)                Minimum execution time for𝑖th task
  9                    if 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then                                                     𝑚𝑖𝑘 (𝑡)               Execution time of 𝑘th mapping for 𝑚𝑖 (𝑡)
 10                         if 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 not in 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then                                𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦             Latency set of all tasks
 11                                continue;                                                𝑀𝑐 𝑎𝑛𝑑 𝑖              Set of all mapping candidates
 12                         end                                                             𝛼𝑘                    Sampling probability of mapping k
13                     end                                                                  𝛽                     Hyperparameter for increasing probability
                                                                                            𝜔𝑖                    Number of appearances of task i in the network
14                     latency = tasks[tid].tune(𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 );
15                     𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 .append(latency);
16                     if latency < 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 then
 17                         𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 [tid] = latency;
                                                                                           defined as:
 18                         𝐵𝑡𝑎𝑠𝑘 [tid] = tasks[tid];
19                     end
                                                                                                   ∑
                                                                                                   𝑛
                                                                                           𝑓 (𝐺) =   (𝜔𝑖 × 𝑚𝑎𝑥(𝛽(𝛼1 ⋅ 𝑚𝑖1 (𝑡), 𝛼2 ⋅ 𝑚𝑖2 (𝑡), ..., 𝛼𝑘 ⋅ 𝑚𝑖𝑘 (𝑡))))             (1)
20               end                                                                                    𝑖=1
21               𝐶𝑠𝑎𝑚𝑝𝑙𝑒 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒(𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 );
22               𝐶𝑡𝑟𝑖𝑎𝑙𝑠 += MEASURE_NUM                                                         Let 𝜔𝑖 denote the number of appearances of main-task 𝑖 in the
23          end                                                                            network, where 𝑖 is the main-task index. If a main-task has already met
24     end                                                                                 its latency requirement, no additional tuning resources are allocated to
25     return 𝐵𝑡𝑎𝑠𝑘 ;                                                                      it. The variable 𝛼𝑘 represents the sampling probability assigned to sub-
                                                                                           task k. Unlike other frameworks, our approach introduces probabilistic
                                                                                           allocation for intrinsic mapping candidates (sub-task). Once perfor-
                                                                                           mance feedback for all mapping candidates of a subgraph is received,
candidates for the intrinsic-enabled 𝑡𝑎𝑠𝑘𝑖 . This effectively breaks the                   sampling probabilities are assigned based on time cost, candidates
main-task into several sub-tasks (as shown by the orange box in Fig. 1).                   with lower time costs are assigned higher probabilities, while those
The second step then generates a batch of promising programs for these                     with higher time costs receive lower probabilities. We also introduce a
sub-tasks and measures their performance on hardware. Each round is                        hyperparameter 𝛽 to adjust sampling probabilities for specific mapping
defined as one unit of time resource. When a time resource is allocated                    candidates, helping to avoid convergence on locally optimal solutions.
to a task, the task gains the opportunity to generate and measure new
programs, increasing the chance of discovering better-performing ones.                     4.2. Optimizing with gradient and probability
    In the following section, we introduce the formulation of the
scheduling problem and our solution.                                                             Inspired by the gradient descent-based task scheduling approach
                                                                                           presented in [25], we propose a DTS algorithm (Algorithm 1) that com-
4.1. Problem formulation                                                                   bines gradient descent with probability-based selection to efficiently
                                                                                           optimize the objective function. Starting from the current allocation t,
                                                                                           the algorithm approximates the gradient of the objective function, 𝜕𝜕𝑓𝑡 ,
                                                                                                                                                                  𝑖
                                                                                           and identifies the primary task i by maximizing the absolute gradient,
    In defining the scheduling problem, we divide DTS into two types of                                                𝜕𝑓
                                                                                           defined as 𝑖 = ar g max𝑖 | 𝜕 𝑡 |. This gradient approximation serves as
tasks: main-tasks and sub-tasks. In this framework, a DNN can be split                                                   𝑖
                                                                                           the foundation for selecting the main-task with the highest potential
into several subgraphs (main-tasks). If the computation type, data type,
                                                                                           impact.
and computation shape of a main-task meet the limitations required                                                     (      (     ( )                         )))
for utilizing hardware intrinsic resources, multiple intrinsic mapping                     𝜕𝑓       𝜕 𝑓 ( 𝛥𝑚                     𝑚𝑖 𝑡𝑖         𝐶𝑖           ( )
                                                                                                  ≈      𝛼     +(1 − 𝜂) min −           ,𝜃              − 𝑚𝑖 𝑡𝑖
candidates will be generated for the main-task. Each of these intrinsic                     𝜕 𝑡𝑖    𝜕 𝑚𝑖   𝛥𝑡                      𝑡𝑖      max𝑘∈𝑁(𝑖) 𝑉𝑘
mapping candidates is referred to as a sub-task. A main-task represents
a process performed to generate high-performance programs for a                                                                                                   (2)
                                                                                                              ( )     (      )
subgraph, meaning that optimizing a single DNN requires completing                         where 𝛥𝑚 = 𝑚𝑖 𝑡𝑖 − 𝑚𝑖 𝑡𝑖 − 𝛥𝑡 and other variables are defined in
dozens of main-tasks. And related notions used in this paper are shown                     Table 3. The parameter 𝜂 and 𝜃 control the weight to trust some
in Table 3.                                                                                predictions.
    We define 𝑚𝑖 (𝑡) as the minimum execution time required for the                            GTA initializes the algorithm with t = 0 and begins with a round-
𝑖th main-task at time 𝑡, and 𝑚𝑖𝑘 (𝑡) as the execution time of the 𝑘th                      robin warm-up phase, resulting in an initial allocation vector of 𝑡 =
mapping scheme for the 𝑖th main-task. The optimal execution time                           {1, 1, … , 1}. After the warm-up, as shown in line 5 of Algorithm 1,
for subgraph 𝑖 is represented as min(𝑚𝑖1 (𝑡), 𝑚𝑖2 (𝑡), … , 𝑚𝑖𝑘 (𝑡)). The end-              the gradient for each main-task is computed, and the main-task with
to-end execution time of the entire network, denoted by 𝐺(𝑚1 (𝑡), 𝑚2 (𝑡),                  the maximum absolute gradient, 𝑖 = argmax𝑖 | 𝜕𝜕𝑓𝑡 |, is selected. A tuning
                                                                                                                                            𝑖
… , 𝑚𝑛 (𝑡)), represents the aggregate time across all main-tasks. Our                      time unit is then allocated to this main-task, updating its allocation to
objective is to minimize this function to achieve the lowest possible                      𝑡𝑖 = 𝑡𝑖 + 1. The optimization process continues until the tuning time
overall execution time for the DNN. Thus, the objective function is                        budget is exhausted.

                                                                                       5
A. Xie et al.                                                                                                               Journal of Systems Architecture 160 (2025) 103359


     Afterward, GTA searches for a hardware intrinsic that matches the               Table 4
                                                                                     Resource-constrained rules and related conditions.
specified main-task. Once a suitable set of hardware intrinsics is identi-
fied, tensor programs are generated for all mapping candidates, serving                  No.              Rule                                   Condition

as a warm-up for the sub-task. This warm-up allows GTA to select the                                                                             HasDataReuse(R, i) &
                                                                                         R1               Multi-Level Tiling
                                                                                                                                                 HasMultiLevelCache(R, i)
most promising mapping candidates by assigning probabilities based
                                                                                                                                                 HasDataReuse(R, i) &
on their performance feedback. In subsequent rounds, only mapping                        R2               Set Multi-Scope
                                                                                                                                                 HasMultiScopeCache(R, i)
candidates prioritized by their previously assigned probabilities are                    R3               Fuse Main Op                           HasStagesFused(R)
executed. This selective exploration avoids spending time on inefficient                 R4               Fuse Output Op                         HasStagesFused(R)
candidates, enhancing tuning efficiency and allowing higher-potential                    R5               AddMemLimit                            HasDSM(R)a
                                                                                         ...              Ansor Defined Ruleb                    ...
candidates more opportunities for optimization.
                                                                                     a
     The 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒 algorithm, as called in line 21 of Algorithm                 DSM: dedicated scratchpad memory.
                                                                                     b   Ansor [25].
1, is designed to probabilistically select mapping candidates for further
analysis and optimization. We first introduce the notation: let 𝑅 =
{𝑟1 , 𝑟2 , … , 𝑟𝑛 } represent the set of all mapping results, where 𝑟𝑖 denotes
the 𝑖th result with a performance value 𝑉 (𝑟𝑖 ).
     The total weight 𝑊 is calculated by considering each result’s in-               Second, current approaches are largely tailored to general-purpose
verse performance value, normalized with respect to the maximum                      processors and lack consideration for specific architectural constraints.
performance in 𝑅, as follows:                                                        This highlights the need to construct a high-quality kernel design
         ∑ 1                1                                                        space to effectively reduce inefficient exploration and improve overall
𝑊 =                   ⋅                                                              performance.
              𝑉 (𝑟𝑖 ) max        1
        𝑟 ∈𝑅
        𝑖                     𝑟𝑗 ∈𝑅 𝑉 (𝑟 )
                                        𝑗                                                To address these challenges, GTA’s implementation of resource-
                                                                                     constrained generation rules is based on existing open-source code
   This ensures that weights are scaled and relative to the most perfor-             for DLAs and general-purpose accelerators [22,25]. In particular, the
mant candidate in the result set 𝑅. Using this normalized total weight               DLA-specific rules are adapted to leverage hardware intrinsics and
𝑊 , the initial probability assigned to each result 𝑟𝑖 is given by:                  dedicated scratchpad memory (DSM) efficiently. From a programmer’s
              1           1
                   ⋅
            𝑉 (𝑟𝑖 ) max𝑟 ∈𝑅     1                                                    perspective, DLAs, in contrast to general-purpose accelerators, feature
                        𝑗     𝑉 (𝑟𝑗 )
𝑃 (𝑟𝑖 ) =                                                                            coarse-grained hardware intrinsics (e.g., WMMA in Tensor Core) and
                    𝑊                                                                user-programmable dedicated DSM (e.g., Unified Buffer in TPU). Based
                                                                                     on these existing implementations, we made targeted modifications
     To encourage exploration, the algorithm applies a probability in-               to better align the rules with the search strategies and optimization
crease factor 𝛽 to selected results. The probability adjustment is defined           methods proposed in this work. Table 4 summarizes five key generation
by weighting the original probability 𝑃 (𝑟𝑖 ) with an exploration boost:             rules that GTA employs to optimize data movement, operation fusion,
                (1 + 𝛽) ⋅ 𝑃 (𝑟𝑖 )                                                    and memory management in DLAs. Each rule addresses specific chal-
𝑃 ′ (𝑟𝑖 ) = ∑
             𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 )                                               lenges to enhance computational efficiency and resource utilization.
                                                                                     The following is a detailed description of each rule:
     Here, 𝛽𝑗 is a task-specific exploration factor, applied selectively                 Rule-R1 generates multiple nodes for data movement between dif-
to candidates 𝑟𝑗 , where 𝛽𝑗 = 𝛽 for selected candidates and 𝛽𝑗 = 0                   ferent levels of on-chip DSMs. To apply this rule, GTA first checks for
otherwise. The inclusion of the initial probability 𝑃 (𝑟𝑗 ), derived from            data reuse opportunities and verifies if the DLA has multiple DSM levels
each candidate’s performance value 𝑉 (𝑟𝑗 ), serves as the foundation of              (e.g., Tensor Core provides two levels of DSMs for WMMA fragments
the adjusted probabilities. This ensures that 𝑃 ′ (𝑟𝑖 ) retains the relative         and shared memory). If these conditions are met, GTA inserts 𝑐 𝑎𝑐 ℎ𝑒𝑟𝑒𝑎𝑑
importance of each candidate while allowing selective exploration                    primitives for the node and its producers to facilitate data movement.
through 𝛽𝑗 .                                                                             Rule-R2 marks the data storage scope for each operation within
                                      ∑
     The normalization term, 𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ), ensures that the             the DSM hierarchy. To apply this rule, GTA first checks for data reuse
adjusted probabilities remain valid and sum to 1. By combining the                   opportunities and verifies whether the DLA provides multiple DSM
task-specific exploration factor with the initial performance-weighted               scopes for different data types. If these conditions are satisfied, GTA
probability 𝑃 (𝑟𝑗 ), this formula balances exploitation of high-priority             assigns 𝑐 𝑎𝑐 ℎ𝑒𝑤𝑟𝑖𝑡𝑒 primitives to the node and 𝑐 𝑎𝑐 ℎ𝑒𝑟𝑒𝑎𝑑 primitives to its
candidates with exploration of less performant options. Furthermore,                 producers, ensuring that data is efficiently stored and accessed within
𝑃 (𝑟𝑗 ) prevents the adjustment from overly concentrating on a small                 the appropriate DSM levels.
subset of candidates, promoting diversity and fairness across the result                 Rule-R3 enables the fusion of main operations within a subgraph
set 𝑅.                                                                               by identifying opportunities to combine operations with shared data
     Finally, the algorithm selects the top 𝑁 results based on the adjusted          dependencies. This reduces data movement overhead and improves
probabilities 𝑃 ′ (𝑟𝑖 ). The selection process is expressed as:                      computational efficiency. When multiple stages are fused, GTA inserts
                  (                                )                                 the appropriate primitives to implement the fusion, streamlining the
{𝑟𝑖 }𝑁
     𝑖=1
         = Top𝑁 𝑃 ′ (𝑟1 ), 𝑃 ′ (𝑟2 ), … , 𝑃 ′ (𝑟𝑛 ) ,                    (3)
                                                                                     execution flow.
where 𝑁 is dynamically determined based on a fraction of the total                       Rule-R4 focuses on fusing output operations within a computational
result set 𝑅, denoted by 𝑁 = ⌈𝜅 ⋅ |𝑅|⌉, and 𝜅 ∈ (0, 1] is a user-defined             graph. Similar to Rule-R3, it targets operations that can be combined
parameter controlling selection size.                                                to minimize data transfer costs and enhance throughput. By analyz-
                                                                                     ing data flow between operations, GTA inserts necessary primitives
5. Resource-constrained rules                                                        to achieve output fusion, resulting in a more compact and efficient
                                                                                     execution structure.
    Existing exploration-based methods face significant challenges in                    Rule-R5 constrains memory usage for operations that utilize ded-
both performance and scalability, primarily due to two factors. First,               icated scratchpad memory (DSM). By evaluating each operation and
although the design space is vast, it contains numerous inefficient ker-             its memory requirements, GTA ensures memory limits are respected,
nels. For example, in the GEMM operation with dimensions                             preventing allocations from exceeding hardware capacity, which could
512 × 768 × 3072 (used in GPT-1 on Tensor Core), the kernel space size               lead to inefficient execution. This rule helps maintain an efficient
reaches 𝑂(1016 ), with over 90% of the kernels being inefficient [63,64].            memory allocation strategy, optimizing overall resource utilization.

                                                                                 6
A. Xie et al.                                                                                                                      Journal of Systems Architecture 160 (2025) 103359


Fig. 3. An illustrative example of tensorized program generation for a GEMM-ReLU operator, demonstrating the transformation of the input program from a mathematical expression
(𝑝0 ) to a tensor expression (𝑝1 ) written in a domain-specific language using TVM. The process further includes intrinsic matching based on the type and shape of the input operator
to select and generate intrinsic mapping candidates, followed by the application of resource-constrained rules to guide the creation of a tensorized program sketch (𝑝2 ).


    An example. Fig. 3 illustrates how resource-constrained rules are                         into smaller subgraphs. Notably, for some subgraphs, spending time on
applied during tensorized program generation. Starting from the input                         tuning may not significantly enhance the end-to-end performance of
program written as a mathematical expression (𝑝0 ), the process con-                          the DNN. In this work, we adopt TVM’s subgraph partitioning strategy
verts it into a tensor expression (𝑝1 ) using domain-specific language                        to divide the input DNN into multiple smaller subgraphs, referred to as
(DSL) in TVM. The intrinsic matching step leverages compute abstrac-                          main-tasks. A main-task is considered a process executed to generate
tion and memory abstraction, as proposed in AMOS [22], to complete                            high-performance programs for a subgraph. TVM categorizes operators
the software-hardware mapping generation. This process selects and                            into four types: injective (e.g., add operations), reduction (e.g., sum
generates intrinsic mapping candidates by analyzing the operator’s                            operations), complex-out-feasible (e.g., matrix multiplication, where
computation type, data type, and memory access patterns based on
                                                                                              element-wise mappings can fuse to the output), and opaque (e.g., sort
its shape and hardware-specific constraints. Subsequently, resource-
                                                                                              operations, which cannot be fused). Subgraph fusion is then performed
constrained rules play a critical role in guiding the generation of the
                                                                                              based on predefined generic rules.
tensorized program sketch, ensuring efficient utilization of hardware in-
trinsic functions while respecting memory and architectural constraints.                          Mapping Generation and Scheduling. At each iteration, based on
Specifically, the derivation for generated rules and the transformed                          the intrinsic mapping generation approach described in AMOS [22],
program can be expressed as:                                                                  main-tasks can be classified into intrinsic-disabled and intrinsic-enabled
                                       𝑅2               𝑅1                                    tasks. For intrinsic-disabled main-tasks, we adopt Ansor’s [25] com-
𝑖𝑛𝑝𝑢𝑡 𝑝1 → 𝑀𝑐 𝑎𝑛𝑑 𝑖 → 𝑜(𝑆0 , 𝑖 = 3) → 𝑜(𝑆1 , 𝑖 = 3) → 𝑜(𝑆2 , 𝑖 = 2)                           pilation optimization to generate programs. In contrast, for intrinsic-
                                                                                   (4)
   𝑅3      𝑅5
   → … → 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝2                                                                            enabled main-tasks, GTA optimizes task scheduling based on gradients
                                                                                              and probabilities. This algorithm prioritizes subgraphs with higher
    We defined the state as 𝑜 = (𝑆 , 𝑖), where S represents the current                       potential for performance improvement, allocating them more tuning
partially generated sketch program for the DAG, and i denotes the                             opportunities while reducing efforts on less promising mapping can-
index of the node currently being transformed. For each rule, if the                          didates based on performance feedback. Slice the time and prioritize
application conditions are met, the rule is applied to 𝑜 = (𝑆 , 𝑖), resulting                 important subgraphs and intrinsic mapping candidates, meaning that
in a new state 𝑜′ = (𝑆 ′ , 𝑖′ ), where 𝑖′ ≤ 𝑖. This ensures that the index                    not all main-tasks and sub-tasks will be executed. For example, an
i (indicating the transforming node) decreases monotonically. A state                         intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and
reaches a terminal condition when 𝑖 = 0. During the enumeration                               discarded mapping candidates. The former will proceed to subsequent
process, multiple rules may be applicable to a single state, generat-                         tensor program optimization and tuning, while the latter will not
ing several succeeding states. Additionally, a single rule can produce
                                                                                              participate in further optimization unless they are selected in the next
multiple succeeding states under certain conditions.
                                                                                              scheduling round.
6. Implementation                                                                                 Search Space Exploration. Subsequently, GTA applies resource-
                                                                                              constrained rules and existing derivation rules (Table 4) to each sub-
    In this section, we delve into the technical details in our imple-                        graph under the guidance of a genetic algorithm [25]. During this
mentation. GTA extends TVM, a end-to-end deep learning compiler, to                           process, tens of thousands of tensor programs are generated, and cost
support loop scheduling and generate high-performance programs with                           model is employed to filter out the most promising candidates with
intrinsic instructions.                                                                       near-optimal performance. These selected candidates are then executed
    Task Generation. To mitigate the issue of search space explosion,                         on the target hardware to identify the tensor program with the best
compilers typically divide the large computational graph of a DNN                             performance.

                                                                                          7
A. Xie et al.                                                                                                      Journal of Systems Architecture 160 (2025) 103359


7. Evaluation                                                                           • PyTorch: PyTorch, a widely-used deep learning framework,
                                                                                          serves as a strong baseline for evaluating GTA’s ability to out-
7.1. Evaluation platforms                                                                 perform standard hand-tuned implementations in practical deep
                                                                                          learning applications. Our experiments include both PyTorch
                                                                                          1.13, which relies heavily on vendor-optimized libraries such
   Our experiments were conducted on two distinct hardware plat-
                                                                                          as cuDNN and cuBLAS for high-performance computations, and
forms to evaluate the performance of the proposed GTA framework:
                                                                                          PyTorch 2.0, which introduces the TorchInductor compiler.
     • NVIDIA GPUs: We performed experiments on two NVIDIA GPUs,                    For a fair comparison, we evaluate AutoTVM, Ansor, AMOS, and GTA
       specifically the RTX 3060 and A100, which are equipped with                  with up to 200 measurement trials per test case and report the best
       Tensor Cores optimized for deep learning tasks. The RTX 3060                 performance achieved. For the vendor-optimized libraries on Tensor
       represents a consumer-grade GPU, while the A100 is a data                    Core, we use PyTorch, which relies on hand-optimized libraries such as
       center-grade GPU designed for high-performance computing.                    cuDNN to support various types of operators. These optimized libraries
     • AMD CPU: We evaluated the performance on an AMD Ryzen 7                      serve as strong baseline references for evaluating the performance of
       7840H CPU,2 which supports advanced SIMD (Single Instruction,                GTA.
       Multiple Data) instructions, enabling efficient vectorized compu-
       tations. This CPU platform provides a competitive environment                7.4. Experimental results
       for testing AVX512-like optimizations in general-purpose proces-
       sors, allowing us to benchmark GTA’s performance on non-GPU                     We evaluate the performance of GTA on both operators and neural
       hardware.                                                                    networks, comparing it against several baselines on two DLAs: GPU
                                                                                    Tensor Cores and CPU AVX512. To further demonstrate the effective-
                                                                                    ness of GTA, we analyze the quality of the generated search spaces and
7.2. Evaluated benchmarks                                                           the efficiency of the exploration process. Finally, we highlight how the
                                                                                    dual-task scheduling strategy significantly reduces compilation time by
   We evaluate the performance of GTA using both deep learning (DL)                 dynamically prioritizing subgraphs and mapping candidates, effectively
operators and complete neural network models.                                       cutting down unnecessary search efforts.

     • Operator-Level Evaluation: We select nine widely-used opera-                 7.5. Operator performance
       tors for this evaluation: General Matrix Multiplication (GEMM),
       1D convolution (C1D), 2D convolution (C2D), 3D convolution                       Tensor Core. First, we compare GTA with PyTorch, which relies on
       (C3D), transposed 2D convolution (T2D), dilated convolution                  hand-optimized libraries such as cuDNN to support various operators.
       (DIL), batch matrix multiplication (BMM), General Matrix–Vector              Fig. 4 shows the results for all operators with batch size 1 on the
       multiplication (GEMV), and scan (SCAN). For each operator, we                NVIDIA RTX 3060. GTA consistently outperforms PyTorch across all
       test 6–10 different shape configurations and report the geometric            operators, achieving an average 2.44× geometric mean speedup. The
       mean of speedups normalized to GTA. The shape configurations                 speedup is attributed to GTA’s comprehensive software-hardware map-
                                                                                    ping exploration, which contrasts with PyTorch’s use of fixed mappings
       are consistent with those used in Ansor and AMOS to ensure a fair
                                                                                    from hand-optimized libraries, often leading to suboptimal perfor-
       comparison.
                                                                                    mance.
     • Network-Level Evaluation: We benchmark six commonly-used
                                                                                        Next, we evaluate the performance on the NVIDIA A100 GPU for
       neural network models: ResNet18 and ResNet50 [1], BERT (base
                                                                                    various operator. As shown in Fig. 9, GTA achieves 1.26×, 5.24×,
       configuration) [65], MI-LSTM [66], MobileNet-V1 [67], and Shuf-
                                                                                    and 1.93× geometric mean speedup over Ansor, PyTorch, and AMOS,
       fleNet [11]. For each model, we evaluate the performance with
                                                                                    respectively. The significant improvement is due to GTA’s ability to
       batch sizes of 1 and 16.
                                                                                    effectively utilize the high-performance Tensor Core units through
                                                                                    enhanced mapping and scheduling strategies.
7.3. Comparison baselines                                                               We also compare GTA with state-of-the-art compilers on RTX 3060
                                                                                    using the C2D in NCHW layout. We test all convolution layers from
                                                                                    ResNet18 (a total of 12 configurations, labeled as C0–C11). These
   Our evaluation compares GTA against three state-of-the-art auto-
                                                                                    configurations are standard benchmarks from well-known networks.
matic generation methods (AutoTVM [49], Ansor [25] (v0.8), and
                                                                                    The results are shown in Figs. 4, 5, and 6. GTA achieves speedups of
AMOS [22] (commit: 0f39742)) as well as two vendor-optimized, hand-
                                                                                    1.85×, 1.76×, and 2.10× over Ansor, AMOS, and hand-tuned PyTorch,
tuned libraries (cuDNN (v11.6) and PyTorch (v1.13.1, v2.0.1)):                      respectively. Compared to Ansor, GTA leverages high-performance
                                                                                    Tensor Core units alongside efficient auto-scheduling strategies, re-
     • AutoTVM: This method uses hand-written templates to support
                                                                                    sulting in better optimization. In contrast to AMOS, GTA employs
       all three selected platforms, demonstrating high performance
                                                                                    DTS to efficiently explore the scheduling space, reducing search time
       across a range of baseline operators.
                                                                                    while enhancing program performance. Moreover, AMOS cannot utilize
     • AMOS: AMOS systematically explores various mappings of loop                  resource-constrained rules for shared memory allocation, leading to
       iterations to DLAs, representing the state-of-the-art for operators          the generation of some tensor programs that exceed hardware re-
       with multiple feasible mappings, such as C1D and C2D.                        source limits. This limitation reduces AMOS’s capability to achieve
     • Ansor: As a leading method for GPU CUDA Core and CPU code                    higher-performing programs.
       generation, Ansor does not support DLAs like Tensor Core due                     AVX512. On the AMD CPU platform, we utilize hardware abstrac-
       to architectural limitations. However, comparing GTA with An-                tion for AVX512 intrinsics (specifically for matrix–vector multiplica-
       sor highlights the benefits of leveraging DLA-specific features in           tion) and apply GTA to generate code for C2D. As shown in Fig. 7, GTA
       tensor program generation.                                                   achieves 1.49× and 2.76× performance improvements over Ansor and
                                                                                    PyTorch, respectively. GTA’s advantage stems from combining high-
                                                                                    performance AVX512 intrinsics with efficient auto-scheduling strate-
  2
    Intel CPUs also support AVX512 instructions and could be used for similar       gies, leading to superior program optimization compared to baseline
experiments.                                                                        methods.

                                                                                8
A. Xie et al.                                                                                                                Journal of Systems Architecture 160 (2025) 103359


                                                  Fig. 4. Single operator performances comparison on NVIDIA RTX 3060.


     Fig. 5. Performance comparison of C2D on NVIDIA RTX 3060 with batch size = 1, using all convolution layers from ResNet18 (12 configurations, labeled as C0–C11).


                                           Fig. 6. Performances comparison for C2D on NVIDIA RTX 3060 with batch size = 16.


                                             Fig. 7. Performance on AMD Ryzen 7 7840H CPU relative to Ansor and PyTorch.


                                                 Fig. 8. Performance of different networks relative to GTA on Tensor Core.


                                                                                     9
A. Xie et al.                                                                                                            Journal of Systems Architecture 160 (2025) 103359


                Fig. 9. Performance comparison of GTA across multiple individual operators on the NVIDIA A100 GPU, compared with baseline methods.


                             Fig. 10. Compilation time overhead and corresponding performance variations under different sampling rates.


7.6. Network performance                                                              optimizes resource allocation during the search process and enables the
                                                                                      rapid identification of high-performance tensor programs.
    Fig. 8 illustrates the performance of GTA on six evaluated networks.                  Unlike traditional methods that exhaustively explore all mapping
On average, GTA achieves 1.75×, 1.42× and 1.29× speedups over                         candidates, GTA employs a dynamic prioritization strategy that adap-
AMOS, PyTorch 1.13 and PyTorch 2.0 with TorchInductor, respec-                        tively allocates tuning resources based on performance feedback. This
tively. For ResNet18 and ResNet50, GTA finds better mappings for                      strategy ensures that the most promising subgraphs and intrinsic map-
operators, enabling more extensive utilization of Tensor Cores com-                   ping candidates are prioritized, while less promising candidates receive
pared to hand-tuned libraries and AMOS’s optimized templates. GTA                     fewer tuning opportunities. By combining this with a sampling-based
overcomes the limitations of these baselines by generating accurate                   approach, GTA minimizes unnecessary exploration while maintaining
search spaces that encompass most high-performance programs, along                    high-quality tensor programs. These results underscore GTA’s suit-
with an efficient search algorithm for finding optimal or near-optimal                ability for real-world deployment scenarios, where both rapid code
solutions. The results demonstrate GTA’s capability to handle complex                 generation and performance optimization are critical. Furthermore, the
operators and effectively leverage Tensor Cores for high performance.                 ability to adjust sampling rates offers flexibility in balancing search
                                                                                      time and performance, making GTA a robust solution for optimizing
7.7. Compilation time                                                                 tensor programs across diverse workloads.

    The search time overhead is a critical factor for practical deploy-               8. Related work
ment in deep learning frameworks, as reducing it can significantly en-
hance usability. To evaluate the efficiency of our dual-task scheduling                   In addition to reviewing DLAs, we summarize related work on
strategy, we analyze the search time and corresponding performance                    numeric precision and dynamic shape optimization for deep learning.
variations under different sampling rates, specifically comparing GTA                     Deep learning accelerators. DLAs offer several significant ad-
at sampling rates of 40% (GTA-0.4), 60% (GTA-0.6), and 100% (GTA-                     vantages, making them essential for advancing DNN research and
Raw). In this experiment, GTA operates at a sampling rate of 20%                      deployment. First, DLAs feature large memory capacities, which ac-
(GTA-0.2), representing a highly efficient configuration with minimal                 commodate the rapidly growing number of parameters in modern
search overhead. The results, as shown in Fig. 10, demonstrate that                   models and facilitate efficient training processes. Second, they provide
as the sampling rate decreases, the search time is significantly reduced              model-specific optimizations while maintaining a degree of flexibility,
while maintaining less than a 5% performance degradation on average,                  enabling tailored performance improvements for various architectures.
thereby achieving an excellent balance between search efficiency and                  Additionally, DLAs support a broader range of data formats, such
performance.                                                                          as FP16, BF16, and INT8, which enhance computational efficiency
    Additionally, we compare GTA’s search time overhead and perfor-                   and reduce memory usage. Third, DLAs are equipped with a high
mance with AMOS, a state-of-the-art compiler designed for DLAs. Our                   number of computing units, enabling extensive parallelism to handle
findings reveal that GTA achieves an average performance improve-                     the computational demands of DNNs effectively. These characteris-
ment of 1.88× over AMOS while maintaining significantly lower search                  tics position DLAs as a cornerstone technology for accelerating the
time. Specifically, AMOS’s average compilation time is approximately                  training and inference of deep learning models. Following this trend,
five times that of GTA. This substantial reduction in search time under-              many emerging accelerators have been proposed, targeting specific
scores the effectiveness of GTA’s dual-task scheduling strategy, which                algorithms or utilizing new technologies. In academia, the DianNao

                                                                                 10
A. Xie et al.                                                                                                         Journal of Systems Architecture 160 (2025) 103359


family [68–71] significantly improves DL computation performance by             search space by coordinating intrinsic-based automatic mapping ab-
leveraging specialized functional units, memory hierarchy, and inter-           straction with rule-based tensor program generation strategy and ap-
connects. Meanwhile, the expansion of DL applications in industry has           plies pruning rules to eliminate ineffective program candidates. Ad-
led hardware vendors (e.g., NVIDIA Tensor Core [17–19] and Intel                ditionally, GTA employs dual-task scheduling strategy for tensorized
NNP [72]), internet giants (e.g., Tesla Dojo [73], Huawei Ascend [74],          programs, effectively reducing tuning efforts while enhancing perfor-
Google TPU [10] and Apple M4 [75,76]), and startups (e.g., Cambricon            mance. Experimental results on three DLAs show that GTA outperforms
MLU [77] and Graphcore IPU [78]) to develop various DLAs. Both                  state-of-the-art automatic generation approaches and vendor-provided
academic and industry DLAs are fundamentally domain-specific, rather            hand-tuned libraries by 1.88× and 2.29×, respectively.
than general-purpose accelerators, inevitably leading to complex and
diverse architectural constraints.                                              CRediT authorship contribution statement
    Numeric precision optimization. Quantization [79,80], a pivotal
technique in deep learning, reduces the numeric precision of weights               Anxing Xie: Writing – original draft, Software, Resources, Project
and activations to enhance computational efficiency and lower resource          administration, Methodology, Investigation, Data curation. Yonghua
requirements. By transitioning from high-precision formats such as              Hu: Writing – review & editing, Supervision, Investigation, Funding
FP32 to lower-precision formats like FP16, INT8, or even single-bit             acquisition. Yaohua Wang: Writing – review & editing, Supervision,
representations [81,82], quantization enables significant reductions in         Methodology, Investigation, Funding acquisition, Formal analysis. Zhe
memory usage and power consumption [83,84]. The progression of                  Li: Writing – review & editing, Supervision, Investigation, Formal
hardware architectures aligns with the increasing demands for low-              analysis. Yuxiang Gao: Investigation. Zenghua Cheng: Investigation.
precision computations. For instance, NVIDIA’s recent developments,
such as the Turing and Ampere architectures, incorporated INT8 and              Declaration of competing interest
INT4 tensor cores to enhance efficiency. Meanwhile, the latest Hop-
per architecture has shifted focus by replacing INT4 support with                   The authors declare that they have no known competing finan-
FP8 tensor cores, prioritizing improved numerical precision. These ad-          cial interests or personal relationships that could have appeared to
                                                                                influence the work reported in this paper.
vancements allow large-scale models, including Large Language Models
(LLMs) [85], to be deployed on resource-constrained devices like edge
                                                                                Acknowledgments
devices and DLAs without sacrificing performance. Compilers play a
critical role in making quantization effective. Tools like AMOS [22],
                                                                                   We would like to thank the anonymous reviewers for their valu-
PreTuner [86] and LADDER [39] introduce advanced optimizations
                                                                                able suggestions. This work is supported by the National Key R&D
for low-precision data types, including hardware-aware scheduling,
                                                                                Program of China (No. 2022ZD0119003), Hunan Provincial Natural
loop tiling, and fine-grained scaling strategies. Expanding on existing
                                                                                Science Foundation (No. 2023JJ50019), the Postgraduate Scientific
techniques, an automated approach [87] integrates bit-slicing into
                                                                                Research Innovation Project of Hunan Province (No. CX20231019) and
the scheduling phase, treating quantization as part of the schedule
                                                                                the National Natural Science Foundation of China (No. 62272477).
space. Coupled with program synthesis, this method efficiently gener-
ates hardware-specific kernels, supporting diverse quantization config-
                                                                                Data availability
urations and ensuring seamless adaptation to new hardware architec-
tures.
                                                                                   Data will be made available on request.
    Dynamic shape optimization. Dynamic-shape workloads are char-
acteristic of DNN models where tensor shapes vary at runtime based
on input data, such as the sequence length in Transformer models.               References
These workloads pose substantial challenges for existing autotuning
frameworks like TVM, which primarily rely on static input shapes to              [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
construct search spaces and cost models. For instance, TVM’s second-                 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
                                                                                     2016, pp. 770–778.
generation IR, Relay [35], lacks the capability to represent dynamic             [2] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju,
tensors. While its third-generation IR, Relax [88], introduces symbolic              H. Shahrzad, A. Navruzyan, N. Duffy, et al., Evolving deep neural networks,
shapes to support dynamic workloads, Relax still depends on hand-                    in: Artificial Intelligence in the Age of Neural Networks and Brain Computing,
written templates for tensor program generation and lacks automatic                  Elsevier, 2024, pp. 269–287.
                                                                                 [3] C.-Y. Wang, I.-H. Yeh, H.-Y. Mark Liao, Yolov9: Learning what you want to learn
tuning support. To address these limitations, recent works such as
                                                                                     using programmable gradient information, in: European Conference on Computer
Nimble [89], DietCode [90], FTuner [91], and MIKPOLY [92] have                       Vision, Springer, 2025, pp. 1–21.
introduced innovative techniques. These approaches construct shape-              [4] A. Vaswani, Attention is all you need, Adv. Neural Inf. Process. Syst. (2017).
agnostic search spaces and cost models to optimize dynamic-shape                 [5] P.P. Ray, ChatGPT: A comprehensive review on background, applications, key
workloads. For example, DietCode effectively groups kernels with vary-               challenges, bias, ethics, limitations and future scope, Internet Things Cyber- Phys.
                                                                                     Syst. 3 (2023) 121–154.
ing shapes into unified workloads, enabling efficient tuning as a single
                                                                                 [6] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur,
entity and significantly reducing overall tuning time. FTuner introduces             A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models, 2024, arXiv
a uKernel-based approach for dynamic tensors, leveraging hardware-                   preprint arXiv:2407.21783.
aware constraints to generate high-performance kernel programs and               [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U.
combining uKernels during runtime to optimize padding and execution                  Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene
efficiency. While these advancements mark significant progress, further              understanding, in: Proceedings of the IEEE Conference on Computer Vision and
                                                                                     Pattern Recognition, 2016, pp. 3213–3223.
research is needed to fully exploit the potential of dynamic-shape DNNs
                                                                                 [8] D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, Y. Qiao, Drive like a human:
on modern hardware accelerators.                                                     Rethinking autonomous driving with large language models, in: Proceedings of
                                                                                     the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp.
                                                                                     910–919.
9. Conclusion
                                                                                 [9] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D.
                                                                                     Liao, et al., A survey on multimodal large language models for autonomous
   We propose GTA, a novel compilation framework for high-                           driving, in: Proceedings of the IEEE/CVF Winter Conference on Applications of
performance tensor program generation on DLAs. GTA expands the                       Computer Vision, 2024, pp. 958–979.


                                                                           11
A. Xie et al.                                                                                                                          Journal of Systems Architecture 160 (2025) 103359


[10] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,              [30] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle,
     S. Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis                     T. Shpeisman, N. Vasilache, O. Zinenko, MLIR: A compiler infrastructure for the
     of a tensor processing unit, in: Proceedings of the 44th Annual International                    end of Moore’s law, 2020, arXiv preprint arXiv:2002.11054.
     Symposium on Computer Architecture, 2017, pp. 1–12.                                         [31] L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L.
[11] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolu-                   Zhang, L. Zhou, Rammer: Enabling holistic deep learning compiler optimizations
     tional neural network for mobile devices, in: Proceedings of the IEEE Conference                 with {rtasks}, in: 14th USENIX Symposium on Operating Systems Design and
     on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.                                 Implementation, OSDI 20, 2020, pp. 881–897.
[12] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: Trainable bag-of-freebies                 [32] J. Zhao, B. Li, W. Nie, Z. Geng, R. Zhang, X. Gao, B. Cheng, C. Wu, Y. Cheng,
     sets new state-of-the-art for real-time object detectors, in: Proceedings of the                 Z. Li, et al., AKG: automatic kernel generation for neural processing units
     IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.                        using polyhedral transformations, in: Proceedings of the 42nd ACM SIGPLAN
     7464–7475.                                                                                       International Conference on Programming Language Design and Implementation,
[13] Z. Xu, W. Wang, H. Dai, Y. Xu, XFC: Enabling automatic and fast operator                         2021, pp. 1233–1248.
     synthesis for mobile deep learning compilation, J. Syst. Archit. 142 (2023)                 [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     102921.                                                                                          N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance
                                                                                                      deep learning library, Adv. Neural Inf. Process. Syst. 32 (2019).
[14] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, D. Chen,
                                                                                                 [34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
     FPGA/DNN co-design: An efficient design methodology for IoT intelligence on
                                                                                                      Ghemawat, G. Irving, M. Isard, et al., {TensorFlow}: a system for {large-scale}
     the edge, in: Proceedings of the 56th Annual Design Automation Conference
                                                                                                      machine learning, in: 12th USENIX Symposium on Operating Systems Design and
     2019, 2019, pp. 1–6.
                                                                                                      Implementation, OSDI 16, 2016, pp. 265–283.
[15] W. Jiang, L. Yang, E.H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi, J. Hu, Hard-
                                                                                                 [35] J. Roesch, S. Lyubomirsky, M. Kirisame, L. Weber, J. Pollock, L. Vega, Z. Jiang,
     ware/software co-exploration of neural architectures, IEEE Trans. Comput.-Aided
                                                                                                      T. Chen, T. Moreau, Z. Tatlock, Relay: A high-level compiler for deep learning,
     Des. Integr. Circuits Syst. 39 (12) (2020) 4805–4815.
                                                                                                      2019, arXiv preprint arXiv:1904.08368.
[16] Z. Xie, M. Emani, X. Yu, D. Tao, X. He, P. Su, K. Zhou, V. Vishwanath,
                                                                                                 [36] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng,
     Centimani: Enabling fast {AI} accelerator selection for {dNN} training with a
                                                                                                      X. Jin, Apollo: Automatic partition-based operator fusion through layer by layer
     novel performance predictor, in: 2024 USENIX Annual Technical Conference,
                                                                                                      optimization., in: MLSys, 2022.
     USENIX ATC 24, 2024, pp. 1203–1221.
                                                                                                 [37] Y. Shi, Z. Yang, J. Xue, L. Ma, Y. Xia, Z. Miao, Y. Guo, F. Yang, L. Zhou,
[17] Nvidia, Ampere architecture white paper, 2022, URL: https://www.nvidia.                          Welder: Scheduling deep learning memory access via tile-graph, in: 17th USENIX
     com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-                          Symposium on Operating Systems Design and Implementation, OSDI 23, 2023,
     whitepaper.pdf Online (Accessed 13 November 2024).                                               pp. 701–718.
[18] Nvidia, Turing architecture white paper, 2022, URL: https://www.nvidia.                     [38] C. Xia, J. Zhao, Q. Sun, Z. Wang, Y. Wen, T. Yu, X. Feng, H. Cui, Optimizing
     com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-                        deep learning inference via global analysis and tensor expressions, in: Proceed-
     architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Online (Accessed 13                       ings of the 29th ACM International Conference on Architectural Support for
     November 2024).                                                                                  Programming Languages and Operating Systems, Volume 1, 2024, pp. 286–301.
[19] Nvidia, Volta architecture white paper, 2022, URL: https://images.nvidia.                   [39] L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y. Shi, N. Zheng, Z. Miao, F. Yang, T.
     com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Online                      Cao, et al., Ladder: Enabling efficient {low-precision} deep learning computing
     (Accessed 13 November 2024).                                                                     through hardware-aware tensor transformation, in: 18th USENIX Symposium on
[20] K. Troester, R. Bhargava, AMD next generation ‘‘Zen 4’’ core and 4th Gen AMD                     Operating Systems Design and Implementation, OSDI 24, 2024, pp. 307–323.
     EPYC™ 9004 server CPU, in: 2023 IEEE Hot Chips 35 Symposium, HCS, IEEE                      [40] F. Wang, M. Shen, Y. Lu, N. Xiao, TensorMap: A deep RL-based tensor mapping
     Computer Society, 2023, pp. 1–25.                                                                framework for spatial accelerators, IEEE Trans. Comput. (2024).
[21] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y.              [41] Y. Zhao, H. Sharif, V. Adve, S. Misailovic, Felix: Optimizing tensor programs
     Hu, L. Ceze, et al., {TVM}: An automated {end-to-end} optimizing compiler for                    with gradient descent, in: Proceedings of the 29th ACM International Conference
     deep learning, in: 13th USENIX Symposium on Operating Systems Design and                         on Architectural Support for Programming Languages and Operating Systems,
     Implementation, OSDI 18, 2018, pp. 578–594.                                                      Volume 3, 2024, pp. 367–381.
[22] S. Zheng, R. Chen, A. Wei, Y. Jin, Q. Han, L. Lu, B. Wu, X. Li, S. Yan, Y.                  [42] Q. Zhao, R. Wang, Y. Liu, H. Yang, Z. Luan, D. Qian, Sifter: An efficient operator
     Liang, AMOS: enabling automatic mapping for tensor computations on spatial                       auto-tuner with speculative design space exploration for deep learning compiler,
     accelerators with hardware abstraction, in: Proceedings of the 49th Annual                       IEEE Trans. Comput. (2024).
     International Symposium on Computer Architecture, 2022, pp. 874–887.                        [43] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe,
[23] S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C.H. Yu, Y. Yu,               Halide: a language and compiler for optimizing parallelism, locality, and re-
     et al., Tensorir: An abstraction for automatic tensorized program optimization,                  computation in image processing pipelines, Acm Sigplan Not. 48 (6) (2013)
     in: Proceedings of the 28th ACM International Conference on Architectural                        519–530.
     Support for Programming Languages and Operating Systems, Volume 2, 2023,                    [44] Y. Bai, X. Yao, Q. Sun, W. Zhao, S. Chen, Z. Wang, B. Yu, Gtco: Graph and
     pp. 804–817.                                                                                     tensor co-design for transformer-based image recognition on tensor cores, IEEE
                                                                                                      Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023).
[24] J. Bi, Q. Guo, X. Li, Y. Zhao, Y. Wen, Y. Guo, E. Zhou, X. Hu, Z. Du, L. Li, et al.,
                                                                                                 [45] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, A. Parashar, Maestro:
     Heron: Automatically constrained high-performance library generation for deep
                                                                                                      A data-centric approach to understand reuse, performance, and hardware cost of
     learning accelerators, in: Proceedings of the 28th ACM International Conference
                                                                                                      dnn mappings, IEEE Micro 40 (3) (2020) 20–29.
     on Architectural Support for Programming Languages and Operating Systems,
                                                                                                 [46] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, Y. Liang, Tenet: A frame-
     Volume 3, 2023, pp. 314–328.
                                                                                                      work for modeling tensor dataflow based on relation-centric notation, in: 2021
[25] L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D.
                                                                                                      ACM/IEEE 48th Annual International Symposium on Computer Architecture,
     Zhuo, K. Sen, et al., Ansor: Generating {high-performance} tensor programs for
                                                                                                      ISCA, IEEE, 2021, pp. 720–733.
     deep learning, in: 14th USENIX Symposium on Operating Systems Design and
                                                                                                 [47] A. Parashar, P. Raina, Y.S. Shao, Y.-H. Chen, V.A. Ying, A. Mukkara, R. Venkate-
     Implementation, OSDI 20, 2020, pp. 863–879.
                                                                                                      san, B. Khailany, S.W. Keckler, J. Emer, Timeloop: A systematic approach to dnn
[26] S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic
                                                                                                      accelerator evaluation, in: 2019 IEEE International Symposium on Performance
     schedule exploration and optimization framework for tensor computation on het-
                                                                                                      Analysis of Systems and Software, ISPASS, IEEE, 2019, pp. 304–315.
     erogeneous system, in: Proceedings of the Twenty-Fifth International Conference             [48] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha,
     on Architectural Support for Programming Languages and Operating Systems,                        P. Raina, et al., Interstellar: Using halide’s scheduling language to analyze dnn
     2020, pp. 859–873.                                                                               accelerators, in: Proceedings of the Twenty-Fifth International Conference on
[27] A. Sabne, Xla: Compiling machine learning for peak performance, Google Res                       Architectural Support for Programming Languages and Operating Systems, 2020,
     (2020).                                                                                          pp. 369–383.
[28] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W.S. Moses, S.               [49] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, A.
     Verdoolaege, A. Adams, A. Cohen, Tensor comprehensions: Framework-agnostic                       Krishnamurthy, Learning to optimize tensor programs, Adv. Neural Inf. Process.
     high-performance machine learning abstractions, 2018, arXiv preprint arXiv:                      Syst. 31 (2018).
     1802.04730.                                                                                 [50] J. Appleyard, S. Yokim, NVIDIA developer technical blog, 2017,
[29] P. Tillet, H.-T. Kung, D. Cox, Triton: an intermediate language and compiler for                 URL:https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9 Online
     tiled neural network computations, in: Proceedings of the 3rd ACM SIGPLAN                        (Accessed 13 November 2024).
     International Workshop on Machine Learning and Programming Languages,                       [51] NVIDIA, Basic linear algebra on NVIDIA GPUs, 2024, URL: https://developer.
     2019, pp. 10–19.                                                                                 nvidia.com/cublas Online (Accessed 13 November 2024) n.d.


                                                                                            12
A. Xie et al.                                                                                                                            Journal of Systems Architecture 160 (2025) 103359


[52] A. Kerr, H. Wu, M. Gupta, D. Blasig, P. Ramini, D. Merrill, A. Shivam, P.                      [75] Apple, Apple introduces M4 chip, 2024, URL: https://www.apple.com/sg/
     Majcher, P. Springer, M. Hohnerbach, J. Wang, M. Nicely, CUTLASS, 2022, URL:                        newsroom/2024/05/apple-introduces-m4-chip/ Online (Accessed 13 November
     https://github.com/NVIDIA/cutlass Online (Accessed 13 November 2024).                               2024).
[53] T. Zerrell, J. Bruestle, Stripe: Tensor compilation via the nested polyhedral                  [76] Apple, Apple introduces M4 pro and M4 max, 2024, URL: https://www.apple.
     model, 2019, arXiv preprint arXiv:1903.06498.                                                       com/sg/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/ Online (Ac-
[54] R. Baghdadi, J. Ray, M.B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang,                               cessed 13 November 2024).
     P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: A polyhedral compiler for                      [77] Cambricon, Cambricon MLU, 2024, URL: https://www.cambricon.com/ Online
     expressing fast and portable code, in: 2019 IEEE/ACM International Symposium                        (Accessed 13 November 2024) n.d..
     on Code Generation and Optimization, CGO, IEEE, 2019, pp. 193–205.                             [78] Z. Jia, B. Tillman, M. Maggioni, D.P. Scarpazza, Dissecting the graphcore ipu
[55] S. Tavarageri, A. Heinecke, S. Avancha, B. Kaul, G. Goyal, R. Upadrasta, Polydl:                    architecture via microbenchmarking, 2019, arXiv preprint arXiv:1912.03413.
     Polyhedral optimizations for creation of high-performance dl primitives, ACM                   [79] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural
     Trans. Archit. Code Optim. ( TACO) 18 (1) (2021) 1–27.                                              networks: Training neural networks with low precision weights and activations,
[56] Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek,                         J. Mach. Learn. Res. 18 (187) (2018) 1–30.
     Y.S. Shao, Cosa: Scheduling by constrained optimization for spatial accelera-                  [80] T. Liang, J. Glossner, L. Wang, S. Shi, X. Zhang, Pruning and quantization for
     tors, in: 2021 ACM/IEEE 48th Annual International Symposium on Computer                             deep neural network acceleration: A survey, Neurocomput. 461 (2021) 370–403.
     Architecture, ISCA, IEEE, 2021, pp. 554–566.                                                   [81] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized
[57] M. Sotoudeh, A. Venkat, M. Anderson, E. Georganas, A. Heinecke, J. Knight,                          neural networks: Training deep neural networks with weights and activations
     ISA mapper: a compute and hardware agnostic deep learning compiler, in:                             constrained to+ 1 or-1, 2016, arXiv preprint arXiv:1602.02830.
     Proceedings of the 16th ACM International Conference on Computing Frontiers,                   [82] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: Imagenet classifi-
     2019, pp. 164–173.                                                                                  cation using binary convolutional neural networks, in: European Conference on
[58] J. Weng, A. Jain, J. Wang, L. Wang, Y. Wang, T. Nowatzki, UNIT: Unifying                            Computer Vision, Springer, 2016, pp. 525–542.
     tensorized instruction compilation, in: 2021 IEEE/ACM International Symposium                  [83] C.-C. Yang, Y.-R. Chen, H.-H. Liao, Y.-M. Chang, J.-K. Lee, Auto-tuning fixed-
     on Code Generation and Optimization, CGO, IEEE, 2021, pp. 77–89.                                    point precision with TVM on RISC-v packed SIMD extension, ACM Trans. Des.
[59] H. Zhu, R. Wu, Y. Diao, S. Ke, H. Li, C. Zhang, J. Xue, L. Ma, Y. Xia, W. Cui,                      Autom. Electron. Syst. 28 (3) (2023) 1–21.
     et al., {RollER}: Fast and efficient tensor compilation for deep learning, in: 16th            [84] D. Diamantopoulos, B. Ringlein, M. Purandare, G. Singh, C. Hagleitner, Agile
     USENIX Symposium on Operating Systems Design and Implementation, OSDI 22,                           autotuning of a transprecision tensor accelerator overlay for TVM compiler
     2022, pp. 233–248.                                                                                  stack, in: 2020 30th International Conference on Field-Programmable Logic and
[60] Y. Ding, C.H. Yu, B. Zheng, Y. Liu, Y. Wang, G. Pekhimenko, Hidet: Task-mapping                     Applications, FPL, IEEE, 2020, pp. 310–316.
     programming paradigm for deep learning tensor programs, in: Proceedings of the                 [85] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, Z. Jia, Towards efficient
     28th ACM International Conference on Architectural Support for Programming                          generative large language model serving: A survey from algorithms to systems,
     Languages and Operating Systems, Volume 2, 2023, pp. 370–384.                                       2023, arXiv preprint ArXiv:2312.15234.
[61] L. Zheng, H. Wang, J. Zhai, M. Hu, Z. Ma, T. Wang, S. Huang, X. Miao, S. Tang,                 [86] J. Xu, G. Song, B. Zhou, F. Li, J. Hao, J. Zhao, A holistic approach to automatic
     K. Huang, et al., {EINNET}: Optimizing tensor programs with {derivation-based}                      mixed-precision code generation and tuning for affine programs, in: Proceedings
     transformations, in: 17th USENIX Symposium on Operating Systems Design and                          of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of
     Implementation, OSDI 23, 2023, pp. 739–755.                                                         Parallel Programming, 2024, pp. 55–67.
[62] Y. Zhai, S. Yang, K. Pan, R. Zhang, S. Liu, C. Liu, Z. Ye, J. Ji, J. Zhao, Y. Zhang, et        [87] M. Cowan, T. Moreau, T. Chen, J. Bornholt, L. Ceze, Automatic generation of
     al., Enabling tensor language model to assist in generating {High-Performance}                      high-performance quantized machine learning kernels, in: Proceedings of the
     tensor programs for deep learning, in: 18th USENIX Symposium on Operating                           18th ACM/IEEE International Symposium on Code Generation and Optimization,
     Systems Design and Implementation, OSDI 24, 2024, pp. 289–305.                                      2020, pp. 305–316.
[63] F. Wang, M. Shen, Y. Ding, N. Xiao, Soter: Analytical tensor-architecture                      [88] R. Lai, J. Shao, S. Feng, S.S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin,
     modeling and automatic tensor program tuning for spatial accelerators, in: 2024                     J. Liu, et al., Relax: Composable abstractions for end-to-end dynamic machine
     ACM/IEEE 51st Annual International Symposium on Computer Architecture,                              learning, 2023, arXiv preprint arXiv:2311.02103.
     ISCA, IEEE, 2024, pp. 991–1004.                                                                [89] H. Shen, J. Roesch, Z. Chen, W. Chen, Y. Wu, M. Li, V. Sharma, Z. Tatlock,
[64] F. Wang, M. Shen, Automatic kernel generation for large language models on                          Y. Wang, Nimble: Efficiently compiling dynamic neural networks for model
     deep learning accelerators, in: 2023 IEEE/ACM International Conference on                           inference, Proc. Mach. Learn. Syst. 3 (2021) 208–222.
     Computer Aided Design, ICCAD, IEEE, 2023, pp. 1–9.                                             [90] B. Zheng, Z. Jiang, C.H. Yu, H. Shen, J. Fromm, Y. Liu, Y. Wang, L. Ceze,
[65] J. Devlin, Bert: Pre-training of deep bidirectional transformers for language                       T. Chen, G. Pekhimenko, DietCode: Automatic optimization for dynamic tensor
     understanding, 2018, arXiv preprint arXiv:1810.04805.                                               programs, Proc. Mach. Learn. Syst. 4 (2022) 848–863.
[66] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, R.R. Salakhutdinov, On multiplicative                    [91] P. Mu, L. Wei, Y. Liu, R. Wang, FTuner: A fast dynamic shape tensors program
     integration with recurrent neural networks, Adv. Neural Inf. Process. Syst. 29                      auto-tuner for deep learning compilers, 2024, arXiv preprint arXiv:2407.21418.
     (2016).                                                                                        [92] F. Yu, G. Li, J. Zhao, H. Cui, X. Feng, J. Xue, Optimizing dynamic-shape
[67] A.G. Howard, Mobilenets: Efficient convolutional neural networks for mobile                         neural networks on accelerators via on-the-fly micro-kernel polymerization,
     vision applications, 2017, arXiv preprint arXiv:1704.04861.                                         in: Proceedings of the 29th ACM International Conference on Architectural
[68] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao:                                 Support for Programming Languages and Operating Systems, Volume 2, 2024,
     A small-footprint high-throughput accelerator for ubiquitous machine-learning,                      pp. 797–812.
     ACM SIGARCH Comput. Archit. News 42 (1) (2014) 269–284.
[69] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
     N. Sun, et al., Dadiannao: A machine-learning supercomputer, in: 2014 47th                                               Anxing Xie is currently working toward a Ph.D. degree
                                                                                                                              in the School of Computer Science and Engineering, Hu-
     Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 2014,
                                                                                                                              nan University of Science and Technology, China. He is
     pp. 609–622.
                                                                                                                              currently working on deep learning automatic compilation
[70] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, Y. Chen,
                                                                                                                              optimization and high-performance computation. His re-
     Pudiannao: A polyvalent machine learning accelerator, ACM SIGARCH Comput.
                                                                                                                              search interests include compiler optimization, and parallel
     Archit. News 43 (1) (2015) 369–381.
                                                                                                                              computing.
[71] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, O. Temam,
     ShiDianNao: Shifting vision processing closer to the sensor, in: Proceedings of
     the 42nd Annual International Symposium on Computer Architecture, 2015, pp.
     92–104.
[72] B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski, S. Avancha, Intel
     nervana neural network processor-t (nnp-t) fused floating point many-term dot                                            Yonghua Hu is a professor in School of Computer Sci-
     product, in: 2020 IEEE 27th Symposium on Computer Arithmetic, ARITH, IEEE,                                               ence and Engineering, Hunan University of Science and
     2020, pp. 133–136.                                                                                                       Technology, China. He received the Ph.D degree in Com-
[73] E. Talpes, D. Williams, D.D. Sarma, Dojo: The microarchitecture of tesla’s exa-                                          puter Application Technology from Hunan University, in
     scale computer, in: 2022 IEEE Hot Chips 34 Symposium, HCS, IEEE Computer                                                 2008. He went to University at Buffalo SUNY as a visiting
     Society, 2022, pp. 1–28.                                                                                                 scholar in 2019. His research interests include compilation
[74] H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable                                              optimization, artificial intelligence and parallel computing.
     and unified architecture for ubiquitous deep neural network computing: Industry
     track paper, in: 2021 IEEE International Symposium on High-Performance
     Computer Architecture, HPCA, IEEE, 2021, pp. 789–801.


                                                                                               13
A. Xie et al.                                                                                   Journal of Systems Architecture 160 (2025) 103359


                Yaohua Wang is currently a professor with the College                Yuxiang Gao is currently working toward an M.S. degree
                of Computer Science, National University of Defense Tech-            in the School of Computer Science and Engineering, Hunan
                nology. His research interest is in computer architecture,           University of Science and Technology, China. He is currently
                machine learning and security. His work spans and stretches          working on code optimization and compilation technol-
                the boundaries of computer architecture. He is especially ex-        ogy. His research interests include automatic compilation
                cited about novel, fundamentally-efficient computation, and          optimization and code generation.
                memory/storage paradigms, applied to emerging machine
                learning applications.


                Zhe Li received the Ph.D. degree in Computer Science                 Zenghua Cheng is currently working toward an M.S. degree
                from Jilin University in 2022. He is currently working at            in the School of Computer Science and Engineering, Hunan
                Tianjin Advanced Technology Institute. His research inter-           University of Science and Technology, China. He is currently
                ests include deep learning compilation and combinatorial             working on code optimization and compilation technol-
                optimization.                                                        ogy. His research interests include automatic compilation
                                                                                     optimization and Web security.


                                                                                14