Files
opaque-lattice/papers_txt/GTA--Generating-high-performance-tensorized-progra_2025_Journal-of-Systems-A.txt
2026-01-06 12:49:26 -07:00

1017 lines
122 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103359
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
GTA: Generating high-performance tensorized program with dual-task
scheduling
Anxing Xie a ,1 , Yonghua Hu a ,, Yaohua Wang b , Zhe Li b,c , Yuxiang Gao a , Zenghua Cheng a
a School of Computer Science and Engineering, Hunan University of Science and Technology, Taoyuan Road, Xiangtan, 411201, Hunan, China
b
School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410073, Hunan, China
c
Tianjin Institute of Advanced Technology, Huixiang Road, 300459, Tianjin, China
ARTICLE INFO ABSTRACT
Keywords: Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring
Mapping the efficient execution of deep neural networks. But, producing such programs for different operators
Code generation across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent
Compiler optimization
acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However,
Tensor computation
their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs
and significant search time overhead.
In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs
for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with
rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate
ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy
to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates.
As a result, GTA can find high-performance tensor programs that are outside the search space of existing
state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88×
over AMOS and 2.29× over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49× over Ansor and 2.76×
over PyTorch on CPU with AVX512.
1. Introduction also bridge the gap between high-level tensor programs and low-level
instructions, a process we refer to as tensorized program generation
Recently, the successful deployment of machine learning models with automatic mapping optimization. However, generating high-
has revolutionized diverse application domains, such as image recog- performance tensorized programs for various DLAs remains challenging
nition [13], natural language processing [46], and autonomous driv- for several reasons.
ing [79]. This rapid development has created a demand for generat-
Firstly, inefficient exploration of the intrinsic mapping space leads
ing high-performance tensor programs for deep learning accelerators
to substantial overhead in search time. For instance, mapping the 7
(DLAs), such as Google TPUs [10], mobile devices [1113], FPGAs [14
16], and more. To accelerate machine learning, hardware vendors loops of a 2D convolution to the 3D of Tensor Core can involve 35
have introduced domain-specific intrinsics for tensor computations, different ways [22]. Current strategies [22,23] treat each mapping can-
such as NVIDIAs Tensor Cores [1719] and CPUs AVX512 [20]. This didate equally, generating a tensorized program for each and ultimately
demand has led to the process known as tensorization [21], which selecting the one with the best performance. This approach incurs
involves transforming computations using these intrinsic instructions. significant time overhead and is inefficient, as it fails to prioritize more
However, hardware specialization complicates the task of generating promising candidates during the exploration process. Our experiments
high-performance tensorized programs. reveal that many mapping candidates for a given subgraph ultimately
To support hardware intrinsic instructions across different acceler- fail to produce high-performance tensorized programs, indicating that
ators, existing methods [2224] use unified hardware abstractions to a large portion of the explored mappings are ineffective in optimizing
enable end-to-end automatic mapping space exploration. These abstrac-
performance.
tions not only convert opaque intrinsics into an analyzable format but
Corresponding author.
E-mail address: huyh@hnust.cn (Y. Hu).
1
Part of this work was done at National University of Defense Technology.
https://doi.org/10.1016/j.sysarc.2025.103359
Received 23 November 2024; Received in revised form 8 January 2025; Accepted 30 January 2025
Available online 7 February 2025
1383-7621/© 2025 Published by Elsevier B.V.
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Fig. 1. Comparison of different task scheduling strategies. Part (a): task scheduling with gradient decent. In round 1, all 𝑡𝑎𝑠𝑘𝑠𝑖 are executed sequentially. In subsequent rounds,
𝑡𝑎𝑠𝑘𝑠𝑖 are selectively executed based on the performance gradients calculated from the feedback of each task. Part (b): sequential execution of sub-tasks without dual-task scheduling.
Part (c): slice the time and prioritize important subgraphs and intrinsic mapping candidates, meaning that not all main-tasks and sub-tasks will be executed. For example, an
intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and discarded mapping candidates. The former will proceed to subsequent tensor program optimization and tuning,
while the latter will not participate in further optimization unless they are selected in the next scheduling round. (For interpretation of the references to color in this figure legend,
the reader is referred to the web version of this article.)
Secondly, existing rule-based tensor program exploration meth- backends. These compilers take model definitions, expressed in frame-
ods [25] lack the ability to perform automatic tuning and optimization works like PyTorch [33] or TensorFlow [34], as inputs and generate
tailored to domain-specific intrinsics. As a result, these methods often efficient code implementations for specific hardware platforms, such as
fail in auto-tuning and produce suboptimal tensorized programs. To CPUs, GPUs. The compilation process often adopts a progressive multi-
overcome these limitations, there is an urgent need for more efficient layer optimization approach. It begins with the front-end, where neural
exploration of subgraph mapping spaces, along with auto-tuning strate- network models serve as input, and proceeds through intermediate
gies that can effectively support domain-specific intrinsics, enabling the representation (IR) stages. These include graph-level IR [3539] for
automatic generation of high-performance tensorized programs. structural optimizations and loop-level IR [4042] for fine-grained
In this paper, we introduce GTA, a new compiler framework de- transformations. Finally, the back-end generates hardware-specific ex-
signed to generate high-performance tensorized programs. GTA auto- ecutable code using traditional compiler techniques, ensuring efficient
matically generates an extensive search space optimized for hardware execution on the target platform.
intrinsics, simultaneously increasing the likelihood of selecting the most A key innovation in deep learning compilers is the compute-
efficient mapping configuration. For generating the search space, we schedule separation first introduced by Halide [43] and adopted by
employ rule-based strategies to construct a large scheduling search frameworks like TVM [21]. Compute represents the mathematical
space and apply pruning techniques based on hardware cache resource description of tensor operations, such as addition, convolution, or
limitations to eliminate invalid program candidates. Finally, as shown matrix multiplication, while schedule defines how these operations
in Fig. 1, for search strategy implementation, we use a dual-task are executed on hardware. Schedule specifies program transforma-
scheduling algorithm to allocate tuning resources across all subgraphs tions, including loop tiling, vectorization, and unrolling, to optimize
(𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 as shown by the blue box in Fig. 1) in the neural net- performance for specific hardware architectures. This decoupling sim-
work and their intrinsic mapping candidates(𝑠𝑢𝑏-𝑡𝑎𝑠𝑘𝑖 as shown by the plifies the representation of tensor computations, enabling flexible
orange box and gray box). This algorithm prioritizes subgraphs with optimization strategies tailored to different backends.
greater potential for performance improvement, allocating them more Recent advancements [2224,44] in deep learning compilers focus
tuning opportunities, while reducing tuning efforts on less promising on leveraging hardware intrinsics to further optimize tensor programs.
mapping candidates based on performance feedback, thereby minimiz- By integrating intrinsic-specific mapping abstractions, these compil-
ing overall tuning time. In summary, this paper makes the following ers can directly utilize the specialized instructions of DLAs, such as
contributions: NVIDIAs Tensor Cores or CPUs AVX512, to achieve higher compu-
tational efficiency. These developments mark a shift from general-
• We integrated intrinsic-based mapping abstraction with a rule-
purpose optimizations to hardware-aware designs, laying the founda-
based program generation strategy to expand the search space
tion for intrinsic-based mapping strategies.
significantly.
• We developed and implemented an efficient dual-task schedul-
2.2. Intrinsic-based mapping abstraction
ing strategy for tensorized programs, effectively reducing tuning
efforts while enhancing performance.
The development of DLAs has led to the creation of specialized
• We propose a compilation framework called GTA, which supports instructions [4548], known as intrinsics, designed to enhance the
the generation of high-performance tensorized programs at both computational efficiency of tensor operations. These instructions serve
the operator level and the full network level on NVIDIA GPUs and as essential interfaces between hardware and compilers, enabling opti-
CPUs. mized execution of key operations like matrix multiplication and data
• We implemented and comprehensively evaluated the GTA sys- movement.
tem, demonstrating that the aforementioned techniques outper- Intrinsics provide an efficient mechanism for managing kernel op-
form state-of-the-art systems across various deep neural networks erations in tensor programs, typically categorized into compute in-
(DNNs). trinsics for performing computations and memory intrinsics for data
handling [22]. For example, NVIDIA Tensor Cores [1719] and CPU
2. Background and motivation AVX512 [20] offer specialized intrinsics that allow accelerated ma-
trix and vector operations, respectively, facilitating high-performance
2.1. Deep learning compilers computation across various accelerators.
Intrinsic-based mapping abstraction further unifies tensor pro-
Deep learning compilers [2132] have emerged as essential tools for gram optimization by representing diverse intrinsic behaviors in a com-
bridging the gap between deep learning models and diverse hardware mon, analyzable form. Frameworks like AMOS [22] and TensorIR [23]
2
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Table 1 ❸ Rule-based mapping: Rule-based mapping [2427,57] gener-
State-of-the-art compilersmappings for hardware accelerators.
ates efficient tensor programs through predefined scheduling primi-
Name Mapping Method tives, streamlining tensor program creation without user-defined tem-
❶ plates. This approach leverages scheduling techniques like loop tiling,
AutoTVM Hand-written templates + Tuning
fusion, and vectorization, as demonstrated by frameworks like An-
Triton Hand-written templates
sor [25], which automatically create search spaces using these rules.
Tiramisu Polyhedral model This method simplifies tensor program generation in deep learning
AKG Polyhedral model + Templates applications. However, it also has limitations: users must ensure that
❸ the predefined rules align with the specific operators and hardware, or
Ansor Generated rules + Tuning
the generated programs may fail to achieve optimal performance.
XLA Templates and rules
Heron Constraint-based rules + Tuning ❹ Analyzable abstraction mapping: Analyzable abstraction map-
MetaSchedule Generated rules + Tuning ping [22,23,44,58,59] unifies tensor program optimization by abstract-
❹ ing diverse hardware intrinsic behaviors into a common representation,
UNIT Analyzable abstraction + Tuning facilitating efficient mapping and transformation of tensorized pro-
ROLLER Tile abstraction + Construction policy
AMOS Analyzable abstraction + Tuning
grams. Examples like AMOS and TensorIR establish direct mappings
TensorIR Analyzable abstraction and generated rules + Tuning between software and hardware, guiding the automated generation
❺ of tensorized programs. This approach broadens the scope of explo-
Hidet Task-mapping + Post-scheduling fusion ration by identifying foundational software-to-hardware combinations,
EINNET Derivation-based + Tuning
increasing the potential for discovering optimized mappings.
TensorMap Reinforcement learning + Tuning
❺ Other mapping: Other mapping methods [13,40,60,61] reformu-
GTA Analyzable abstraction and generated rules + Tuning
late deep learning optimization problems using strategies from other
domains to enhance efficiency. For example, CoSA [56] and Heron [24]
convert the scheduling space search into a constrained optimization
leverage this approach to directly map software operations to hard- problem and leverage solvers to rapidly explore the space. Alterna-
ware intrinsics, supporting automated generation and transformation tively, TLM [62] and Soter [63] treat tensor program exploration as
of tensorized programs. This abstraction broadens the search space for a language model generation task, where tensor programs are rep-
high-performance configurations by identifying fundamental software- resented as sequences and tunable parameters as language tokens.
to-hardware mappings, thus enhancing optimization potential across Specifically, they leverage a large language model (LLM) to generate
different hardware backends. these tokens for tunable parameters, enabling efficient exploration of
mapping schemes and more effective optimization of tensor programs.
2.3. Tensor program generation strategy Building on this foundation, we reviewed five primary mapping
approaches used for deep learning accelerators: hand-written, rule-
In Table 1, we summarize state-of-the-art compiler mapping tech- based, polyhedral model, analyzable abstraction, and other mapping
niques used to generate optimized tensor programs on hardware accel- methods. Each approach brings unique advantages—hand-written and
erators. Most existing compilers leverage programmable intrinsics as rule-based mappings allow fine-tuned performance but require exten-
part of their mapping strategy, enabling developers to focus on high- sive manual intervention or rigid predefined rules, while polyhedral
level optimization while the compiler handles low-level architectural and analyzable abstraction mappings offer more automated solutions
details. These mapping methods streamline tensor program generation but are challenged by complexity and limited applicability. Methods
by abstracting hardware-specific operations, thereby enhancing both borrowing from other domains, such as optimization solvers and lan-
efficiency and portability. guage models, open new directions but may lack consistency across
Specifically, we categorize the state-of-the-art compilers/mappers diverse hardware. In summary, intrinsic-based mapping abstraction of-
for DLAs into five main approaches: fers a unified framework for optimizing tensor programs across diverse
❶ Hand-written mapping: Hand-written mapping [29,49] requires hardware accelerators by abstracting hardware intrinsic behaviors into
developers to manually define mappings for tensorized programs using a common representation. Systems like AMOS and TensorIR leverage
compiler-provided tensorize interfaces. This approach enables fine- this approach to enable efficient and adaptable mappings for tensorized
grained optimization, especially for specialized hardware like NVIDIA programs.
Tensor Cores. However, it demands significant expertise and high Despite these advances, significant challenges remain in achieving
development costs, as developers must continually rewrite templates flexible, high-performance mappings that are adaptable to new hard-
to support new operators and accelerators [5052]. While hand-written ware accelerators, such as the inefficiency of existing approaches in
mapping can achieve high performance for specific workloads, its lack handling diverse architectural constraints and their inability to effec-
of scalability and adaptability limits its effectiveness compared to more tively explore large and complex search spaces. To better illustrate our
automated methods. motivation, we present an example to illustrate the specific challenges
❷ Polyhedral model mapping: Polyhedral model mapping [28,32, within existing analyzable abstraction mapping systems, motivating the
5356] provides a powerful strategy for optimizing tensor programs by development of our approach.
restructuring execution and managing complex memory dependencies. Mapping intrinsic instructions onto hardware accelerators poses
In the realm of tensor program compilation, this approach plays a significant challenges due to the vast number of possible configurations
critical role in handling intricate memory structures and optimizing ex- and their impact on performance. The process of selecting the optimal
ecution. For example, AKG [32] leverages polyhedral scheduling to re- mapping for intrinsic instructions, such as those used in Tensor Cores, is
structure execution order through new linear relationships, effectively complex, given the numerous potential mapping candidates. Each map-
eliminating inter-loop dependencies. This method is particularly advan- ping choice can critically affect performance factors like data locality
tageous for hardware like TPUs, where enhancing parallel computation and parallelism. For example, as shown in Table 2, AMOS identified 35
is essential. By exploring a broader range of affine transformations distinct ways to map the seven loops of a 2D convolution onto the 3D
compared to methods such as TVM [21], polyhedral mapping optimizes loops of the Tensor Core. Exhaustively exploring all configurations is in-
performance for diverse workloads. However, the models inherent efficient and rarely yields substantial performance gains. Thus, a more
complexity limits its general applicability, making it less feasible for efficient approach is required, one that prioritizes the most promising
simpler or less resource-intensive tasks. mappings to reduce search overhead and maximize performance.
3
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Table 2
Mapping candidates choices. This example maps a 2D convolution index to Tensor Core index (type: float16). Space loops: n, k, p, q, 𝑖1 , 𝑖2 ;
Reduction loops: rc, rr, rs, 𝑟1 . The mapping choices can be categorized into basic mapping and complex mapping. Basic mapping means selecting
only one choice at a time, while complex mapping allows multiple choices to be combined for mixed mapping.
mapping1 mapping2 mapping3 mapping4 mapping5 mapping6 mapping7
i1 n n n p p q q
i2 k k k k k k k
r1 rc rr rs rc rs rc rr
Choices 0/1 0/1 0/1 0/1 0/1 0/1 0/1
Fig. 2. The compilation flow of GTA. 𝑡𝑛 denotes the 𝑛th non-intrinsic main-task (blue box), and 𝑡𝑛𝑘 denotes the 𝑘th mapping candidate of 𝑛th intrinsic-enabled main-task (orange
box). All mapping candidates are ranked and executed based on performance feedback. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of this article.)
A second challenge lies in the scheduling of tensor programs, which 4. Dual-task scheduling
often lacks consideration for DLAs intrinsics. Existing systems do not
sufficiently incorporate these intrinsics when generating the scheduling Most existing compiler frameworks adopt a performance-aware tun-
search space, limiting their ability to optimize tensorized programs for ing strategy to fine-tune generated programs, a method proven effective
specialized hardware. To address this, a more comprehensive approach by systems such as Ansor and AMOS. For example, Ansor refines its
to scheduling is needed, integrating primitives like tiling, fusion, and cost model by updating task weights based on feedback from each
vectorization that are tailored to the unique characteristics of DLAs. search iteration, while dynamically allocating subgraph trials. Building
Without such a targeted approach, the scheduling search space cannot on this approach, when multiple intrinsic instruction mapping options
fully leverage the potential of available mappings, thereby constraining are available, feeding performance results of each mapping back into
the systems capacity to produce high-performance programs. the front-end further enhances the framework by enabling seamless
co-design between the front and back-end stages.
3. GTA overview To optimize tuning resource allocation, a DNN can be decomposed
into multiple independent subgraphs (e.g., conv2d + ReLU). For some
To address the aforementioned issues, we propose GTA, a compila-
subgraphs, spending time on tuning may not significantly improve the
tion framework designed to automatically generate high-performance
overall network performance. This may occur when a subgraph is not a
tensorized programs for specialized hardware. As shown in Fig. 2, it
performance bottleneck, or when any tuning yields only marginal gains.
takes deep neural networks (DNNs) as input, converting them into
Similarly, a subgraph may have multiple intrinsic mapping candidates,
computation graphs represented as directed acyclic graphs (DAGs). In
but further tuning on certain mappings may not result in meaningful
these graphs, each node corresponds to a tensor operation, and each
improvements. This is often because certain mapping schemes exhibit
edge denotes a producerconsumer relationship between operations. To
inefficient memory access patterns, limiting their ability to leverage the
handle the complexity of large computational graphs, GTA partitions
unique features of the underlying hardware and thereby restricting the
the DNNs computation graph into smaller, manageable subgraphs us-
potential for significant performance gains.
ing Relays operator fusion algorithm, which has minimal performance
impact due to the layer-by-layer structure of DNNs (𝑡1 , 𝑡2 , … , 𝑡𝑛 in To illustrate the dual-task scheduling (DTS) process, we use
Fig. 2). ResNet18 as an example. After splitting ResNet18 into subgraphs,
To maximize performance across multiple subgraphs, GTA dynam- there are 24 unique subgraphs, most of which are convolution layers
ically prioritizes subgraphs and mapping candidates most likely to with varying shape configurations (e.g., input size, kernel size, stride).
enhance end-to-end efficiency. It uses a dual-task scheduling ap- Following Ansors task scheduling methodology, we define a task as the
proach (detailed in Section 4) that allocates tuning time across both process of generating high-performance programs for each subgraph.
subgraph and mapping candidate levels. By allocating varying amounts Thus, optimizing a single DNN like ResNet18 requires completing
of time to different subgraphs and probabilistically discarding less effi- multiple tasks (e.g., 24 tasks for ResNet18).
cient candidates based on performance feedback, dual-task scheduling To efficiently allocate tuning resources across these tasks, GTA
helps avoid wasted tuning resources on low-impact mappings. employs a DTS approach. This method dynamically assigns varying
Additionally, resource-constrained rules (explained in Section 5) amounts of time to different subgraphs and probabilistically discards in-
guide program generation on both DLAs and general-purpose acceler- efficient mapping candidates based on program performance feedback.
ators. GTA designs these rules by abstracting common architectural DTS operates on two levels: the subgraph level and the mapping can-
characteristics across DLAs, such as coarse-grained hardware intrin- didate level, helping GTA focus tuning resources on the most impactful
sics (e.g., WMMA in Tensor Core) and dedicated scratchpad memory configurations and avoid spending time on low-impact mappings.
(e.g., Unified Buffer in TPU). This design allows GTA to efficiently As shown in Fig. 1, the DTS iteratively allocates tuning resources to
leverage hardware-specific features, optimizing tensorized programs to different tasks. In each round, the first step selects a subgraph for pro-
fully exploit the underlying hardware capabilities. gram generation, GTA generates a set of intrinsic-compatible mapping
4
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Algorithm 1: Dual-Task Scheduling Table 3
Notations.
Input:
Notation Description/Definition
𝐺: native deep learning neural network
target : target hardware platform Main-task Subgraph process for generating high-performance programs
Sub-task Intrinsic mapping candidate satisfying hardware constraints
trials: total tuning counts
𝛥𝑡 Small backward window size
MEASURE_NUM : number of measures per round
𝑁𝑖 The set of similar task of i
Output: 𝑏𝑒𝑠𝑡_𝑡𝑎𝑠𝑘𝑠: best performance tasks 𝐶𝑖 The number of floating point operation in task i
1 Function dual_scheduling 𝑉𝑘 The number of floating point operation per second we can
2 Initialize local variables 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐵𝑡𝑎𝑠𝑘 , 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐶𝑡𝑎𝑠𝑘 , 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ; achieve in task k
3 tasks = 𝑒𝑥𝑡𝑟𝑎𝑐 𝑡_𝑡𝑎𝑠𝑘𝑠(𝐺, 𝑡𝑎𝑟𝑔 𝑒𝑡); 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 Best mapping latency set of tasks
4 while 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 < trials do 𝐵𝑡𝑎𝑠𝑘 Best mapping tasks set of DNN
5 tid = gradient_scheduling (tasks, 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 ); 𝐶𝑠𝑎𝑚𝑝𝑙𝑒 Samples selected from all mappings
6 𝑀𝑐 𝑎𝑛𝑑 𝑖 = 𝑚𝑎𝑡𝑐 _𝑖𝑛𝑡𝑟𝑖𝑛𝑠𝑖𝑐(tasks[tid], target ); 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 Current number of trials
𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 Current mapping selection
7 if 𝑀𝑐 𝑎𝑛𝑑 𝑖 not NULL then
G Native neural network
8 for 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 in 𝑀𝑐 𝑎𝑛𝑑 𝑖 do
𝑚𝑖 (𝑡) Minimum execution time for𝑖th task
9 if 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then 𝑚𝑖𝑘 (𝑡) Execution time of 𝑘th mapping for 𝑚𝑖 (𝑡)
10 if 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 not in 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 Latency set of all tasks
11 continue; 𝑀𝑐 𝑎𝑛𝑑 𝑖 Set of all mapping candidates
12 end 𝛼𝑘 Sampling probability of mapping k
13 end 𝛽 Hyperparameter for increasing probability
𝜔𝑖 Number of appearances of task i in the network
14 latency = tasks[tid].tune(𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 );
15 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 .append(latency);
16 if latency < 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 then
17 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 [tid] = latency;
defined as:
18 𝐵𝑡𝑎𝑠𝑘 [tid] = tasks[tid];
19 end
𝑛
𝑓 (𝐺) = (𝜔𝑖 × 𝑚𝑎𝑥(𝛽(𝛼1 ⋅ 𝑚𝑖1 (𝑡), 𝛼2 ⋅ 𝑚𝑖2 (𝑡), ..., 𝛼𝑘 ⋅ 𝑚𝑖𝑘 (𝑡)))) (1)
20 end 𝑖=1
21 𝐶𝑠𝑎𝑚𝑝𝑙𝑒 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒(𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 );
22 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 += MEASURE_NUM Let 𝜔𝑖 denote the number of appearances of main-task 𝑖 in the
23 end network, where 𝑖 is the main-task index. If a main-task has already met
24 end its latency requirement, no additional tuning resources are allocated to
25 return 𝐵𝑡𝑎𝑠𝑘 ; it. The variable 𝛼𝑘 represents the sampling probability assigned to sub-
task k. Unlike other frameworks, our approach introduces probabilistic
allocation for intrinsic mapping candidates (sub-task). Once perfor-
mance feedback for all mapping candidates of a subgraph is received,
candidates for the intrinsic-enabled 𝑡𝑎𝑠𝑘𝑖 . This effectively breaks the sampling probabilities are assigned based on time cost, candidates
main-task into several sub-tasks (as shown by the orange box in Fig. 1). with lower time costs are assigned higher probabilities, while those
The second step then generates a batch of promising programs for these with higher time costs receive lower probabilities. We also introduce a
sub-tasks and measures their performance on hardware. Each round is hyperparameter 𝛽 to adjust sampling probabilities for specific mapping
defined as one unit of time resource. When a time resource is allocated candidates, helping to avoid convergence on locally optimal solutions.
to a task, the task gains the opportunity to generate and measure new
programs, increasing the chance of discovering better-performing ones. 4.2. Optimizing with gradient and probability
In the following section, we introduce the formulation of the
scheduling problem and our solution. Inspired by the gradient descent-based task scheduling approach
presented in [25], we propose a DTS algorithm (Algorithm 1) that com-
4.1. Problem formulation bines gradient descent with probability-based selection to efficiently
optimize the objective function. Starting from the current allocation t,
the algorithm approximates the gradient of the objective function, 𝜕𝜕𝑓𝑡 ,
𝑖
and identifies the primary task i by maximizing the absolute gradient,
In defining the scheduling problem, we divide DTS into two types of 𝜕𝑓
defined as 𝑖 = ar g max𝑖 | 𝜕 𝑡 |. This gradient approximation serves as
tasks: main-tasks and sub-tasks. In this framework, a DNN can be split 𝑖
the foundation for selecting the main-task with the highest potential
into several subgraphs (main-tasks). If the computation type, data type,
impact.
and computation shape of a main-task meet the limitations required ( ( ( ) )))
for utilizing hardware intrinsic resources, multiple intrinsic mapping 𝜕𝑓 𝜕 𝑓 ( 𝛥𝑚 𝑚𝑖 𝑡𝑖 𝐶𝑖 ( )
𝛼 +(1 𝜂) min ,𝜃 𝑚𝑖 𝑡𝑖
candidates will be generated for the main-task. Each of these intrinsic 𝜕 𝑡𝑖 𝜕 𝑚𝑖 𝛥𝑡 𝑡𝑖 max𝑘𝑁(𝑖) 𝑉𝑘
mapping candidates is referred to as a sub-task. A main-task represents
a process performed to generate high-performance programs for a (2)
( ) ( )
subgraph, meaning that optimizing a single DNN requires completing where 𝛥𝑚 = 𝑚𝑖 𝑡𝑖 𝑚𝑖 𝑡𝑖 𝛥𝑡 and other variables are defined in
dozens of main-tasks. And related notions used in this paper are shown Table 3. The parameter 𝜂 and 𝜃 control the weight to trust some
in Table 3. predictions.
We define 𝑚𝑖 (𝑡) as the minimum execution time required for the GTA initializes the algorithm with t = 0 and begins with a round-
𝑖th main-task at time 𝑡, and 𝑚𝑖𝑘 (𝑡) as the execution time of the 𝑘th robin warm-up phase, resulting in an initial allocation vector of 𝑡 =
mapping scheme for the 𝑖th main-task. The optimal execution time {1, 1, … , 1}. After the warm-up, as shown in line 5 of Algorithm 1,
for subgraph 𝑖 is represented as min(𝑚𝑖1 (𝑡), 𝑚𝑖2 (𝑡), … , 𝑚𝑖𝑘 (𝑡)). The end- the gradient for each main-task is computed, and the main-task with
to-end execution time of the entire network, denoted by 𝐺(𝑚1 (𝑡), 𝑚2 (𝑡), the maximum absolute gradient, 𝑖 = argmax𝑖 | 𝜕𝜕𝑓𝑡 |, is selected. A tuning
𝑖
… , 𝑚𝑛 (𝑡)), represents the aggregate time across all main-tasks. Our time unit is then allocated to this main-task, updating its allocation to
objective is to minimize this function to achieve the lowest possible 𝑡𝑖 = 𝑡𝑖 + 1. The optimization process continues until the tuning time
overall execution time for the DNN. Thus, the objective function is budget is exhausted.
5
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Afterward, GTA searches for a hardware intrinsic that matches the Table 4
Resource-constrained rules and related conditions.
specified main-task. Once a suitable set of hardware intrinsics is identi-
fied, tensor programs are generated for all mapping candidates, serving No. Rule Condition
as a warm-up for the sub-task. This warm-up allows GTA to select the HasDataReuse(R, i) &
R1 Multi-Level Tiling
HasMultiLevelCache(R, i)
most promising mapping candidates by assigning probabilities based
HasDataReuse(R, i) &
on their performance feedback. In subsequent rounds, only mapping R2 Set Multi-Scope
HasMultiScopeCache(R, i)
candidates prioritized by their previously assigned probabilities are R3 Fuse Main Op HasStagesFused(R)
executed. This selective exploration avoids spending time on inefficient R4 Fuse Output Op HasStagesFused(R)
candidates, enhancing tuning efficiency and allowing higher-potential R5 AddMemLimit HasDSM(R)a
... Ansor Defined Ruleb ...
candidates more opportunities for optimization.
a
The 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒 algorithm, as called in line 21 of Algorithm DSM: dedicated scratchpad memory.
b Ansor [25].
1, is designed to probabilistically select mapping candidates for further
analysis and optimization. We first introduce the notation: let 𝑅 =
{𝑟1 , 𝑟2 , … , 𝑟𝑛 } represent the set of all mapping results, where 𝑟𝑖 denotes
the 𝑖th result with a performance value 𝑉 (𝑟𝑖 ).
The total weight 𝑊 is calculated by considering each results in- Second, current approaches are largely tailored to general-purpose
verse performance value, normalized with respect to the maximum processors and lack consideration for specific architectural constraints.
performance in 𝑅, as follows: This highlights the need to construct a high-quality kernel design
∑ 1 1 space to effectively reduce inefficient exploration and improve overall
𝑊 = ⋅ performance.
𝑉 (𝑟𝑖 ) max 1
𝑟 ∈𝑅
𝑖 𝑟𝑗 ∈𝑅 𝑉 (𝑟 )
𝑗 To address these challenges, GTAs implementation of resource-
constrained generation rules is based on existing open-source code
This ensures that weights are scaled and relative to the most perfor- for DLAs and general-purpose accelerators [22,25]. In particular, the
mant candidate in the result set 𝑅. Using this normalized total weight DLA-specific rules are adapted to leverage hardware intrinsics and
𝑊 , the initial probability assigned to each result 𝑟𝑖 is given by: dedicated scratchpad memory (DSM) efficiently. From a programmers
1 1
𝑉 (𝑟𝑖 ) max𝑟 ∈𝑅 1 perspective, DLAs, in contrast to general-purpose accelerators, feature
𝑗 𝑉 (𝑟𝑗 )
𝑃 (𝑟𝑖 ) = coarse-grained hardware intrinsics (e.g., WMMA in Tensor Core) and
𝑊 user-programmable dedicated DSM (e.g., Unified Buffer in TPU). Based
on these existing implementations, we made targeted modifications
To encourage exploration, the algorithm applies a probability in- to better align the rules with the search strategies and optimization
crease factor 𝛽 to selected results. The probability adjustment is defined methods proposed in this work. Table 4 summarizes five key generation
by weighting the original probability 𝑃 (𝑟𝑖 ) with an exploration boost: rules that GTA employs to optimize data movement, operation fusion,
(1 + 𝛽) ⋅ 𝑃 (𝑟𝑖 ) and memory management in DLAs. Each rule addresses specific chal-
𝑃 (𝑟𝑖 ) = ∑
𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ) lenges to enhance computational efficiency and resource utilization.
The following is a detailed description of each rule:
Here, 𝛽𝑗 is a task-specific exploration factor, applied selectively Rule-R1 generates multiple nodes for data movement between dif-
to candidates 𝑟𝑗 , where 𝛽𝑗 = 𝛽 for selected candidates and 𝛽𝑗 = 0 ferent levels of on-chip DSMs. To apply this rule, GTA first checks for
otherwise. The inclusion of the initial probability 𝑃 (𝑟𝑗 ), derived from data reuse opportunities and verifies if the DLA has multiple DSM levels
each candidates performance value 𝑉 (𝑟𝑗 ), serves as the foundation of (e.g., Tensor Core provides two levels of DSMs for WMMA fragments
the adjusted probabilities. This ensures that 𝑃 (𝑟𝑖 ) retains the relative and shared memory). If these conditions are met, GTA inserts 𝑐 𝑎𝑐 𝑒𝑟𝑒𝑎𝑑
importance of each candidate while allowing selective exploration primitives for the node and its producers to facilitate data movement.
through 𝛽𝑗 . Rule-R2 marks the data storage scope for each operation within
The normalization term, 𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ), ensures that the the DSM hierarchy. To apply this rule, GTA first checks for data reuse
adjusted probabilities remain valid and sum to 1. By combining the opportunities and verifies whether the DLA provides multiple DSM
task-specific exploration factor with the initial performance-weighted scopes for different data types. If these conditions are satisfied, GTA
probability 𝑃 (𝑟𝑗 ), this formula balances exploitation of high-priority assigns 𝑐 𝑎𝑐 𝑒𝑤𝑟𝑖𝑡𝑒 primitives to the node and 𝑐 𝑎𝑐 𝑒𝑟𝑒𝑎𝑑 primitives to its
candidates with exploration of less performant options. Furthermore, producers, ensuring that data is efficiently stored and accessed within
𝑃 (𝑟𝑗 ) prevents the adjustment from overly concentrating on a small the appropriate DSM levels.
subset of candidates, promoting diversity and fairness across the result Rule-R3 enables the fusion of main operations within a subgraph
set 𝑅. by identifying opportunities to combine operations with shared data
Finally, the algorithm selects the top 𝑁 results based on the adjusted dependencies. This reduces data movement overhead and improves
probabilities 𝑃 (𝑟𝑖 ). The selection process is expressed as: computational efficiency. When multiple stages are fused, GTA inserts
( ) the appropriate primitives to implement the fusion, streamlining the
{𝑟𝑖 }𝑁
𝑖=1
= Top𝑁 𝑃 (𝑟1 ), 𝑃 (𝑟2 ), … , 𝑃 (𝑟𝑛 ) , (3)
execution flow.
where 𝑁 is dynamically determined based on a fraction of the total Rule-R4 focuses on fusing output operations within a computational
result set 𝑅, denoted by 𝑁 = ⌈𝜅 ⋅ |𝑅|⌉, and 𝜅 ∈ (0, 1] is a user-defined graph. Similar to Rule-R3, it targets operations that can be combined
parameter controlling selection size. to minimize data transfer costs and enhance throughput. By analyz-
ing data flow between operations, GTA inserts necessary primitives
5. Resource-constrained rules to achieve output fusion, resulting in a more compact and efficient
execution structure.
Existing exploration-based methods face significant challenges in Rule-R5 constrains memory usage for operations that utilize ded-
both performance and scalability, primarily due to two factors. First, icated scratchpad memory (DSM). By evaluating each operation and
although the design space is vast, it contains numerous inefficient ker- its memory requirements, GTA ensures memory limits are respected,
nels. For example, in the GEMM operation with dimensions preventing allocations from exceeding hardware capacity, which could
512 × 768 × 3072 (used in GPT-1 on Tensor Core), the kernel space size lead to inefficient execution. This rule helps maintain an efficient
reaches 𝑂(1016 ), with over 90% of the kernels being inefficient [63,64]. memory allocation strategy, optimizing overall resource utilization.
6
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Fig. 3. An illustrative example of tensorized program generation for a GEMM-ReLU operator, demonstrating the transformation of the input program from a mathematical expression
(𝑝0 ) to a tensor expression (𝑝1 ) written in a domain-specific language using TVM. The process further includes intrinsic matching based on the type and shape of the input operator
to select and generate intrinsic mapping candidates, followed by the application of resource-constrained rules to guide the creation of a tensorized program sketch (𝑝2 ).
An example. Fig. 3 illustrates how resource-constrained rules are into smaller subgraphs. Notably, for some subgraphs, spending time on
applied during tensorized program generation. Starting from the input tuning may not significantly enhance the end-to-end performance of
program written as a mathematical expression (𝑝0 ), the process con- the DNN. In this work, we adopt TVMs subgraph partitioning strategy
verts it into a tensor expression (𝑝1 ) using domain-specific language to divide the input DNN into multiple smaller subgraphs, referred to as
(DSL) in TVM. The intrinsic matching step leverages compute abstrac- main-tasks. A main-task is considered a process executed to generate
tion and memory abstraction, as proposed in AMOS [22], to complete high-performance programs for a subgraph. TVM categorizes operators
the software-hardware mapping generation. This process selects and into four types: injective (e.g., add operations), reduction (e.g., sum
generates intrinsic mapping candidates by analyzing the operators operations), complex-out-feasible (e.g., matrix multiplication, where
computation type, data type, and memory access patterns based on
element-wise mappings can fuse to the output), and opaque (e.g., sort
its shape and hardware-specific constraints. Subsequently, resource-
operations, which cannot be fused). Subgraph fusion is then performed
constrained rules play a critical role in guiding the generation of the
based on predefined generic rules.
tensorized program sketch, ensuring efficient utilization of hardware in-
trinsic functions while respecting memory and architectural constraints. Mapping Generation and Scheduling. At each iteration, based on
Specifically, the derivation for generated rules and the transformed the intrinsic mapping generation approach described in AMOS [22],
program can be expressed as: main-tasks can be classified into intrinsic-disabled and intrinsic-enabled
𝑅2 𝑅1 tasks. For intrinsic-disabled main-tasks, we adopt Ansors [25] com-
𝑖𝑛𝑝𝑢𝑡 𝑝1 → 𝑀𝑐 𝑎𝑛𝑑 𝑖𝑜(𝑆0 , 𝑖 = 3) → 𝑜(𝑆1 , 𝑖 = 3) → 𝑜(𝑆2 , 𝑖 = 2) pilation optimization to generate programs. In contrast, for intrinsic-
(4)
𝑅3 𝑅5
→ … → 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝2 enabled main-tasks, GTA optimizes task scheduling based on gradients
and probabilities. This algorithm prioritizes subgraphs with higher
We defined the state as 𝑜 = (𝑆 , 𝑖), where S represents the current potential for performance improvement, allocating them more tuning
partially generated sketch program for the DAG, and i denotes the opportunities while reducing efforts on less promising mapping can-
index of the node currently being transformed. For each rule, if the didates based on performance feedback. Slice the time and prioritize
application conditions are met, the rule is applied to 𝑜 = (𝑆 , 𝑖), resulting important subgraphs and intrinsic mapping candidates, meaning that
in a new state 𝑜 = (𝑆 , 𝑖 ), where 𝑖𝑖. This ensures that the index not all main-tasks and sub-tasks will be executed. For example, an
i (indicating the transforming node) decreases monotonically. A state intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and
reaches a terminal condition when 𝑖 = 0. During the enumeration discarded mapping candidates. The former will proceed to subsequent
process, multiple rules may be applicable to a single state, generat- tensor program optimization and tuning, while the latter will not
ing several succeeding states. Additionally, a single rule can produce
participate in further optimization unless they are selected in the next
multiple succeeding states under certain conditions.
scheduling round.
6. Implementation Search Space Exploration. Subsequently, GTA applies resource-
constrained rules and existing derivation rules (Table 4) to each sub-
In this section, we delve into the technical details in our imple- graph under the guidance of a genetic algorithm [25]. During this
mentation. GTA extends TVM, a end-to-end deep learning compiler, to process, tens of thousands of tensor programs are generated, and cost
support loop scheduling and generate high-performance programs with model is employed to filter out the most promising candidates with
intrinsic instructions. near-optimal performance. These selected candidates are then executed
Task Generation. To mitigate the issue of search space explosion, on the target hardware to identify the tensor program with the best
compilers typically divide the large computational graph of a DNN performance.
7
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
7. Evaluation • PyTorch: PyTorch, a widely-used deep learning framework,
serves as a strong baseline for evaluating GTAs ability to out-
7.1. Evaluation platforms perform standard hand-tuned implementations in practical deep
learning applications. Our experiments include both PyTorch
1.13, which relies heavily on vendor-optimized libraries such
Our experiments were conducted on two distinct hardware plat-
as cuDNN and cuBLAS for high-performance computations, and
forms to evaluate the performance of the proposed GTA framework:
PyTorch 2.0, which introduces the TorchInductor compiler.
• NVIDIA GPUs: We performed experiments on two NVIDIA GPUs, For a fair comparison, we evaluate AutoTVM, Ansor, AMOS, and GTA
specifically the RTX 3060 and A100, which are equipped with with up to 200 measurement trials per test case and report the best
Tensor Cores optimized for deep learning tasks. The RTX 3060 performance achieved. For the vendor-optimized libraries on Tensor
represents a consumer-grade GPU, while the A100 is a data Core, we use PyTorch, which relies on hand-optimized libraries such as
center-grade GPU designed for high-performance computing. cuDNN to support various types of operators. These optimized libraries
• AMD CPU: We evaluated the performance on an AMD Ryzen 7 serve as strong baseline references for evaluating the performance of
7840H CPU,2 which supports advanced SIMD (Single Instruction, GTA.
Multiple Data) instructions, enabling efficient vectorized compu-
tations. This CPU platform provides a competitive environment 7.4. Experimental results
for testing AVX512-like optimizations in general-purpose proces-
sors, allowing us to benchmark GTAs performance on non-GPU We evaluate the performance of GTA on both operators and neural
hardware. networks, comparing it against several baselines on two DLAs: GPU
Tensor Cores and CPU AVX512. To further demonstrate the effective-
ness of GTA, we analyze the quality of the generated search spaces and
7.2. Evaluated benchmarks the efficiency of the exploration process. Finally, we highlight how the
dual-task scheduling strategy significantly reduces compilation time by
We evaluate the performance of GTA using both deep learning (DL) dynamically prioritizing subgraphs and mapping candidates, effectively
operators and complete neural network models. cutting down unnecessary search efforts.
• Operator-Level Evaluation: We select nine widely-used opera- 7.5. Operator performance
tors for this evaluation: General Matrix Multiplication (GEMM),
1D convolution (C1D), 2D convolution (C2D), 3D convolution Tensor Core. First, we compare GTA with PyTorch, which relies on
(C3D), transposed 2D convolution (T2D), dilated convolution hand-optimized libraries such as cuDNN to support various operators.
(DIL), batch matrix multiplication (BMM), General MatrixVector Fig. 4 shows the results for all operators with batch size 1 on the
multiplication (GEMV), and scan (SCAN). For each operator, we NVIDIA RTX 3060. GTA consistently outperforms PyTorch across all
test 610 different shape configurations and report the geometric operators, achieving an average 2.44× geometric mean speedup. The
mean of speedups normalized to GTA. The shape configurations speedup is attributed to GTAs comprehensive software-hardware map-
ping exploration, which contrasts with PyTorchs use of fixed mappings
are consistent with those used in Ansor and AMOS to ensure a fair
from hand-optimized libraries, often leading to suboptimal perfor-
comparison.
mance.
• Network-Level Evaluation: We benchmark six commonly-used
Next, we evaluate the performance on the NVIDIA A100 GPU for
neural network models: ResNet18 and ResNet50 [1], BERT (base
various operator. As shown in Fig. 9, GTA achieves 1.26×, 5.24×,
configuration) [65], MI-LSTM [66], MobileNet-V1 [67], and Shuf-
and 1.93× geometric mean speedup over Ansor, PyTorch, and AMOS,
fleNet [11]. For each model, we evaluate the performance with
respectively. The significant improvement is due to GTAs ability to
batch sizes of 1 and 16.
effectively utilize the high-performance Tensor Core units through
enhanced mapping and scheduling strategies.
7.3. Comparison baselines We also compare GTA with state-of-the-art compilers on RTX 3060
using the C2D in NCHW layout. We test all convolution layers from
ResNet18 (a total of 12 configurations, labeled as C0C11). These
Our evaluation compares GTA against three state-of-the-art auto-
configurations are standard benchmarks from well-known networks.
matic generation methods (AutoTVM [49], Ansor [25] (v0.8), and
The results are shown in Figs. 4, 5, and 6. GTA achieves speedups of
AMOS [22] (commit: 0f39742)) as well as two vendor-optimized, hand-
1.85×, 1.76×, and 2.10× over Ansor, AMOS, and hand-tuned PyTorch,
tuned libraries (cuDNN (v11.6) and PyTorch (v1.13.1, v2.0.1)): respectively. Compared to Ansor, GTA leverages high-performance
Tensor Core units alongside efficient auto-scheduling strategies, re-
• AutoTVM: This method uses hand-written templates to support
sulting in better optimization. In contrast to AMOS, GTA employs
all three selected platforms, demonstrating high performance
DTS to efficiently explore the scheduling space, reducing search time
across a range of baseline operators.
while enhancing program performance. Moreover, AMOS cannot utilize
• AMOS: AMOS systematically explores various mappings of loop resource-constrained rules for shared memory allocation, leading to
iterations to DLAs, representing the state-of-the-art for operators the generation of some tensor programs that exceed hardware re-
with multiple feasible mappings, such as C1D and C2D. source limits. This limitation reduces AMOSs capability to achieve
• Ansor: As a leading method for GPU CUDA Core and CPU code higher-performing programs.
generation, Ansor does not support DLAs like Tensor Core due AVX512. On the AMD CPU platform, we utilize hardware abstrac-
to architectural limitations. However, comparing GTA with An- tion for AVX512 intrinsics (specifically for matrixvector multiplica-
sor highlights the benefits of leveraging DLA-specific features in tion) and apply GTA to generate code for C2D. As shown in Fig. 7, GTA
tensor program generation. achieves 1.49× and 2.76× performance improvements over Ansor and
PyTorch, respectively. GTAs advantage stems from combining high-
performance AVX512 intrinsics with efficient auto-scheduling strate-
2
Intel CPUs also support AVX512 instructions and could be used for similar gies, leading to superior program optimization compared to baseline
experiments. methods.
8
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Fig. 4. Single operator performances comparison on NVIDIA RTX 3060.
Fig. 5. Performance comparison of C2D on NVIDIA RTX 3060 with batch size = 1, using all convolution layers from ResNet18 (12 configurations, labeled as C0C11).
Fig. 6. Performances comparison for C2D on NVIDIA RTX 3060 with batch size = 16.
Fig. 7. Performance on AMD Ryzen 7 7840H CPU relative to Ansor and PyTorch.
Fig. 8. Performance of different networks relative to GTA on Tensor Core.
9
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Fig. 9. Performance comparison of GTA across multiple individual operators on the NVIDIA A100 GPU, compared with baseline methods.
Fig. 10. Compilation time overhead and corresponding performance variations under different sampling rates.
7.6. Network performance optimizes resource allocation during the search process and enables the
rapid identification of high-performance tensor programs.
Fig. 8 illustrates the performance of GTA on six evaluated networks. Unlike traditional methods that exhaustively explore all mapping
On average, GTA achieves 1.75×, 1.42× and 1.29× speedups over candidates, GTA employs a dynamic prioritization strategy that adap-
AMOS, PyTorch 1.13 and PyTorch 2.0 with TorchInductor, respec- tively allocates tuning resources based on performance feedback. This
tively. For ResNet18 and ResNet50, GTA finds better mappings for strategy ensures that the most promising subgraphs and intrinsic map-
operators, enabling more extensive utilization of Tensor Cores com- ping candidates are prioritized, while less promising candidates receive
pared to hand-tuned libraries and AMOSs optimized templates. GTA fewer tuning opportunities. By combining this with a sampling-based
overcomes the limitations of these baselines by generating accurate approach, GTA minimizes unnecessary exploration while maintaining
search spaces that encompass most high-performance programs, along high-quality tensor programs. These results underscore GTAs suit-
with an efficient search algorithm for finding optimal or near-optimal ability for real-world deployment scenarios, where both rapid code
solutions. The results demonstrate GTAs capability to handle complex generation and performance optimization are critical. Furthermore, the
operators and effectively leverage Tensor Cores for high performance. ability to adjust sampling rates offers flexibility in balancing search
time and performance, making GTA a robust solution for optimizing
7.7. Compilation time tensor programs across diverse workloads.
The search time overhead is a critical factor for practical deploy- 8. Related work
ment in deep learning frameworks, as reducing it can significantly en-
hance usability. To evaluate the efficiency of our dual-task scheduling In addition to reviewing DLAs, we summarize related work on
strategy, we analyze the search time and corresponding performance numeric precision and dynamic shape optimization for deep learning.
variations under different sampling rates, specifically comparing GTA Deep learning accelerators. DLAs offer several significant ad-
at sampling rates of 40% (GTA-0.4), 60% (GTA-0.6), and 100% (GTA- vantages, making them essential for advancing DNN research and
Raw). In this experiment, GTA operates at a sampling rate of 20% deployment. First, DLAs feature large memory capacities, which ac-
(GTA-0.2), representing a highly efficient configuration with minimal commodate the rapidly growing number of parameters in modern
search overhead. The results, as shown in Fig. 10, demonstrate that models and facilitate efficient training processes. Second, they provide
as the sampling rate decreases, the search time is significantly reduced model-specific optimizations while maintaining a degree of flexibility,
while maintaining less than a 5% performance degradation on average, enabling tailored performance improvements for various architectures.
thereby achieving an excellent balance between search efficiency and Additionally, DLAs support a broader range of data formats, such
performance. as FP16, BF16, and INT8, which enhance computational efficiency
Additionally, we compare GTAs search time overhead and perfor- and reduce memory usage. Third, DLAs are equipped with a high
mance with AMOS, a state-of-the-art compiler designed for DLAs. Our number of computing units, enabling extensive parallelism to handle
findings reveal that GTA achieves an average performance improve- the computational demands of DNNs effectively. These characteris-
ment of 1.88× over AMOS while maintaining significantly lower search tics position DLAs as a cornerstone technology for accelerating the
time. Specifically, AMOSs average compilation time is approximately training and inference of deep learning models. Following this trend,
five times that of GTA. This substantial reduction in search time under- many emerging accelerators have been proposed, targeting specific
scores the effectiveness of GTAs dual-task scheduling strategy, which algorithms or utilizing new technologies. In academia, the DianNao
10
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
family [6871] significantly improves DL computation performance by search space by coordinating intrinsic-based automatic mapping ab-
leveraging specialized functional units, memory hierarchy, and inter- straction with rule-based tensor program generation strategy and ap-
connects. Meanwhile, the expansion of DL applications in industry has plies pruning rules to eliminate ineffective program candidates. Ad-
led hardware vendors (e.g., NVIDIA Tensor Core [1719] and Intel ditionally, GTA employs dual-task scheduling strategy for tensorized
NNP [72]), internet giants (e.g., Tesla Dojo [73], Huawei Ascend [74], programs, effectively reducing tuning efforts while enhancing perfor-
Google TPU [10] and Apple M4 [75,76]), and startups (e.g., Cambricon mance. Experimental results on three DLAs show that GTA outperforms
MLU [77] and Graphcore IPU [78]) to develop various DLAs. Both state-of-the-art automatic generation approaches and vendor-provided
academic and industry DLAs are fundamentally domain-specific, rather hand-tuned libraries by 1.88× and 2.29×, respectively.
than general-purpose accelerators, inevitably leading to complex and
diverse architectural constraints. CRediT authorship contribution statement
Numeric precision optimization. Quantization [79,80], a pivotal
technique in deep learning, reduces the numeric precision of weights Anxing Xie: Writing original draft, Software, Resources, Project
and activations to enhance computational efficiency and lower resource administration, Methodology, Investigation, Data curation. Yonghua
requirements. By transitioning from high-precision formats such as Hu: Writing review & editing, Supervision, Investigation, Funding
FP32 to lower-precision formats like FP16, INT8, or even single-bit acquisition. Yaohua Wang: Writing review & editing, Supervision,
representations [81,82], quantization enables significant reductions in Methodology, Investigation, Funding acquisition, Formal analysis. Zhe
memory usage and power consumption [83,84]. The progression of Li: Writing review & editing, Supervision, Investigation, Formal
hardware architectures aligns with the increasing demands for low- analysis. Yuxiang Gao: Investigation. Zenghua Cheng: Investigation.
precision computations. For instance, NVIDIAs recent developments,
such as the Turing and Ampere architectures, incorporated INT8 and Declaration of competing interest
INT4 tensor cores to enhance efficiency. Meanwhile, the latest Hop-
per architecture has shifted focus by replacing INT4 support with The authors declare that they have no known competing finan-
FP8 tensor cores, prioritizing improved numerical precision. These ad- cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
vancements allow large-scale models, including Large Language Models
(LLMs) [85], to be deployed on resource-constrained devices like edge
Acknowledgments
devices and DLAs without sacrificing performance. Compilers play a
critical role in making quantization effective. Tools like AMOS [22],
We would like to thank the anonymous reviewers for their valu-
PreTuner [86] and LADDER [39] introduce advanced optimizations
able suggestions. This work is supported by the National Key R&D
for low-precision data types, including hardware-aware scheduling,
Program of China (No. 2022ZD0119003), Hunan Provincial Natural
loop tiling, and fine-grained scaling strategies. Expanding on existing
Science Foundation (No. 2023JJ50019), the Postgraduate Scientific
techniques, an automated approach [87] integrates bit-slicing into
Research Innovation Project of Hunan Province (No. CX20231019) and
the scheduling phase, treating quantization as part of the schedule
the National Natural Science Foundation of China (No. 62272477).
space. Coupled with program synthesis, this method efficiently gener-
ates hardware-specific kernels, supporting diverse quantization config-
Data availability
urations and ensuring seamless adaptation to new hardware architec-
tures.
Data will be made available on request.
Dynamic shape optimization. Dynamic-shape workloads are char-
acteristic of DNN models where tensor shapes vary at runtime based
on input data, such as the sequence length in Transformer models. References
These workloads pose substantial challenges for existing autotuning
frameworks like TVM, which primarily rely on static input shapes to [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
construct search spaces and cost models. For instance, TVMs second- Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 770778.
generation IR, Relay [35], lacks the capability to represent dynamic [2] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju,
tensors. While its third-generation IR, Relax [88], introduces symbolic H. Shahrzad, A. Navruzyan, N. Duffy, et al., Evolving deep neural networks,
shapes to support dynamic workloads, Relax still depends on hand- in: Artificial Intelligence in the Age of Neural Networks and Brain Computing,
written templates for tensor program generation and lacks automatic Elsevier, 2024, pp. 269287.
[3] C.-Y. Wang, I.-H. Yeh, H.-Y. Mark Liao, Yolov9: Learning what you want to learn
tuning support. To address these limitations, recent works such as
using programmable gradient information, in: European Conference on Computer
Nimble [89], DietCode [90], FTuner [91], and MIKPOLY [92] have Vision, Springer, 2025, pp. 121.
introduced innovative techniques. These approaches construct shape- [4] A. Vaswani, Attention is all you need, Adv. Neural Inf. Process. Syst. (2017).
agnostic search spaces and cost models to optimize dynamic-shape [5] P.P. Ray, ChatGPT: A comprehensive review on background, applications, key
workloads. For example, DietCode effectively groups kernels with vary- challenges, bias, ethics, limitations and future scope, Internet Things Cyber- Phys.
Syst. 3 (2023) 121154.
ing shapes into unified workloads, enabling efficient tuning as a single
[6] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur,
entity and significantly reducing overall tuning time. FTuner introduces A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models, 2024, arXiv
a uKernel-based approach for dynamic tensors, leveraging hardware- preprint arXiv:2407.21783.
aware constraints to generate high-performance kernel programs and [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U.
combining uKernels during runtime to optimize padding and execution Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene
efficiency. While these advancements mark significant progress, further understanding, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 32133223.
research is needed to fully exploit the potential of dynamic-shape DNNs
[8] D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, Y. Qiao, Drive like a human:
on modern hardware accelerators. Rethinking autonomous driving with large language models, in: Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp.
910919.
9. Conclusion
[9] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D.
Liao, et al., A survey on multimodal large language models for autonomous
We propose GTA, a novel compilation framework for high- driving, in: Proceedings of the IEEE/CVF Winter Conference on Applications of
performance tensor program generation on DLAs. GTA expands the Computer Vision, 2024, pp. 958979.
11
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
[10] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, [30] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle,
S. Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis T. Shpeisman, N. Vasilache, O. Zinenko, MLIR: A compiler infrastructure for the
of a tensor processing unit, in: Proceedings of the 44th Annual International end of Moores law, 2020, arXiv preprint arXiv:2002.11054.
Symposium on Computer Architecture, 2017, pp. 112. [31] L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L.
[11] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolu- Zhang, L. Zhou, Rammer: Enabling holistic deep learning compiler optimizations
tional neural network for mobile devices, in: Proceedings of the IEEE Conference with {rtasks}, in: 14th USENIX Symposium on Operating Systems Design and
on Computer Vision and Pattern Recognition, 2018, pp. 68486856. Implementation, OSDI 20, 2020, pp. 881897.
[12] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: Trainable bag-of-freebies [32] J. Zhao, B. Li, W. Nie, Z. Geng, R. Zhang, X. Gao, B. Cheng, C. Wu, Y. Cheng,
sets new state-of-the-art for real-time object detectors, in: Proceedings of the Z. Li, et al., AKG: automatic kernel generation for neural processing units
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. using polyhedral transformations, in: Proceedings of the 42nd ACM SIGPLAN
74647475. International Conference on Programming Language Design and Implementation,
[13] Z. Xu, W. Wang, H. Dai, Y. Xu, XFC: Enabling automatic and fast operator 2021, pp. 12331248.
synthesis for mobile deep learning compilation, J. Syst. Archit. 142 (2023) [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
102921. N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance
deep learning library, Adv. Neural Inf. Process. Syst. 32 (2019).
[14] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, D. Chen,
[34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
FPGA/DNN co-design: An efficient design methodology for IoT intelligence on
Ghemawat, G. Irving, M. Isard, et al., {TensorFlow}: a system for {large-scale}
the edge, in: Proceedings of the 56th Annual Design Automation Conference
machine learning, in: 12th USENIX Symposium on Operating Systems Design and
2019, 2019, pp. 16.
Implementation, OSDI 16, 2016, pp. 265283.
[15] W. Jiang, L. Yang, E.H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi, J. Hu, Hard-
[35] J. Roesch, S. Lyubomirsky, M. Kirisame, L. Weber, J. Pollock, L. Vega, Z. Jiang,
ware/software co-exploration of neural architectures, IEEE Trans. Comput.-Aided
T. Chen, T. Moreau, Z. Tatlock, Relay: A high-level compiler for deep learning,
Des. Integr. Circuits Syst. 39 (12) (2020) 48054815.
2019, arXiv preprint arXiv:1904.08368.
[16] Z. Xie, M. Emani, X. Yu, D. Tao, X. He, P. Su, K. Zhou, V. Vishwanath,
[36] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng,
Centimani: Enabling fast {AI} accelerator selection for {dNN} training with a
X. Jin, Apollo: Automatic partition-based operator fusion through layer by layer
novel performance predictor, in: 2024 USENIX Annual Technical Conference,
optimization., in: MLSys, 2022.
USENIX ATC 24, 2024, pp. 12031221.
[37] Y. Shi, Z. Yang, J. Xue, L. Ma, Y. Xia, Z. Miao, Y. Guo, F. Yang, L. Zhou,
[17] Nvidia, Ampere architecture white paper, 2022, URL: https://www.nvidia. Welder: Scheduling deep learning memory access via tile-graph, in: 17th USENIX
com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture- Symposium on Operating Systems Design and Implementation, OSDI 23, 2023,
whitepaper.pdf Online (Accessed 13 November 2024). pp. 701718.
[18] Nvidia, Turing architecture white paper, 2022, URL: https://www.nvidia. [38] C. Xia, J. Zhao, Q. Sun, Z. Wang, Y. Wen, T. Yu, X. Feng, H. Cui, Optimizing
com/content/dam/en-zz/Solutions/design-visualization/technologies/turing- deep learning inference via global analysis and tensor expressions, in: Proceed-
architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Online (Accessed 13 ings of the 29th ACM International Conference on Architectural Support for
November 2024). Programming Languages and Operating Systems, Volume 1, 2024, pp. 286301.
[19] Nvidia, Volta architecture white paper, 2022, URL: https://images.nvidia. [39] L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y. Shi, N. Zheng, Z. Miao, F. Yang, T.
com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Online Cao, et al., Ladder: Enabling efficient {low-precision} deep learning computing
(Accessed 13 November 2024). through hardware-aware tensor transformation, in: 18th USENIX Symposium on
[20] K. Troester, R. Bhargava, AMD next generation Zen 4 core and 4th Gen AMD Operating Systems Design and Implementation, OSDI 24, 2024, pp. 307323.
EPYC™ 9004 server CPU, in: 2023 IEEE Hot Chips 35 Symposium, HCS, IEEE [40] F. Wang, M. Shen, Y. Lu, N. Xiao, TensorMap: A deep RL-based tensor mapping
Computer Society, 2023, pp. 125. framework for spatial accelerators, IEEE Trans. Comput. (2024).
[21] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. [41] Y. Zhao, H. Sharif, V. Adve, S. Misailovic, Felix: Optimizing tensor programs
Hu, L. Ceze, et al., {TVM}: An automated {end-to-end} optimizing compiler for with gradient descent, in: Proceedings of the 29th ACM International Conference
deep learning, in: 13th USENIX Symposium on Operating Systems Design and on Architectural Support for Programming Languages and Operating Systems,
Implementation, OSDI 18, 2018, pp. 578594. Volume 3, 2024, pp. 367381.
[22] S. Zheng, R. Chen, A. Wei, Y. Jin, Q. Han, L. Lu, B. Wu, X. Li, S. Yan, Y. [42] Q. Zhao, R. Wang, Y. Liu, H. Yang, Z. Luan, D. Qian, Sifter: An efficient operator
Liang, AMOS: enabling automatic mapping for tensor computations on spatial auto-tuner with speculative design space exploration for deep learning compiler,
accelerators with hardware abstraction, in: Proceedings of the 49th Annual IEEE Trans. Comput. (2024).
International Symposium on Computer Architecture, 2022, pp. 874887. [43] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe,
[23] S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C.H. Yu, Y. Yu, Halide: a language and compiler for optimizing parallelism, locality, and re-
et al., Tensorir: An abstraction for automatic tensorized program optimization, computation in image processing pipelines, Acm Sigplan Not. 48 (6) (2013)
in: Proceedings of the 28th ACM International Conference on Architectural 519530.
Support for Programming Languages and Operating Systems, Volume 2, 2023, [44] Y. Bai, X. Yao, Q. Sun, W. Zhao, S. Chen, Z. Wang, B. Yu, Gtco: Graph and
pp. 804817. tensor co-design for transformer-based image recognition on tensor cores, IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023).
[24] J. Bi, Q. Guo, X. Li, Y. Zhao, Y. Wen, Y. Guo, E. Zhou, X. Hu, Z. Du, L. Li, et al.,
[45] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, A. Parashar, Maestro:
Heron: Automatically constrained high-performance library generation for deep
A data-centric approach to understand reuse, performance, and hardware cost of
learning accelerators, in: Proceedings of the 28th ACM International Conference
dnn mappings, IEEE Micro 40 (3) (2020) 2029.
on Architectural Support for Programming Languages and Operating Systems,
[46] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, Y. Liang, Tenet: A frame-
Volume 3, 2023, pp. 314328.
work for modeling tensor dataflow based on relation-centric notation, in: 2021
[25] L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D.
ACM/IEEE 48th Annual International Symposium on Computer Architecture,
Zhuo, K. Sen, et al., Ansor: Generating {high-performance} tensor programs for
ISCA, IEEE, 2021, pp. 720733.
deep learning, in: 14th USENIX Symposium on Operating Systems Design and
[47] A. Parashar, P. Raina, Y.S. Shao, Y.-H. Chen, V.A. Ying, A. Mukkara, R. Venkate-
Implementation, OSDI 20, 2020, pp. 863879.
san, B. Khailany, S.W. Keckler, J. Emer, Timeloop: A systematic approach to dnn
[26] S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic
accelerator evaluation, in: 2019 IEEE International Symposium on Performance
schedule exploration and optimization framework for tensor computation on het-
Analysis of Systems and Software, ISPASS, IEEE, 2019, pp. 304315.
erogeneous system, in: Proceedings of the Twenty-Fifth International Conference [48] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha,
on Architectural Support for Programming Languages and Operating Systems, P. Raina, et al., Interstellar: Using halides scheduling language to analyze dnn
2020, pp. 859873. accelerators, in: Proceedings of the Twenty-Fifth International Conference on
[27] A. Sabne, Xla: Compiling machine learning for peak performance, Google Res Architectural Support for Programming Languages and Operating Systems, 2020,
(2020). pp. 369383.
[28] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W.S. Moses, S. [49] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, A.
Verdoolaege, A. Adams, A. Cohen, Tensor comprehensions: Framework-agnostic Krishnamurthy, Learning to optimize tensor programs, Adv. Neural Inf. Process.
high-performance machine learning abstractions, 2018, arXiv preprint arXiv: Syst. 31 (2018).
1802.04730. [50] J. Appleyard, S. Yokim, NVIDIA developer technical blog, 2017,
[29] P. Tillet, H.-T. Kung, D. Cox, Triton: an intermediate language and compiler for URL:https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9 Online
tiled neural network computations, in: Proceedings of the 3rd ACM SIGPLAN (Accessed 13 November 2024).
International Workshop on Machine Learning and Programming Languages, [51] NVIDIA, Basic linear algebra on NVIDIA GPUs, 2024, URL: https://developer.
2019, pp. 1019. nvidia.com/cublas Online (Accessed 13 November 2024) n.d.
12
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
[52] A. Kerr, H. Wu, M. Gupta, D. Blasig, P. Ramini, D. Merrill, A. Shivam, P. [75] Apple, Apple introduces M4 chip, 2024, URL: https://www.apple.com/sg/
Majcher, P. Springer, M. Hohnerbach, J. Wang, M. Nicely, CUTLASS, 2022, URL: newsroom/2024/05/apple-introduces-m4-chip/ Online (Accessed 13 November
https://github.com/NVIDIA/cutlass Online (Accessed 13 November 2024). 2024).
[53] T. Zerrell, J. Bruestle, Stripe: Tensor compilation via the nested polyhedral [76] Apple, Apple introduces M4 pro and M4 max, 2024, URL: https://www.apple.
model, 2019, arXiv preprint arXiv:1903.06498. com/sg/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/ Online (Ac-
[54] R. Baghdadi, J. Ray, M.B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, cessed 13 November 2024).
P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: A polyhedral compiler for [77] Cambricon, Cambricon MLU, 2024, URL: https://www.cambricon.com/ Online
expressing fast and portable code, in: 2019 IEEE/ACM International Symposium (Accessed 13 November 2024) n.d..
on Code Generation and Optimization, CGO, IEEE, 2019, pp. 193205. [78] Z. Jia, B. Tillman, M. Maggioni, D.P. Scarpazza, Dissecting the graphcore ipu
[55] S. Tavarageri, A. Heinecke, S. Avancha, B. Kaul, G. Goyal, R. Upadrasta, Polydl: architecture via microbenchmarking, 2019, arXiv preprint arXiv:1912.03413.
Polyhedral optimizations for creation of high-performance dl primitives, ACM [79] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural
Trans. Archit. Code Optim. ( TACO) 18 (1) (2021) 127. networks: Training neural networks with low precision weights and activations,
[56] Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, J. Mach. Learn. Res. 18 (187) (2018) 130.
Y.S. Shao, Cosa: Scheduling by constrained optimization for spatial accelera- [80] T. Liang, J. Glossner, L. Wang, S. Shi, X. Zhang, Pruning and quantization for
tors, in: 2021 ACM/IEEE 48th Annual International Symposium on Computer deep neural network acceleration: A survey, Neurocomput. 461 (2021) 370403.
Architecture, ISCA, IEEE, 2021, pp. 554566. [81] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized
[57] M. Sotoudeh, A. Venkat, M. Anderson, E. Georganas, A. Heinecke, J. Knight, neural networks: Training deep neural networks with weights and activations
ISA mapper: a compute and hardware agnostic deep learning compiler, in: constrained to+ 1 or-1, 2016, arXiv preprint arXiv:1602.02830.
Proceedings of the 16th ACM International Conference on Computing Frontiers, [82] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: Imagenet classifi-
2019, pp. 164173. cation using binary convolutional neural networks, in: European Conference on
[58] J. Weng, A. Jain, J. Wang, L. Wang, Y. Wang, T. Nowatzki, UNIT: Unifying Computer Vision, Springer, 2016, pp. 525542.
tensorized instruction compilation, in: 2021 IEEE/ACM International Symposium [83] C.-C. Yang, Y.-R. Chen, H.-H. Liao, Y.-M. Chang, J.-K. Lee, Auto-tuning fixed-
on Code Generation and Optimization, CGO, IEEE, 2021, pp. 7789. point precision with TVM on RISC-v packed SIMD extension, ACM Trans. Des.
[59] H. Zhu, R. Wu, Y. Diao, S. Ke, H. Li, C. Zhang, J. Xue, L. Ma, Y. Xia, W. Cui, Autom. Electron. Syst. 28 (3) (2023) 121.
et al., {RollER}: Fast and efficient tensor compilation for deep learning, in: 16th [84] D. Diamantopoulos, B. Ringlein, M. Purandare, G. Singh, C. Hagleitner, Agile
USENIX Symposium on Operating Systems Design and Implementation, OSDI 22, autotuning of a transprecision tensor accelerator overlay for TVM compiler
2022, pp. 233248. stack, in: 2020 30th International Conference on Field-Programmable Logic and
[60] Y. Ding, C.H. Yu, B. Zheng, Y. Liu, Y. Wang, G. Pekhimenko, Hidet: Task-mapping Applications, FPL, IEEE, 2020, pp. 310316.
programming paradigm for deep learning tensor programs, in: Proceedings of the [85] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, Z. Jia, Towards efficient
28th ACM International Conference on Architectural Support for Programming generative large language model serving: A survey from algorithms to systems,
Languages and Operating Systems, Volume 2, 2023, pp. 370384. 2023, arXiv preprint ArXiv:2312.15234.
[61] L. Zheng, H. Wang, J. Zhai, M. Hu, Z. Ma, T. Wang, S. Huang, X. Miao, S. Tang, [86] J. Xu, G. Song, B. Zhou, F. Li, J. Hao, J. Zhao, A holistic approach to automatic
K. Huang, et al., {EINNET}: Optimizing tensor programs with {derivation-based} mixed-precision code generation and tuning for affine programs, in: Proceedings
transformations, in: 17th USENIX Symposium on Operating Systems Design and of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of
Implementation, OSDI 23, 2023, pp. 739755. Parallel Programming, 2024, pp. 5567.
[62] Y. Zhai, S. Yang, K. Pan, R. Zhang, S. Liu, C. Liu, Z. Ye, J. Ji, J. Zhao, Y. Zhang, et [87] M. Cowan, T. Moreau, T. Chen, J. Bornholt, L. Ceze, Automatic generation of
al., Enabling tensor language model to assist in generating {High-Performance} high-performance quantized machine learning kernels, in: Proceedings of the
tensor programs for deep learning, in: 18th USENIX Symposium on Operating 18th ACM/IEEE International Symposium on Code Generation and Optimization,
Systems Design and Implementation, OSDI 24, 2024, pp. 289305. 2020, pp. 305316.
[63] F. Wang, M. Shen, Y. Ding, N. Xiao, Soter: Analytical tensor-architecture [88] R. Lai, J. Shao, S. Feng, S.S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin,
modeling and automatic tensor program tuning for spatial accelerators, in: 2024 J. Liu, et al., Relax: Composable abstractions for end-to-end dynamic machine
ACM/IEEE 51st Annual International Symposium on Computer Architecture, learning, 2023, arXiv preprint arXiv:2311.02103.
ISCA, IEEE, 2024, pp. 9911004. [89] H. Shen, J. Roesch, Z. Chen, W. Chen, Y. Wu, M. Li, V. Sharma, Z. Tatlock,
[64] F. Wang, M. Shen, Automatic kernel generation for large language models on Y. Wang, Nimble: Efficiently compiling dynamic neural networks for model
deep learning accelerators, in: 2023 IEEE/ACM International Conference on inference, Proc. Mach. Learn. Syst. 3 (2021) 208222.
Computer Aided Design, ICCAD, IEEE, 2023, pp. 19. [90] B. Zheng, Z. Jiang, C.H. Yu, H. Shen, J. Fromm, Y. Liu, Y. Wang, L. Ceze,
[65] J. Devlin, Bert: Pre-training of deep bidirectional transformers for language T. Chen, G. Pekhimenko, DietCode: Automatic optimization for dynamic tensor
understanding, 2018, arXiv preprint arXiv:1810.04805. programs, Proc. Mach. Learn. Syst. 4 (2022) 848863.
[66] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, R.R. Salakhutdinov, On multiplicative [91] P. Mu, L. Wei, Y. Liu, R. Wang, FTuner: A fast dynamic shape tensors program
integration with recurrent neural networks, Adv. Neural Inf. Process. Syst. 29 auto-tuner for deep learning compilers, 2024, arXiv preprint arXiv:2407.21418.
(2016). [92] F. Yu, G. Li, J. Zhao, H. Cui, X. Feng, J. Xue, Optimizing dynamic-shape
[67] A.G. Howard, Mobilenets: Efficient convolutional neural networks for mobile neural networks on accelerators via on-the-fly micro-kernel polymerization,
vision applications, 2017, arXiv preprint arXiv:1704.04861. in: Proceedings of the 29th ACM International Conference on Architectural
[68] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao: Support for Programming Languages and Operating Systems, Volume 2, 2024,
A small-footprint high-throughput accelerator for ubiquitous machine-learning, pp. 797812.
ACM SIGARCH Comput. Archit. News 42 (1) (2014) 269284.
[69] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
N. Sun, et al., Dadiannao: A machine-learning supercomputer, in: 2014 47th Anxing Xie is currently working toward a Ph.D. degree
in the School of Computer Science and Engineering, Hu-
Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 2014,
nan University of Science and Technology, China. He is
pp. 609622.
currently working on deep learning automatic compilation
[70] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, Y. Chen,
optimization and high-performance computation. His re-
Pudiannao: A polyvalent machine learning accelerator, ACM SIGARCH Comput.
search interests include compiler optimization, and parallel
Archit. News 43 (1) (2015) 369381.
computing.
[71] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, O. Temam,
ShiDianNao: Shifting vision processing closer to the sensor, in: Proceedings of
the 42nd Annual International Symposium on Computer Architecture, 2015, pp.
92104.
[72] B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski, S. Avancha, Intel
nervana neural network processor-t (nnp-t) fused floating point many-term dot Yonghua Hu is a professor in School of Computer Sci-
product, in: 2020 IEEE 27th Symposium on Computer Arithmetic, ARITH, IEEE, ence and Engineering, Hunan University of Science and
2020, pp. 133136. Technology, China. He received the Ph.D degree in Com-
[73] E. Talpes, D. Williams, D.D. Sarma, Dojo: The microarchitecture of teslas exa- puter Application Technology from Hunan University, in
scale computer, in: 2022 IEEE Hot Chips 34 Symposium, HCS, IEEE Computer 2008. He went to University at Buffalo SUNY as a visiting
Society, 2022, pp. 128. scholar in 2019. His research interests include compilation
[74] H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable optimization, artificial intelligence and parallel computing.
and unified architecture for ubiquitous deep neural network computing: Industry
track paper, in: 2021 IEEE International Symposium on High-Performance
Computer Architecture, HPCA, IEEE, 2021, pp. 789801.
13
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
Yaohua Wang is currently a professor with the College Yuxiang Gao is currently working toward an M.S. degree
of Computer Science, National University of Defense Tech- in the School of Computer Science and Engineering, Hunan
nology. His research interest is in computer architecture, University of Science and Technology, China. He is currently
machine learning and security. His work spans and stretches working on code optimization and compilation technol-
the boundaries of computer architecture. He is especially ex- ogy. His research interests include automatic compilation
cited about novel, fundamentally-efficient computation, and optimization and code generation.
memory/storage paradigms, applied to emerging machine
learning applications.
Zhe Li received the Ph.D. degree in Computer Science Zenghua Cheng is currently working toward an M.S. degree
from Jilin University in 2022. He is currently working at in the School of Computer Science and Engineering, Hunan
Tianjin Advanced Technology Institute. His research inter- University of Science and Technology, China. He is currently
ests include deep learning compilation and combinatorial working on code optimization and compilation technol-
optimization. ogy. His research interests include automatic compilation
optimization and Web security.
14