1017 lines
122 KiB
Plaintext
1017 lines
122 KiB
Plaintext
Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Journal of Systems Architecture
|
||
journal homepage: www.elsevier.com/locate/sysarc
|
||
|
||
|
||
|
||
|
||
GTA: Generating high-performance tensorized program with dual-task
|
||
scheduling
|
||
Anxing Xie a ,1 , Yonghua Hu a ,∗, Yaohua Wang b , Zhe Li b,c , Yuxiang Gao a , Zenghua Cheng a
|
||
a School of Computer Science and Engineering, Hunan University of Science and Technology, Taoyuan Road, Xiangtan, 411201, Hunan, China
|
||
b
|
||
School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410073, Hunan, China
|
||
c
|
||
Tianjin Institute of Advanced Technology, Huixiang Road, 300459, Tianjin, China
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring
|
||
Mapping the efficient execution of deep neural networks. But, producing such programs for different operators
|
||
Code generation across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent
|
||
Compiler optimization
|
||
acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However,
|
||
Tensor computation
|
||
their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs
|
||
and significant search time overhead.
|
||
In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs
|
||
for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with
|
||
rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate
|
||
ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy
|
||
to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates.
|
||
As a result, GTA can find high-performance tensor programs that are outside the search space of existing
|
||
state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88×
|
||
over AMOS and 2.29× over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49× over Ansor and 2.76×
|
||
over PyTorch on CPU with AVX512.
|
||
|
||
|
||
|
||
1. Introduction also bridge the gap between high-level tensor programs and low-level
|
||
instructions, a process we refer to as tensorized program generation
|
||
Recently, the successful deployment of machine learning models with automatic mapping optimization. However, generating high-
|
||
has revolutionized diverse application domains, such as image recog- performance tensorized programs for various DLAs remains challenging
|
||
nition [1–3], natural language processing [4–6], and autonomous driv- for several reasons.
|
||
ing [7–9]. This rapid development has created a demand for generat-
|
||
Firstly, inefficient exploration of the intrinsic mapping space leads
|
||
ing high-performance tensor programs for deep learning accelerators
|
||
to substantial overhead in search time. For instance, mapping the 7
|
||
(DLAs), such as Google TPUs [10], mobile devices [11–13], FPGAs [14–
|
||
16], and more. To accelerate machine learning, hardware vendors loops of a 2D convolution to the 3D of Tensor Core can involve 35
|
||
have introduced domain-specific intrinsics for tensor computations, different ways [22]. Current strategies [22,23] treat each mapping can-
|
||
such as NVIDIA’s Tensor Cores [17–19] and CPU’s AVX512 [20]. This didate equally, generating a tensorized program for each and ultimately
|
||
demand has led to the process known as tensorization [21], which selecting the one with the best performance. This approach incurs
|
||
involves transforming computations using these intrinsic instructions. significant time overhead and is inefficient, as it fails to prioritize more
|
||
However, hardware specialization complicates the task of generating promising candidates during the exploration process. Our experiments
|
||
high-performance tensorized programs. reveal that many mapping candidates for a given subgraph ultimately
|
||
To support hardware intrinsic instructions across different acceler- fail to produce high-performance tensorized programs, indicating that
|
||
ators, existing methods [22–24] use unified hardware abstractions to a large portion of the explored mappings are ineffective in optimizing
|
||
enable end-to-end automatic mapping space exploration. These abstrac-
|
||
performance.
|
||
tions not only convert opaque intrinsics into an analyzable format but
|
||
|
||
|
||
∗ Corresponding author.
|
||
E-mail address: huyh@hnust.cn (Y. Hu).
|
||
1
|
||
Part of this work was done at National University of Defense Technology.
|
||
|
||
https://doi.org/10.1016/j.sysarc.2025.103359
|
||
Received 23 November 2024; Received in revised form 8 January 2025; Accepted 30 January 2025
|
||
Available online 7 February 2025
|
||
1383-7621/© 2025 Published by Elsevier B.V.
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
|
||
|
||
Fig. 1. Comparison of different task scheduling strategies. Part (a): task scheduling with gradient decent. In round 1, all 𝑡𝑎𝑠𝑘𝑠𝑖 are executed sequentially. In subsequent rounds,
|
||
𝑡𝑎𝑠𝑘𝑠𝑖 are selectively executed based on the performance gradients calculated from the feedback of each task. Part (b): sequential execution of sub-tasks without dual-task scheduling.
|
||
Part (c): slice the time and prioritize important subgraphs and intrinsic mapping candidates, meaning that not all main-tasks and sub-tasks will be executed. For example, an
|
||
intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and discarded mapping candidates. The former will proceed to subsequent tensor program optimization and tuning,
|
||
while the latter will not participate in further optimization unless they are selected in the next scheduling round. (For interpretation of the references to color in this figure legend,
|
||
the reader is referred to the web version of this article.)
|
||
|
||
|
||
Secondly, existing rule-based tensor program exploration meth- backends. These compilers take model definitions, expressed in frame-
|
||
ods [25] lack the ability to perform automatic tuning and optimization works like PyTorch [33] or TensorFlow [34], as inputs and generate
|
||
tailored to domain-specific intrinsics. As a result, these methods often efficient code implementations for specific hardware platforms, such as
|
||
fail in auto-tuning and produce suboptimal tensorized programs. To CPUs, GPUs. The compilation process often adopts a progressive multi-
|
||
overcome these limitations, there is an urgent need for more efficient layer optimization approach. It begins with the front-end, where neural
|
||
exploration of subgraph mapping spaces, along with auto-tuning strate- network models serve as input, and proceeds through intermediate
|
||
gies that can effectively support domain-specific intrinsics, enabling the representation (IR) stages. These include graph-level IR [35–39] for
|
||
automatic generation of high-performance tensorized programs. structural optimizations and loop-level IR [40–42] for fine-grained
|
||
In this paper, we introduce GTA, a new compiler framework de- transformations. Finally, the back-end generates hardware-specific ex-
|
||
signed to generate high-performance tensorized programs. GTA auto- ecutable code using traditional compiler techniques, ensuring efficient
|
||
matically generates an extensive search space optimized for hardware execution on the target platform.
|
||
intrinsics, simultaneously increasing the likelihood of selecting the most A key innovation in deep learning compilers is the compute-
|
||
efficient mapping configuration. For generating the search space, we schedule separation first introduced by Halide [43] and adopted by
|
||
employ rule-based strategies to construct a large scheduling search frameworks like TVM [21]. Compute represents the mathematical
|
||
space and apply pruning techniques based on hardware cache resource description of tensor operations, such as addition, convolution, or
|
||
limitations to eliminate invalid program candidates. Finally, as shown matrix multiplication, while schedule defines how these operations
|
||
in Fig. 1, for search strategy implementation, we use a dual-task are executed on hardware. Schedule specifies program transforma-
|
||
scheduling algorithm to allocate tuning resources across all subgraphs tions, including loop tiling, vectorization, and unrolling, to optimize
|
||
(𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 as shown by the blue box in Fig. 1) in the neural net- performance for specific hardware architectures. This decoupling sim-
|
||
work and their intrinsic mapping candidates(𝑠𝑢𝑏-𝑡𝑎𝑠𝑘𝑖 as shown by the plifies the representation of tensor computations, enabling flexible
|
||
orange box and gray box). This algorithm prioritizes subgraphs with optimization strategies tailored to different backends.
|
||
greater potential for performance improvement, allocating them more Recent advancements [22–24,44] in deep learning compilers focus
|
||
tuning opportunities, while reducing tuning efforts on less promising on leveraging hardware intrinsics to further optimize tensor programs.
|
||
mapping candidates based on performance feedback, thereby minimiz- By integrating intrinsic-specific mapping abstractions, these compil-
|
||
ing overall tuning time. In summary, this paper makes the following ers can directly utilize the specialized instructions of DLAs, such as
|
||
contributions: NVIDIA’s Tensor Cores or CPU’s AVX512, to achieve higher compu-
|
||
tational efficiency. These developments mark a shift from general-
|
||
• We integrated intrinsic-based mapping abstraction with a rule-
|
||
purpose optimizations to hardware-aware designs, laying the founda-
|
||
based program generation strategy to expand the search space
|
||
tion for intrinsic-based mapping strategies.
|
||
significantly.
|
||
• We developed and implemented an efficient dual-task schedul-
|
||
2.2. Intrinsic-based mapping abstraction
|
||
ing strategy for tensorized programs, effectively reducing tuning
|
||
efforts while enhancing performance.
|
||
The development of DLAs has led to the creation of specialized
|
||
• We propose a compilation framework called GTA, which supports instructions [45–48], known as intrinsics, designed to enhance the
|
||
the generation of high-performance tensorized programs at both computational efficiency of tensor operations. These instructions serve
|
||
the operator level and the full network level on NVIDIA GPUs and as essential interfaces between hardware and compilers, enabling opti-
|
||
CPUs. mized execution of key operations like matrix multiplication and data
|
||
• We implemented and comprehensively evaluated the GTA sys- movement.
|
||
tem, demonstrating that the aforementioned techniques outper- Intrinsics provide an efficient mechanism for managing kernel op-
|
||
form state-of-the-art systems across various deep neural networks erations in tensor programs, typically categorized into compute in-
|
||
(DNNs). trinsics for performing computations and memory intrinsics for data
|
||
handling [22]. For example, NVIDIA Tensor Cores [17–19] and CPU
|
||
2. Background and motivation AVX512 [20] offer specialized intrinsics that allow accelerated ma-
|
||
trix and vector operations, respectively, facilitating high-performance
|
||
2.1. Deep learning compilers computation across various accelerators.
|
||
Intrinsic-based mapping abstraction further unifies tensor pro-
|
||
Deep learning compilers [21–32] have emerged as essential tools for gram optimization by representing diverse intrinsic behaviors in a com-
|
||
bridging the gap between deep learning models and diverse hardware mon, analyzable form. Frameworks like AMOS [22] and TensorIR [23]
|
||
|
||
2
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
Table 1 ❸ Rule-based mapping: Rule-based mapping [24–27,57] gener-
|
||
State-of-the-art compilers∕mappings for hardware accelerators.
|
||
ates efficient tensor programs through predefined scheduling primi-
|
||
Name Mapping Method tives, streamlining tensor program creation without user-defined tem-
|
||
❶ plates. This approach leverages scheduling techniques like loop tiling,
|
||
AutoTVM Hand-written templates + Tuning
|
||
fusion, and vectorization, as demonstrated by frameworks like An-
|
||
Triton Hand-written templates
|
||
❷
|
||
sor [25], which automatically create search spaces using these rules.
|
||
Tiramisu Polyhedral model This method simplifies tensor program generation in deep learning
|
||
AKG Polyhedral model + Templates applications. However, it also has limitations: users must ensure that
|
||
❸ the predefined rules align with the specific operators and hardware, or
|
||
Ansor Generated rules + Tuning
|
||
the generated programs may fail to achieve optimal performance.
|
||
XLA Templates and rules
|
||
Heron Constraint-based rules + Tuning ❹ Analyzable abstraction mapping: Analyzable abstraction map-
|
||
MetaSchedule Generated rules + Tuning ping [22,23,44,58,59] unifies tensor program optimization by abstract-
|
||
❹ ing diverse hardware intrinsic behaviors into a common representation,
|
||
UNIT Analyzable abstraction + Tuning facilitating efficient mapping and transformation of tensorized pro-
|
||
ROLLER Tile abstraction + Construction policy
|
||
AMOS Analyzable abstraction + Tuning
|
||
grams. Examples like AMOS and TensorIR establish direct mappings
|
||
TensorIR Analyzable abstraction and generated rules + Tuning between software and hardware, guiding the automated generation
|
||
❺ of tensorized programs. This approach broadens the scope of explo-
|
||
Hidet Task-mapping + Post-scheduling fusion ration by identifying foundational software-to-hardware combinations,
|
||
EINNET Derivation-based + Tuning
|
||
increasing the potential for discovering optimized mappings.
|
||
TensorMap Reinforcement learning + Tuning
|
||
❺ Other mapping: Other mapping methods [13,40,60,61] reformu-
|
||
GTA Analyzable abstraction and generated rules + Tuning
|
||
late deep learning optimization problems using strategies from other
|
||
domains to enhance efficiency. For example, CoSA [56] and Heron [24]
|
||
convert the scheduling space search into a constrained optimization
|
||
leverage this approach to directly map software operations to hard- problem and leverage solvers to rapidly explore the space. Alterna-
|
||
ware intrinsics, supporting automated generation and transformation tively, TLM [62] and Soter [63] treat tensor program exploration as
|
||
of tensorized programs. This abstraction broadens the search space for a language model generation task, where tensor programs are rep-
|
||
high-performance configurations by identifying fundamental software- resented as sequences and tunable parameters as language tokens.
|
||
to-hardware mappings, thus enhancing optimization potential across Specifically, they leverage a large language model (LLM) to generate
|
||
different hardware backends. these tokens for tunable parameters, enabling efficient exploration of
|
||
mapping schemes and more effective optimization of tensor programs.
|
||
2.3. Tensor program generation strategy Building on this foundation, we reviewed five primary mapping
|
||
approaches used for deep learning accelerators: hand-written, rule-
|
||
In Table 1, we summarize state-of-the-art compiler mapping tech- based, polyhedral model, analyzable abstraction, and other mapping
|
||
niques used to generate optimized tensor programs on hardware accel- methods. Each approach brings unique advantages—hand-written and
|
||
erators. Most existing compilers leverage programmable intrinsics as rule-based mappings allow fine-tuned performance but require exten-
|
||
part of their mapping strategy, enabling developers to focus on high- sive manual intervention or rigid predefined rules, while polyhedral
|
||
level optimization while the compiler handles low-level architectural and analyzable abstraction mappings offer more automated solutions
|
||
details. These mapping methods streamline tensor program generation but are challenged by complexity and limited applicability. Methods
|
||
by abstracting hardware-specific operations, thereby enhancing both borrowing from other domains, such as optimization solvers and lan-
|
||
efficiency and portability. guage models, open new directions but may lack consistency across
|
||
Specifically, we categorize the state-of-the-art compilers/mappers diverse hardware. In summary, intrinsic-based mapping abstraction of-
|
||
for DLAs into five main approaches: fers a unified framework for optimizing tensor programs across diverse
|
||
❶ Hand-written mapping: Hand-written mapping [29,49] requires hardware accelerators by abstracting hardware intrinsic behaviors into
|
||
developers to manually define mappings for tensorized programs using a common representation. Systems like AMOS and TensorIR leverage
|
||
compiler-provided tensorize interfaces. This approach enables fine- this approach to enable efficient and adaptable mappings for tensorized
|
||
grained optimization, especially for specialized hardware like NVIDIA programs.
|
||
Tensor Cores. However, it demands significant expertise and high Despite these advances, significant challenges remain in achieving
|
||
development costs, as developers must continually rewrite templates flexible, high-performance mappings that are adaptable to new hard-
|
||
to support new operators and accelerators [50–52]. While hand-written ware accelerators, such as the inefficiency of existing approaches in
|
||
mapping can achieve high performance for specific workloads, its lack handling diverse architectural constraints and their inability to effec-
|
||
of scalability and adaptability limits its effectiveness compared to more tively explore large and complex search spaces. To better illustrate our
|
||
automated methods. motivation, we present an example to illustrate the specific challenges
|
||
❷ Polyhedral model mapping: Polyhedral model mapping [28,32, within existing analyzable abstraction mapping systems, motivating the
|
||
53–56] provides a powerful strategy for optimizing tensor programs by development of our approach.
|
||
restructuring execution and managing complex memory dependencies. Mapping intrinsic instructions onto hardware accelerators poses
|
||
In the realm of tensor program compilation, this approach plays a significant challenges due to the vast number of possible configurations
|
||
critical role in handling intricate memory structures and optimizing ex- and their impact on performance. The process of selecting the optimal
|
||
ecution. For example, AKG [32] leverages polyhedral scheduling to re- mapping for intrinsic instructions, such as those used in Tensor Cores, is
|
||
structure execution order through new linear relationships, effectively complex, given the numerous potential mapping candidates. Each map-
|
||
eliminating inter-loop dependencies. This method is particularly advan- ping choice can critically affect performance factors like data locality
|
||
tageous for hardware like TPUs, where enhancing parallel computation and parallelism. For example, as shown in Table 2, AMOS identified 35
|
||
is essential. By exploring a broader range of affine transformations distinct ways to map the seven loops of a 2D convolution onto the 3D
|
||
compared to methods such as TVM [21], polyhedral mapping optimizes loops of the Tensor Core. Exhaustively exploring all configurations is in-
|
||
performance for diverse workloads. However, the model’s inherent efficient and rarely yields substantial performance gains. Thus, a more
|
||
complexity limits its general applicability, making it less feasible for efficient approach is required, one that prioritizes the most promising
|
||
simpler or less resource-intensive tasks. mappings to reduce search overhead and maximize performance.
|
||
|
||
3
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
Table 2
|
||
Mapping candidates choices. This example maps a 2D convolution index to Tensor Core index (type: float16). Space loops: n, k, p, q, 𝑖1 , 𝑖2 ;
|
||
Reduction loops: rc, rr, rs, 𝑟1 . The mapping choices can be categorized into basic mapping and complex mapping. Basic mapping means selecting
|
||
only one choice at a time, while complex mapping allows multiple choices to be combined for mixed mapping.
|
||
mapping1 mapping2 mapping3 mapping4 mapping5 mapping6 mapping7
|
||
i1 n n n p p q q
|
||
i2 k k k k k k k
|
||
r1 rc rr rs rc rs rc rr
|
||
Choices 0/1 0/1 0/1 0/1 0/1 0/1 0/1
|
||
|
||
|
||
|
||
|
||
Fig. 2. The compilation flow of GTA. 𝑡𝑛 denotes the 𝑛th non-intrinsic main-task (blue box), and 𝑡𝑛𝑘 denotes the 𝑘th mapping candidate of 𝑛th intrinsic-enabled main-task (orange
|
||
box). All mapping candidates are ranked and executed based on performance feedback. (For interpretation of the references to color in this figure legend, the reader is referred
|
||
to the web version of this article.)
|
||
|
||
|
||
A second challenge lies in the scheduling of tensor programs, which 4. Dual-task scheduling
|
||
often lacks consideration for DLAs intrinsics. Existing systems do not
|
||
sufficiently incorporate these intrinsics when generating the scheduling Most existing compiler frameworks adopt a performance-aware tun-
|
||
search space, limiting their ability to optimize tensorized programs for ing strategy to fine-tune generated programs, a method proven effective
|
||
specialized hardware. To address this, a more comprehensive approach by systems such as Ansor and AMOS. For example, Ansor refines its
|
||
to scheduling is needed, integrating primitives like tiling, fusion, and cost model by updating task weights based on feedback from each
|
||
vectorization that are tailored to the unique characteristics of DLAs. search iteration, while dynamically allocating subgraph trials. Building
|
||
Without such a targeted approach, the scheduling search space cannot on this approach, when multiple intrinsic instruction mapping options
|
||
fully leverage the potential of available mappings, thereby constraining are available, feeding performance results of each mapping back into
|
||
the system’s capacity to produce high-performance programs. the front-end further enhances the framework by enabling seamless
|
||
co-design between the front and back-end stages.
|
||
3. GTA overview To optimize tuning resource allocation, a DNN can be decomposed
|
||
into multiple independent subgraphs (e.g., conv2d + ReLU). For some
|
||
To address the aforementioned issues, we propose GTA, a compila-
|
||
subgraphs, spending time on tuning may not significantly improve the
|
||
tion framework designed to automatically generate high-performance
|
||
overall network performance. This may occur when a subgraph is not a
|
||
tensorized programs for specialized hardware. As shown in Fig. 2, it
|
||
performance bottleneck, or when any tuning yields only marginal gains.
|
||
takes deep neural networks (DNNs) as input, converting them into
|
||
Similarly, a subgraph may have multiple intrinsic mapping candidates,
|
||
computation graphs represented as directed acyclic graphs (DAGs). In
|
||
but further tuning on certain mappings may not result in meaningful
|
||
these graphs, each node corresponds to a tensor operation, and each
|
||
improvements. This is often because certain mapping schemes exhibit
|
||
edge denotes a producer–consumer relationship between operations. To
|
||
inefficient memory access patterns, limiting their ability to leverage the
|
||
handle the complexity of large computational graphs, GTA partitions
|
||
unique features of the underlying hardware and thereby restricting the
|
||
the DNN’s computation graph into smaller, manageable subgraphs us-
|
||
potential for significant performance gains.
|
||
ing Relay’s operator fusion algorithm, which has minimal performance
|
||
impact due to the layer-by-layer structure of DNNs (𝑡1 , 𝑡2 , … , 𝑡𝑛 in To illustrate the dual-task scheduling (DTS) process, we use
|
||
Fig. 2). ResNet18 as an example. After splitting ResNet18 into subgraphs,
|
||
To maximize performance across multiple subgraphs, GTA dynam- there are 24 unique subgraphs, most of which are convolution layers
|
||
ically prioritizes subgraphs and mapping candidates most likely to with varying shape configurations (e.g., input size, kernel size, stride).
|
||
enhance end-to-end efficiency. It uses a dual-task scheduling ap- Following Ansor’s task scheduling methodology, we define a task as the
|
||
proach (detailed in Section 4) that allocates tuning time across both process of generating high-performance programs for each subgraph.
|
||
subgraph and mapping candidate levels. By allocating varying amounts Thus, optimizing a single DNN like ResNet18 requires completing
|
||
of time to different subgraphs and probabilistically discarding less effi- multiple tasks (e.g., 24 tasks for ResNet18).
|
||
cient candidates based on performance feedback, dual-task scheduling To efficiently allocate tuning resources across these tasks, GTA
|
||
helps avoid wasted tuning resources on low-impact mappings. employs a DTS approach. This method dynamically assigns varying
|
||
Additionally, resource-constrained rules (explained in Section 5) amounts of time to different subgraphs and probabilistically discards in-
|
||
guide program generation on both DLAs and general-purpose acceler- efficient mapping candidates based on program performance feedback.
|
||
ators. GTA designs these rules by abstracting common architectural DTS operates on two levels: the subgraph level and the mapping can-
|
||
characteristics across DLAs, such as coarse-grained hardware intrin- didate level, helping GTA focus tuning resources on the most impactful
|
||
sics (e.g., WMMA in Tensor Core) and dedicated scratchpad memory configurations and avoid spending time on low-impact mappings.
|
||
(e.g., Unified Buffer in TPU). This design allows GTA to efficiently As shown in Fig. 1, the DTS iteratively allocates tuning resources to
|
||
leverage hardware-specific features, optimizing tensorized programs to different tasks. In each round, the first step selects a subgraph for pro-
|
||
fully exploit the underlying hardware capabilities. gram generation, GTA generates a set of intrinsic-compatible mapping
|
||
|
||
4
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
Algorithm 1: Dual-Task Scheduling Table 3
|
||
Notations.
|
||
Input:
|
||
Notation Description/Definition
|
||
𝐺: native deep learning neural network
|
||
target : target hardware platform Main-task Subgraph process for generating high-performance programs
|
||
Sub-task Intrinsic mapping candidate satisfying hardware constraints
|
||
trials: total tuning counts
|
||
𝛥𝑡 Small backward window size
|
||
MEASURE_NUM : number of measures per round
|
||
𝑁𝑖 The set of similar task of i
|
||
Output: 𝑏𝑒𝑠𝑡_𝑡𝑎𝑠𝑘𝑠: best performance tasks 𝐶𝑖 The number of floating point operation in task i
|
||
1 Function dual_scheduling 𝑉𝑘 The number of floating point operation per second we can
|
||
2 Initialize local variables 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐵𝑡𝑎𝑠𝑘 , 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 , 𝐶𝑡𝑎𝑠𝑘 , 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ; achieve in task k
|
||
3 tasks = 𝑒𝑥𝑡𝑟𝑎𝑐 𝑡_𝑡𝑎𝑠𝑘𝑠(𝐺, 𝑡𝑎𝑟𝑔 𝑒𝑡); 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 Best mapping latency set of tasks
|
||
4 while 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 < trials do 𝐵𝑡𝑎𝑠𝑘 Best mapping tasks set of DNN
|
||
5 tid = gradient_scheduling (tasks, 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 ); 𝐶𝑠𝑎𝑚𝑝𝑙𝑒 Samples selected from all mappings
|
||
6 𝑀𝑐 𝑎𝑛𝑑 𝑖 = 𝑚𝑎𝑡𝑐 ℎ_𝑖𝑛𝑡𝑟𝑖𝑛𝑠𝑖𝑐(tasks[tid], target ); 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 Current number of trials
|
||
𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 Current mapping selection
|
||
7 if 𝑀𝑐 𝑎𝑛𝑑 𝑖 not NULL then
|
||
G Native neural network
|
||
8 for 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 in 𝑀𝑐 𝑎𝑛𝑑 𝑖 do
|
||
𝑚𝑖 (𝑡) Minimum execution time for𝑖th task
|
||
9 if 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then 𝑚𝑖𝑘 (𝑡) Execution time of 𝑘th mapping for 𝑚𝑖 (𝑡)
|
||
10 if 𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 not in 𝐶𝑠𝑎𝑚𝑝𝑙𝑒𝑠 then 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 Latency set of all tasks
|
||
11 continue; 𝑀𝑐 𝑎𝑛𝑑 𝑖 Set of all mapping candidates
|
||
12 end 𝛼𝑘 Sampling probability of mapping k
|
||
13 end 𝛽 Hyperparameter for increasing probability
|
||
𝜔𝑖 Number of appearances of task i in the network
|
||
14 latency = tasks[tid].tune(𝐶𝑚𝑎𝑝𝑝𝑖𝑛𝑔 );
|
||
15 𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 .append(latency);
|
||
16 if latency < 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 then
|
||
17 𝐵𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 [tid] = latency;
|
||
defined as:
|
||
18 𝐵𝑡𝑎𝑠𝑘 [tid] = tasks[tid];
|
||
19 end
|
||
∑
|
||
𝑛
|
||
𝑓 (𝐺) = (𝜔𝑖 × 𝑚𝑎𝑥(𝛽(𝛼1 ⋅ 𝑚𝑖1 (𝑡), 𝛼2 ⋅ 𝑚𝑖2 (𝑡), ..., 𝛼𝑘 ⋅ 𝑚𝑖𝑘 (𝑡)))) (1)
|
||
20 end 𝑖=1
|
||
21 𝐶𝑠𝑎𝑚𝑝𝑙𝑒 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒(𝑇𝑙𝑎𝑡𝑒𝑛𝑐 𝑦 );
|
||
22 𝐶𝑡𝑟𝑖𝑎𝑙𝑠 += MEASURE_NUM Let 𝜔𝑖 denote the number of appearances of main-task 𝑖 in the
|
||
23 end network, where 𝑖 is the main-task index. If a main-task has already met
|
||
24 end its latency requirement, no additional tuning resources are allocated to
|
||
25 return 𝐵𝑡𝑎𝑠𝑘 ; it. The variable 𝛼𝑘 represents the sampling probability assigned to sub-
|
||
task k. Unlike other frameworks, our approach introduces probabilistic
|
||
allocation for intrinsic mapping candidates (sub-task). Once perfor-
|
||
mance feedback for all mapping candidates of a subgraph is received,
|
||
candidates for the intrinsic-enabled 𝑡𝑎𝑠𝑘𝑖 . This effectively breaks the sampling probabilities are assigned based on time cost, candidates
|
||
main-task into several sub-tasks (as shown by the orange box in Fig. 1). with lower time costs are assigned higher probabilities, while those
|
||
The second step then generates a batch of promising programs for these with higher time costs receive lower probabilities. We also introduce a
|
||
sub-tasks and measures their performance on hardware. Each round is hyperparameter 𝛽 to adjust sampling probabilities for specific mapping
|
||
defined as one unit of time resource. When a time resource is allocated candidates, helping to avoid convergence on locally optimal solutions.
|
||
to a task, the task gains the opportunity to generate and measure new
|
||
programs, increasing the chance of discovering better-performing ones. 4.2. Optimizing with gradient and probability
|
||
In the following section, we introduce the formulation of the
|
||
scheduling problem and our solution. Inspired by the gradient descent-based task scheduling approach
|
||
presented in [25], we propose a DTS algorithm (Algorithm 1) that com-
|
||
4.1. Problem formulation bines gradient descent with probability-based selection to efficiently
|
||
optimize the objective function. Starting from the current allocation t,
|
||
the algorithm approximates the gradient of the objective function, 𝜕𝜕𝑓𝑡 ,
|
||
𝑖
|
||
and identifies the primary task i by maximizing the absolute gradient,
|
||
In defining the scheduling problem, we divide DTS into two types of 𝜕𝑓
|
||
defined as 𝑖 = ar g max𝑖 | 𝜕 𝑡 |. This gradient approximation serves as
|
||
tasks: main-tasks and sub-tasks. In this framework, a DNN can be split 𝑖
|
||
the foundation for selecting the main-task with the highest potential
|
||
into several subgraphs (main-tasks). If the computation type, data type,
|
||
impact.
|
||
and computation shape of a main-task meet the limitations required ( ( ( ) )))
|
||
for utilizing hardware intrinsic resources, multiple intrinsic mapping 𝜕𝑓 𝜕 𝑓 ( 𝛥𝑚 𝑚𝑖 𝑡𝑖 𝐶𝑖 ( )
|
||
≈ 𝛼 +(1 − 𝜂) min − ,𝜃 − 𝑚𝑖 𝑡𝑖
|
||
candidates will be generated for the main-task. Each of these intrinsic 𝜕 𝑡𝑖 𝜕 𝑚𝑖 𝛥𝑡 𝑡𝑖 max𝑘∈𝑁(𝑖) 𝑉𝑘
|
||
mapping candidates is referred to as a sub-task. A main-task represents
|
||
a process performed to generate high-performance programs for a (2)
|
||
( ) ( )
|
||
subgraph, meaning that optimizing a single DNN requires completing where 𝛥𝑚 = 𝑚𝑖 𝑡𝑖 − 𝑚𝑖 𝑡𝑖 − 𝛥𝑡 and other variables are defined in
|
||
dozens of main-tasks. And related notions used in this paper are shown Table 3. The parameter 𝜂 and 𝜃 control the weight to trust some
|
||
in Table 3. predictions.
|
||
We define 𝑚𝑖 (𝑡) as the minimum execution time required for the GTA initializes the algorithm with t = 0 and begins with a round-
|
||
𝑖th main-task at time 𝑡, and 𝑚𝑖𝑘 (𝑡) as the execution time of the 𝑘th robin warm-up phase, resulting in an initial allocation vector of 𝑡 =
|
||
mapping scheme for the 𝑖th main-task. The optimal execution time {1, 1, … , 1}. After the warm-up, as shown in line 5 of Algorithm 1,
|
||
for subgraph 𝑖 is represented as min(𝑚𝑖1 (𝑡), 𝑚𝑖2 (𝑡), … , 𝑚𝑖𝑘 (𝑡)). The end- the gradient for each main-task is computed, and the main-task with
|
||
to-end execution time of the entire network, denoted by 𝐺(𝑚1 (𝑡), 𝑚2 (𝑡), the maximum absolute gradient, 𝑖 = argmax𝑖 | 𝜕𝜕𝑓𝑡 |, is selected. A tuning
|
||
𝑖
|
||
… , 𝑚𝑛 (𝑡)), represents the aggregate time across all main-tasks. Our time unit is then allocated to this main-task, updating its allocation to
|
||
objective is to minimize this function to achieve the lowest possible 𝑡𝑖 = 𝑡𝑖 + 1. The optimization process continues until the tuning time
|
||
overall execution time for the DNN. Thus, the objective function is budget is exhausted.
|
||
|
||
5
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
Afterward, GTA searches for a hardware intrinsic that matches the Table 4
|
||
Resource-constrained rules and related conditions.
|
||
specified main-task. Once a suitable set of hardware intrinsics is identi-
|
||
fied, tensor programs are generated for all mapping candidates, serving No. Rule Condition
|
||
|
||
as a warm-up for the sub-task. This warm-up allows GTA to select the HasDataReuse(R, i) &
|
||
R1 Multi-Level Tiling
|
||
HasMultiLevelCache(R, i)
|
||
most promising mapping candidates by assigning probabilities based
|
||
HasDataReuse(R, i) &
|
||
on their performance feedback. In subsequent rounds, only mapping R2 Set Multi-Scope
|
||
HasMultiScopeCache(R, i)
|
||
candidates prioritized by their previously assigned probabilities are R3 Fuse Main Op HasStagesFused(R)
|
||
executed. This selective exploration avoids spending time on inefficient R4 Fuse Output Op HasStagesFused(R)
|
||
candidates, enhancing tuning efficiency and allowing higher-potential R5 AddMemLimit HasDSM(R)a
|
||
... Ansor Defined Ruleb ...
|
||
candidates more opportunities for optimization.
|
||
a
|
||
The 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒 algorithm, as called in line 21 of Algorithm DSM: dedicated scratchpad memory.
|
||
b Ansor [25].
|
||
1, is designed to probabilistically select mapping candidates for further
|
||
analysis and optimization. We first introduce the notation: let 𝑅 =
|
||
{𝑟1 , 𝑟2 , … , 𝑟𝑛 } represent the set of all mapping results, where 𝑟𝑖 denotes
|
||
the 𝑖th result with a performance value 𝑉 (𝑟𝑖 ).
|
||
The total weight 𝑊 is calculated by considering each result’s in- Second, current approaches are largely tailored to general-purpose
|
||
verse performance value, normalized with respect to the maximum processors and lack consideration for specific architectural constraints.
|
||
performance in 𝑅, as follows: This highlights the need to construct a high-quality kernel design
|
||
∑ 1 1 space to effectively reduce inefficient exploration and improve overall
|
||
𝑊 = ⋅ performance.
|
||
𝑉 (𝑟𝑖 ) max 1
|
||
𝑟 ∈𝑅
|
||
𝑖 𝑟𝑗 ∈𝑅 𝑉 (𝑟 )
|
||
𝑗 To address these challenges, GTA’s implementation of resource-
|
||
constrained generation rules is based on existing open-source code
|
||
This ensures that weights are scaled and relative to the most perfor- for DLAs and general-purpose accelerators [22,25]. In particular, the
|
||
mant candidate in the result set 𝑅. Using this normalized total weight DLA-specific rules are adapted to leverage hardware intrinsics and
|
||
𝑊 , the initial probability assigned to each result 𝑟𝑖 is given by: dedicated scratchpad memory (DSM) efficiently. From a programmer’s
|
||
1 1
|
||
⋅
|
||
𝑉 (𝑟𝑖 ) max𝑟 ∈𝑅 1 perspective, DLAs, in contrast to general-purpose accelerators, feature
|
||
𝑗 𝑉 (𝑟𝑗 )
|
||
𝑃 (𝑟𝑖 ) = coarse-grained hardware intrinsics (e.g., WMMA in Tensor Core) and
|
||
𝑊 user-programmable dedicated DSM (e.g., Unified Buffer in TPU). Based
|
||
on these existing implementations, we made targeted modifications
|
||
To encourage exploration, the algorithm applies a probability in- to better align the rules with the search strategies and optimization
|
||
crease factor 𝛽 to selected results. The probability adjustment is defined methods proposed in this work. Table 4 summarizes five key generation
|
||
by weighting the original probability 𝑃 (𝑟𝑖 ) with an exploration boost: rules that GTA employs to optimize data movement, operation fusion,
|
||
(1 + 𝛽) ⋅ 𝑃 (𝑟𝑖 ) and memory management in DLAs. Each rule addresses specific chal-
|
||
𝑃 ′ (𝑟𝑖 ) = ∑
|
||
𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ) lenges to enhance computational efficiency and resource utilization.
|
||
The following is a detailed description of each rule:
|
||
Here, 𝛽𝑗 is a task-specific exploration factor, applied selectively Rule-R1 generates multiple nodes for data movement between dif-
|
||
to candidates 𝑟𝑗 , where 𝛽𝑗 = 𝛽 for selected candidates and 𝛽𝑗 = 0 ferent levels of on-chip DSMs. To apply this rule, GTA first checks for
|
||
otherwise. The inclusion of the initial probability 𝑃 (𝑟𝑗 ), derived from data reuse opportunities and verifies if the DLA has multiple DSM levels
|
||
each candidate’s performance value 𝑉 (𝑟𝑗 ), serves as the foundation of (e.g., Tensor Core provides two levels of DSMs for WMMA fragments
|
||
the adjusted probabilities. This ensures that 𝑃 ′ (𝑟𝑖 ) retains the relative and shared memory). If these conditions are met, GTA inserts 𝑐 𝑎𝑐 ℎ𝑒𝑟𝑒𝑎𝑑
|
||
importance of each candidate while allowing selective exploration primitives for the node and its producers to facilitate data movement.
|
||
through 𝛽𝑗 . Rule-R2 marks the data storage scope for each operation within
|
||
∑
|
||
The normalization term, 𝑟𝑗 ∈𝑅 (1 + 𝛽𝑗 ) ⋅ 𝑃 (𝑟𝑗 ), ensures that the the DSM hierarchy. To apply this rule, GTA first checks for data reuse
|
||
adjusted probabilities remain valid and sum to 1. By combining the opportunities and verifies whether the DLA provides multiple DSM
|
||
task-specific exploration factor with the initial performance-weighted scopes for different data types. If these conditions are satisfied, GTA
|
||
probability 𝑃 (𝑟𝑗 ), this formula balances exploitation of high-priority assigns 𝑐 𝑎𝑐 ℎ𝑒𝑤𝑟𝑖𝑡𝑒 primitives to the node and 𝑐 𝑎𝑐 ℎ𝑒𝑟𝑒𝑎𝑑 primitives to its
|
||
candidates with exploration of less performant options. Furthermore, producers, ensuring that data is efficiently stored and accessed within
|
||
𝑃 (𝑟𝑗 ) prevents the adjustment from overly concentrating on a small the appropriate DSM levels.
|
||
subset of candidates, promoting diversity and fairness across the result Rule-R3 enables the fusion of main operations within a subgraph
|
||
set 𝑅. by identifying opportunities to combine operations with shared data
|
||
Finally, the algorithm selects the top 𝑁 results based on the adjusted dependencies. This reduces data movement overhead and improves
|
||
probabilities 𝑃 ′ (𝑟𝑖 ). The selection process is expressed as: computational efficiency. When multiple stages are fused, GTA inserts
|
||
( ) the appropriate primitives to implement the fusion, streamlining the
|
||
{𝑟𝑖 }𝑁
|
||
𝑖=1
|
||
= Top𝑁 𝑃 ′ (𝑟1 ), 𝑃 ′ (𝑟2 ), … , 𝑃 ′ (𝑟𝑛 ) , (3)
|
||
execution flow.
|
||
where 𝑁 is dynamically determined based on a fraction of the total Rule-R4 focuses on fusing output operations within a computational
|
||
result set 𝑅, denoted by 𝑁 = ⌈𝜅 ⋅ |𝑅|⌉, and 𝜅 ∈ (0, 1] is a user-defined graph. Similar to Rule-R3, it targets operations that can be combined
|
||
parameter controlling selection size. to minimize data transfer costs and enhance throughput. By analyz-
|
||
ing data flow between operations, GTA inserts necessary primitives
|
||
5. Resource-constrained rules to achieve output fusion, resulting in a more compact and efficient
|
||
execution structure.
|
||
Existing exploration-based methods face significant challenges in Rule-R5 constrains memory usage for operations that utilize ded-
|
||
both performance and scalability, primarily due to two factors. First, icated scratchpad memory (DSM). By evaluating each operation and
|
||
although the design space is vast, it contains numerous inefficient ker- its memory requirements, GTA ensures memory limits are respected,
|
||
nels. For example, in the GEMM operation with dimensions preventing allocations from exceeding hardware capacity, which could
|
||
512 × 768 × 3072 (used in GPT-1 on Tensor Core), the kernel space size lead to inefficient execution. This rule helps maintain an efficient
|
||
reaches 𝑂(1016 ), with over 90% of the kernels being inefficient [63,64]. memory allocation strategy, optimizing overall resource utilization.
|
||
|
||
6
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
|
||
|
||
Fig. 3. An illustrative example of tensorized program generation for a GEMM-ReLU operator, demonstrating the transformation of the input program from a mathematical expression
|
||
(𝑝0 ) to a tensor expression (𝑝1 ) written in a domain-specific language using TVM. The process further includes intrinsic matching based on the type and shape of the input operator
|
||
to select and generate intrinsic mapping candidates, followed by the application of resource-constrained rules to guide the creation of a tensorized program sketch (𝑝2 ).
|
||
|
||
|
||
An example. Fig. 3 illustrates how resource-constrained rules are into smaller subgraphs. Notably, for some subgraphs, spending time on
|
||
applied during tensorized program generation. Starting from the input tuning may not significantly enhance the end-to-end performance of
|
||
program written as a mathematical expression (𝑝0 ), the process con- the DNN. In this work, we adopt TVM’s subgraph partitioning strategy
|
||
verts it into a tensor expression (𝑝1 ) using domain-specific language to divide the input DNN into multiple smaller subgraphs, referred to as
|
||
(DSL) in TVM. The intrinsic matching step leverages compute abstrac- main-tasks. A main-task is considered a process executed to generate
|
||
tion and memory abstraction, as proposed in AMOS [22], to complete high-performance programs for a subgraph. TVM categorizes operators
|
||
the software-hardware mapping generation. This process selects and into four types: injective (e.g., add operations), reduction (e.g., sum
|
||
generates intrinsic mapping candidates by analyzing the operator’s operations), complex-out-feasible (e.g., matrix multiplication, where
|
||
computation type, data type, and memory access patterns based on
|
||
element-wise mappings can fuse to the output), and opaque (e.g., sort
|
||
its shape and hardware-specific constraints. Subsequently, resource-
|
||
operations, which cannot be fused). Subgraph fusion is then performed
|
||
constrained rules play a critical role in guiding the generation of the
|
||
based on predefined generic rules.
|
||
tensorized program sketch, ensuring efficient utilization of hardware in-
|
||
trinsic functions while respecting memory and architectural constraints. Mapping Generation and Scheduling. At each iteration, based on
|
||
Specifically, the derivation for generated rules and the transformed the intrinsic mapping generation approach described in AMOS [22],
|
||
program can be expressed as: main-tasks can be classified into intrinsic-disabled and intrinsic-enabled
|
||
𝑅2 𝑅1 tasks. For intrinsic-disabled main-tasks, we adopt Ansor’s [25] com-
|
||
𝑖𝑛𝑝𝑢𝑡 𝑝1 → 𝑀𝑐 𝑎𝑛𝑑 𝑖 → 𝑜(𝑆0 , 𝑖 = 3) → 𝑜(𝑆1 , 𝑖 = 3) → 𝑜(𝑆2 , 𝑖 = 2) pilation optimization to generate programs. In contrast, for intrinsic-
|
||
(4)
|
||
𝑅3 𝑅5
|
||
→ … → 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝2 enabled main-tasks, GTA optimizes task scheduling based on gradients
|
||
and probabilities. This algorithm prioritizes subgraphs with higher
|
||
We defined the state as 𝑜 = (𝑆 , 𝑖), where S represents the current potential for performance improvement, allocating them more tuning
|
||
partially generated sketch program for the DAG, and i denotes the opportunities while reducing efforts on less promising mapping can-
|
||
index of the node currently being transformed. For each rule, if the didates based on performance feedback. Slice the time and prioritize
|
||
application conditions are met, the rule is applied to 𝑜 = (𝑆 , 𝑖), resulting important subgraphs and intrinsic mapping candidates, meaning that
|
||
in a new state 𝑜′ = (𝑆 ′ , 𝑖′ ), where 𝑖′ ≤ 𝑖. This ensures that the index not all main-tasks and sub-tasks will be executed. For example, an
|
||
i (indicating the transforming node) decreases monotonically. A state intrinsic-enabled 𝑚𝑎𝑖𝑛-𝑡𝑎𝑠𝑘𝑖 may contain both retained mapping and
|
||
reaches a terminal condition when 𝑖 = 0. During the enumeration discarded mapping candidates. The former will proceed to subsequent
|
||
process, multiple rules may be applicable to a single state, generat- tensor program optimization and tuning, while the latter will not
|
||
ing several succeeding states. Additionally, a single rule can produce
|
||
participate in further optimization unless they are selected in the next
|
||
multiple succeeding states under certain conditions.
|
||
scheduling round.
|
||
6. Implementation Search Space Exploration. Subsequently, GTA applies resource-
|
||
constrained rules and existing derivation rules (Table 4) to each sub-
|
||
In this section, we delve into the technical details in our imple- graph under the guidance of a genetic algorithm [25]. During this
|
||
mentation. GTA extends TVM, a end-to-end deep learning compiler, to process, tens of thousands of tensor programs are generated, and cost
|
||
support loop scheduling and generate high-performance programs with model is employed to filter out the most promising candidates with
|
||
intrinsic instructions. near-optimal performance. These selected candidates are then executed
|
||
Task Generation. To mitigate the issue of search space explosion, on the target hardware to identify the tensor program with the best
|
||
compilers typically divide the large computational graph of a DNN performance.
|
||
|
||
7
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
7. Evaluation • PyTorch: PyTorch, a widely-used deep learning framework,
|
||
serves as a strong baseline for evaluating GTA’s ability to out-
|
||
7.1. Evaluation platforms perform standard hand-tuned implementations in practical deep
|
||
learning applications. Our experiments include both PyTorch
|
||
1.13, which relies heavily on vendor-optimized libraries such
|
||
Our experiments were conducted on two distinct hardware plat-
|
||
as cuDNN and cuBLAS for high-performance computations, and
|
||
forms to evaluate the performance of the proposed GTA framework:
|
||
PyTorch 2.0, which introduces the TorchInductor compiler.
|
||
• NVIDIA GPUs: We performed experiments on two NVIDIA GPUs, For a fair comparison, we evaluate AutoTVM, Ansor, AMOS, and GTA
|
||
specifically the RTX 3060 and A100, which are equipped with with up to 200 measurement trials per test case and report the best
|
||
Tensor Cores optimized for deep learning tasks. The RTX 3060 performance achieved. For the vendor-optimized libraries on Tensor
|
||
represents a consumer-grade GPU, while the A100 is a data Core, we use PyTorch, which relies on hand-optimized libraries such as
|
||
center-grade GPU designed for high-performance computing. cuDNN to support various types of operators. These optimized libraries
|
||
• AMD CPU: We evaluated the performance on an AMD Ryzen 7 serve as strong baseline references for evaluating the performance of
|
||
7840H CPU,2 which supports advanced SIMD (Single Instruction, GTA.
|
||
Multiple Data) instructions, enabling efficient vectorized compu-
|
||
tations. This CPU platform provides a competitive environment 7.4. Experimental results
|
||
for testing AVX512-like optimizations in general-purpose proces-
|
||
sors, allowing us to benchmark GTA’s performance on non-GPU We evaluate the performance of GTA on both operators and neural
|
||
hardware. networks, comparing it against several baselines on two DLAs: GPU
|
||
Tensor Cores and CPU AVX512. To further demonstrate the effective-
|
||
ness of GTA, we analyze the quality of the generated search spaces and
|
||
7.2. Evaluated benchmarks the efficiency of the exploration process. Finally, we highlight how the
|
||
dual-task scheduling strategy significantly reduces compilation time by
|
||
We evaluate the performance of GTA using both deep learning (DL) dynamically prioritizing subgraphs and mapping candidates, effectively
|
||
operators and complete neural network models. cutting down unnecessary search efforts.
|
||
|
||
• Operator-Level Evaluation: We select nine widely-used opera- 7.5. Operator performance
|
||
tors for this evaluation: General Matrix Multiplication (GEMM),
|
||
1D convolution (C1D), 2D convolution (C2D), 3D convolution Tensor Core. First, we compare GTA with PyTorch, which relies on
|
||
(C3D), transposed 2D convolution (T2D), dilated convolution hand-optimized libraries such as cuDNN to support various operators.
|
||
(DIL), batch matrix multiplication (BMM), General Matrix–Vector Fig. 4 shows the results for all operators with batch size 1 on the
|
||
multiplication (GEMV), and scan (SCAN). For each operator, we NVIDIA RTX 3060. GTA consistently outperforms PyTorch across all
|
||
test 6–10 different shape configurations and report the geometric operators, achieving an average 2.44× geometric mean speedup. The
|
||
mean of speedups normalized to GTA. The shape configurations speedup is attributed to GTA’s comprehensive software-hardware map-
|
||
ping exploration, which contrasts with PyTorch’s use of fixed mappings
|
||
are consistent with those used in Ansor and AMOS to ensure a fair
|
||
from hand-optimized libraries, often leading to suboptimal perfor-
|
||
comparison.
|
||
mance.
|
||
• Network-Level Evaluation: We benchmark six commonly-used
|
||
Next, we evaluate the performance on the NVIDIA A100 GPU for
|
||
neural network models: ResNet18 and ResNet50 [1], BERT (base
|
||
various operator. As shown in Fig. 9, GTA achieves 1.26×, 5.24×,
|
||
configuration) [65], MI-LSTM [66], MobileNet-V1 [67], and Shuf-
|
||
and 1.93× geometric mean speedup over Ansor, PyTorch, and AMOS,
|
||
fleNet [11]. For each model, we evaluate the performance with
|
||
respectively. The significant improvement is due to GTA’s ability to
|
||
batch sizes of 1 and 16.
|
||
effectively utilize the high-performance Tensor Core units through
|
||
enhanced mapping and scheduling strategies.
|
||
7.3. Comparison baselines We also compare GTA with state-of-the-art compilers on RTX 3060
|
||
using the C2D in NCHW layout. We test all convolution layers from
|
||
ResNet18 (a total of 12 configurations, labeled as C0–C11). These
|
||
Our evaluation compares GTA against three state-of-the-art auto-
|
||
configurations are standard benchmarks from well-known networks.
|
||
matic generation methods (AutoTVM [49], Ansor [25] (v0.8), and
|
||
The results are shown in Figs. 4, 5, and 6. GTA achieves speedups of
|
||
AMOS [22] (commit: 0f39742)) as well as two vendor-optimized, hand-
|
||
1.85×, 1.76×, and 2.10× over Ansor, AMOS, and hand-tuned PyTorch,
|
||
tuned libraries (cuDNN (v11.6) and PyTorch (v1.13.1, v2.0.1)): respectively. Compared to Ansor, GTA leverages high-performance
|
||
Tensor Core units alongside efficient auto-scheduling strategies, re-
|
||
• AutoTVM: This method uses hand-written templates to support
|
||
sulting in better optimization. In contrast to AMOS, GTA employs
|
||
all three selected platforms, demonstrating high performance
|
||
DTS to efficiently explore the scheduling space, reducing search time
|
||
across a range of baseline operators.
|
||
while enhancing program performance. Moreover, AMOS cannot utilize
|
||
• AMOS: AMOS systematically explores various mappings of loop resource-constrained rules for shared memory allocation, leading to
|
||
iterations to DLAs, representing the state-of-the-art for operators the generation of some tensor programs that exceed hardware re-
|
||
with multiple feasible mappings, such as C1D and C2D. source limits. This limitation reduces AMOS’s capability to achieve
|
||
• Ansor: As a leading method for GPU CUDA Core and CPU code higher-performing programs.
|
||
generation, Ansor does not support DLAs like Tensor Core due AVX512. On the AMD CPU platform, we utilize hardware abstrac-
|
||
to architectural limitations. However, comparing GTA with An- tion for AVX512 intrinsics (specifically for matrix–vector multiplica-
|
||
sor highlights the benefits of leveraging DLA-specific features in tion) and apply GTA to generate code for C2D. As shown in Fig. 7, GTA
|
||
tensor program generation. achieves 1.49× and 2.76× performance improvements over Ansor and
|
||
PyTorch, respectively. GTA’s advantage stems from combining high-
|
||
performance AVX512 intrinsics with efficient auto-scheduling strate-
|
||
2
|
||
Intel CPUs also support AVX512 instructions and could be used for similar gies, leading to superior program optimization compared to baseline
|
||
experiments. methods.
|
||
|
||
8
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
|
||
|
||
Fig. 4. Single operator performances comparison on NVIDIA RTX 3060.
|
||
|
||
|
||
|
||
|
||
Fig. 5. Performance comparison of C2D on NVIDIA RTX 3060 with batch size = 1, using all convolution layers from ResNet18 (12 configurations, labeled as C0–C11).
|
||
|
||
|
||
|
||
|
||
Fig. 6. Performances comparison for C2D on NVIDIA RTX 3060 with batch size = 16.
|
||
|
||
|
||
|
||
|
||
Fig. 7. Performance on AMD Ryzen 7 7840H CPU relative to Ansor and PyTorch.
|
||
|
||
|
||
|
||
|
||
Fig. 8. Performance of different networks relative to GTA on Tensor Core.
|
||
|
||
|
||
|
||
9
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
|
||
|
||
Fig. 9. Performance comparison of GTA across multiple individual operators on the NVIDIA A100 GPU, compared with baseline methods.
|
||
|
||
|
||
|
||
|
||
Fig. 10. Compilation time overhead and corresponding performance variations under different sampling rates.
|
||
|
||
|
||
7.6. Network performance optimizes resource allocation during the search process and enables the
|
||
rapid identification of high-performance tensor programs.
|
||
Fig. 8 illustrates the performance of GTA on six evaluated networks. Unlike traditional methods that exhaustively explore all mapping
|
||
On average, GTA achieves 1.75×, 1.42× and 1.29× speedups over candidates, GTA employs a dynamic prioritization strategy that adap-
|
||
AMOS, PyTorch 1.13 and PyTorch 2.0 with TorchInductor, respec- tively allocates tuning resources based on performance feedback. This
|
||
tively. For ResNet18 and ResNet50, GTA finds better mappings for strategy ensures that the most promising subgraphs and intrinsic map-
|
||
operators, enabling more extensive utilization of Tensor Cores com- ping candidates are prioritized, while less promising candidates receive
|
||
pared to hand-tuned libraries and AMOS’s optimized templates. GTA fewer tuning opportunities. By combining this with a sampling-based
|
||
overcomes the limitations of these baselines by generating accurate approach, GTA minimizes unnecessary exploration while maintaining
|
||
search spaces that encompass most high-performance programs, along high-quality tensor programs. These results underscore GTA’s suit-
|
||
with an efficient search algorithm for finding optimal or near-optimal ability for real-world deployment scenarios, where both rapid code
|
||
solutions. The results demonstrate GTA’s capability to handle complex generation and performance optimization are critical. Furthermore, the
|
||
operators and effectively leverage Tensor Cores for high performance. ability to adjust sampling rates offers flexibility in balancing search
|
||
time and performance, making GTA a robust solution for optimizing
|
||
7.7. Compilation time tensor programs across diverse workloads.
|
||
|
||
The search time overhead is a critical factor for practical deploy- 8. Related work
|
||
ment in deep learning frameworks, as reducing it can significantly en-
|
||
hance usability. To evaluate the efficiency of our dual-task scheduling In addition to reviewing DLAs, we summarize related work on
|
||
strategy, we analyze the search time and corresponding performance numeric precision and dynamic shape optimization for deep learning.
|
||
variations under different sampling rates, specifically comparing GTA Deep learning accelerators. DLAs offer several significant ad-
|
||
at sampling rates of 40% (GTA-0.4), 60% (GTA-0.6), and 100% (GTA- vantages, making them essential for advancing DNN research and
|
||
Raw). In this experiment, GTA operates at a sampling rate of 20% deployment. First, DLAs feature large memory capacities, which ac-
|
||
(GTA-0.2), representing a highly efficient configuration with minimal commodate the rapidly growing number of parameters in modern
|
||
search overhead. The results, as shown in Fig. 10, demonstrate that models and facilitate efficient training processes. Second, they provide
|
||
as the sampling rate decreases, the search time is significantly reduced model-specific optimizations while maintaining a degree of flexibility,
|
||
while maintaining less than a 5% performance degradation on average, enabling tailored performance improvements for various architectures.
|
||
thereby achieving an excellent balance between search efficiency and Additionally, DLAs support a broader range of data formats, such
|
||
performance. as FP16, BF16, and INT8, which enhance computational efficiency
|
||
Additionally, we compare GTA’s search time overhead and perfor- and reduce memory usage. Third, DLAs are equipped with a high
|
||
mance with AMOS, a state-of-the-art compiler designed for DLAs. Our number of computing units, enabling extensive parallelism to handle
|
||
findings reveal that GTA achieves an average performance improve- the computational demands of DNNs effectively. These characteris-
|
||
ment of 1.88× over AMOS while maintaining significantly lower search tics position DLAs as a cornerstone technology for accelerating the
|
||
time. Specifically, AMOS’s average compilation time is approximately training and inference of deep learning models. Following this trend,
|
||
five times that of GTA. This substantial reduction in search time under- many emerging accelerators have been proposed, targeting specific
|
||
scores the effectiveness of GTA’s dual-task scheduling strategy, which algorithms or utilizing new technologies. In academia, the DianNao
|
||
|
||
10
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
family [68–71] significantly improves DL computation performance by search space by coordinating intrinsic-based automatic mapping ab-
|
||
leveraging specialized functional units, memory hierarchy, and inter- straction with rule-based tensor program generation strategy and ap-
|
||
connects. Meanwhile, the expansion of DL applications in industry has plies pruning rules to eliminate ineffective program candidates. Ad-
|
||
led hardware vendors (e.g., NVIDIA Tensor Core [17–19] and Intel ditionally, GTA employs dual-task scheduling strategy for tensorized
|
||
NNP [72]), internet giants (e.g., Tesla Dojo [73], Huawei Ascend [74], programs, effectively reducing tuning efforts while enhancing perfor-
|
||
Google TPU [10] and Apple M4 [75,76]), and startups (e.g., Cambricon mance. Experimental results on three DLAs show that GTA outperforms
|
||
MLU [77] and Graphcore IPU [78]) to develop various DLAs. Both state-of-the-art automatic generation approaches and vendor-provided
|
||
academic and industry DLAs are fundamentally domain-specific, rather hand-tuned libraries by 1.88× and 2.29×, respectively.
|
||
than general-purpose accelerators, inevitably leading to complex and
|
||
diverse architectural constraints. CRediT authorship contribution statement
|
||
Numeric precision optimization. Quantization [79,80], a pivotal
|
||
technique in deep learning, reduces the numeric precision of weights Anxing Xie: Writing – original draft, Software, Resources, Project
|
||
and activations to enhance computational efficiency and lower resource administration, Methodology, Investigation, Data curation. Yonghua
|
||
requirements. By transitioning from high-precision formats such as Hu: Writing – review & editing, Supervision, Investigation, Funding
|
||
FP32 to lower-precision formats like FP16, INT8, or even single-bit acquisition. Yaohua Wang: Writing – review & editing, Supervision,
|
||
representations [81,82], quantization enables significant reductions in Methodology, Investigation, Funding acquisition, Formal analysis. Zhe
|
||
memory usage and power consumption [83,84]. The progression of Li: Writing – review & editing, Supervision, Investigation, Formal
|
||
hardware architectures aligns with the increasing demands for low- analysis. Yuxiang Gao: Investigation. Zenghua Cheng: Investigation.
|
||
precision computations. For instance, NVIDIA’s recent developments,
|
||
such as the Turing and Ampere architectures, incorporated INT8 and Declaration of competing interest
|
||
INT4 tensor cores to enhance efficiency. Meanwhile, the latest Hop-
|
||
per architecture has shifted focus by replacing INT4 support with The authors declare that they have no known competing finan-
|
||
FP8 tensor cores, prioritizing improved numerical precision. These ad- cial interests or personal relationships that could have appeared to
|
||
influence the work reported in this paper.
|
||
vancements allow large-scale models, including Large Language Models
|
||
(LLMs) [85], to be deployed on resource-constrained devices like edge
|
||
Acknowledgments
|
||
devices and DLAs without sacrificing performance. Compilers play a
|
||
critical role in making quantization effective. Tools like AMOS [22],
|
||
We would like to thank the anonymous reviewers for their valu-
|
||
PreTuner [86] and LADDER [39] introduce advanced optimizations
|
||
able suggestions. This work is supported by the National Key R&D
|
||
for low-precision data types, including hardware-aware scheduling,
|
||
Program of China (No. 2022ZD0119003), Hunan Provincial Natural
|
||
loop tiling, and fine-grained scaling strategies. Expanding on existing
|
||
Science Foundation (No. 2023JJ50019), the Postgraduate Scientific
|
||
techniques, an automated approach [87] integrates bit-slicing into
|
||
Research Innovation Project of Hunan Province (No. CX20231019) and
|
||
the scheduling phase, treating quantization as part of the schedule
|
||
the National Natural Science Foundation of China (No. 62272477).
|
||
space. Coupled with program synthesis, this method efficiently gener-
|
||
ates hardware-specific kernels, supporting diverse quantization config-
|
||
Data availability
|
||
urations and ensuring seamless adaptation to new hardware architec-
|
||
tures.
|
||
Data will be made available on request.
|
||
Dynamic shape optimization. Dynamic-shape workloads are char-
|
||
acteristic of DNN models where tensor shapes vary at runtime based
|
||
on input data, such as the sequence length in Transformer models. References
|
||
These workloads pose substantial challenges for existing autotuning
|
||
frameworks like TVM, which primarily rely on static input shapes to [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
|
||
construct search spaces and cost models. For instance, TVM’s second- Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
|
||
2016, pp. 770–778.
|
||
generation IR, Relay [35], lacks the capability to represent dynamic [2] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju,
|
||
tensors. While its third-generation IR, Relax [88], introduces symbolic H. Shahrzad, A. Navruzyan, N. Duffy, et al., Evolving deep neural networks,
|
||
shapes to support dynamic workloads, Relax still depends on hand- in: Artificial Intelligence in the Age of Neural Networks and Brain Computing,
|
||
written templates for tensor program generation and lacks automatic Elsevier, 2024, pp. 269–287.
|
||
[3] C.-Y. Wang, I.-H. Yeh, H.-Y. Mark Liao, Yolov9: Learning what you want to learn
|
||
tuning support. To address these limitations, recent works such as
|
||
using programmable gradient information, in: European Conference on Computer
|
||
Nimble [89], DietCode [90], FTuner [91], and MIKPOLY [92] have Vision, Springer, 2025, pp. 1–21.
|
||
introduced innovative techniques. These approaches construct shape- [4] A. Vaswani, Attention is all you need, Adv. Neural Inf. Process. Syst. (2017).
|
||
agnostic search spaces and cost models to optimize dynamic-shape [5] P.P. Ray, ChatGPT: A comprehensive review on background, applications, key
|
||
workloads. For example, DietCode effectively groups kernels with vary- challenges, bias, ethics, limitations and future scope, Internet Things Cyber- Phys.
|
||
Syst. 3 (2023) 121–154.
|
||
ing shapes into unified workloads, enabling efficient tuning as a single
|
||
[6] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur,
|
||
entity and significantly reducing overall tuning time. FTuner introduces A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models, 2024, arXiv
|
||
a uKernel-based approach for dynamic tensors, leveraging hardware- preprint arXiv:2407.21783.
|
||
aware constraints to generate high-performance kernel programs and [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U.
|
||
combining uKernels during runtime to optimize padding and execution Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene
|
||
efficiency. While these advancements mark significant progress, further understanding, in: Proceedings of the IEEE Conference on Computer Vision and
|
||
Pattern Recognition, 2016, pp. 3213–3223.
|
||
research is needed to fully exploit the potential of dynamic-shape DNNs
|
||
[8] D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, Y. Qiao, Drive like a human:
|
||
on modern hardware accelerators. Rethinking autonomous driving with large language models, in: Proceedings of
|
||
the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp.
|
||
910–919.
|
||
9. Conclusion
|
||
[9] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D.
|
||
Liao, et al., A survey on multimodal large language models for autonomous
|
||
We propose GTA, a novel compilation framework for high- driving, in: Proceedings of the IEEE/CVF Winter Conference on Applications of
|
||
performance tensor program generation on DLAs. GTA expands the Computer Vision, 2024, pp. 958–979.
|
||
|
||
|
||
11
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
[10] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, [30] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle,
|
||
S. Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis T. Shpeisman, N. Vasilache, O. Zinenko, MLIR: A compiler infrastructure for the
|
||
of a tensor processing unit, in: Proceedings of the 44th Annual International end of Moore’s law, 2020, arXiv preprint arXiv:2002.11054.
|
||
Symposium on Computer Architecture, 2017, pp. 1–12. [31] L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L.
|
||
[11] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolu- Zhang, L. Zhou, Rammer: Enabling holistic deep learning compiler optimizations
|
||
tional neural network for mobile devices, in: Proceedings of the IEEE Conference with {rtasks}, in: 14th USENIX Symposium on Operating Systems Design and
|
||
on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856. Implementation, OSDI 20, 2020, pp. 881–897.
|
||
[12] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: Trainable bag-of-freebies [32] J. Zhao, B. Li, W. Nie, Z. Geng, R. Zhang, X. Gao, B. Cheng, C. Wu, Y. Cheng,
|
||
sets new state-of-the-art for real-time object detectors, in: Proceedings of the Z. Li, et al., AKG: automatic kernel generation for neural processing units
|
||
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. using polyhedral transformations, in: Proceedings of the 42nd ACM SIGPLAN
|
||
7464–7475. International Conference on Programming Language Design and Implementation,
|
||
[13] Z. Xu, W. Wang, H. Dai, Y. Xu, XFC: Enabling automatic and fast operator 2021, pp. 1233–1248.
|
||
synthesis for mobile deep learning compilation, J. Syst. Archit. 142 (2023) [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
|
||
102921. N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance
|
||
deep learning library, Adv. Neural Inf. Process. Syst. 32 (2019).
|
||
[14] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, D. Chen,
|
||
[34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
|
||
FPGA/DNN co-design: An efficient design methodology for IoT intelligence on
|
||
Ghemawat, G. Irving, M. Isard, et al., {TensorFlow}: a system for {large-scale}
|
||
the edge, in: Proceedings of the 56th Annual Design Automation Conference
|
||
machine learning, in: 12th USENIX Symposium on Operating Systems Design and
|
||
2019, 2019, pp. 1–6.
|
||
Implementation, OSDI 16, 2016, pp. 265–283.
|
||
[15] W. Jiang, L. Yang, E.H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi, J. Hu, Hard-
|
||
[35] J. Roesch, S. Lyubomirsky, M. Kirisame, L. Weber, J. Pollock, L. Vega, Z. Jiang,
|
||
ware/software co-exploration of neural architectures, IEEE Trans. Comput.-Aided
|
||
T. Chen, T. Moreau, Z. Tatlock, Relay: A high-level compiler for deep learning,
|
||
Des. Integr. Circuits Syst. 39 (12) (2020) 4805–4815.
|
||
2019, arXiv preprint arXiv:1904.08368.
|
||
[16] Z. Xie, M. Emani, X. Yu, D. Tao, X. He, P. Su, K. Zhou, V. Vishwanath,
|
||
[36] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng,
|
||
Centimani: Enabling fast {AI} accelerator selection for {dNN} training with a
|
||
X. Jin, Apollo: Automatic partition-based operator fusion through layer by layer
|
||
novel performance predictor, in: 2024 USENIX Annual Technical Conference,
|
||
optimization., in: MLSys, 2022.
|
||
USENIX ATC 24, 2024, pp. 1203–1221.
|
||
[37] Y. Shi, Z. Yang, J. Xue, L. Ma, Y. Xia, Z. Miao, Y. Guo, F. Yang, L. Zhou,
|
||
[17] Nvidia, Ampere architecture white paper, 2022, URL: https://www.nvidia. Welder: Scheduling deep learning memory access via tile-graph, in: 17th USENIX
|
||
com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture- Symposium on Operating Systems Design and Implementation, OSDI 23, 2023,
|
||
whitepaper.pdf Online (Accessed 13 November 2024). pp. 701–718.
|
||
[18] Nvidia, Turing architecture white paper, 2022, URL: https://www.nvidia. [38] C. Xia, J. Zhao, Q. Sun, Z. Wang, Y. Wen, T. Yu, X. Feng, H. Cui, Optimizing
|
||
com/content/dam/en-zz/Solutions/design-visualization/technologies/turing- deep learning inference via global analysis and tensor expressions, in: Proceed-
|
||
architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Online (Accessed 13 ings of the 29th ACM International Conference on Architectural Support for
|
||
November 2024). Programming Languages and Operating Systems, Volume 1, 2024, pp. 286–301.
|
||
[19] Nvidia, Volta architecture white paper, 2022, URL: https://images.nvidia. [39] L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y. Shi, N. Zheng, Z. Miao, F. Yang, T.
|
||
com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Online Cao, et al., Ladder: Enabling efficient {low-precision} deep learning computing
|
||
(Accessed 13 November 2024). through hardware-aware tensor transformation, in: 18th USENIX Symposium on
|
||
[20] K. Troester, R. Bhargava, AMD next generation ‘‘Zen 4’’ core and 4th Gen AMD Operating Systems Design and Implementation, OSDI 24, 2024, pp. 307–323.
|
||
EPYC™ 9004 server CPU, in: 2023 IEEE Hot Chips 35 Symposium, HCS, IEEE [40] F. Wang, M. Shen, Y. Lu, N. Xiao, TensorMap: A deep RL-based tensor mapping
|
||
Computer Society, 2023, pp. 1–25. framework for spatial accelerators, IEEE Trans. Comput. (2024).
|
||
[21] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. [41] Y. Zhao, H. Sharif, V. Adve, S. Misailovic, Felix: Optimizing tensor programs
|
||
Hu, L. Ceze, et al., {TVM}: An automated {end-to-end} optimizing compiler for with gradient descent, in: Proceedings of the 29th ACM International Conference
|
||
deep learning, in: 13th USENIX Symposium on Operating Systems Design and on Architectural Support for Programming Languages and Operating Systems,
|
||
Implementation, OSDI 18, 2018, pp. 578–594. Volume 3, 2024, pp. 367–381.
|
||
[22] S. Zheng, R. Chen, A. Wei, Y. Jin, Q. Han, L. Lu, B. Wu, X. Li, S. Yan, Y. [42] Q. Zhao, R. Wang, Y. Liu, H. Yang, Z. Luan, D. Qian, Sifter: An efficient operator
|
||
Liang, AMOS: enabling automatic mapping for tensor computations on spatial auto-tuner with speculative design space exploration for deep learning compiler,
|
||
accelerators with hardware abstraction, in: Proceedings of the 49th Annual IEEE Trans. Comput. (2024).
|
||
International Symposium on Computer Architecture, 2022, pp. 874–887. [43] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe,
|
||
[23] S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C.H. Yu, Y. Yu, Halide: a language and compiler for optimizing parallelism, locality, and re-
|
||
et al., Tensorir: An abstraction for automatic tensorized program optimization, computation in image processing pipelines, Acm Sigplan Not. 48 (6) (2013)
|
||
in: Proceedings of the 28th ACM International Conference on Architectural 519–530.
|
||
Support for Programming Languages and Operating Systems, Volume 2, 2023, [44] Y. Bai, X. Yao, Q. Sun, W. Zhao, S. Chen, Z. Wang, B. Yu, Gtco: Graph and
|
||
pp. 804–817. tensor co-design for transformer-based image recognition on tensor cores, IEEE
|
||
Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023).
|
||
[24] J. Bi, Q. Guo, X. Li, Y. Zhao, Y. Wen, Y. Guo, E. Zhou, X. Hu, Z. Du, L. Li, et al.,
|
||
[45] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, A. Parashar, Maestro:
|
||
Heron: Automatically constrained high-performance library generation for deep
|
||
A data-centric approach to understand reuse, performance, and hardware cost of
|
||
learning accelerators, in: Proceedings of the 28th ACM International Conference
|
||
dnn mappings, IEEE Micro 40 (3) (2020) 20–29.
|
||
on Architectural Support for Programming Languages and Operating Systems,
|
||
[46] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, Y. Liang, Tenet: A frame-
|
||
Volume 3, 2023, pp. 314–328.
|
||
work for modeling tensor dataflow based on relation-centric notation, in: 2021
|
||
[25] L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D.
|
||
ACM/IEEE 48th Annual International Symposium on Computer Architecture,
|
||
Zhuo, K. Sen, et al., Ansor: Generating {high-performance} tensor programs for
|
||
ISCA, IEEE, 2021, pp. 720–733.
|
||
deep learning, in: 14th USENIX Symposium on Operating Systems Design and
|
||
[47] A. Parashar, P. Raina, Y.S. Shao, Y.-H. Chen, V.A. Ying, A. Mukkara, R. Venkate-
|
||
Implementation, OSDI 20, 2020, pp. 863–879.
|
||
san, B. Khailany, S.W. Keckler, J. Emer, Timeloop: A systematic approach to dnn
|
||
[26] S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic
|
||
accelerator evaluation, in: 2019 IEEE International Symposium on Performance
|
||
schedule exploration and optimization framework for tensor computation on het-
|
||
Analysis of Systems and Software, ISPASS, IEEE, 2019, pp. 304–315.
|
||
erogeneous system, in: Proceedings of the Twenty-Fifth International Conference [48] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha,
|
||
on Architectural Support for Programming Languages and Operating Systems, P. Raina, et al., Interstellar: Using halide’s scheduling language to analyze dnn
|
||
2020, pp. 859–873. accelerators, in: Proceedings of the Twenty-Fifth International Conference on
|
||
[27] A. Sabne, Xla: Compiling machine learning for peak performance, Google Res Architectural Support for Programming Languages and Operating Systems, 2020,
|
||
(2020). pp. 369–383.
|
||
[28] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W.S. Moses, S. [49] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, A.
|
||
Verdoolaege, A. Adams, A. Cohen, Tensor comprehensions: Framework-agnostic Krishnamurthy, Learning to optimize tensor programs, Adv. Neural Inf. Process.
|
||
high-performance machine learning abstractions, 2018, arXiv preprint arXiv: Syst. 31 (2018).
|
||
1802.04730. [50] J. Appleyard, S. Yokim, NVIDIA developer technical blog, 2017,
|
||
[29] P. Tillet, H.-T. Kung, D. Cox, Triton: an intermediate language and compiler for URL:https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9 Online
|
||
tiled neural network computations, in: Proceedings of the 3rd ACM SIGPLAN (Accessed 13 November 2024).
|
||
International Workshop on Machine Learning and Programming Languages, [51] NVIDIA, Basic linear algebra on NVIDIA GPUs, 2024, URL: https://developer.
|
||
2019, pp. 10–19. nvidia.com/cublas Online (Accessed 13 November 2024) n.d.
|
||
|
||
|
||
12
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
[52] A. Kerr, H. Wu, M. Gupta, D. Blasig, P. Ramini, D. Merrill, A. Shivam, P. [75] Apple, Apple introduces M4 chip, 2024, URL: https://www.apple.com/sg/
|
||
Majcher, P. Springer, M. Hohnerbach, J. Wang, M. Nicely, CUTLASS, 2022, URL: newsroom/2024/05/apple-introduces-m4-chip/ Online (Accessed 13 November
|
||
https://github.com/NVIDIA/cutlass Online (Accessed 13 November 2024). 2024).
|
||
[53] T. Zerrell, J. Bruestle, Stripe: Tensor compilation via the nested polyhedral [76] Apple, Apple introduces M4 pro and M4 max, 2024, URL: https://www.apple.
|
||
model, 2019, arXiv preprint arXiv:1903.06498. com/sg/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/ Online (Ac-
|
||
[54] R. Baghdadi, J. Ray, M.B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, cessed 13 November 2024).
|
||
P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: A polyhedral compiler for [77] Cambricon, Cambricon MLU, 2024, URL: https://www.cambricon.com/ Online
|
||
expressing fast and portable code, in: 2019 IEEE/ACM International Symposium (Accessed 13 November 2024) n.d..
|
||
on Code Generation and Optimization, CGO, IEEE, 2019, pp. 193–205. [78] Z. Jia, B. Tillman, M. Maggioni, D.P. Scarpazza, Dissecting the graphcore ipu
|
||
[55] S. Tavarageri, A. Heinecke, S. Avancha, B. Kaul, G. Goyal, R. Upadrasta, Polydl: architecture via microbenchmarking, 2019, arXiv preprint arXiv:1912.03413.
|
||
Polyhedral optimizations for creation of high-performance dl primitives, ACM [79] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural
|
||
Trans. Archit. Code Optim. ( TACO) 18 (1) (2021) 1–27. networks: Training neural networks with low precision weights and activations,
|
||
[56] Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, J. Mach. Learn. Res. 18 (187) (2018) 1–30.
|
||
Y.S. Shao, Cosa: Scheduling by constrained optimization for spatial accelera- [80] T. Liang, J. Glossner, L. Wang, S. Shi, X. Zhang, Pruning and quantization for
|
||
tors, in: 2021 ACM/IEEE 48th Annual International Symposium on Computer deep neural network acceleration: A survey, Neurocomput. 461 (2021) 370–403.
|
||
Architecture, ISCA, IEEE, 2021, pp. 554–566. [81] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized
|
||
[57] M. Sotoudeh, A. Venkat, M. Anderson, E. Georganas, A. Heinecke, J. Knight, neural networks: Training deep neural networks with weights and activations
|
||
ISA mapper: a compute and hardware agnostic deep learning compiler, in: constrained to+ 1 or-1, 2016, arXiv preprint arXiv:1602.02830.
|
||
Proceedings of the 16th ACM International Conference on Computing Frontiers, [82] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: Imagenet classifi-
|
||
2019, pp. 164–173. cation using binary convolutional neural networks, in: European Conference on
|
||
[58] J. Weng, A. Jain, J. Wang, L. Wang, Y. Wang, T. Nowatzki, UNIT: Unifying Computer Vision, Springer, 2016, pp. 525–542.
|
||
tensorized instruction compilation, in: 2021 IEEE/ACM International Symposium [83] C.-C. Yang, Y.-R. Chen, H.-H. Liao, Y.-M. Chang, J.-K. Lee, Auto-tuning fixed-
|
||
on Code Generation and Optimization, CGO, IEEE, 2021, pp. 77–89. point precision with TVM on RISC-v packed SIMD extension, ACM Trans. Des.
|
||
[59] H. Zhu, R. Wu, Y. Diao, S. Ke, H. Li, C. Zhang, J. Xue, L. Ma, Y. Xia, W. Cui, Autom. Electron. Syst. 28 (3) (2023) 1–21.
|
||
et al., {RollER}: Fast and efficient tensor compilation for deep learning, in: 16th [84] D. Diamantopoulos, B. Ringlein, M. Purandare, G. Singh, C. Hagleitner, Agile
|
||
USENIX Symposium on Operating Systems Design and Implementation, OSDI 22, autotuning of a transprecision tensor accelerator overlay for TVM compiler
|
||
2022, pp. 233–248. stack, in: 2020 30th International Conference on Field-Programmable Logic and
|
||
[60] Y. Ding, C.H. Yu, B. Zheng, Y. Liu, Y. Wang, G. Pekhimenko, Hidet: Task-mapping Applications, FPL, IEEE, 2020, pp. 310–316.
|
||
programming paradigm for deep learning tensor programs, in: Proceedings of the [85] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, Z. Jia, Towards efficient
|
||
28th ACM International Conference on Architectural Support for Programming generative large language model serving: A survey from algorithms to systems,
|
||
Languages and Operating Systems, Volume 2, 2023, pp. 370–384. 2023, arXiv preprint ArXiv:2312.15234.
|
||
[61] L. Zheng, H. Wang, J. Zhai, M. Hu, Z. Ma, T. Wang, S. Huang, X. Miao, S. Tang, [86] J. Xu, G. Song, B. Zhou, F. Li, J. Hao, J. Zhao, A holistic approach to automatic
|
||
K. Huang, et al., {EINNET}: Optimizing tensor programs with {derivation-based} mixed-precision code generation and tuning for affine programs, in: Proceedings
|
||
transformations, in: 17th USENIX Symposium on Operating Systems Design and of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of
|
||
Implementation, OSDI 23, 2023, pp. 739–755. Parallel Programming, 2024, pp. 55–67.
|
||
[62] Y. Zhai, S. Yang, K. Pan, R. Zhang, S. Liu, C. Liu, Z. Ye, J. Ji, J. Zhao, Y. Zhang, et [87] M. Cowan, T. Moreau, T. Chen, J. Bornholt, L. Ceze, Automatic generation of
|
||
al., Enabling tensor language model to assist in generating {High-Performance} high-performance quantized machine learning kernels, in: Proceedings of the
|
||
tensor programs for deep learning, in: 18th USENIX Symposium on Operating 18th ACM/IEEE International Symposium on Code Generation and Optimization,
|
||
Systems Design and Implementation, OSDI 24, 2024, pp. 289–305. 2020, pp. 305–316.
|
||
[63] F. Wang, M. Shen, Y. Ding, N. Xiao, Soter: Analytical tensor-architecture [88] R. Lai, J. Shao, S. Feng, S.S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin,
|
||
modeling and automatic tensor program tuning for spatial accelerators, in: 2024 J. Liu, et al., Relax: Composable abstractions for end-to-end dynamic machine
|
||
ACM/IEEE 51st Annual International Symposium on Computer Architecture, learning, 2023, arXiv preprint arXiv:2311.02103.
|
||
ISCA, IEEE, 2024, pp. 991–1004. [89] H. Shen, J. Roesch, Z. Chen, W. Chen, Y. Wu, M. Li, V. Sharma, Z. Tatlock,
|
||
[64] F. Wang, M. Shen, Automatic kernel generation for large language models on Y. Wang, Nimble: Efficiently compiling dynamic neural networks for model
|
||
deep learning accelerators, in: 2023 IEEE/ACM International Conference on inference, Proc. Mach. Learn. Syst. 3 (2021) 208–222.
|
||
Computer Aided Design, ICCAD, IEEE, 2023, pp. 1–9. [90] B. Zheng, Z. Jiang, C.H. Yu, H. Shen, J. Fromm, Y. Liu, Y. Wang, L. Ceze,
|
||
[65] J. Devlin, Bert: Pre-training of deep bidirectional transformers for language T. Chen, G. Pekhimenko, DietCode: Automatic optimization for dynamic tensor
|
||
understanding, 2018, arXiv preprint arXiv:1810.04805. programs, Proc. Mach. Learn. Syst. 4 (2022) 848–863.
|
||
[66] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, R.R. Salakhutdinov, On multiplicative [91] P. Mu, L. Wei, Y. Liu, R. Wang, FTuner: A fast dynamic shape tensors program
|
||
integration with recurrent neural networks, Adv. Neural Inf. Process. Syst. 29 auto-tuner for deep learning compilers, 2024, arXiv preprint arXiv:2407.21418.
|
||
(2016). [92] F. Yu, G. Li, J. Zhao, H. Cui, X. Feng, J. Xue, Optimizing dynamic-shape
|
||
[67] A.G. Howard, Mobilenets: Efficient convolutional neural networks for mobile neural networks on accelerators via on-the-fly micro-kernel polymerization,
|
||
vision applications, 2017, arXiv preprint arXiv:1704.04861. in: Proceedings of the 29th ACM International Conference on Architectural
|
||
[68] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao: Support for Programming Languages and Operating Systems, Volume 2, 2024,
|
||
A small-footprint high-throughput accelerator for ubiquitous machine-learning, pp. 797–812.
|
||
ACM SIGARCH Comput. Archit. News 42 (1) (2014) 269–284.
|
||
[69] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
|
||
N. Sun, et al., Dadiannao: A machine-learning supercomputer, in: 2014 47th Anxing Xie is currently working toward a Ph.D. degree
|
||
in the School of Computer Science and Engineering, Hu-
|
||
Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 2014,
|
||
nan University of Science and Technology, China. He is
|
||
pp. 609–622.
|
||
currently working on deep learning automatic compilation
|
||
[70] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, Y. Chen,
|
||
optimization and high-performance computation. His re-
|
||
Pudiannao: A polyvalent machine learning accelerator, ACM SIGARCH Comput.
|
||
search interests include compiler optimization, and parallel
|
||
Archit. News 43 (1) (2015) 369–381.
|
||
computing.
|
||
[71] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, O. Temam,
|
||
ShiDianNao: Shifting vision processing closer to the sensor, in: Proceedings of
|
||
the 42nd Annual International Symposium on Computer Architecture, 2015, pp.
|
||
92–104.
|
||
[72] B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski, S. Avancha, Intel
|
||
nervana neural network processor-t (nnp-t) fused floating point many-term dot Yonghua Hu is a professor in School of Computer Sci-
|
||
product, in: 2020 IEEE 27th Symposium on Computer Arithmetic, ARITH, IEEE, ence and Engineering, Hunan University of Science and
|
||
2020, pp. 133–136. Technology, China. He received the Ph.D degree in Com-
|
||
[73] E. Talpes, D. Williams, D.D. Sarma, Dojo: The microarchitecture of tesla’s exa- puter Application Technology from Hunan University, in
|
||
scale computer, in: 2022 IEEE Hot Chips 34 Symposium, HCS, IEEE Computer 2008. He went to University at Buffalo SUNY as a visiting
|
||
Society, 2022, pp. 1–28. scholar in 2019. His research interests include compilation
|
||
[74] H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable optimization, artificial intelligence and parallel computing.
|
||
and unified architecture for ubiquitous deep neural network computing: Industry
|
||
track paper, in: 2021 IEEE International Symposium on High-Performance
|
||
Computer Architecture, HPCA, IEEE, 2021, pp. 789–801.
|
||
|
||
|
||
13
|
||
A. Xie et al. Journal of Systems Architecture 160 (2025) 103359
|
||
|
||
|
||
Yaohua Wang is currently a professor with the College Yuxiang Gao is currently working toward an M.S. degree
|
||
of Computer Science, National University of Defense Tech- in the School of Computer Science and Engineering, Hunan
|
||
nology. His research interest is in computer architecture, University of Science and Technology, China. He is currently
|
||
machine learning and security. His work spans and stretches working on code optimization and compilation technol-
|
||
the boundaries of computer architecture. He is especially ex- ogy. His research interests include automatic compilation
|
||
cited about novel, fundamentally-efficient computation, and optimization and code generation.
|
||
memory/storage paradigms, applied to emerging machine
|
||
learning applications.
|
||
|
||
|
||
|
||
|
||
Zhe Li received the Ph.D. degree in Computer Science Zenghua Cheng is currently working toward an M.S. degree
|
||
from Jilin University in 2022. He is currently working at in the School of Computer Science and Engineering, Hunan
|
||
Tianjin Advanced Technology Institute. His research inter- University of Science and Technology, China. He is currently
|
||
ests include deep learning compilation and combinatorial working on code optimization and compilation technol-
|
||
optimization. ogy. His research interests include automatic compilation
|
||
optimization and Web security.
|
||
|
||
|
||
|
||
|
||
14
|
||
|