Computer Standards & Interfaces 97 (2026) 104120 Contents lists available at ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi Energy consumption assessment in embedded AI: Metrological improvements of benchmarks for edge devices Andrea Apicella b , Pasquale Arpaia a ,∗, Luigi Capobianco d , Francesco Caputo a , Antonella Cioffi d , Antonio Esposito a , Francesco Isgrò a , Rosanna Manzo c , Nicola Moccaldi a , Danilo Pau e , Ettore Toscano d a Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione, Università degli Studi di Napoli Federico II, Naples, Italy b Dipartimento di Ingegneria dell’Informazione ed Elettrica e Matematica applicata (DIEM), Università degli Studi di Salerno, Fisciano, Italy c Dipartimento di Sanità Pubblica e Medicina Preventiva, Università degli Studi di Napoli Federico II, Naples, Italy d Software Design Center, STMicroelectronics, Marcianise, Italy e System Research and Applications, STMicroelectronics, Agrate Brianza, Italy ARTICLE INFO ABSTRACT Keywords: This manuscript proposes a new method to improve the MLCommons protocol for measuring power consump- Energy assessment tion on Microcontroller Units (MCUs) when running edge Artificial Intelligence (AI). In particular, the proposed Embedded AI approach (i) selectively measures the power consumption attributable to the inferences (namely, the predictions Tiny-ML performed by Artificial Neural Networks — ANN), preventing the impact of other operations, (ii) accurately Uncertainty analysis identifies the time window for acquiring the sample of the current thanks to the simultaneous measurement of Edge device benchmark power consumption and inference duration, and (iii) precisely synchronize the measurement windows and the inferences. The method is validated on three use cases: (i) Rockchip RV1106, a neural MCU that implements ANN via hardware neural processing unit through a dedicated accelerator, (ii) STM32 H7, and (iii) STM32 U5, high-performance and ultra-low-power general-purpose microcontroller, respectively. The proposed method returns higher power consumption for the two devices with respect to the MLCommons approach. This result is compatible with an improvement of selectivity and accuracy. Furthermore, the method reduces measurement uncertainty on the Rockchip RV1106 and STM32 boards by factors of 6 and 12, respectively. 1. Introduction (MCUs), widely used in IoT, this is particularly true. Many IoT applica- tions, such as autonomous driving [6], demand low-latency responses The rapid expansion of Internet of Things (IoT) devices has ushered to be effectively reactive. Moreover, several IoT devices often operate in a new era of connected intelligence at the edge, where data process- under very limited power sources. Promising energy-efficient strategies ing, low latency, and real-time decision making can take place directly aim to minimize consumption. For instance, index modulation [7,8] is at the edge [1]. These IoT devices cover a variety of applications, from a transmission technique that conveys additional information through smart home sensors [2], to industrial automation [3], and health mon- the indices of available resources such as antennas, subcarriers, or itoring systems [4], where low latency responses and energy efficiency time slots, and it can significantly reduce energy usage while maintain- are essential. ing data throughput. Nevertheless, even with advanced optimization Extending computation to more peripheral network nodes enhances strategies, the repetitive and frequent processing required by many ap- all key aspects of edge computing, including energy efficiency, carbon plications can rapidly deplete power resources, thereby limiting device footprint reduction, security, latency, privacy, offline functionality, and lifetime. data management costs [5]. However, deploying intelligence at the In recent years, Machine Learning (ML) methods [9], particularly end nodes requires careful consideration of the IoT devices inherent Artificial Neural Networks (ANNs), have been increasingly deployed on limitations, such as memory and computational resources impacting IoT devices to enhance localized data processing capabilities and reduce time performances, and energy constraints. For Microcontroller Units ∗ Corresponding author. E-mail addresses: andapicella@unisa.it (A. Apicella), pasquale.arpaia@unina.it (P. Arpaia), luigi.capobianco@st.com (L. Capobianco), francesco.caputo3@unina.it (F. Caputo), antonella.cioffi@st.com (A. Cioffi), antonio.esposito9@unina.it (A. Esposito), francesco.isgro@unina.it (F. Isgrò), rosanna.manzo@unina.it (R. Manzo), nicola.moccaldi@unina.it (N. Moccaldi), danilo.pau@st.com (D. Pau), ettore.toscano@st.com (E. Toscano). https://doi.org/10.1016/j.csi.2025.104120 Received 10 January 2025; Received in revised form 2 September 2025; Accepted 21 December 2025 Available online 22 December 2025 0920-5489/© 2025 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 dependency on cloud infrastructures [10,11]. It is common to refer to these devices as tiny devices [12] and embedded ML as tiny machine learning or tiny ML [5]. Consequently, assessing the inference time provided by the IoT hardware for a specific ANN model is crucial to ensure that the em- bedded system can satisfy real-time processing requirements. In this context, inference refers to the process of an ANN generating outputs based on its trained model parameters and given inputs. Therefore, tailored energy consumption metrics are essential to ensure the alignment between the ANN implementation and the en- ergy constraints of the targeted IoT application. To this aim, Neural MCUs are new edge devices embedding ANN accelerators, specifically designed to manage the trade-off between reliability, latency, cost, and power consumption [13]. Therefore, adopting standardized metrics and procedures is essential for assessing the actual performance gains achieved by neural MCUs in the context of embedded AI. Despite several frameworks and tools have been proposed to facilitate the benchmarking of tinyML models [14–16], no standardized metrics and procedures are currently defined. Fig. 1. Energy measurement set up proposed by MLPerf Tiny Benchmark [17, Among the proposed benchmarking protocols, MLPerf Tiny Bench- 19]. The DUT is powered by the Energy Monitor. The IO manager serves as mark (MLPTB) [17] is developed by the MLCommons Association, an electrical-isolation proxy. the largest and most authoritative community aimed at improving the industrialization standardization process of machine learning [18]. MLPTB provides protocols and AI components, namely datasets and functionalities: (i) sending a trigger signal, (ii) enabling UART commu- pre-trained ML models. These can act as metrological references when nication, (iii) generating and feeding random input data to the ANN, implemented on different hardware to assess their performance such (iv) performing inferences, and (v) printing the prediction results. The as the inference time and the power consumption under real-world software includes a graphical user interface that can be run on the Host conditions. However, the MLPTB protocols exhibit some metrological Computer, allowing the initiation of the measurement and monitoring weakness: (i) both the assessment of time performance and energy of input data. It is important to emphasize that in phase (iii) random consumption is realized without measurement uncertainty computa- data are generated to feed the ANN. This operation, however, does tion, (ii) the energy consumption analysis is performed based on an not reflect real-world applications, where the network processes sensor approximate estimate of the average inference duration, and (iii) the data in real time. Although not an intrinsic part of ANN inference, impact on consumption caused by inferences is not isolated with respect MLPTB includes this step in the performance and energy measurements. to other processes. Throughout this paper, phase (iii) is explicitly distinguished from phase In this paper, a new method is proposed and validated to improve (iv) (i.e., inference) and is referred to as the pre-inference phase. MLPTB protocols to measure power consumption in MCUs running ANNs, in a rigorous metrological framework. Specifically, in Section 2 The energy per inference (𝐸𝑖𝑛𝑓 ) is calculated using latency infor- the MLPTB framework is reported, then the proposed method is pre- mation determined in the Performance phase. Specifically, the IPS is sented in Section 3. Experiments and results are reported in Section 4 determined by taking the median value across five experiments. In each and discussed in Section 5. experiment, input data is provided for a duration of at least 10 s, and the number of inferences is recorded via a direct connection between 2. Background the Host Computer and the DUT. Given the IPS, 𝐸𝑖𝑛𝑓 is computed as: 𝐼𝑚 × 𝑉𝑛 𝐸𝑖𝑛𝑓 = (1) Several frameworks and tools have been introduced to support 𝜏 × 𝐼𝑃 𝑆 the benchmarking of tinyML models [14–16]. Among the available where 𝑉𝑛 is the nominal voltage, 𝐼𝑚 is the current averaged over the benchmarking protocols, the MLPerf Tiny Benchmark (MLPTB) [17], fixed period 𝜏. developed by the MLCommons Association [18], emerges as a key initiative. 3. Proposed method MLPTB proposes two modalities of assessment: (i) Performance and (ii) Energy. The former measures Latency (inferences per second — IPS) and accuracy (percentage of correct predictions to all predictions ratio) The MLCommons pre-inference phase generates random numbers as through a direct USB connection between a Device Under Test (DUT) input to the ANN in order to perform inference (in addition to memory and an host computer, while the latter measures energy (micro-joules operations needed to provide the input to the network). However, ran- per inference). In the remainder of this section, the energy configura- dom number generation is hardly reproducible across different devices tion mode is detailed, as it represents the central focus of this study. In under test, since both the libraries and the hardware resources available the energy configuration mode (Fig. 1), an Energy Monitor is proposed on the microcontrollers for random number generation vary. In con- to supply power to the DUT while measuring the current consumption. trast, the proposed work selectively excludes the pre-inference phase An Input/Output Manager is introduced to interface the Host Computer from the performance and energy measurements, ensuring greater re- with the DUT and serving as an electrical-isolation proxy. Furthermore, producibility while also providing a closer adherence to the actual MLPTB requires level shifters to adapt the power supply in input to the operation of the device in real-world scenarios. In the following of this DUT (not reported in Fig. 1 to simplify the schematic as they are not section, the proposed method is described. In paragraph 3.1 the circuit essential to the discussion). solution for the joint measurement of time and energy consumption In addition to defining assessment procedures, MLPTB provides is described. In paragraph 3.2 the expected impact of the method on some firmware and software [19] for ML tasks on DUT. In particular, selectivity, accuracy, and uncertainty during the energy measurement the provided firmware to be loaded onto the DUT ensures the following is highlighted. 2 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 inference. Furthermore, it is assumed with a non-negligible degree of approximation that the inferences are executed consecutively by the MCU, disregarding the impact of inter-inference operations that are still present. Finally, the delays in the transmission of the command for starting the measurement have a further impact on the accuracy, albeit to a very small extent. Specifically, this refers to the time taken by the CPU on the DUT to generate the trigger signal and by the Measurement Board to handle the interrupt triggered at its input pin (see Fig. 3). In the proposed method, limiting the observation to a single in- ference at a time eliminates the approximation inherent in MLPTB, where the inference duration is estimated through the average of multiple successive inferences executed within a known time window. Specifically, the proposed method allows the exclusion of all energy contributions unrelated to the inference itself (e.g., data transfer op- erations to memory during the pre-inference phase). However, in the proposed method, the repetition of the measurement for each inference amplifies the impact of inaccuracies caused by the delay in transmitting the status signal. In contrast, the MLPTB approach mitigates this effect Fig. 2. Proposed energy measurement setup. The Host Computer powers the because the delay only occurs at the start of the measurement for DUT and an ammeter is connected in series along the power line on the DUT multiple inferences. To address this issue, the inference duration (𝛥𝑡) (e.g. a MCU). measurement is also performed. In the firmware for the DUT, the onboard counter is read immediately before and after the inference execution. The 𝛥𝑡, is used to appropriately resize the current sample vector acquired while the inference status signal is active. The current 3.1. Circuit diagram and measurement procedure sample vector is trimmed at both ends by a number of elements (𝑁𝑡𝑟𝑖𝑚 ), calculated as follows: The proposed method utilizes an ammeter that does not require ( ) powering the DUT to measure the absorbed current. The ammeter is 𝑓 𝑁𝑐𝑠 𝑁𝑡𝑟𝑖𝑚 = 𝑐 − 𝛥𝑡 (2) connected in series to the microprocessor on the MCU powered by the 2 𝑓𝑐 Host Computer through the USB port (Fig. 2). This approach allows where 𝑓𝑐 is the sampling frequency of the Ammeter, 𝑁𝑐𝑠 is the number the Host Computer to perform both latency and energy measurements of current samples acquired when the inference status signal is high, simultaneously. Indeed, the firmware provided by MLPTB enables the and 𝛥𝑡 is the inference duration. DUT to update the Host Computer on the number of completed infer- ences through the USB connection. Instead of computing the energy 3.3. Uncertainty improvements per inference as the ratio between the total energy measured in a specific time window and the number of inferences (MLPTB method), Two distinct phases should be addressed in the evaluation of un- the proposed method computes the energy for each inference without certainty: (i) the inference time measurement, and (ii) the energy considering the impact of pre-inference phase. This is obtained by consumption assessment. In particular, an important source of un- modifying the firmware provided by MLPTB: the trigger is replaced by certainty in MLPTB is due to the counting of inferences during the a logic signal (inference status) that goes high during an ongoing infer- IPS measurement affecting inference time measurement and, conse- ence and returns low otherwise. The inference status signal output from quently, also the energy consumption assessment. More deeply, the the device under test is sampled by the Measurement Board (ammeter) measurement window is not an integer multiple of the inference period, in parallel with the current (Fig. 3.a). Two vectors of synchronously therefore, there is no synchronization between the end of the last sampled data (current and inference status signal) are sent to the Host inference and the end of the measurement window. This contribution Computer. The current samples are processed, and the energy consump- can be modeled by a uniform random variable whose domain is equal tion is calculated only when the inference status samples indicate a to the central value inference duration 𝛥𝑡𝑚 , with a standard deviation low logic signal. Additionally, before and after each inference, the DUT 𝜎1𝑐𝑜𝑛𝑡 computed as: reads the values of the Clock and Reset Management Unit (CRMU) and 𝛥𝑡 transmits them to the Host Computer to determine the duration of the 𝜎1𝑐𝑜𝑛𝑡 = 𝑢𝑡1 = √𝑚 (3) inference. Finally, the software on the Host Computer computes the 2 3 mean value of 𝑁 inferences with associated uncertainty. In this work, The uncertainty of the MLPTB method is assessed by assuming the 𝑁 is set to 100. Similar to the MLPTB, the proposed firmware runs as median inference duration approximately equal to the mean. Differ- the sole program on the MCU, with fully sequential execution and no ently, in the proposed method the counting uncertainty is determined concurrency, or interrupts. Furthermore, in the proposed method, the by the fact that the inference duration is not an integer multiple of inference status signal is set high immediately after the pre-inference the counter period (𝑇𝑐 ). Again, the random variable with uniform phase, and the CRMU is queried right before the inference execution. probability distribution effectively describes this aspect. The standard As soon as the inference completes, the CRMU is queried again, and deviation 𝜎2𝑐𝑜𝑛𝑡 is computed as: finally the inference status is set low to signal the ammeter that the 𝑇 inference has finished. In Fig. 4, a flowchart describing the customized 𝜎2𝑐𝑜𝑛𝑡 = 𝑢𝑡2 = √𝑐 (4) firmware behavior is reported. 2 3 Assuming that 𝛥𝑡𝑚 ≫ 𝑇𝑐 , it follows 𝑢𝑡1 ≫ 𝑢𝑡2 and the proposed method 3.2. Accuracy improvements improves the measurement uncertainty due to counting. Then there is the uncertainty due to the variability of the duration In the MLPTB, the number of inferences during the measurement time of the processes between the inferences (pre-inference phase). The time in energy mode is calculated using the IPS obtained from the proposed method is not affected by this source of uncertainty because previous latency measurement. This approach introduces accuracy is- it excludes from the energy measurement all the processes outside sues because an estimator is used instead of the actual time of each the inference. Finally, both methods are exposed to the uncertainty 3 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 Fig. 3. Comparison between the block diagram of the proposed method (a) and ML Commons-Tiny approach (b) for energy consumption measurement. The added blocks and signals are reported in red. In the proposed method, the Device Under Test stops the power consumption computation after each inference. Differently, in the MLCommons-Tiny approach, the Host Computer stops the acquisition of current samples after a fixed time window, without distinguishing between pre-inference and inference phases. Furthermore, it computes the energy consumption (μJ per inference) based on the Inference per Second measured exploiting the Performance mode (see Section 2.) The Counter and the Time Calculator blocks are used for the measurement of the duration of each inference, while an Inference Status ADC minimizes the latency between the inference start and current sample consideration. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) according to the following formula [20]: √ 𝑢𝑐 = 𝑢2𝐴 + 𝑢2𝐵 + 𝑢2𝐵 + ⋯ + 𝑢2𝐵 . (5) 1 2 𝐾 4. Experiments and results In this section, a comparison between the application of the pro- posed and MLPTB methods is presented. In paragraph 4.1 the ex- perimental procedure is described. The DUTs and the ammeter are presented in paragraph 4.2. Results are reported in paragraph 4.3. 4.1. Experimental procedure The MLPTB method was implemented using two different circuit configurations for measuring inference duration and energy per infer- ence, as described in [17]. Instead, in the proposed method the two measures were realized with the same circuital solution shown in Fig. 2. The Firmware used for MLPTB measurement was modified to allow the measurement of the single inference as described in the paragraph 3.1. The four MLPerf benchmarks were retained: (i) Anomaly Detection, (ii) Keyword Spotting, (iii) Image Classification, (iv) Visual Wake Words. Each benchmark targets a specific use case and specifies a dataset, a model, and a quality target [17]. 4.2. Experimental setup Both methods are applied on three different MCU: STMicroelec- tronics STM32-H7 (Clock Frequency = 280 MHz), STMicroelectronics STM32-U5 (Clock Frequency = 160 MHz), and Rockchip RV1106 (Clock Fig. 4. Flow chart of the proposed Firmware. The pre-inference phase (in red) Frequency = 1200 MHz). The STM32H7 and the STM32U5 are general- is excluded from both time (CRMU timestamp read) and energy assessment purpose microcontrollers, the former designed for high-performance (‘‘Inference Status’’ digital signal setting and unsetting). (For interpretation of applications and the latter for ultra-low-power operation, both pro- the references to color in this figure legend, the reader is referred to the web duced by STMicroelectronics. These devices do not have any ded- version of this article.) icated Neural Processing Unit (NPU) hardware for ANN computa- tion, so this part is commonly made by implemented firmware that run on main Central Process Unit (CPU). The firmware is automati- of the stability of the DUT (jitter) and ammeter precision, as well cally deployed using ST EdgeAI Core Technology and compiled through as to the uncertainty of the signal transmission times between the STMCubeIDE [21] compiler implementing all needed tools to convert, devices involved in the measurement process. For the calculation of optimize, and implement ANN models on the DUT. the measurement uncertainty, the combined standard uncertainty 𝑢𝑐 is The evaluation boards of the STMicroelectronics Nucleo-STM32H7 adopted, where the contribution from the type A evaluation (𝑢𝐴 ) is with STM32H7 microcontroller and B-U585I-IOT02 A Discovery Kit integrated with the 𝐾 contributions from the type B evaluations (𝑢𝐵𝑘 ), with STM32U5 microcontroller were chosen for the experimental setup 4 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 (a) (b) (c) (d) Fig. 5. Hardware components used in the experiments: (a) H7 board with STM32H7 MCU, (b) Luckfox Pico Pro Max with Rockchip RV1106 SoC, (c) B-U585I- IOT02 A Discovery Kit with STM32U5 MCU, and (d) Power Profiler Kit II ammeter. (Figs. 5(a), 5(c)). They include a connector in series to the MCU’s power counter values returned by two consecutive CRMU readings. On each supply line allowing an ammeter to be inserted to assess the power board, 30 experiments were performed, each providing two latency consumption of the DUT under operating conditions. values. For each board, the mean value and type A uncertainty were The RV1106 is a System on Chip (SoC) produced by Rockchip Elec- computed. In the worst case, namely the Rockchip, the latency was tronics. This device has a dedicated NPU hardware, so the computation found to be 7 ± 4 CPU clock cycles (2 ± 1 for the other two boards), of ANN models are made by hardware, and the software shall only which corresponds to only a few nanoseconds. Tables 1, 2, and 3 allocate necessary data into a dedicated memory area. While STM32 present the results of inference duration (𝛥𝑡) assessments conducted microcontrollers operate without an operating system, RV1106 requires using both the MLPTB and the proposed methods. The results are the use of an operating system given its CPU architecture. Ubuntu reported for the Rockchip RV1106, STM32H7, and STM32U5, respec- 22.04 RT [22] was therefore installed to minimize execution timing tively, with varying ANN models. Concerning uncertainty computation, uncertainties. the MLPTB method does not provide strategies for calculating mea- The software is deployed using RKNN Toolkit compiler that im- surement uncertainty and, in this work, it was computed by referring plements all needed tools to convert, optimize, and implement ANN to the sole contribution of the counting inferences (Eq. (2)). In the models on the device. The evaluation board with Rockchip RV1106 proposed method, since the Clock and Reset Management Unit (CRMU) chosen for the experimental setup is the Luckfox Pico Pro Max (Fig. of the MCUs is employed for inference time measurement, the type 5(b)). The ammeter is inserted between USB-C main supply and the A uncertainty is combined with type B contributions arising from SoC’s power supply line in order to assess the power consumption of counting uncertainty, system clock stability (jitter), and the response device under operative conditions. time required by the CRMU to be queried and to return a value. The measurement board used for the power assessment is the Power For all the considered microcontrollers, the type B contribution was Profiler Kit II (PPKII) produced by Nordic Semiconductor (Fig. 5(d)). found to be dominated by the counting uncertainty, computed using This device is composed by an ammeter and a 8-bits digital sampler formula (4), and equal to 289 ns. The jitter contribution is at least synchronized with the same time base. It can work into two different three orders of magnitude smaller at room temperature (between 20 ◦ C modes that affect the only ammeter component: and 30 ◦ C) [23–25]. Similarly, the uncertainty related to the CRMU response time, characterized in this work for all three microcontrollers, • Source Meter: With this mode, the internal ammeter is linked was found to be equal to 1 CPU clock cycle. In the worst case, i.e., con- to a power supply generator that can be used to provide the sidering the STM32U5 device with the lowest CPU clock frequency, this power supply to DUT. This mode was adopted for the MLPTB contribution was on the order of nanoseconds. Therefore, the overall implementation evaluated uncertainty corresponds to the joint contribution of type A • Ammeter Mode: With this mode, the instrument works as a pure and type B, with the latter coinciding with the counting uncertainty, ammeter and the power supply of DUT can be provided ex- according to: ternally. This mode was implemented in the proposed method √ application. 𝑢𝑡 = 𝑢2𝐴 + 𝑢2𝐵 (6) For both modes, the device was metrologically characterized under To propagate the measurement uncertainty of the 𝛥𝑡 on the energy operating conditions of 20–30 ◦ C (the same conditions used for all per inference (𝐸𝑖𝑛𝑓 ) measurement, a constant power 𝑃 is assumed experiments), exhibiting an uncertainty of less than 2%. during the inference time, obtaining the following propagation formula: 4.3. Results 𝐸𝑖𝑛𝑓 = 𝑃 𝛥𝑡 ⇒ 𝑢𝑒 = 𝑃 𝑢𝑑 (7) For the proposed method, a characterization of the CRMU query where 𝑢𝑒 is the energy per inference measurement uncertainty. With latency was carried out on all devices. A modified version of the same respect to the energy consumption estimation, an additional uncer- firmware used for the energy consumption assessment was employed. tainty source arises from the measuring instrument, i.e., the ammeter Specifically, an additional CRMU query was appended directly after employed. For both methods, an instrumental uncertainty of 2% was the preceding one, making it consecutive to the two already present. considered, after a metrological characterization performed under oper- The CRMU query latency was measured as the difference between the ational conditions at room temperature (between 20 ◦ C and 30 ◦ C). The 5 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 Table 1 Comparison of central value (𝑚𝑡 ) and uncertaintya (𝑢𝑡 ) of inference duration (expressed in ms) assessed by MLCommons and proposed methods on Rockchip RV1106 at varying of neural models. Method Visual Wake Words Image Classification Keyword Spotting Anomaly Detection 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 Proposed 0.820 0.006 0.415 0.012 0.400 0.008 0.558 0.033 MLPTB 0.815 0.235 0.414 0.120 0.371 0.107 0.350 0.101 a In MLPTB, the counting uncertainty was taken into account. Table 2 Comparison of central value (𝑚𝑡 ) and uncertaintya (𝑢𝑡 ) of inference duration (expressed in ms) assessed by MLCommons and proposed methods on STM32H7 microcontroller at varying of neural models. Method Visual Wake Words Image Classification Keyword Spotting Anomaly Detection 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 Proposed 29.656 0.003 49.941 0.001 14.860 0.001 1.690 0.002 MLPTB 29.600 8.545 51.900 14.982 15.400 4.446 1.800 0.520 a In MLPTB, the Counting Uncertainty was taken into account. Table 3 Comparison of central value (𝑚𝑡 ) and uncertaintya (𝑢𝑡 ) of inference duration (expressed in ms) assessed by MLCommons and proposed methods on STM32U5 microcontroller at varying of neural models. Method Visual Wake Words Image Classification Keyword Spotting Anomaly Detection 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 𝑚𝑡 𝑢𝑡 Proposed 78.447 0.002 133.280 0.002 48.060 0.001 4.910 0.002 MLPTB 71.600 20.669 128.200 37.008 38.600 11.143 4.800 1.386 a In MLPTB, the Counting Uncertainty was taken into account. Table 4 Comparison of central value (𝑚𝑡 ) and uncertaintya (𝑢𝑒 ) of energy (expressed in μJ) assessed by MLCommons and proposed methods on Rockchip RV1106 at varying of neural models. Method Visual Wake Words Image Classification Keyword Spotting Anomaly Detection 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 Proposed 380 13 193 15 165 9 222 11 MLPTB 373 108 183 53 159 46 148 43 a In MLPTB, the counting uncertainty was propagated into the energy measurements. Table 5 Comparison of central value (𝑚𝑡 ) and uncertaintya (𝑢𝑒 ) of energy (expressed in μJ) assessed by MLCommons and proposed methods on STM32H7 microcontroller at varying of neural models. Method Visual Wake Words Image Classification Keyword Spotting Anomaly Detection 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 Proposed 4386 88 7536 151 2202 44 236 6 MLPTB 3699 1068 6311 1822 1870 540 221 64 a In MLPTB, the counting uncertainty was propagated into the energy measurements. final uncertainty was thus obtained by applying the following formula: trends: for two networks, the measured consumption is higher with the proposed method, while for the other two networks it is higher with √ MLCommons. Regarding the uncertainty, the proposed method reduces 𝑢𝑒 = 𝑢2𝑡 + 𝑢2𝑠 (8) 𝑝 it by a factor of 12. where 𝑢𝑡𝑝 denotes the inference time measurement uncertainty 𝑢𝑡 prop- agated through the functional relation used for energy computation 5. Discussion (see formula), and 𝑢𝑠 represents the instrumental uncertainty of the ammeter. The measurement uncertainty obtained for the proposed The contrasting trends from energy assessment on STM32U5 pro- method appears for all tested devices to be very low compared to the vide an opportunity to discuss the relationship between the two meth- uncertainty of the MLPTB method. ods in terms of metrological accuracy. The MLCommons method ex- In Tables 4, 5, and 6 a comparison between results of energy per tracts a central Inference Per Second value based on five experiments, inference assessment by MLPTB and proposed methods are reported for whereas our method computes a central value as the mean over 100 the three DUTs. On the Rockchip RV1106, the proposed method mea- acquisitions. Given the large uncertainty of the MLPTB method and sures an inference energy value that is, on average, 15% higher than the limited number of experiments, the calculated central value is that obtained with MLPTB, while improving the uncertainty by a factor unlikely to be a reliable estimator of the true value of the measured of 6. In the case of a STM32H7 inference energy assessment grows quantity [26]. The comparison of mean values obtained with the two by 16% while the uncertainty improves by a factor of 12. Notably, methods is limited by the large difference in their associated uncertain- the inference energy assessment on the STM32U5 shows contrasting ties. The less precise method exhibits an uncertainty up to two orders 6 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 Fig. 6. Temporal diagram of current values acquired from MCU during ANN operations. Orange traces represent (a) the inference status signal in the proposed method and (b) the trigger signal in the MLPTB method. The windows used for energy consumption estimation are highlighted in light blue. Specifically, the proposed method (a) considers only the current samples acquired during each neural network inference phase, whereas the MLPTB method (b) also includes the energy contribution of pre-inference phases (light yellow window). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 7. Comparison between proposed method (orange) and MLPTB (green) in Energy per inference Assessment on the Rockchip RV1106, at varying th Models provided by MLCommons. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 6 Comparison of central value (𝑚𝑡 ) and uncertaintya (𝑢𝑒 ) of energy (expressed in μJ) assessed by MLCommons and proposed methods on STM32U5 microcontroller at varying of neural models. Method Visual Wake Words Image Classification Keyword Spotting Anomaly Detection 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 𝑚𝑡 𝑢𝑒 Proposed 2362 47 3249 65 1184 27 116 3 MLPTB 1921 556 3384 980 1004 291 121 35 a In MLPTB, the counting uncertainty was propagated into the energy measurements. of magnitude higher than the other, rendering direct statistical com- by low energy consumption) from the calculation (Fig. 6). This prevents parisons of the means largely insignificant. Observed differences may underestimation of the actual energy consumption, which may occur therefore primarily reflect the inherent variability of the less accurate when using the MLPTB method. method rather than genuine differences in the measured phenomenon. Finally the Figs. 7, 8, and 9 present the histograms of Energy However, it is important to note that the proposed method provides per Inference assessment with the two methods on Rockchip RV1106, greater selectivity by excluding the pre-inference phase (characterized STM32H7, and STM32U5, respectively. The orange bars (proposed 7 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 Fig. 8. Comparison between proposed method (orange) and MLPTB (green) in Energy per inference Assessment on the STM32 H7, at varying th Models provided by MLCommons. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 9. Comparison between proposed method (orange) and MLPTB (green) in Energy per inference Assessment on the STM32 U5, at varying th Models provided by MLCommons. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) method) are generally higher than the green bars (MLPTB). However, 6. Conclusions comparing the mean values measured by the two methods is challeng- ing due to the large uncertainty intervals (error bars) associated with A new method for assessing power consumption of edge devices MLPTB. Nevertheless, the differences in error bar lengths confirm the such as MCUs running ANNs is presented, claiming metrological im- improved precision of the proposed method. provements over the MLPerf Tiny Benchmark. Unlike MLPTB, the The metrological improvements introduced in this work have direct proposed method calculates the duration and energy consumption of each individual inference performed by the Device Under Test. Through consequences for the practical adoption of embedded AI. First, more an appropriate circuit and firmware design, the method measures only accurate and reproducible energy assessments enhance the reliability of the energy consumed by the inference, excluding other operations from benchmarking, enabling fair comparisons among devices and support- the computation. This approach not only enhances the selectivity and ing informed selection of hardware for battery-powered applications, accuracy of the measurement process but also reduces measurement where autonomy is a critical design constraint. Second, the improved uncertainty. Instead of counting the number of inferences over a fixed accuracy in energy characterization facilitates more precise sizing of interval, as MLPTB does, the proposed method counts the number of power supply components, which is essential for ensuring efficiency, ticks from the counter of the DUT during a single inference execution. stability, and cost-effectiveness in embedded deployments. Finally, the On a NPU powered microcontroller, the proposed method improves refined timing characterization allows designers to better estimate measurement uncertainty by a factor of 6. In the case of two general- inference latency, a key parameter for real-time and safety-critical purpose microcontrollers (high-performance and ultra-low-power), the applications. measurement uncertainty improves by a factor of 12. 8 A. Apicella et al. Computer Standards & Interfaces 97 (2026) 104120 CRediT authorship contribution statement [6] M. Cunneen, M. Mullins, F. Murphy, Autonomous vehicles and embedded artificial intelligence: The challenges of framing machine driving decisions, Appl. Artif. Intell. 33 (8) (2019) 706–731. Andrea Apicella: Writing – review & editing, Methodology, Con- [7] J. Li, S. Dang, M. Wen, Q. Li, Y. Chen, Y. Huang, W. Shang, Index modulation ceptualization. Pasquale Arpaia: Writing – review & editing, Method- multiple access for 6G communications: Principles, applications, and challenges, ology, Conceptualization. Luigi Capobianco: Writing – review & edit- IEEE Netw. 37 (1) (2023) 52–60. ing, Methodology, Conceptualization. Francesco Caputo: Writing – re- [8] M. Wen, B. Zheng, K.J. Kim, M. Di Renzo, T.A. Tsiftsis, K.-C. Chen, N. view & editing, Writing – original draft, Visualization, Validation, Soft- Al-Dhahir, A survey on spatial modulation in emerging wireless systems: Re- search progresses and applications, IEEE J. Sel. Areas Commun. 37 (9) (2019) ware, Methodology, Investigation, Formal analysis, Data curation, Con- 1949–1972. ceptualization. Antonella Cioffi: Writing – review & editing, Methodol- [9] M.I. Jordan, T.M. Mitchell, Machine learning: Trends, perspectives, and ogy, Conceptualization. Antonio Esposito: Writing – review & editing, prospects, Science 349 (6245) (2015) 255–260. Methodology, Conceptualization. Francesco Isgrò: Writing – review [10] S. Mishra, J. Manda, Improving real-time analytics through the internet of things and data processing at the network edge, J. AI Assist. Sci. Discov. 4 (1) (2024) & editing, Methodology, Conceptualization. Rosanna Manzo: Writ- 184–206. ing – review & editing, Methodology, Conceptualization. Nicola Moc- [11] M. De Donno, K. Tange, N. Dragoni, Foundations and evolution of mod- caldi: Writing – review & editing, Methodology, Conceptualization. ern computing paradigms: Cloud, IoT, edge, and fog, IEEE Access 7 (2019) Danilo Pau: Writing – review & editing, Methodology, Conceptual- 150936–150948. ization. Ettore Toscano: Writing – review & editing, Methodology, [12] D.P. Pau, P.K. Ambrose, F.M. Aymone, A quantitative review of automated neural search and on-device learning for tiny devices, Chips 2 (2) (2023) 130–141. Conceptualization. [13] C.-T. Lin, P.X. Huang, J. Oh, D. Wang, M. Seok, iMCU: A 102-𝜇J, 61-ms digital in-memory computing-based microcontroller unit for edge TinyML, in: 2023 IEEE Declaration of competing interest Custom Integrated Circuits Conference, CICC, IEEE, 2023, pp. 1–2. [14] S. Gal-On, M. Levy, Exploring coremark a benchmark maximizing simplicity and efficacy, Embed. Microprocess. Benchmark Consortium (2012). The authors declare that they have no known competing finan- [15] P. Torelli, M. Bangale, Measuring Inference Performance of Machine-Learning cial interests or personal relationships that could have appeared to Frameworks on Edge-Class Devices with the Mlmark Benchmark, Techincal Re- influence the work reported in this paper. port, 2021, Available Online: https://www.eembc.org/techlit/articles/MLMARK- WHITEPAPERFINAL-1.pdf. (Accessed on 5 April 2021). Acknowledgments [16] B. Sudharsan, S. Salerno, D.-D. Nguyen, M. Yahya, A. Wahid, P. Yadav, J.G. Breslin, M.I. Ali, Tinyml benchmark: Executing fully connected neural networks on commodity microcontrollers, in: 2021 IEEE 7th World Forum on Internet of This work was carried out within the DHEAL-COM project (ID: PNC- Things, WF-IoT, IEEE, 2021, pp. 883–884. E3-2022-23683267 PNC – HLS – DH; CUP: E63C22003790001), which [17] C. Banbury, V.J. Reddi, P. Torelli, J. Holleman, N. Jeffries, C. Kiraly, P. Montino, was financially supported by the Italian Ministry of Health through D. Kanter, S. Ahmed, D. Pau, et al., Mlperf tiny benchmark, 2021, arXiv preprint arXiv:2106.07597. the Complementary National Plan (CNP) to the PNRR. This publication [18] MLCommons, 2024, URL: https://mlcommons.org/benchmarks/inference-tiny/. reflects only the authors’ view and the Italian Ministry of Health is not [19] Performance mode vs. Energy mode, 2022, URL: https://github.com/eembc/ responsible for any use that may be made of the information it contains. energyrunner?tab=readme-ov-file#performance-mode-vs-energy-mode. [20] B.N. Taylor, C.E. Kuyatt, Guidelines for Evaluating and Expressing the Un- Data availability certainty of NIST Measurement Results, NIST Technical Note 1297, National Institute of Standards and Technology (NIST), Gaithersburg, MD, 2020, http: //dx.doi.org/10.6028/NIST.TN.1297-2020. Data will be made available on request. [21] STMCubeIDE, 2022, URL: https://stm32ai.st.com/stm32-cube-ai/. [22] Ubuntu 12 RT, 2012, Real-time variant of Ubuntu 12, Canonical Ltd. https: //ubuntu.com/real-time. Canonical Ltd. References [23] STMicroelectronics, STM32H753xI - 32-bit Arm® Cortex® -M7 480MHz MCUs, 2MB flash, 1MB RAM, 46 com. and Analog Interfaces, Crypto - Datasheet - [1] R. Chataut, A. Phoummalayvane, R. Akl, Unleashing the power of IoT: A Production Data, Datasheet DS12117 Rev 9, STMicroelectronics, 2023, p. 358, comprehensive review of IoT applications and future prospects in healthcare, URL: https://www.st.com/resource/en/datasheet/stm32h753vi.pdf. (Accessed 21 agriculture, smart homes, smart cities, and industry 4.0, Sensors 23 (16) (2023) August 2025). 7194. [24] STMicroelectronics, STM32U575xx - Ultra-low-power Arm® Cortex® -M33 32-bit [2] Q. Ma, H. Tan, T. Zhou, Mutual authentication scheme for smart devices in MCU+TrustZone® +FPU, 240 DMIPS, up to 2 MB Flash memory, 786 KB SRAM - IoT-enabled smart home systems, Comput. Stand. Interfaces 86 (2023) 103743. Datasheet - production data, Datasheet DS13737 Rev 10, STMicroelectronics, [3] C.-W. Shih, C.-H. Wang, Integrating wireless sensor networks with statistical 2024, p. 346, URL: https://www.st.com/resource/en/datasheet/stm32u575ag. quality control to develop a cold chain system in food industries, Comput. Stand. pdf. (Accessed 21 August 2025). Interfaces 45 (2016) 62–78. [25] UEC Electronics, AR4236–AR4237 Luckfox Pico Pro/Max Datasheet, [4] S.B. Baker, W. Xiang, I. Atkinson, Internet of things for smart healthcare: Datasheet, UEC Electronics, 2024, URL: https://uelectronics.com/wp- Technologies, challenges, and opportunities, IEEE Access 5 (2017) 26521–26544. content/uploads/2024/07/AR4236-AR4237-Luckfox-Pico-Pro-Max-Datasheet.pdf. [5] Y. Abadade, A. Temouden, H. Bamoumen, N. Benamar, Y. Chtouki, A.S. Hafid, (Accessed 21 August 2025). A comprehensive survey on tinyml, IEEE Access (2023). [26] I. BIPM, I. IFCC, I. ISO, O. IUPAP, Evaluation of measurement data—guide to the expression of uncertainty in measurement, JCGM 100: 2008 GUM 1995 with minor corrections, Jt. Comm. Guides Metrol. 98 (2008). 9