## Compute and Energy Consumption Trends in Deep Learning Inference
## Radosvet Desislavov
VRAIN. Universitat Polit` ecnica de Val` encia, Spain radegeo@inf.upv.es
## Fernando Mart´ ınez-Plumed
European Commission, Joint Research Centre fernando.martinez-plumed@ec.europa.eu
VRAIN. Universitat Polit` ecnica de Val` encia, Spain fmartinez@dsic.upv.es
Jos´ e Hern´ andez-Orallo
VRAIN. Universitat Polit` ecnica de Val` encia, Spain jorallo@upv.es
## Abstract
The progress of some AI paradigms such as deep learning is said to be linked to an exponential growth in the number of parameters. There are many studies corroborating these trends, but does this translate into an exponential increase in energy consumption? In order to answer this question we focus on inference costs rather than training costs, as the former account for most of the computing effort, solely because of the multiplicative factors. Also, apart from algorithmic innovations, we account for more specific and powerful hardware (leading to higher FLOPS) that is usually accompanied with important energy efficiency optimisations. We also move the focus from the first implementation of a breakthrough paper towards the consolidated version of the techniques one or two year later. Under this distinctive and comprehensive perspective, we study relevant models in the areas of computer vision and natural language processing: for a sustained increase in performance we see a much softer growth in energy consumption than previously anticipated. The only caveat is, yet again, the multiplicative factor, as future AI increases penetration and becomes more pervasive.
## Introduction
As Deep Neural Networks (DNNs) become more widespread in all kinds of devices and situations, what is the associated energy cost? In this work we explore the evolution of different metrics of deep learning models, paying particular attention to inference computational cost and its associated energy consumption. The full impact, and its final carbon footprint, not only depends on the internalities (hardware and software directly involved in their operation) but also on the externalities (all social and economic activities around it). From the AI research community, we have more to say and do about the former. Accordingly, more effort is needed, within AI, to better account for the internalities, as we do in this paper.
For a revised version and its published version refer to:
Desislavov, Radosvet, Fernando Mart´ ınez-Plumed, and Jos´ e Hern´ andez-Orallo. ' Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning ' Sustainable Computing: Informatics and Systems , Volume 38, April 2023. (DOI: https://doi.org/10.1016/j.suscom.2023.100857)
In our study, we differentiate between training and inference. At first look it seems that training cost is higher. However, for deployed systems, inference costs exceed training costs, because of the multiplicative factor of using the system many times [Martinez-Plumed et al., 2018]. Training, even if it involves repetitions, is done once but inference is done repeatedly. It is estimated that inference accounts for up to 90% of the costs [Thomas, 2020]. There are several studies about training computation and its environmental impact [Amodei and Hernandez, 2018, Gholami et al., 2021a, Canziani et al., 2017, Li et al., 2016, Anthony et al., 2020, Thompson et al., 2020] but there are very few focused on inference costs and their associated energy consumption.
DNNs are deployed almost everywhere [Balas et al., 2019], from smartphones to automobiles, all having their own compute, temperature and battery limitations. Precisely because of this, there has been a pressure to build DNNs that are less resource demanding, even if larger DNNs usually outperform smaller ones. Alternatively to this in-device use, many larger DNNs are run on data centres, with people accessing them repeated in a transparent way, e.g., when using social networks [Park et al., 2018]. Millions of requests imply millions of inferences over the same DNN.
Many studies report that the size of neural networks is growing exponentially [Xu et al., 2018, Bianco et al., 2018]. However, this does not necessarily imply that the cost is also growing exponentially, as more weights could be implemented with the same amount of energy, mostly due to hardware specialisation but especially as the energy consumption per unit of compute is decreasing. Also, there is the question of whether the changing costs of energy and their carbon footprint [EEA, 2021] should be added to the equation. Finally, many studies focus on the state-of-the-art (SOTA) or the cutting-edge methods according to a given metric of performance, but many algorithmic improvements usually come in the months or few years after a new technique is introduced, in the form of general use implementations having similar results with much lower compute requirements. All these elements have been studied separately, but a more comprehensive and integrated analysis is necessary to properly evaluate whether the impact of AI on energy consumption and its carbon footprint is alarming or simply worrying, in order to calibrate the measures to be taken in the following years and estimate the effect in the future.
For conducting our analysis we chose two representative domains: Computer Vision (CV) and Natural Language Processing (NLP). For CV we analysed image classification models, and ImageNet [Russakovsky et al., 2015] more specifically, because there is a great quantity of historical data in this area and many advances in this domain are normally brought to other computer vision tasks, such as object detection, semantic segmentation, action recognition, or video classification, among others. For NLP we analysed results for the General Language Understanding Evaluation (GLUE) benchmark [Wang et al., 2019], since language understanding is a core task in NLP.
We focus our analysis on inference FLOPs (Floating Point Operations) required to process one input item (image or text fragment). We collect inference FLOPs for many different DNNs architectures following a comprehensive literature review. Since hardware manufacturers have been working on specific chips for DNN, adapting the hardware to a specific case of use leads to performance and efficiency improvements. We collect hardware data over the recent years, and estimate how many FLOPs can be obtained with one Joule with each chip. Having all this data we finally estimate how much energy is needed to perform one inference step with a given DNN. Our main objective is to study the evolution of the required energy for one prediction over the years.
The main findings and contributions of this paper are to (1) showcase that better results for DNN models are in part attributable to algorithmic improvements and not only to more computing power; (2) determine how much hardware improvements and specialisation is decreasing DNNs energy consumption; (3) report that, while energy consumption is still increasing exponentially for new cutting-edge models, DNN inference energy consumption could be maintained low for increasing performance if the efficient models that come relatively soon after the breakthrough are selected.
We provide all collected data and performed estimations as a data set, publicly available in the appendixes and as a GitHub repository 1 . The rest of the paper covers the background, introduces the methodology and presents the analysis of hardware and energy consumption of DNN models and expounds on some forecasts. Discussion and future work close the paper.
1 Temporary copy in: https://bit.ly/3DTHvFC
## Background
In line with other areas of computer science, there is some previous work that analyses compute and its cost for AI, and DNNs more specifically. Recently, OpenAI carried out a detailed analysis about AI efficiency [Hernandez and Brown, 2020], focusing on the amount of compute used to train models with the ImageNet dataset. They show that 44 times less compute was required in 2020 to train a network with the performance AlexNet achieved seven years before.
However, a demand for better task performance, linked with more complex DNNs and larger volumes of data to be processed, the growth in demand for AI compute is still growing fast. [Thompson et al., 2020] reports the computational demands of several Deep Learning applications, showing that progress in them is strongly reliant on increases in computing power. AI models have doubled the computational power used every 3.4 months since 2012 [Amodei and Hernandez, 2018]. The study [Gholami et al., 2021a] declare similar scaling rates for AI training compute to [Amodei and Hernandez, 2018] and they forecast that DNNs memory requirements will soon become a problem. This exponential trend seems to impose a limit on how far we can improve performance in the future without a paradigm change.
Compared to training costs, there are fewer studies on inference costs, despite using a far more representative share of compute and energy. Canziani et al. (2017) study accuracy, memory footprint, parameters, operations count, inference time and power consumption of 14 ImageNet models. To measure the power consumption they execute the DNNs on a NVIDIA Jetson TX1 board. A similar study [Li et al., 2016] measures energy efficiency, Joules per image, for a single forward and backward propagation iteration (a training step). This study benchmarks 4 Convolutional Neural Networks (CNNs) on CPUs and GPUs on different frameworks. Their work shows that GPUs are more efficient than CPUs for the CNNs analysed. Both publications analyse model efficiency, but they do this for very concrete cases. We analyse a greater number of DNNs and hardware components in a longer time frame.
These and other papers are key in helping society and AI researchers realise the issues about efficiency and energy consumption. Strubell et al. (2019) estimate the energy consumption, the cost and CO2 emissions of training various of the most popular NLP models. Henderson et al. (2020) performs a systematic reporting of the energy and carbon footprints of reinforcement learning algorithms. Bommasani et al. (2021) (section 5.3) seek to identify assumptions that shape the calculus of environmental impact for foundation models. Schwartz et al. (2019) analyse training costs and propose that researchers should put more attention on efficiency and they should report always the number of FLOPs. These studies contribute to a better assessment of the problem and more incentives for their solution. For instance, new algorithms and architectures such as EfficientNet [Tan and Le, 2020] and EfficientNetV2 [Tan and Le, 2021] have aimed at this reduction in compute.
When dealing about computing effort and computing speed (hardware performance), terminology is usually confusing. For instance, the term 'compute' is used ambiguously, sometimes applied to the number of operations or the number of operations per second. However, it is important to clarify what kind of operations and the acronyms for them. In this regard, we will use the acronym FLOPS to measure hardware performance, by referring to the number of floating point operations per second , as standardised in the industry, while FLOPs will be applied to the amount of computation for a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting a multiply-add operation pair as two operations. An extended discussion about this can be found in the appendix.
## Methodology
We collect most of our information directly from research papers that report results, compute and other data for one or more newly introduced techniques for the benchmarks and metrics we cover in this work. We manually read and inspected the original paper and frequently explored the official GitHub repository, if exists. However, often there is missing information in these sources, so we need to get the data from other sources, namely:
- Related papers : usually the authors of another paper that introduces a new model compare it with previously existing models, providing further information.
- Model implementations : PyTorch [Paszke et al., 2016] contains many (pre-trained) models, and their performance is reported. Other projects do the same (see, e.g., [Cadene, 2016, S´ emery, 2019]).
- Existing data compilations : there are some projects and public databases collecting information about deep learning architectures and their benchmarks, e.g., [Albanie, 2016, Coleman et al., 2017, Mattson et al., 2020, Gholami et al., 2021b, Stojnic and Taylor, 2021].
- Measuring tools : when no other source was available or reliable, we used the ptflops library [Sovrasov, 2020] or similar tools to calculate the model's FLOPs and parameters (when the implementation is available).
Given this general methodology, we now discuss in more detail how we made the selection of CV and NLP models, and the information about hardware.
## CV Models Data Compilation
There is a huge number of models for image classification, so we selected models based on two criteria: popularity and accuracy. For popularity we looked at the times that the paper presenting the model is cited on Google Scholar and whether the model appears mentioned in other papers (e.g., for comparative analyses). We focused on model's accuracy as well because having the best models per year in terms of accuracy is necessary for analysing progress. To achieve this we used existing compilations [Stojnic and Taylor, 2021] and filtered by year and accuracy. For our selection, accuracy was more important than popularity for recent models, as they are less cited than the older ones because they have been published for a shorter time. Once we selected the sources for image classification models, we collected the following information: Top-1 accuracy on ImageNet, number of parameters, FLOPs per forward pass, release date and training dataset. Further details about model selection, FLOPs estimation, image cropping [Krizhevsky et al., 2012] and resolution [Simonyan and Zisserman, 2015, Zhai et al., 2021] can be found in the Appendix (and Table 2).
## NLP Models Data Compilation
For NLP models we noted that there is much less information about inference (e.g., FLOPs) and the number of models for which we can get the required information is smaller than for CV. We chose GLUE for being sufficiently representative and its value determined for a good number of architectures. To keep the numbers high we just included all the models since 2017 for which we found inference compute estimation [Clark et al., 2020]. Further details about FLOPs estimation and counting can be found in the Appendix (selected models in in Table 7).
## Hardware Data Compilation
Regarding hardware evolution, we collected data for Nvidia GPUs 2 . Wechose Nvidia GPUs because they represent one of the most efficient hardware platforms for DNN 3 and they have been used for Deep Learning in the last 10 years, so we have a good temporal window for exploration. In particular, we collected GPU data for Nvidia GPUs from 2010 to 2021. The collected data is: FLOPS, memory size, power consumption (reported as Thermal Design Power, TDP) and launch date. As explained before, FLOPS is a measure of computer performance. From the FLOPS and power consumption we calculate the efficiency, dividing FLOPS by Watts. We use TDP and the reported peak FLOPS to calculate efficiency. This means we are considering the efficiency (GLOPS/Watt) when the GPU is at full utilisation. In practice the efficiency may vary depending on the workload, but we consider this estimate ('peak FLOPS'/TDP) accurate enough for analysing the trends and for giving an approximation of energy consumption. In our compilation there are desktop GPUs and server GPUs. We pay special attention to server GPUs released in the last years, because they are more common for AI, and DNNs in particular. A discussion about discrepancies between theoretical and real FLOPS as well as issues regarding Floating Point (FP) precision operations can be found in the Appendix.
2 https://developer.nvidia.com/deep-learning
3 We considered Google's TPUs (https://cloud.google.com/tpu?hl=en) for the analysis but there is not enough public information about them, as they are not sold but only available as a service.
## Computer Vision Analysis
In this section, we analyse the evolution of ImageNet [Deng et al., 2009] (one pass inference) according to performance and compute. Further details in the Appendix.
## Number of Parameters and FLOPs
The number of parameters is usually reported, but it is not directly proportional to compute. For instance, in CNNs, convolution operations dominate the computation: if d , w and r represent the network's depth, widith and input resolution, the FLOPs grow following the relation [Tan and Le, 2020]:
$$F L O P s \, \infty \, d + w ^ { 2 } + r ^ { 2 }$$
This means that FLOPs do not directly depend on the number of parameters. Parameters affect network depth ( d ) or width ( w ), but distributing the same number of parameters in different ways will result in different numbers of FLOPs. Moreover, the resolution ( r ) does not depend on the number of parameters directly, because the input resolution can be increased without increasing network size.
Figure 1: Relation between the number of parameters and FLOPs (both axes are logarithmic).
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Line Graph: Computational Efficiency Comparison of CNN and Transformer Architectures
### Overview
The image is a logarithmic-scale line graph comparing the computational efficiency (GFLOPs) of two neural network architectures (CNN and Transformer) as a function of model parameter count (in millions). Two trendlines and scattered data points illustrate the relationship between model size and computational cost.
### Components/Axes
- **X-axis**: "Parameters (M)" (logarithmic scale, 2 to 2000)
- **Y-axis**: "GFLOPs" (logarithmic scale, 3e-01 to 3e+03)
- **Legend**: Located at bottom-right corner, with:
- Teal line/circles: CNN architecture
- Black line/squares: Transformer architecture
- **Trendlines**: Solid lines connecting data points for each architecture
- **Data Points**: Scattered markers (circles for CNN, squares for Transformer) around trendlines
### Detailed Analysis
1. **CNN Architecture (Teal)**:
- Trendline slope: Moderate positive correlation (y ≈ 0.5x)
- Data points: Clustered tightly around the trendline, with minor scatter (e.g., 10M parameters ≈ 50 GFLOPs, 100M ≈ 500 GFLOPs)
- Notable: Consistent efficiency across parameter ranges
2. **Transformer Architecture (Black)**:
- Trendline slope: Steeper positive correlation (y ≈ 1.5x)
- Data points: Wider scatter, especially at higher parameter counts (e.g., 100M parameters ≈ 1500 GFLOPs, 1000M ≈ 3000 GFLOPs)
- Notable: Increasing computational inefficiency at scale
3. **Cross-Architecture Comparison**:
- At 10M parameters: CNN ≈ 50 GFLOPs vs. Transformer ≈ 150 GFLOPs
- At 1000M parameters: CNN ≈ 500 GFLOPs vs. Transformer ≈ 3000 GFLOPs
- Divergence ratio: ~6:1 at scale
### Key Observations
- Transformers exhibit **super-linear scaling** in computational cost relative to parameters
- CNN efficiency remains relatively stable across parameter ranges
- Data point scatter suggests implementation variability (e.g., different CNN variants vs. Transformer configurations)
- No outliers detected; all points follow expected trends
### Interpretation
The graph demonstrates that Transformer architectures require **significantly more computational resources** than CNNs for equivalent parameter counts, with the efficiency gap widening as models scale. This suggests:
1. **Architectural Tradeoffs**: Transformers may offer performance benefits that justify higher computational costs in some applications
2. **Resource Constraints**: CNN architectures might be preferable for edge devices or latency-sensitive applications
3. **Scalability Limits**: The steep Transformer trendline implies potential practical limits to model size due to hardware constraints
The logarithmic scale emphasizes exponential growth patterns, highlighting that Transformer computational demands grow faster than parameter count alone would suggest. This visualization supports architectural selection decisions based on computational budget considerations.
</details>
Despite this, Fig. 1 shows a linear relation between FLOPs and parameters. We attribute this to the balanced scaling of w , d and r . These dimensions are usually scaled together with bigger CNNs using higher resolution. Note that recent transformer models [Vaswani et al., 2017] do not follow the growth relation presented above. However, the correlation between the number of parameters and FLOPs for CNNs is 0.772 and the correlation for transformers is 0.994 (Fig. 1). This suggests that usually in both architectures parameters and FLOPs scale in tandem. We will use FLOPs, as they allow us to estimate the needed energy relating hardware FLOPS with required FLOPs for a model [Hollemans, 2018, Clark et al., 2020].
## Performance and Compute
There has been very significant progress for ImageNet. In 2012, AlexNet achieved 56% Top-1 accuracy (single model, one crop). In 2021, Meta Pseudo Labels (EfficientNet-L2) achieved 90.2% Top-1 accuracy (single model, one crop). However, this increase in accuracy comes with an increase in the required FLOPs for a forward pass. A forward pass for AlexNet is 1.42 GFLOPs while for EfficientNet-L2 is 1040 GFLOPs (details in the appendix).
Fig. 2 shows the evolution from 2012 to 2021 in ImageNet accuracy (with the size of the bubbles representing the FLOPs of one forward pass). In recent papers some researchers began using more data than those available in ImageNet1k for training the models. However, using extra data only affects training FLOPs, but does not affect the computational cost for inferring each classification (forward pass).
If we only look at models with the best accuracy for each year we can see an exponential growth in compute (measured in FLOPs). This can be observed clearly in Fig. 3: the dashed line represents an exponential growth (shown as a linear fit since the y -axis is logarithmic). The line is fitted with
Figure 2: Accuracy evolution over the years. The size of the balls represent the GFLOPs of one forward pass.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Scatter Plot: Top-1 Accuracy vs. Date with GFLOPs and Extra Data Indicators
### Overview
The image is a scatter plot showing the relationship between **Top-1 Accuracy (%)** and **Date** (2012–2021), with two data series differentiated by **Extra Data** (Yes/No) and **GFLOPs** (computational power). The plot uses color (pink/blue) and dot size to encode these variables.
---
### Components/Axes
- **X-axis**: "Date" (2012–2021, annual intervals).
- **Y-axis**: "Top-1 Accuracy (%)" (50%–90%).
- **Legend**:
- **Extra Data**:
- Pink dots = "No extra data"
- Blue dots = "Yes extra data"
- **GFLOPs**: Dot size encodes computational power:
- Smallest: 1 GFLOP
- Medium: 10 GFLOPs
- Large: 100 GFLOPs
- Largest: 1000 GFLOPs
- **Spatial Grounding**:
- Legend: Top-left corner.
- Data points: Distributed across the plot, with blue dots (higher GFLOPs) generally larger and clustered toward the right (later dates).
---
### Detailed Analysis
1. **Data Series Trends**:
- **Pink Dots (No Extra Data)**:
</details>
Figure 3: GFLOPs over the years. The dashed line is a linear fit (note the logarithmic y -axis) for the models with highest accuracy per year. The solid line includes all points.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Line Chart: Top-1 Accuracy Model Extra data
### Overview
The chart visualizes the evolution of computational performance (GFLOPs) and model accuracy (Top-1 Accuracy) for deep neural networks (DNNs) over a decade (2012–2021). It compares three data series: "Best DNNs," "All DNNs," and "Average per year," alongside markers for datasets with or without extra data.
---
### Components/Axes
- **X-axis (Date)**: Years from 2012 to 2021, labeled at 1-year intervals.
- **Y-axis (GFLOPs)**: Logarithmic scale from 0.3 to 3000, with ticks at 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000.
- **Legend**:
- **Lines**:
- Solid black: "Best DNNs"
- Dashed black: "All DNNs"
- Dotted black: "Average per year"
- **Markers**:
- Circles: "No extra data" (Top-1 Accuracy: 60–90, color-coded)
- Triangles: "Yes extra data" (Top-1 Accuracy: 60–90, color-coded)
- **Color Coding**: Top-1 Accuracy ranges from 60 (purple) to 90 (yellow), with intermediate shades of green and blue.
---
### Detailed Analysis
1. **Best DNNs (Solid Line)**:
- Starts at ~1 GFLOP in 2012, rising exponentially to ~3000 GFLOPs by 2021.
- Steep upward trend, especially post-2017.
- Data points (triangles/circles) align closely with the line, peaking at 3000 GFLOPs in 2021 (yellow triangle).
2. **All DNNs (Dashed Line)**:
- Begins at ~1 GFLOP in 2012, grows to ~100 GFLOPs by 2021.
- Slower, more gradual increase compared to "Best DNNs."
- Data points (circles) cluster below the line, with lower GFLOPs and Top-1 Accuracy (e.g., 2014: ~5 GFLOPs, purple circle).
3. **Average per Year (Dotted Line)**:
- Starts at ~0.5 GFLOPs in 2012, reaches ~50 GFLOPs by 2021.
- Intermediate growth rate between "Best DNNs" and "All DNNs."
- Data points (circles/triangles) scatter around the line, with higher values in later years.
4. **Markers**:
- **Triangles ("Yes extra data")**:
- Dominant in 2018–2021, with GFLOPs ranging from 10 to 3000.
- Higher Top-1 Accuracy (green/yellow) correlates with later years.
- **Circles ("No extra data")**:
- Concentrated in 2012–2017, with GFLOPs < 100.
- Lower Top-1 Accuracy (purple/blue) in earlier years.
---
### Key Observations
- **Exponential Growth**: "Best DNNs" show a 3000x increase in GFLOPs from 2012 to 2021, suggesting rapid advancements in hardware/software.
- **Extra Data Impact**: Triangles ("Yes extra data") consistently outperform circles ("No extra data") in both GFLOPs and Top-1 Accuracy.
- **Accuracy Correlation**: Higher Top-1 Accuracy (yellow/green) aligns with later years and higher GFLOPs, indicating improved model efficiency.
- **Outliers**: Early years (2012–2014) have sparse data points, with "Best DNNs" and "All DNNs" nearly overlapping.
---
### Interpretation
The chart demonstrates a clear trend of increasing computational power and model accuracy over time. The "Best DNNs" line reflects optimized performance, while "All DNNs" represent broader, less efficient implementations. The use of extra data (triangles) correlates with higher performance, suggesting its critical role in training. The logarithmic y-axis emphasizes the scale of growth, particularly post-2017. This data underscores the importance of selecting top-performing models and leveraging additional datasets for advancements in DNN capabilities.
</details>
the models with highest accuracy for each year. However not all models released in the latest years need so much compute. This is reflected by the solid line, which includes all points. We also see that for the same number of FLOPs we have models with increasing accuracy as time goes by.
In Table 1 there is a list of models having similar number of FLOPs as AlexNet. In 2019 we have a model (EfficientNet-B1) with the same number of operations as AlexNet achieving a Top-1 accuracy of 79.1% without using extra data, and a model (NoisyStudent-B1) achieving Top-1 accuracy of 81.5% using extra data. In a period of 7 years, we have models with similar computation with much higher accuracy. We observe that when a SOTA model is released it usually has a huge number of FLOPs, and therefore consumes a large amount of energy, but in a couple of years there is a model with similar accuracy but with much lower number of FLOPs. These models are usually those that become popular in many industry applications. This observation confirms that better results for DNN models of general use are in part attributable to algorithmic improvements and not only to the use of more computing power.
Finally, Fig. 4 shows that the Pareto frontier (in grey) is composed of new models (in yellow and green), whereas old models (in purple and dark blue) are relegated below the Pareto. As expected, the models which use extra data are normally those forming the Pareto frontier. Let us note again that extra training data does not affect inference GFLOPs.
| Model | Top-1 Accuracy | GFLOPs | Year |
|----------------------------------------|------------------|----------|--------|
| AlexNet [Krizhevsky et al., 2012] | 56.52 | 1.42 | 2012 |
| ZFNet [Zeiler and Fergus, 2013] | 60.21 | 2.34 | 2013 |
| GoogleLeNet [Szegedy et al., 2014] | 69.77 | 3 | 2014 |
| MobileNet [Howard et al., 2017] | 70.6 | 1.14 | 2017 |
| MobileNetV2 1.4 [Sandler et al., 2019] | 74.7 | 1.18 | 2018 |
| EfficientNet-B1 [Tan and Le, 2020] | 79.1 | 1.4 | 2019 |
| NoisyStudent-B1 [Xie et al., 2020] | 81.5 | 1.4 | 2019 |
Table 1: Results for several DNNs with a similar number of FLOPs as AlexNet.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Scatter Plot with Line Graph: Top-1 Accuracy vs. GFLOPs Over Time
### Overview
The image is a scatter plot overlaid with a line graph, illustrating the relationship between computational power (measured in GFLOPs) and machine learning model performance (Top-1 Accuracy %) across different years. Data points are color-coded by year (2013–2021) and differentiated by shape (circles for "No extra data," triangles for "Yes extra data"). A smooth gray trend line shows the general improvement in accuracy with increasing computational power.
---
### Components/Axes
- **X-axis (GFLOPs)**: Ranges from 1 to 1000, labeled "GFLOPs."
- **Y-axis (Top-1 Accuracy %)**: Ranges from 60% to 90%, labeled "Top-1 Accuracy (%)."
- **Legend**: Located on the right, with:
- **Colors**:
- 2013: Dark blue
- 2015: Blue
- 2017: Teal
- 2019: Green
- 2021: Yellow
- **Shapes**:
- Circles: "No extra data"
- Triangles: "Yes extra data"
---
### Detailed Analysis
1. **Trend Line**:
- The gray line slopes upward from ~70% accuracy at 1 GFLOP to ~90% at 1000 GFLOPs, indicating a strong positive correlation between computational power and accuracy.
- The line’s curvature suggests diminishing returns at higher GFLOP values (e.g., slower improvement beyond 100 GFLOPs).
2. **Data Points**:
- **2013 (Dark Blue Circles/Triangles)**:
- Lowest GFLOPs (1–10 range) and accuracy (55–70%).
- Only one triangle (extra data) at ~1 GFLOP and 55% accuracy.
- **2015 (Blue Circles/Triangles)**:
- GFLOPs: 5–50; Accuracy: 65–75%.
- Triangles appear at higher GFLOPs (e.g., 10 GFLOPs, 70% accuracy).
- **2017 (Teal Circles/Triangles)**:
- GFLOPs: 10–100; Accuracy: 75–85%.
- Triangles cluster near 50–100 GFLOPs (e.g., 80% accuracy at 50 GFLOPs).
- **2019 (Green Circles/Triangles)**:
- GFLOPs: 50–200; Accuracy: 80–88%.
- Triangles dominate at 100–200 GFLOPs (e.g., 85% accuracy at 150 GFLOPs).
- **2021 (Yellow Circles/Triangles)**:
- GFLOPs: 100–1000; Accuracy: 85–90%.
- Triangles spread across the range, with the highest accuracy (90%) at 1000 GFLOPs.
3. **Extra Data (Triangles)**:
- Triangles (extra data) are consistently positioned at higher GFLOPs and accuracy values across all years, suggesting these points may represent optimized or specialized configurations.
---
### Key Observations
- **Upward Trend**: Accuracy improves monotonically with GFLOPs, with the steepest gains between 1–100 GFLOPs.
- **Diminishing Returns**: The trend line flattens slightly after 100 GFLOPs, indicating reduced marginal gains.
- **Yearly Progression**: Later years (2019–2021) achieve higher accuracy at lower GFLOPs compared to earlier years, implying efficiency improvements.
- **Extra Data**: Triangles (extra data) align with higher-performing configurations, possibly indicating advanced techniques or datasets.
---
### Interpretation
The chart demonstrates that increased computational power (GFLOPs) has driven significant improvements in machine learning model accuracy over time. The trend line’s upward trajectory confirms a strong correlation, while the clustering of triangles (extra data) at higher performance levels suggests these points represent optimized or cutting-edge configurations. The flattening of the trend line at higher GFLOPs hints at potential saturation, where additional compute yields diminishing returns. This aligns with real-world observations of hardware-software co-evolution in AI, where newer architectures (e.g., transformers) leverage increased compute to achieve state-of-the-art results. The inclusion of "extra data" flags implies that some data points may incorporate additional training strategies or datasets, further emphasizing the interplay between compute, data, and algorithmic innovation.
</details>
GFLOPs
Figure 4: Relation between accuracy and GFLOPs.
## Natural Language Analysis
In this section, we analyse the trends in performance and inference compute for NLP models. To analyse performance we use GLUE, which is a popular benchmark for natural language understanding, one key task in NLP. The GLUE benchmark 4 is composed of nine sentence understanding tasks, which cover a broad range of domains. The description of each task can be found in [Wang et al., 2019].
## Performance and Compute
We represent the improvement on the GLUE score in relation to GFLOPs over the years in Fig. 5 (and in Fig. 15 in the Appendix). GFLOPs are for single input of length 128, which is a reasonable sequence length for many use cases, being able to fit text messages or short emails. We can observe a very similar evolution to the evolution observed in ImageNet: SOTA models require a large number of FLOPs, but in a short period of time other models appear, which require much fewer FLOPs to reach the same score. There are many models that focus on being efficient instead of reaching high score, and this is reflected in their names too (e.g., MobileBERT [Sun et al., 2020] and SqueezeBERT [Iandola et al., 2020]). We note that the old models become inefficient (they have lower score with higher number of GLOPs) compared to the new ones, as it happens in CV models.
## Compute Trend
In Fig. 6 we include all models (regardless of having performance results) for which we found inference FLOPs estimation. The dashed line adjusts to the models with higher GFLOPs (models that, when released, become the most demanding model) and the solid line to all NLP models. In this plot we indicate the input sequence length, because in this plot we represent models with different input sequence lengths. We observe a similar trend as in CV: the GFLOPS of the most cutting-edge models have a clear exponential growth, while the general trend, i.e., considering all models, does not scale so aggressively. Actually, there is a good pocket of low-compute models in the last year.
4 Many recent models are evaluated on SUPERGLUE, but we choose GLUE to have a temporal window for our analysis.
Figure 5: GFLOPs per token analysis for NLP models.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Scatter Plot: GLUE Score vs. GFLOPs for NLP Models
### Overview
The image is a scatter plot comparing the performance (GLUE score) of various natural language processing (NLP) models against their computational cost (GFLOPs). Models are color-coded by release date, with a legend indicating temporal progression from 2018 to 2020.
### Components/Axes
- **X-axis (GFLOPs)**: Ranges from 3 to 70, labeled "GFLOPs" in bold black text.
- **Y-axis (GLUE score)**: Ranges from 75 to 85, labeled "GLUE score" in bold black text.
- **Legend**: Vertical color gradient on the right, with dates (2018-01 to 2020-07) and corresponding colors (purple to yellow). Each model is annotated with its name and release date.
- **Data Points**: Labeled with model names (e.g., "ELECTRA-Large," "BERT-Large") and positioned according to their GFLOPs and GLUE scores.
### Detailed Analysis
1. **ELECTRA-Large** (2020-07, yellow):
- GFLOPs: ~70
- GLUE score: ~88
- Position: Top-right corner, highest GFLOPs and GLUE score.
2. **BERT-Large** (2018-01, dark blue):
- GFLOPs: ~50
- GLUE score: ~82
- Position: Mid-right, second-highest GLUE score.
3. **ELECTRA-Base** (2020-01, light green):
- GFLOPs: ~20
- GLUE score: ~83
- Position: Mid-right, third-highest GLUE score.
4. **MobileBERT** (2020-01, light green):
- GFLOPs: ~5
- GLUE score: ~79
- Position: Mid-left, moderate performance.
5. **SqueezeBERT** (2020-01, light green):
- GFLOPs: ~7
- GLUE score: ~78
- Position: Mid-left, lower than MobileBERT.
6. **MobileBERT tiny** (2020-01, light green):
- GFLOPs: ~3
- GLUE score: ~76
- Position: Bottom-left, lowest GFLOPs and score.
7. **Theseus 6/768** (2020-01, light green):
- GFLOPs: ~10
- GLUE score: ~77
- Position: Mid-left, slightly better than SqueezeBERT.
8. **GPT-1** (2018-01, purple):
- GFLOPs: ~30
- GLUE score: ~75
- Position: Mid-right, low score despite high GFLOPs.
9. **ELMo** (2018-01, purple):
- GFLOPs: ~25
- GLUE score: ~74
- Position: Bottom-left, lowest score overall.
### Key Observations
- **Temporal Trend**: Newer models (2020) generally achieve higher GLUE scores but require more GFLOPs.
- **Efficiency Outliers**:
- **SqueezeBERT** (2020-01) achieves a GLUE score of ~78 with only ~7 GFLOPs, outperforming older models like GPT-1 (30 GFLOPs, 75 score).
- **ELMo** (2018-01) has the lowest score (~74) despite moderate GFLOPs (~25).
- **Performance vs. Cost**: ELECTRA-Large (70 GFLOPs, 88 score) dominates in both metrics, while MobileBERT tiny (3 GFLOPs, 76 score) shows minimal computational cost but limited performance.
### Interpretation
The plot highlights a trade-off between model size (GFLOPs) and performance (GLUE score). Newer models (2020) like ELECTRA-Large and BERT-Large achieve state-of-the-art results but demand significantly more computational resources. However, some 2020 models (e.g., SqueezeBERT) demonstrate efficiency by balancing performance and cost. Older models like ELMo and GPT-1 lag in performance despite higher GFLOPs, suggesting architectural improvements in newer designs. This underscores the importance of optimizing model efficiency alongside performance in NLP development.
</details>
Figure 6: GFLOPs per token analysis for NLP models.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Line Chart: GFLOPs vs. Date for Different Token Counts and DNN Models
### Overview
The chart visualizes the growth of GFLOPs (giga floating-point operations per second) over time (2017–2021) for deep neural networks (DNNs), with data points categorized by token counts (128, 512, 1024, 2048). Two trend lines are included: one for "All DNNs" (solid black) and another for "DNNs with higher GFLOPs" (dashed black). The y-axis uses a logarithmic scale (1e+01 to 1e+06), emphasizing exponential growth patterns.
### Components/Axes
- **X-axis (Date)**: Years 2017–2021, labeled at yearly intervals.
- **Y-axis (GFLOPs)**: Logarithmic scale from 1e+01 to 1e+06, with ticks at 1e+01, 1e+02, 1e+03, 1e+04, 1e+05, 1e+06.
- **Legend**:
- Solid black line: "All DNNs"
- Dashed black line: "DNNs with higher GFLOPs"
- **Data Points**:
- Pink dots: 128 tokens
- Purple dots: 512 tokens
- Blue dots: 1024 tokens
- Cyan dots: 2048 tokens
### Detailed Analysis
- **Trend Lines**:
- **Solid Line ("All DNNs")**: A nearly linear upward slope, starting near 1e+01 GFLOPs in 2017 and reaching ~1e+02 by 2021.
- **Dashed Line ("DNNs with higher GFLOPs")**: A steeper upward slope, starting near 1e+01 in 2017 and reaching ~1e+05 by 2021.
- **Data Points**:
- **128 Tokens (Pink)**: Scattered below the solid line, with values ranging from ~1e+01 (2017) to ~1e+02 (2021). Most points cluster between 1e+01 and 1e+02.
- **512 Tokens (Purple)**: Two points near the dashed line (~1e+02 in 2018 and ~1e+03 in 2020).
- **1024 Tokens (Blue)**: Three points aligned with the dashed line (~1e+03 in 2019, ~1e+04 in 2020, ~1e+05 in 2021).
- **2048 Tokens (Cyan)**: One point at ~1e+06 GFLOPs in 2021, far above all other data and trend lines.
### Key Observations
1. **Exponential Growth**: The dashed line ("DNNs with higher GFLOPs") shows a much steeper increase than the solid line, indicating faster growth in computational power for high-performance models.
2. **Token Count Correlation**: Higher token counts (2048) correlate with significantly higher GFLOPs (1e+06 in 2021), while lower counts (128) remain near the baseline.
3. **Temporal Trends**:
- 128-token models show minimal growth after 2018.
- 1024- and 2048-token models dominate later years, with 2048 tokens achieving orders-of-magnitude higher performance by 2021.
4. **Outliers**: The 2048-token data point in 2021 is an extreme outlier, far exceeding the dashed line’s projection.
### Interpretation
The chart demonstrates a clear relationship between token count, computational power (GFLOPs), and model performance over time. The dashed line represents high-efficiency DNNs that scale more effectively with increasing token counts, while the solid line reflects the average growth of all DNNs. The 2048-token model’s performance in 2021 (1e+06 GFLOPs) suggests a breakthrough in computational efficiency, likely driven by architectural advancements or specialized hardware. Conversely, the stagnation of 128-token models highlights a performance gap, possibly due to limitations in model complexity or resource allocation. This aligns with trends in AI research, where larger models (e.g., transformers) require exponentially more compute to achieve state-of-the-art results.
</details>
## Hardware Progress
We use FLOPS as a measure of hardware performance and FLOPS/Watt as a measure of hardware efficiency. We collected performance for different precision formats and tensor cores for a wide range of GPUs. The results are shown in Fig. 7. Note that the y -axis is in logarithmic scale. Theoretical FLOPS for tensor cores are very high in the plot. However, the actual performance for inference using tensor cores is not so high, if we follow a more realistic estimation for the Nvidia GPUs (V100, A100 and T4 5 ). The details of this estimation are shown in Table 3 in the appendix.
Figure 7: Theoretical Nvidia GPUs GFLOPS per Watt. Data in Table 8 in the appendix.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Chart: GFLOPs/Watt Performance Trends (2011-2021)
### Overview
The chart visualizes the evolution of computational efficiency (GFLOPs/Watt) across three precision formats (FP16, FP16/FP32 Tensor, FP32) over a decade. Data points are plotted annually, with distinct color-coded series for each precision type.
### Components/Axes
- **X-axis (Date)**: Years from 2011 to 2021, marked at 1-year intervals.
- **Y-axis (GFLOPs/Watt)**: Logarithmic scale from 7 to 1000, with intervals at 7, 20, 50, 100, 200, 500, and 1000.
- **Legend**: Located in the top-left corner, associating:
- **Black dots**: FP16
- **Blue dots**: FP16/FP32 Tensor
- **Yellow dots**: FP32
### Detailed Analysis
1. **FP32 (Yellow)**:
- **Trend**: Gradual, linear increase from ~7 GFLOPs/Watt in 2011 to ~60 GFLOPs/Watt in 2021.
- **Key Data Points**:
- 2011: ~7 GFLOPs/Watt
- 2015: ~25 GFLOPs/Watt
- 2020: ~50 GFLOPs/Watt
2. **FP16 (Black)**:
- **Trend**: Steeper growth starting in 2016, reaching ~200 GFLOPs/Watt by 2021.
- **Key Data Points**:
- 2016: ~70 GFLOPs/Watt
- 2018: ~100 GFLOPs/Watt
- 2021: ~200 GFLOPs/Watt
3. **FP16/FP32 Tensor (Blue)**:
- **Trend**: Sharp exponential rise beginning in 2018, peaking at ~1000 GFLOPs/Watt in 2021.
- **Key Data Points**:
- 2018: ~200 GFLOPs/Watt
- 2020: ~700 GFLOPs/Watt
- 2021: ~1000 GFLOPs/Watt
### Key Observations
- **FP32 Baseline**: Consistent but slow improvement, reflecting legacy hardware limitations.
- **FP16 Acceleration**: Doubles efficiency every ~3 years post-2016, aligning with GPU advancements (e.g., NVIDIA Volta/Ampere architectures).
- **Tensor Leap**: The FP16/FP32 Tensor series dominates post-2018, suggesting specialized hardware (e.g., tensor cores) for AI/ML workloads.
- **Anomalies**: No data points for FP16/FP32 Tensor before 2018, indicating its emergence as a novel technology.
### Interpretation
The data underscores a paradigm shift in computational efficiency driven by precision optimization and specialized hardware. The FP16/FP32 Tensor series’ exponential growth (2018–2021) likely reflects innovations like NVIDIA’s Tensor Cores, which accelerate matrix operations critical for deep learning. FP16’s rise highlights the industry’s pivot toward lower-precision computing for performance gains, while FP32 remains a stable but outdated benchmark. The absence of pre-2018 Tensor data suggests its adoption coincided with the rise of AI-driven workloads, marking a turning point in hardware design priorities.
</details>
5 Specifications in: https://www.nvidia.com/en-us/data-center/.
With these estimations we obtained good linear fits (with the y -axis in logarithmic scale) to each data set, one for CV and another for NLP, as shown by the solid lines in Fig. 8. Notice that there is a particular point in Fig. 8 for year 2018 that stands out among the others by a large margin. This corresponds to T4 using mixed precision, a GPU specifically designed for inference, and this is the reason why it is so efficient for this task.
Figure 8: Nvidia GPU GFLOPS per Watt adapted for CV (CNNs) and NLP models. Data in Table 9 in the appendix.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Line Graphs: GFLOPS per Watt Estimation for CNN and NLP Models
### Overview
The image contains two line graphs comparing the efficiency of computational hardware over time, measured in GFLOPS per Watt (GFLOPS/Watt). The left graph focuses on Convolutional Neural Networks (CNNs), while the right graph focuses on Natural Language Processing (NLP) models. Both graphs track three precision types: FP32 (dark blue), Mixed (light blue), and TF32 (yellow), plotted against time.
</details>
## Energy Consumption Analysis
Once we have estimated the inference FLOPs for a range of models and the GFLOPS per Watt for different GPUs, we can estimate the energy (in Joules) consumed in one inference. We do this by dividing the FLOPs for the model by FLOPS per Watt for the GPU. But how can we choose the FLOPS per Watt that correspond to the model? We use the models presented in Fig. 8 to obtain an estimation of GLOPS per Watt for the model's release date . In this regard, Henderson et al. (2020) report that FLOPs for DNNs can be misleading sometimes, due to underlying optimisations at the firmware, frameworks, memory and hardware that can influence energy efficiency. They show that energy and FLOPs are highly correlated for the same architecture, but the correlation decreases when different architectures are mixed. We consider that this low correlation does not affect our estimations significantly as we analyse the trends through the years and we fit in the exponential scale, where dispersion is reduced. To perform a more precise analysis it would be necessary to measure power consumption for each network with the original hardware and software, as unfortunately the required energy per (one) inference is rarely reported.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Line Chart: Joules Over Time by Model Performance
### Overview
The chart visualizes the relationship between energy consumption (Joules) and model performance (Top-1 Accuracy) across three categories: "All DNNs," "DNNs," and "Best." Data spans from 2012 to 2021, with a logarithmic y-axis (Joules) and categorical markers for "Extra Data" (Yes/No). The chart includes a color gradient representing Top-1 Accuracy (60–90) and trend lines for each model category.
### Components/Axes
- **X-axis (Date)**: Years 2012–2021, labeled at 1-year intervals.
- **Y-axis (Joules)**: Logarithmic scale from 0.003 to 30,000.
- **Legend**:
- **All DNNs**: Solid black line.
- **DNNs**: Dashed black line.
- **Best**: Dotted black line.
- **Extra Data**: Circles (No), Triangles (Yes).
- **Top-1 Accuracy**: Color gradient (purple = 60, yellow = 90).
### Detailed Analysis
1. **All DNNs (Solid Black Line)**:
- Starts at ~0.1 Joules in 2012.
- Gradually increases to ~0.3 Joules by 2021.
- Trend: Steady upward slope with minimal fluctuation.
2. **DNNs (Dashed Black Line)**:
- Begins at ~0.03 Joules in 2012.
- Peaks at ~0.1 Joules in 2014.
- Declines sharply to ~0.01 Joules by 2021.
- Trend: Initial rise followed by a steep drop.
3. **Best (Dotted Black Line)**:
- Starts at ~0.01 Joules in 2012.
- Rises to ~0.3 Joules in 2014.
- Plateaus near ~0.3 Joules from 2015–2021.
- Trend: Sharp early growth, then stabilization.
4. **Color Gradient (Top-1 Accuracy)**:
- Data points transition from purple (60) to yellow (90).
- Higher accuracy correlates with higher Joules, especially in later years (2018–2021).
5. **Extra Data Markers**:
- **Triangles (Yes)**: Clustered in the upper-right quadrant (2018–2021), indicating higher Joules and Top-1 Accuracy.
- **Circles (No)**: Scattered across lower Joules values (2012–2017), with mixed accuracy levels.
### Key Observations
- **Best Model Dominance**: The "Best" line consistently outperforms others in Joules after 2014, suggesting superior efficiency or optimization.
- **DNNs Decline**: The "DNNs" category shows a significant drop post-2014, possibly due to model retirement or shifts in data strategy.
- **Energy-Accuracy Tradeoff**: Higher Top-1 Accuracy (yellow points) aligns with increased Joules, particularly in 2020–2021.
- **Extra Data Impact**: "Yes" markers (triangles) dominate high-performance regions, implying additional data enhances results but increases energy use.
### Interpretation
The chart highlights a tradeoff between model performance and energy consumption. The "Best" category’s plateau suggests a stabilization of optimal models, while the decline in "DNNs" may reflect obsolescence or inefficiency. The correlation between Top-1 Accuracy and Joules indicates that advanced models (e.g., those with extra data) require more computational resources. The 2014 peak for "DNNs" and "Best" could mark a technological milestone, after which energy efficiency became a priority. The dominance of "Yes" markers in later years underscores the growing reliance on supplemental data to achieve higher accuracy, albeit at the cost of increased energy use.
</details>
Extra Data
No
Yes
Top-1 Accuracy
Lines
Average per year
Model all DNNs
Model best DNNs
Figure 9: Estimated Joules of a forward pass (CV). The dashed line is a linear fit (logarithmic y -axis) for the models with highest accuracy per year. The solid line fits all models.
We can express the efficiency metric FLOPS per Watt as FLOPs per Joule, as shown in Eq. 1. Having this equivalence we can use it to divide the FLOPs needed for a forward pass and obtain the required Joules, see Eq. 2. Doing this operation we obtain the consumed energy in Joules.
Figure 10: Estimated Joules of a forward pass (NLP). Same interpretation as in Fig. 9.
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Line Chart: Growth in GFLOPs vs. Joules Over Time
### Overview
The chart illustrates the relationship between computational growth (measured in GFLOPs) and energy consumption (measured in Joules) for AI models over time, from 2017 to 2021. Two trend lines and scattered data points represent different model configurations.
### Components/Axes
- **X-axis (Date)**: Spans 2017 to 2021, with annual increments.
- **Y-axis (Joules)**: Logarithmic scale ranging from 1e-01 to 1e+04.
- **Legend**: Located in the top-left corner, with four color-coded categories:
- Pink: 128 tokens
- Purple: 512 tokens
- Blue: 1024 tokens
- Cyan: 2048 tokens
- **Lines**:
- Solid black: "Growth GFLOPs all models"
- Dashed black: "Growth GFLOPs of models with higher GFLOPs"
### Detailed Analysis
- **Solid Black Line (All Models)**:
- Starts near 1e+00 Joules in 2017.
- Gradually increases to ~1e+01 Joules by 2021.
- Represents the average growth trajectory across all models.
- **Dashed Black Line (Higher GFLOPs Models)**:
- Begins near 1e+01 Joules in 2017.
- Rises sharply to ~1e+04 Joules by 2021.
- Indicates faster growth for models with higher GFLOPs.
- **Data Points**:
- **Pink (128 tokens)**: Clustered below the solid line, mostly between 1e-01 and 1e+00 Joules (2018–2021).
- **Purple (512 tokens)**: Two points near the solid line (~1e+00 Joules) in 2017–2018, one at ~1e+01 Joules in 2019.
- **Blue (1024 tokens)**: Follows the dashed line closely, with points at ~1e+02 (2019), ~1e+03 (2020), and ~1e+04 (2021).
- **Cyan (2048 tokens)**: Single point at ~1e+04 Joules in 2021, aligning with the dashed line.
### Key Observations
1. **Divergence in Growth Rates**: Models with higher token counts (blue/cyan) exhibit significantly faster GFLOPs growth compared to lower token models (pink/purple).
2. **Energy Consumption Correlation**: Higher GFLOPs growth correlates with exponentially greater energy use (Joules), especially for models with 1024+ tokens.
3. **Temporal Trends**: The steepest growth occurs post-2019, with the largest models (2048 tokens) dominating by 2021.
### Interpretation
The chart demonstrates that computational efficiency (GFLOPs) and energy consumption (Joules) are strongly linked, with larger models driving disproportionate increases in both metrics. The dashed line highlights that models optimized for higher GFLOPs (likely newer architectures) outpace older, smaller models in growth rate. This suggests a trend toward increasingly resource-intensive AI development, raising concerns about sustainability and accessibility. Outliers like the 2048-token model in 2021 indicate rapid advancements in model scale, potentially reflecting breakthroughs in hardware or algorithmic efficiency.
</details>
$$E \text {efficiency} & = \frac { \text {HW Perf. } } { \text {Power} } \text { in units: } \frac { \ F L O P S } { W a t t } = \frac { \ F L O P s / s } { J o u l e s / s } = \frac { \ F L O P s } { J o u l e } & ( 1 ) \\ E \text {energy} & = \frac { \text {Fwd. Pass } } { \text {Efficiency} } \text { in units: } \frac { \ F L O P s } { \ F L O p s / J o u l e } = J o u l e$$
Applying this calculation to all collected models we obtain Fig. 9 for CV. The dashed line represents an exponential trend (a linear fit as the y -axis is logarithmic), adjusted to the models with highest accuracy for each year, like in Fig. 2, and the dotted line represent the average Joules for each year. By comparing both plots we can see that hardware progress softens the growth observed for FLOPs, but the growth is still clearly exponential for the models with high accuracy. The solid line is almost horizontal, but in a logarithmic scale this may be interpreted as having an exponential growth with a small base or a linear fit on the semi log plot that is affected by the extreme points. In Fig. 10 we do the same for NLP models and we see a similar picture.
Fig. 11 shows the relation between Top-1 Accuracy and Joules. Joules are calculated in the same way as in Fig. 9. The relation is similar as the observed in Fig. 4, but in Fig. 11 the older models are not only positioned further down in the y -axis (performance) but they tend to cluster on the bottom right part of the plot (high Joules), so their position on the y -axis is worse than for Fig. 4 due to the evolution in hardware. This is even more clear for NLP, as seen in Fig. 12.
Figure 11: Relation between Joules and Top-1 Accuracy over the years (CV, ImageNet).
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Scatter Plot: Top-1 Accuracy vs Joules with Date and Extra Data Indicators
### Overview
The image is a scatter plot visualizing the relationship between **Top-1 Accuracy** (y-axis) and **Joules** (x-axis) across different years (2013–2021). Data points are color-coded by year and marked with either circles (no extra data) or triangles (extra data). The plot emphasizes trends in model performance over time and the impact of additional data.
---
### Components/Axes
- **X-axis (Joules)**: Logarithmic scale ranging from **0.003** to **30**, with gridlines at intervals of 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, and 30.
- **Y-axis (Top-1 Accuracy)**: Linear scale from **50** to **90**, with gridlines at 10-unit intervals.
- **Legend**:
- **Color Gradient**: Represents the year of data collection, transitioning from **purple (2013)** to **yellow (2021)**.
- **Markers**:
- **Circles**: No extra data.
- **Triangles**: Extra data included.
- **Placement**: Right-aligned, vertically stacked with color gradient above marker symbols.
---
### Detailed Analysis
1. **Data Distribution**:
- **Low Joules (0.003–0.1)**:
- Accuracy ranges from **65–75%**.
- Dominated by older years (2013–2015, purple/blue points).
- Fewer data points with extra data (triangles).
- **Mid Joules (0.1–1)**:
- Accuracy improves to **75–85%**.
- Mix of years (2015–2019, blue/green/yellow points).
- More triangles (extra data) appear here.
- **High Joules (1–30)**:
- Accuracy peaks at **85–90%**.
- Newer years (2019–2021, green/yellow points) dominate.
- Triangles (extra data) are prevalent, especially at higher Joules.
2. **Trends**:
- **Positive Correlation**: Higher Joules generally correspond to higher Top-1 Accuracy.
- **Extra Data Impact**: Triangles (extra data) consistently outperform circles (no extra data) across all Joules ranges.
- **Yearly Progression**: Newer years (2019–2021) show higher accuracy for similar Joules values compared to older years.
3. **Notable Outliers**:
- A single purple circle (2013, no extra data) at **Joules = 0.1** and **Accuracy = 55%** is the lowest-performing point.
- A yellow triangle (2021, extra data) at **Joules = 30** achieves **90% accuracy**, the highest point.
---
### Key Observations
- **Extra Data Significance**: The presence of extra data (triangles) consistently improves accuracy, especially at higher Joules values.
- **Yearly Improvements**: Models from 2019–2021 outperform earlier years, even at lower Joules.
- **Diminishing Returns**: Accuracy plateaus near **90%** for Joules > 10, suggesting limited gains from further resource investment.
---
### Interpretation
The plot demonstrates that **model performance improves with increased computational resources (Joules)** and **additional data**. The color gradient highlights advancements over time, with newer years achieving higher accuracy for comparable Joules. The dominance of triangles (extra data) underscores its critical role in enhancing results. The outlier at 55% accuracy (2013, no extra data) emphasizes the baseline performance without these enhancements. This suggests that optimizing both resource allocation and data quality is essential for maximizing Top-1 Accuracy.
</details>
## Forecasting and Multiplicative Effect
In our analysis we see that DNNs as well as hardware are improving their efficiency and do not show symptoms of standstill. This is consistent with most studies in the literature: performance will
Figure 12: Relation between Joules and GLUE score over the years (NLP, GLUE).
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Scatter Plot: Model Performance vs. Computational Cost
### Overview
The image is a scatter plot comparing machine learning models based on two metrics: **GLUE score** (y-axis) and **Joules** (x-axis). Models are color-coded by release time period (2018–2020), with annotations for model names and sizes. The plot highlights trade-offs between performance and computational efficiency.
---
### Components/Axes
- **X-axis (Joules)**: Ranges from 0.03 to 1.00, representing energy consumption during inference.
- **Y-axis (GLUE)**: Ranges from 75 to 85, representing performance on the General Language Understanding Evaluation (GLUE) benchmark.
- **Legend**: Located in the top-left corner, mapping colors to time periods:
- Purple: 2018–07
- Blue: 2019–01
- Green: 2019–07
- Yellow: 2020–01
- **Annotations**: Model names and sizes (e.g., "ELECTRA-Base," "BERT-Large") are labeled near their respective data points.
---
### Detailed Analysis
1. **Model Placement**:
- **MobileBERT** (~0.03 Joules, ~77 GLUE): Green (2019–01), efficient but mid-tier performance.
- **MobileBERT tiny** (~0.03 Joules, ~75 GLUE): Green (2019–01), lowest performance among annotated models.
- **SqueezeBERT** (~0.05 Joules, ~78 GLUE): Green (2019–01), slightly better than MobileBERT.
- **Theseus 6/768** (~0.10 Joules, ~77 GLUE): Green (2019–01), similar to MobileBERT.
- **ELECTRA-Small** (~0.05 Joules, ~79 GLUE): Green (2019–01), better performance than SqueezeBERT.
- **ELECTRA-Base** (~0.10 Joules, ~83 GLUE): Green (2019–01), significant performance jump.
- **ELECTRA Large** (~0.50 Joules, ~85 GLUE): Yellow (2020–01), highest performance but highest cost.
- **BERT-Base** (~0.30 Joules, ~80 GLUE): Blue (2019–01), mid-tier performance.
- **BERT Large** (~1.00 Joules, ~82 GLUE): Blue (2019–01), highest cost among BERT variants.
- **GPT-1** (~0.30 Joules, ~75 GLUE): Purple (2018–07), lowest performance.
- **ELMo** (~0.50 Joules, ~73 GLUE): Purple (2018–07), worst performance.
2. **Trends**:
- **Performance vs. Efficiency**: Newer models (2020–01, yellow) generally achieve higher GLUE scores but require more Joules (e.g., ELECTRA Large).
- **Efficiency Leaders**: MobileBERT variants (2019–01) cluster at the bottom-left, indicating low cost but moderate performance.
- **Outliers**:
- **ELECTRA-Base** (2019–01) achieves 83 GLUE at 0.10 Joules, outperforming older models at similar costs.
- **GPT-1** (2018–07) and **ELMo** (2018–07) are the least efficient performers, occupying the bottom-right quadrant.
---
### Key Observations
- **Temporal Progression**: Models from 2020–01 (yellow) dominate the top-right quadrant, suggesting advancements in both performance and efficiency.
- **ELECTRA Dominance**: ELECTRA variants (Small, Base, Large) show a clear trend of increasing GLUE scores with higher Joules, outperforming BERT and GPT-1.
- **BERT Limitations**: BERT-Base and BERT-Large (2019–01) lag behind ELECTRA in performance despite similar or higher computational costs.
- **Legacy Models**: GPT-1 and ELMo (2018–07) are outperformed by newer models even at comparable Joules.
---
### Interpretation
The plot demonstrates a clear **performance-computational cost trade-off** in NLP models. Newer architectures like ELECTRA (2020–01) achieve state-of-the-art GLUE scores but require significantly more energy. Conversely, older models like GPT-1 and ELMo (2018–07) are less efficient and perform poorly. The MobileBERT family (2019–01) represents a pragmatic balance for edge devices, while ELECTRA-Base (2019–01) offers a "sweet spot" of high performance at moderate cost. The data underscores the importance of architectural innovation (e.g., ELECTRA's pre-training strategy) in driving efficiency gains.
</details>
Figure 13: Estimated Joules per forward pass (e.g., one prediction) compared to human energy consumption in 1s (CV).
<details>
<summary>Image 13 Details</summary>

### Visual Description
## Line Chart: Energy Consumption Trends (DNNs vs Human)
### Overview
The chart visualizes energy consumption trends (in Joules) over time (2012–2021) for two categories: Deep Neural Networks (DNNs) and human energy use. It includes two human-related metrics ("external energy" and "internal consumption") and two DNN-related metrics ("Best DNNs" and "All DNNs"). The y-axis uses a logarithmic scale (1e-02 to 1e+04 Joules), while the x-axis spans 2012 to 2021.
### Components/Axes
- **X-axis (Date)**: 2012 to 2021, labeled annually.
- **Y-axis (Joules)**: Logarithmic scale from 1e-02 to 1e+04.
- **Legend**: Located at the top-right, with four entries:
- **Best DNNs**: Dashed red line.
- **All DNNs**: Solid pink line.
- **Human external energy**: Solid blue line.
- **Human internal consumption**: Solid cyan line.
- **Data Points**: Black dots scattered across the chart, primarily clustered near the "All DNNs" and "Best DNNs" lines.
### Detailed Analysis
1. **Best DNNs (Dashed Red Line)**:
- Starts near 1e+01 Joules in 2012.
- Rises steadily to ~1e+03 Joules by 2021.
- Slope: ~10x increase over the decade.
2. **All DNNs (Solid Pink Line)**:
- Begins at ~1e-01 Joules in 2012.
- Grows to ~1e+02 Joules by 2021.
- Slope: ~1000x increase over the decade.
3. **Human External Energy (Solid Blue Line)**:
- Flat line at ~1e+04 Joules throughout 2012–2021.
4. **Human Internal Consumption (Solid Cyan Line)**:
- Flat line at ~1e+02 Joules throughout 2012–2021.
5. **Data Points**:
- Black dots align closely with the "All DNNs" and "Best DNNs" lines.
- Outliers: A few points deviate slightly (e.g., 2017, 2020), but most follow the trend.
### Key Observations
- **Exponential Growth**: Both DNN categories show exponential growth, with "Best DNNs" outpacing "All DNNs" by ~10x.
- **Human Stability**: Human energy metrics remain constant, suggesting no significant change in consumption patterns.
- **Scaling Disparity**: By 2021, "Best DNNs" approach human external energy levels (~1e+03 vs. 1e+04), while "All DNNs" surpass human internal consumption (~1e+02 vs. 1e+02).
- **Outliers**: Minor deviations in data points (e.g., 2017, 2020) may reflect measurement noise or transient events.
### Interpretation
The chart highlights the rapid energy consumption growth of DNNs, particularly high-performing models ("Best DNNs"), which are closing the gap with human energy use. By 2021, DNNs collectively exceed human internal consumption, signaling their increasing computational footprint. The stability of human energy metrics contrasts sharply with DNN trends, emphasizing technological advancements' role in driving energy demand. Outliers warrant further investigation to determine if they represent anomalies or intentional design choices in specific models. This trend raises concerns about sustainability and the environmental impact of AI infrastructure.
</details>
continue growing as compute grows, but at the same time efficiency is increasing. However, this is the first work that analyses whether these two things cancel, especially when we analyse inference and not training. Our conclusion is that they not cancel out for the cutting-edge models of each moment but this is less clear for the regular models in general use by industries and invididuals.
However, since we are focusing on inference costs, we need to consider the multiplicative factor. How many inferences are performed per capita ? This has definitely increased very significantly with the use of smart devices, Internet of things and many other devices around us, which are incorporating DNN-based services. However, how many inference passes per capita do we have at this moment, and how is this growing? This is very difficult to estimate, and we leave it for future work. However, it is interesting to analyse possible hypotheses: assume there is one inference pass of a neural network application per second per capita. What would this imply in terms of energy consumption?
In order to put this inference energy consumption in context we calculate the value of average human body energy consumption (we will refer to it as somatic or internal consumption) in one second and the average energy that a human being consumes in one second with all their commodities (we will refer to it as external consumption). The internal consumption is calculated assuming 2000 KCal per person day, and converting this to Joules/s, giving approximately 100 Joules/s. The external consumption is the sum of total energy consumption, including electricity, transport and heating, using the USA as a reference [Ritchie and Roser, 2020]. This suggests 79,897 Kwh/year in 2019, which is approximately 10,000 Joules every second. The comparison of these two references with the trends can be seen in Fig. 13 (CV). As we see, the energy consumed for one inference of the best models approaches the energy consumed by the human body in one second but stills far from the external energy consumed in one second. If each human did an AI-based decision implying a forward pass every second during the whole day (and night), this would be still well below their
Figure 14: Estimated Joules per forward pass (e.g., one prediction) compared to human consumption in 1s (NLP).
<details>
<summary>Image 14 Details</summary>

### Visual Description
## Line Chart: Energy Consumption and AI Efficiency Trends (2017-2021)
### Overview
The chart visualizes the growth of AI efficiency (Best DNNs and All DNNs) alongside human energy metrics (external energy and internal consumption) over five years. The y-axis uses a logarithmic scale (Joules), while the x-axis spans 2017–2021. Key trends include exponential growth in AI efficiency and stable human energy consumption.
### Components/Axes
- **X-axis (Date)**: 2017 to 2021 (annual intervals).
- **Y-axis (Joules)**: Logarithmic scale from 1e-1 to 1e4.
- **Legend**: Located in the top-right corner, with four entries:
- **Best DNNs**: Dashed red line.
- **All DNNs**: Solid red line.
- **Human external energy**: Solid blue line.
- **Human internal consumption**: Solid cyan line.
### Detailed Analysis
1. **Best DNNs (Dashed Red Line)**:
- Starts at ~1e0 Joules in 2017.
- Exponential growth to ~1e4 Joules by 2021.
- Crosses the **All DNNs** line (~1e3 Joules in 2020) and surpasses it by 2021.
2. **All DNNs (Solid Red Line)**:
- Begins at ~1e0 Joules in 2017.
- Gradual linear increase to ~1e3 Joules by 2021.
- Data points (black dots) align closely with the line, showing consistent growth.
3. **Human External Energy (Solid Blue Line)**:
- Flat line at ~1e4 Joules throughout 2017–2021.
- No variation observed.
4. **Human Internal Consumption (Solid Cyan Line)**:
- Flat line at ~1e2 Joules throughout 2017–2021.
- Remains significantly lower than external energy.
### Key Observations
- **Best DNNs** exhibit exponential growth, outpacing **All DNNs** by 2020.
- **Human external energy** (1e4 Joules) is 10x higher than **internal consumption** (1e2 Joules), with no change over time.
- **All DNNs** growth is linear, contrasting with the exponential trajectory of **Best DNNs**.
### Interpretation
The chart highlights a divergence between AI efficiency and human energy use:
- **Best DNNs** (dashed red) likely represent optimized AI models, showing rapid efficiency gains (e.g., reduced energy per task). Their exponential growth suggests breakthroughs in AI architecture or training methods.
- **All DNNs** (solid red) reflect broader AI adoption, with slower, linear progress, possibly due to incremental improvements or scaling challenges.
- **Human energy metrics** (blue and cyan lines) remain static, implying either stable energy consumption patterns or normalization of data (e.g., per capita or per task). The 10x gap between external and internal energy use may indicate that external energy (e.g., grid power) dominates human activity, while internal consumption (e.g., metabolic energy) is negligible in this context.
The data underscores AI's transformative potential in energy efficiency, though real-world adoption (All DNNs) lags behind theoretical best-case scenarios (Best DNNs). The flat human energy lines raise questions about whether the chart normalizes energy use or reflects absolute values, which could impact interpretations of sustainability trends.
</details>
internal consumption. However, AI-based decisions are becoming more ubiquitous. For instance, a self-driving car or a surveillance camera may be making many forward passes per second. For NLP, the trends are similar but the best models are growing much faster, as we see in Fig. 14, while the regular models may even decrease. Here, the interpretation in terms of how many decisions are made in a second is also hard to determine. For instance, a language model interfaced by a human does not require more than the basic 128-token windows per second. However, many applications of language models can process data without interacting with humans at a much higher speed.
## Discussion and Future Work
In this work we have combined the analysis of several elements about AI, compute and energy consumption that allow us to have a different and more comprehensive perspective about the energy impact of AI. The most distinctive element of our analysis is that we focus on inference cost, which is usually lower than the training cost when both are reported in research papers, but because of multiplicative factors, it is much higher overall. Many DNN models are trained once and applied millions of times (forward passes).
Our findings are very different from the unbridled exponential growth that is usually reported when just looking at the number of parameters of new deep learning models [Hestness et al., 2017, Kaplan et al., 2020, Henighan et al., 2020]. When we focus on inference costs of these networks, the energy that is associated is not growing so fast, because of several factors that partially compensate the growth, such as algorithmic improvements, hardware specialisation and hardware consumption efficiency. The gap gets closer when we analyse those models that settle, i.e., those models whose implementation become very popular one or two years after the breakthrough algorithm was introduced. These general-use models can achieve systematic growth in performance at an almost constant energy consumption. The main conclusion is that even if the energy used by AI were kept constant, the improvement in performance could be sustained with algorithmic improvements and fast increase in the number of parameters.
This conclusion has an important limitation. It assumes a constant multiplicative factor. As more and more devices use AI (locally or remotely) the energy consumption can escalate just by means of increased penetration, in the same way that cars have become more efficient in the past two decades but there are many more cars in the world today.
We hope this paper contributes to the increasing debate about AI and energy consumption by analysing the inference costs. As these are dominated by multiplicative factors, this should encourage not only AI researchers but economists and social scientists to participate in this analysis. Future studies would be enriched by socio-economic indicators about the use of AI (the degree of penetration), the cost of energy and devices as well as the carbon footprint per Joule [EEA, 2021]. Similarly, comparing energy consumption by AI and trends in human salaries could help determine where automation [Tolan et al., 2021] becomes cost effective in economic terms.
Finally, this paper has many limitations that originate from the limited information reported in scientific papers. Many papers include the number of hyperparameters, but it is less common to have complete information about FLOPs and energy consumption. It is even rarer when looking for inference costs. This information is not only necessary for the transparency of the field but it is of utmost relevance for producing studies such as the one we have presented here, with a larger number of benchmarks and models. Also, it is important that new techniques are reported with new but also old benchmarks, so that we can have larger temporal windows where we can analyse the evolution of the field. We hope that future studies can build on this one and better publishing practices.
## References
- S. Albanie. Convnet burden: Estimates of memory consumption and flop counts for various convolutional neural networks., 2016. https://github.com/albanie/convnet-burden.
- D. Amodei and D. Hernandez. Ai and compute. https://openai.com/blog/ai-and-compute/, 2018.
- L. F. W. Anthony, B. Kanding, and R. Selvan. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051 , 2020.
- V. E. Balas, S. S. Roy, D. Sharma, and P. Samui. Handbook of deep learning applications , volume 136. Springer, 2019.
- S. Bianco, R. Cadene, L. Celona, and P. Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access , 6:64270-64277, 2018.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models, 2021.
- A. Brock, S. De, S. L. Smith, and K. Simonyan. High-performance large-scale image recognition without normalization, 2021.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners, 2020.
- R. Cadene. Pretrained models for Pytorch , 2016. https://github.com/Cadene/pretrained-models. pytorch#torchvision.
- A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications, 2017.
11. C.-F. Chen, Q. Fan, and R. Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification, 2021.
- Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks, 2017.
- F. Chollet. Keras applications, 2015. https://keras.io/api/applications/.
- F. Chollet. Xception: Deep learning with depthwise separable convolutions, 2017.
- K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre-training text encoders as discriminators rather than generators, 2020.
- C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. R´ e, and M. Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition. Training , 100(101):102, 2017.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248-255. Ieee, 2009.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- E. E. A. EEA. Greenhouse gas emission intensity of electricity generation in europe. https://www.eea.europa.eu/data-and-maps/indicators/overview-of-the-electricity-production3/assessment-1, 2021.
- A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer. Ai and memory wall. RiseLab Medium Post , 2021a.
- A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer. Ai and memory wall. RiseLab Medium Post , 2021b.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual networks github, 2015a. https://github.com/ KaimingHe/deep-residual-networks.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015b.
- P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research , 21(248):1-43, 2020.
- T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 , 2020.
- D. Hernandez and T. B. Brown. Measuring the algorithmic efficiency of neural networks, 2020.
- J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 , 2017.
- M. Hollemans. How fast is my model?, 2018. https://machinethink.net/blog/how-fast-is-my-model/.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
- J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu. Squeeze-and-excitation networks, 2019.
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks, 2018.
- F. N. Iandola, A. E. Shaw, R. Krishna, and K. W. Keutzer. Squeezebert: What can computer vision teach nlp about efficient neural networks?, 2020.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 , 2020.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems , 25:1097-1105, 2012.
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020.
- C. Li. Openai's gpt-3 language model: A technical overview. https://lambdalabs.com/blog/ demystifying-gpt-3, 2020.
- D. Li, X. Chen, M. Becchi, and Z. Zong. Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus. In 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom) , pages 477-484, 2016. doi: 10.1109/BDCloud-SocialCom-SustainCom.2016.76.
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search, 2018.
- Z. Liu, Y. Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design, 2018.
- D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining, 2018.
- F. Martinez-Plumed, S. Avin, M. Brundage, A. Dafoe, S. ´ O. h ´ Eigeartaigh, and J. Hern´ andez-Orallo. Accounting for the neglected dimensions of ai progress. arXiv preprint arXiv:1806.00610 , 2018.
- P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, et al. Mlperf: An industry standard benchmark suite for machine learning performance. IEEE Micro , 40(2):8-16, 2020.
- C. NVIDIA. Achieved FLOPs , 2015. https://docs.nvidia.com/gameworks/content/developertools/ desktop/analysis/report/cudaexperiments/kernellevel/achievedflops.htm.
- C. NVIDIA. Nvidia tesla v100 gpu architectur, 2017. https://images.nvidia.com/content/voltaarchitecture/pdf/volta-architecture-whitepaper.pdf.
- C. NVIDIA. Training with mixed precision, 2018. https://docs.nvidia.com/deeplearning/ performance/mixed-precision-training/index.html.
- J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur, J. Pino, M. Schatz, A. Sidorov, V. Sivakumar, A. Tulloch, X. Wang, Y. Wu, H. Yuen, U. Diril, D. Dzhulgakov, K. Hazelwood, B. Jia, Y. Jia, L. Qiao, V. Rao, N. Rotem, S. Yoo, and M. Smelyanskiy. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications, 2018.
- A. Paszke, S. Gross, S. Chintala, and G. Chanan. Torchvision models, 2016. https://pytorch.org/ vision/stable/models.html.
- M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations, 2018.
- H. Pham, Z. Dai, Q. Xie, M.-T. Luong, and Q. V. Le. Meta pseudo labels, 2021.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
- E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search, 2019.
- H. Ritchie and M. Roser. Energy. Our World in Data , 2020. https://ourworldindata.org/energy.
- C. Rosset. Turing-nlg: A 17-billion-parameter language model by microsoft, 2020. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-languagemodel-by-microsoft/.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge, 2015.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks, 2019.
- R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green ai, 2019.
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
- V. Sovrasov. Flops counter for convolutional networks in pytorch framework , 2020. https://github. com/sovrasov/flops-counter.pytorch.
- A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani. Bottleneck transformers for visual recognition, 2021.
- R. Stojnic and R. Taylor. Papers with code imagenet benchmark (image classification), 2021. https: //paperswithcode.com/sota/image-classification-on-imagenet.
- E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in nlp, 2019.
- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices, 2020.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions, 2014.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision, 2015.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.
- O. S´ emery. Computer vision models on pytorch, 2019. https://pypi.org/project/pytorchcv/.
- M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2020.
- M. Tan and Q. V. Le. Efficientnetv2: Smaller models and faster training, 2021.
- D. Thomas. Reducing machine learning inference cost for pytorch models - aws online tech talks. https://www.youtube.com/watch?v=ET2KVe2du3Y, 2020.
- N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso. The computational limits of deep learning. arXiv preprint arXiv:2007.05558 , 2020.
- S. Tolan, A. Pesole, F. Mart´ ınez-Plumed, E. Fern´ andez-Mac´ ıas, J. Hern´ andez-Orallo, and E. G´ omez. Measuring the occupational impact of ai: tasks, cognitive abilities and ai benchmarks. Journal of Artificial Intelligence Research , 71:191-236, 2021.
- H. Touvron, A. Vedaldi, M. Douze, and H. J´ egou. Fixing the train-test resolution discrepancy: Fixefficientnet, 2020.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Deit: Data-efficient image transformers github, 2021a. https://github.com/facebookresearch/deit.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Training data-efficient image transformers & distillation through attention, 2021b.
- H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. J´ egou. Going deeper with image transformers, 2021c.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998-6008, 2017.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
- Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification, 2020.
- S. Xie, R. Girshick, P. Doll´ ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks, 2017.
- C. Xu, W. Zhou, T. Ge, F. Wei, and M. Zhou. Bert-of-theseus: Compressing bert by progressive module replacing, 2020.
- X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for edge inference of deep neural networks. Nature Electronics , 1(4):216-222, 2018.
- I. Z. Yalniz, H. J´ egou, K. Chen, M. Paluri, and D. Mahajan. Semi-supervised and semi-weakly supervised imagenet models github, 2019a. https://github.com/facebookresearch/semi-supervisedImageNet1K-models.
- I. Z. Yalniz, H. J´ egou, K. Chen, M. Paluri, and D. Mahajan. Billion-scale semi-supervised learning for image classification, 2019b.
- L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet, 2021.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks, 2013.
- X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers, 2021.
- H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, and A. Smola. Resnest: Split-attention networks, 2020.
- X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices, 2017.
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition, 2018.
## Appendix
In this technical appendix we include some supplementary material giving detailed information about 1) differences between FLOPs and FLOPS; 2) methodological details for CV and NLP models used in our analyses; 3) benchmarks addresed; 4) hardware specifics regarding precision; 5) further analysis for performance and compute in NLP tasks; 6) FLOPS estimation procedures; 7) Results for the GLUE benchmark; and 8) GPU consumption data.
## FLOPs vs FLOPS
When dealing about computing effort and computing speed (hardware performance), terminology is usually confusing. The term 'compute' is usually ambiguous, sometimes applied for a number of operations or the number of operations per second. However, it is important to clarify what kind of operations and the acronyms for them. In this regard, we will use the acronym FLOPS to measure hardware performance, by referring to the number of floating point operations per second , as standardised in the industry, while FLOPs will be applied to the amount of computation for a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting a multiply-add operation pair as two operations.
For instance, we found out that the acronym FLOP may be misleading. By FLOP, we mean one floating point operation, a measure of the amount of compute (computing effort) and by FLOPS, we mean floating point operations per second , i.e., FLOPS = FLOP/s. However, many papers, especially CVpapers, use the terms FLOPs and FLOPS to refer to the number of operations, but we will be just use FLOPs as the plural of FLOP, never as FLOPS. Then there is the question of what a FLOP is. When dealing with DNN, this is usually associated with the number of multiply-add operations, even there are other type of operations involved when executing a DNN. This is done this way because it is usually a good estimation [Hollemans, 2018, Clark et al., 2020]. More specifically, we will count one fused multiply-add operation as 2 FLOPs (note the lowercase 's'). Hardware manufacturers count them in this manner [NVIDIA, 2015], because in fact there are two mathematical operations. However, CV research papers count a multiply-add operation as only one operation. In this case, we will multiply the number of operations reported by 2. In sum, the acronym FLOPS will be applied to measure hardware performance, by referring to the number of floating point operations per second , as standardised in the industry, while FLOPs will be applied to the amount of computation for a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting a multiply-add operation pair as two operations.
## Methodology Details for CV Models
Accuracy and FLOPs metrics were collected carefully, taking into account that there are different sampling techniques to reach a given accuracy. For instance, in the AlexNet paper [Krizhevsky et al., 2012], to classify a single image they make 10 predictions, they take 10 different crops 6 from the original image and average the 10 predictions to get the final prediction. While this is a useful trick, it is not fair to compare the accuracy of a model achieved with 10 crops with another achieved with 1 crop. Furthermore, the use of several crops or other kinds of repetitions is problematic, as the papers usually report the number of FLOPs for one forward pass 7 (if 10 forward passes are needed to make a single prediction, then the FLOPs should be multiplied by 10). For these reasons we only report 1-crop accuracy for all models, to make a meaningful comparison.
Note that the FLOPs also depend of the input image resolution: the higher the image resolution, the more operations (FLOPs) are required to process it. Some researchers report results with different image resolutions [Simonyan and Zisserman, 2015, Zhai et al., 2021], and sometimes it is not clear which resolution the results are reported for. In these cases, we need to investigate until we find that information. In sum, all the collected FLOPs in this work are for a forward pass with the resolution used for inference. The selected models and their values are shown in Table 2.
6 Cropping is a common image manipulation process: while cropping the middle square (down-sampling) from input images is a good practice for data preparation, random cropping is also a good practice for train-data augmentation
7 A'forward pass' refers to calculation process, values of the output layers from the inputs data. It's traversing through all neurons from first to last layer. A loss function is calculated from the output values.
Table 2: CV models data set. A citation next to a given value means that this value is extracted from that source, otherwise the values are from the paper (cited in model column). The symbol † means that this value was obtained or checked from a model implementation using model analysis tools, and the symbol ∗ means that we estimated the value.
| Model | Top-1 Acc. | Params (M) | GFLOPs | Extra Data | Date | Architecture |
|----------------------------------------------------------------------------------------|----------------------------------------------------|----------------------------------------------------|---------------------------------------------------------|---------------------------------|----------------------------------|-------------------------------------|
| AlexNet [Krizhevsky et al., 2012] | 56.52 [Paszke et al., 2016] | 61.00 † | 1.42 † | No | 01/06/2012 | CNN |
| ZFNet-b [Zeiler and Fergus, 2013] | 63.63 [S´ emery, 2019] | 107.63 [S´ emery, 2019] | 4.96 [S´ emery, 2019] | No | 11/11/2013 | CNN |
| ZFNet [Zeiler and Fergus, 2013] VGG-19 [Simonyan and Zisserman, 2015] | 60.21 [S´ emery, 2019] 72.37 [Paszke et al., 2016] | 62.36 [S´ emery, 2019] 144.00 | 2.34 [S´ emery, 2019] 39.34 † | No No | 12/11/2013 04/09/2014 | CNN CNN |
| VGG-16 [Simonyan and Zisserman, 2015] | 71.59 [Paszke et al., 2016] 69.77 2016] | 138.00 | 31.00 † | No | 04/09/2014 | CNN |
| Inception V1/GoogleLeNet [Szegedy et al., 2014] | [Paszke et al., | 6.80 | 3.00 4.10 2019] | No No | 17/09/2014 11/02/2015 | CNN |
| Inception V2/Incepton BN [Ioffe and Szegedy, 2015] Inception V3 [Szegedy et al., 2015] | 74.80 78.80 | 11.29 [S´ emery, 2019] 23.83 | [S´ emery, 11.48 | No | 02/12/2015 | CNN CNN |
| ResNet-50 [He et al., 2015b] | 75.30 [He et al., 2015a] | [Chollet, 2015] | 7.60 | No | 10/12/2015 | CNN |
| ResNet-101 [He et al., 2015b] ResNet-152 [He et al., 2015b] | 76.40 [He et al., 2015a] 77.00 2015a] | 26.00 45.00 [Chollet, 2015] | 15.20 | No No | 10/12/2015 10/12/2015 | CNN CNN |
| Inception V4 [Szegedy et al., 2016] | [He et al., 80.00 | 60.00 [Chollet, 2015] 42.68 [S´ emery, 2019] | 22.60 [S´ emery, | No | 23/02/2016 | CNN |
| Inception ResNet V2 [Szegedy et al., 2016] | 80.10 | 55.84 [S´ emery, 2019] | 24.60 2019] 26.38 [S´ emery, 2019] | No | 23/02/2016 | CNN |
| Densenet-121 [Huang et al., 2018] | 74.98 | 7.98 [S´ emery, 2019] | 5.74 [S´ emery, 2019] | No | 25/08/2016 | CNN |
| Densenet-169 [Huang et al., 2018] Densenet-201 [Huang et al., 2018] | 76.20 77.42 | 14.15 [S´ emery, 2019] 20.01 [S´ emery, 2019] | 6.80 [S´ emery, 2019] 8.68 [S´ emery, 2019] | No No | 25/08/2016 25/08/2016 | CNN CNN |
| Xception [Chollet, 2017] | 79.00 | 22.86 | 16.80 [S´ emery, 2019] | No | 07/10/2016 | CNN |
| ResNeXt-50 (32x4d) [Xie et al., 2017] | 77.80 | 25.00 | 8.40 | No | 16/11/2016 | CNN |
| ResNeXt-101 (64x4d) [Xie et al., 2017] | 79.60 | 83.46 | 31.20 † | No | 16/11/2016 | CNN |
| MobileNet [Howard et al., 2017] | 70.60 | 4.20 | 1.14 | No | 17/04/2017 | CNN |
| ShuffleNet x1.0 (g=8) [Zhang et al., 2017] DPN-131 (40 × 4d) [Chen et al., 2017] | 67.60 | 2.43 [S´ emery, 2019] | 0.28 | No | 04/07/2017 06/07/2017 | CNN CNN |
| DPN-98 (40 × 4d) [Chen et al., 2017] | 80.07 79.80 | 79.50 61.70 | 32.00 23.40 | No No | 06/07/2017 | CNN |
| DPN-92 (32 × 3d) [Chen et al., 2017] | 79.30 | 37.80 | 13.00 | No | 06/07/2017 | CNN |
| NASNet-A (6 @4032) [Zoph et al., 2018] NASNet-A (7 @1920) [Zoph et al., | 82.70 | 88.90 | 47.60 | No | 21/07/2017 21/07/2017 | CNN |
| 2018] | 80.80 | 22.60 115.09 2019] | 9.86 [S´ emery, | No No | | CNN |
| SENet-154 [Hu et al., 2019] PNASNet-5 (N = 4, F = 216) [Liu et al., 2018] 2019] | 81.32 82.90 | [S´ emery, 86.10 | 41.50 2019] 50.00 | No | 05/09/2017 02/12/2017 | CNN CNN |
| PNASNet-5 (N = 3, F = 54) [Hu et al., | 74.20 | 5.10 | 1.18 0.60 | No | 02/12/2017 | CNN |
| MobileNetV2 [Sandler et al., 2019] MobileNetV2 1.4 [Sandler et al., 2019] | 72.00 74.70 | 3.40 6.90 | 1.18 | No No | 13/01/2018 13/01/2018 | CNN CNN |
| AmoebaNet-A (N=6, F=190) [Real et al., 2019] | 82.80 | 86.70 | 46.20 | No | 05/02/2018 | CNN |
| AmoebaNet-A (N=6, F=448) [Real et al., 2019] ResNeXt-101 32×32d [Mahajan et al., 2018] | 83.90 85.10 | 469.00 466.00 | 208.00 174.00 | No Instagram 940M | 05/02/2018 02/05/2018 | CNN CNN |
| ResNeXt-101 32×48d [Mahajan et al., 2018] ShuffleNetV2 x1.0 [Ma et al., 2018] | 85.40 | 829.00 2019] | 306.00 0.30 | Instagram 940M | 02/05/2018 | CNN |
| | 69.40 | 2.28 [S´ emery, | | No | | CNN |
| ResNeXt-101 32x16d [Yalniz et al., 2019b,a] ResNeXt-101 32x8d [Yalniz et al., 2019b,a] | 84.80 | 193.00 | 72.00 | Custom 940M | 30/07/2018 02/05/2019 | CNN CNN |
| ResNeXt-50 32x4d [Yalniz et al., 2019b,a] | 84.30 | 88.00 25.00 | 32.00 8.00 | Custom 940M Custom 940M | 02/05/2019 | CNN CNN |
| EfficientNet-B0 [Tan and Le, 2020] EfficientNet-B1 [Tan and Le, 2020] | 82.20 77.10 | 5.30 | 0.78 | No | 02/05/2019 28/05/2019 | |
| EfficientNet-B2 [Tan and Le, 2020] | 79.10 80.10 | 7.80 9.20 | 1.40 2.00 | No No | 28/05/2019 | CNN |
| EfficientNet-B3 [Tan and Le, 2020] | 81.60 | 12.00 | 3.60 | No | 28/05/2019 | CNN CNN |
| EfficientNet-B4 [Tan and Le, 2020] EfficientNet-B5 [Tan and Le, 2020] | 82.90 83.60 | 19.00 30.00 | 8.40 19.80 | No | 28/05/2019 28/05/2019 | CNN CNN |
| EfficientNet-B6 [Tan and Le, 2020] | 84.00 | 43.00 | | No | | |
| EfficientNet-B7 [Tan and Le, 2020] | | 66.00 | 38.00 74.00 | No | 28/05/2019 28/05/2019 | CNN CNN |
| NoisyStudent-B0 [Xie et al., 2020] | 84.30 78.80 | 5.30 | 0.78 | No | 28/05/2019 | CNN CNN |
| NoisyStudent-B1 [Xie et al., 2020] NoisyStudent-B2 [Xie et al., | 81.50 | 7.80 | 1.40 2.00 | JFT 300M JFT 300M | 11/11/2019 11/11/2019 | CNN |
| 2020] NoisyStudent-B3 [Xie et al., 2020] | 82.40 | 9.20 | 3.60 | JFT 300M JFT | 11/11/2019 11/11/2019 | CNN |
| NoisyStudent-B4 [Xie et al., 2020] | 84.10 85.30 | 12.00 19.00 | 8.40 | 300M JFT | | CNN |
| NoisyStudent-B5 [Xie et al., 2020] | 86.10 | 30.00 | 19.80 | 300M JFT | 11/11/2019 | CNN |
| NoisyStudent-B6 [Xie et al., 2020] | 86.40 | 43.00 66.00 | 38.00 74.00 | 300M JFT 300M JFT 300M | 11/11/2019 11/11/2019 | CNN CNN |
| NoisyStudent-B7 [Xie et al., 2020] NoisyStudent-L2 [Xie et al., 2020] | 86.90 | | 1040.00 | JFT 300M | 11/11/2019 11/11/2019 | CNN |
| FixEfficientNet-L2 [Touvron et al., 2020] FixEfficientNet-B7 [Touvron et al., | 88.40 88.50 85.30 | 480.00 480.00 66.00 | ∗ 585.00 ∗ 82.00 ∗ 1.60 ∗ | JFT 300M No | 18/03/2020 18/03/2020 18/03/2020 | CNN CNN |
| 2020] FixEfficientNet-B0 [Touvron et al., 2020] 2021] | 79.30 | 5.30 | 1040.00 | No JFT 300M | | CNN |
| Meta Pseudo Labels L2 [Pham et al., ResNeSt-269 [Zhang et al., 2020] | | | 155.8 † | No | | CNN CNN |
| ResNeSt-200 [Zhang et al., 2020] | 90.20 84.50 83.90 | 480.00 | ∗ 71.56 † | No | 23/03/2020 | CNN |
| ResNeSt-50 [Zhang et al., 2020] | | 111.00 70.00 27.50 | 10.78 | No | 19/04/2020 19/04/2020 19/04/2020 | CNN |
| ViT-L/16 [Dosovitskiy et al., 2021] | 81.13 85.30 | 304.00 [Tan and Le, 2021] | 384.00 [Tan and Le, 2021] | ImageNet 21k | 22/10/2020 22/10/2020 | Transformer |
| ViT-L/16 [Dosovitskiy et al., 2021] ViT-B/16 [Dosovitskiy et al., 2021] | 87.12 84.60 [Tan and Le, 2021] | 304.00 [Tan and Le, 2021] 87.00 [Tan and Le, 2021] | 384.00 [Tan and Le, 2021] 112.00 [Tan and Le, | JFT 300M ImageNet 21k | 22/10/2020 | Transformer |
| DeiT-small [Touvron et al., 2021b,a] DeiT-small-Distilled [Touvron et al., | 79.90 81.20 | 22.00 22.00 | 2021] 9.20 [Yuan et al., 2021] 9.40 [Yuan et al., 2021] | No No | 23/12/2020 23/12/2020 | Transformer |
| 2021b,a] DeiT-base [Touvron et al., 2021b,a] | | 86.00 86.00 | 36.00 [Tan and Le, 112.00 [Tan and Le, | No No | 23/12/2020 23/12/2020 | Transformer Transformer Transformer |
| DeiT-base-384 [Touvron et al., 2021b,a] | 81.80 | | 2021] 2021] 92.00 | | | Transformer |
| BotNet-T7 [Srinivas et al., 2021] BotNet-T5 [Srinivas et al., 2021] | 82.90 | 75.00 75.10 | 38.60 | No No | 27/01/2021 27/01/2021 | Hybrid Hybrid |
| T2T-ViTt-14 [Yuan et al., 2021] | 84.70 83.50 81.70 | 21.50 | 12.20 19.60 | No | 28/01/2021 | Transformer |
| T2T-ViTt-19 [Yuan et al., 2021] | | 39.20 64.10 | 30.00 | No | 28/01/2021 | |
| T2T-ViTt-24 [Yuan et al., 2021] NFNet-F4+ [Brock et al., 2021] | 82.20 82.60 89.20 | | 734.00 | No | 28/01/2021 11/02/2021 | CNN |
| NFNet-F0 [Brock et al., 2021] | 83.60 | 527.00 71.50 | 24.76 | No | | Transformer Transformer CNN |
| NFNet-F6+SAM [Brock et al., | 86.50 | 438.40 88.00 | | JFT 300M No | | CNN |
| 2021] Swin-B 224 [Liu et al., 2021] | 85.20 | 88.00 | 754.56 30.80 94.00 | | 11/02/2021 11/02/2021 | |
| Swin-B 384 [Liu et al., 2021] Swin-L [Liu et al., 2021] | 86.00 86.40 | 197.00 | | ImageNet 21k | 25/03/2021 | Transformer |
| CrossViT-15 [Chen et al., 2021] | 81.50 | 27.40 43.30 | 207.80 11.60 18.06 | ImageNet 21k ImageNet 21k No No | 25/03/2021 25/03/2021 27/03/2021 | Transformer |
| CrossViT-18 [Chen et al., 2021] CaiT-S36 [Touvron et al., 2021c] | 82.50 | 68.00 | 27.80 | No | 27/03/2021 | Transformer Transformer Transformer |
| CaiT-S36 dist [Touvron et al., 2021c] | 83.30 84.00 | 68.00 | 27.80 | No | 31/03/2021 31/03/2021 | Transformer Transformer |
| CaiT-S24-384 dist [Touvron et al., | 85.10 | 46.90 | 64.40 | No | 31/03/2021 | Transformer |
| CaiT-M48-448 dist [Touvron et al., | 86.50 | 356.00 | | No | 31/03/2021 | Transformer |
| 2021c] 2021c] EfficientNetV2-S [Tan and Le, 2021] | 83.90 | 24.00 | 659.20 17.60 | No | 01/04/2021 01/04/2021 | CNN CNN |
| EfficientNetV2-M [Tan and Le, 2021] EfficientNetV2-L [Tan and Le, 2021] | 85.10 | 55.00 121.00 | 48.00 | No No | 01/04/2021 | CNN |
| EfficientNetV2-S [Tan and Le, 2021] | 85.70 | | | | 01/04/2021 | CNN |
| EfficientNetV2-M [Tan and Le, 2021] | 85.00 86.10 | 24.00 55.00 | 106.00 17.60 48.00 | ImageNet ImageNet ImageNet | 01/04/2021 | CNN CNN |
| EfficientNetV2-L [Tan and Le, 2021] | 86.80 | 121.00 | 106.00 | 21k 21k 21k | 01/04/2021 | Transformer |
| | 90.45 | | 5270.00 ∗ | JFT 3B | 08/06/2021 | |
| ViT-G/14 [Zhai et al., 2021] | | | | | | |
| | | 1843.00 | | | | |
## Methodology Details for NLP Models
As previously stated, for NLP models we just included all the models since 2017 for which we find inference compute estimation. Many papers do not explain how they count FLOPs (as single mathematical operations or single hardware instructions), but we ultimately found out this information explained in [Clark et al., 2020]. We compare the presented numbers with estimations in other publications (we compare the numbers for repeated and similar models) and we see that these numbers are very similar. We assume that the other authors follow this as the standard procedure to count FLOPs. In NLP, they count FLOPs as single mathematical operations and not as a single hardware instructions (like in CV). The important thing is that we use the same approach in all the NLP models, as the comparison and analysis will be intra-domain and never inter-domain.
## Datasets
## ImageNet
ImageNet is the most used dataset in the last decade for training and evaluating CV models. The full dataset consists of 14,197,122 images distributed in 21,841 classes. Researchers refer to this dataset as ImageNet21k or ImageNet22k. However, researchers commonly use a subset of the full ImageNet dataset. This subset consists of 1.2 million images for training and 50,000 images for validation distributed in 1,000 classes. This subset was released for ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) and is usually referred as ImageNet1k or just as ImageNet. In 2012 the AlexNet model [Krizhevsky et al., 2012] won the ILSVRC 2012 Image Classification with an impressive result, outperforming the other models by large margin. AlexNet was the first DNN to win this competition. Since then many other DNNs have been created for image classification.
## GLUE
The General Language Understanding Evaluation (GLUE) benchmark [Wang et al., 2019] is a collection of resources for evaluating and analysing the performance of models across a diverse range of existing NLP tasks with the goal of driving 'research in the development of general and robust natural language understanding systems'. The collection in GLUE consists of nine 'difficult and diverse' tasks, mostly adopted from existing datasets. The tasks involve sentiment analysis, acceptability, paraphrasing, natural language inference and coreference resolution. GLUE is modelagnostic, but it incentivises sharing knowledge across tasks (using parameter sharing or other transfer learning techniques) due to the limited training data for certain tasks.
## Hardware data compilation: floating point precision details
At the end of 2017 Nvidia launched GPUs with new features for AI acceleration (improved lower precision performance and tensor cores, which can improve low-precision calculations) [NVIDIA, 2017]. For instance, many new GPUs have accelerated FP16 operations through tensor cores (DNN can operate at low precision in many calculations without problems) and combine them with FP32 precision operations when is necessary. In this way we benefit from higher performance, maintaining calculation's precision. Nvidia specifies different FLOPS for FP16 and for tensor cores. Nowadays, frameworks as PyTorch and TensorFlow allow to train and infer with a DNN with mixed precision, i.e., taking advantage of the tensor cores, easily without practically any significant reduction in accuracy. Because of all this, we consider necessary to include the performance achieved with tensor cores in our analysis.
Theoretical FLOPS using tensor cores are very high, but this increase in FLOPS does not correspond with the gain seen in practice for deep learning applications (maybe gaming is different). This is because it is not possible to use tensor cores for all operations. To solve the discrepancy between tensor core FLOPS and the real utilisation of these FLOPS, we calculate the speed up achieved for DNN when inference is done with mixed precision. We have looked for experimental results to adjust the tensor FP16/FP32 FLOPS to real performance improvement, the inference experimental results that we use are available in Nvidia NGC Catalog 8 . The collected data can be found in Table
3.
8 https://ngc.nvidia.com/catalog/resources
Table 3: Throughput measures for V100, A100 and T4 GPUs on different Models. The 'speed-up' column is the speed-up achieved with respect to FP32 throughput using different precision formats. A100 speed-up is calculated with respect to V100 FP32 throughput. The data is obtained from NVIDIA NGC catalog (https://ngc.nvidia.com/catalog/resources).
| Task | Model | Framework | Batch size | GPU | Presicion | Throughput | Speed-up |
|--------|-----------------------------------------|-----------------------|--------------|---------------------|-------------|---------------|------------|
| | efficientnet-b0 | PyTorch | 256 | V100 16GB | FP32 | 2968 | 1.00 |
| | efficientnet-b0 | PyTorch | 256 | V100 16GB | Mixed | 6176 | 2.08 |
| | efficientnet-b0 | PyTorch | 256 | A100 80GB | TF32 | 5154 | 1.74 |
| | efficientnet-b0 | PyTorch | 256 | A100 80GB | Mixed | 10239 | 3.45 |
| | efficientnet-b4 | PyTorch | 128 | V100 16GB | FP32 | 376 | 1.00 |
| | efficientnet-b4 | PyTorch | 128 | V100 16GB | Mixed | 843 | 2.24 |
| | efficientnet-b4 | PyTorch | 128 | A100 80GB | TF32 | 700 | 1.86 |
| | efficientnet-b4 | PyTorch | 128 | A100 80GB | Mixed | 1418 | 3.77 |
| | ResNeXt101-32x4d | PyTorch | 256 | V100 16GB | FP32 | 533 | 1.00 |
| | ResNeXt101-32x4d | PyTorch | 256 | V100 16GB | Mixed | 1746 | 3.28 |
| | ResNeXt101-32x4d | PyTorch | 256 | T4 16GB | FP32 | 161 | 1.00 |
| | ResNeXt101-32x4d | PyTorch | 256 | T4 16GB | Mixed | 598 | 3.71 |
| | ResNet v1.5 | PyTorch | 256 | V100 16GB | FP32 | 1261 | 1.00 |
| | ResNet v1.5 | PyTorch | 256 | V100 16GB | Mixed | 3382 | 2.68 |
| | ResNet v1.5 | PyTorch | 256 | T4 16GB | FP32 | 415 | 1.00 |
| | ResNet v1.5 | PyTorch | 256 | T4 16GB | Mixed | 1198 | 2.89 |
| | ResNet v1.5 | TensorFlow | 256 | V100 16GB | FP32 | 1348.52 | 1.00 |
| | ResNet v1.5 | TensorFlow | 256 | V100 16GB | Mixed | 2742.14 | 2.03 |
| CV | ResNet v1.5 | TensorFlow | 256 | A100 40GB | TF32 | 1911.96 | 1.42 |
| | ResNet v1.5 | TensorFlow | 256 | A100 40GB | Mixed | 3229.32 | 2.39 |
| | ResNet v1.5 | TensorFlow | 256 | T4 16GB | FP32 | 425.72 | 1.00 |
| | ResNet v1.5 | TensorFlow | 256 | T4 16GB | Mixed | 993.39 | 2.33 |
| | SSD v1.1 | PyTorch | 32 | V100 16GB | FP32 | 271.73 | 1.00 |
| | SSD v1.1 | PyTorch | 32 | V100 16GB | Mixed | 438.85 | 1.62 |
| | SSD v1.1 SSD v1.1 | PyTorch | 32 | A100 40GB A100 40GB | TF32 Mixed | 548.75 910.17 | 2.02 3.35 |
| | | PyTorch | 32 | | | | |
| | UNet Industrial | TensorFlow | 16 | V100 16GB | FP32 | 250.23 | 1.00 |
| | UNet Industrial UNet Industrial | TensorFlow TensorFlow | 16 16 | V100 16GB | Mixed TF32 | 469.27 424.57 | 1.88 1.70 |
| | UNet Industrial | TensorFlow | 16 | A100 40GB A100 40GB | Mixed | 823.46 | 3.29 |
| | | | 128 | | FP32 | 460.82 | 1.00 |
| | SE-ResNeXt101-32x4d | TensorFlow | 128 | V100 16GB | | | |
| | SE-ResNeXt101-32x4d SE-ResNeXt101-32x4d | TensorFlow TensorFlow | 128 | V100 16GB | Mixed TF32 | 1102 802.64 | 2.39 1.74 |
| | SE-ResNeXt101-32x4d | TensorFlow | 128 | A100 40GB A100 40GB | Mixed | 1728.27 | 3.75 |
| | SE-ResNeXt101-32x4d | TensorFlow | 128 | T4 16GB | FP32 | 105.16 | 1.00 |
| | SE-ResNeXt101-32x4d BERT-LARGE | TensorFlow TensorFlow | 128 8 | T4 16GB V100 16GB | Mixed FP32 | 195.17 44.03 | 1.86 1.00 |
| | BERT-LARGE | TensorFlow | 8 | V100 16GB | Mixed | 168.34 | 3.82 |
| | BERT-LARGE | TensorFlow | 8 | A100 80GB | TF32 | 241.68 | 5.49 |
| | BERT-LARGE | TensorFlow | 8 | A100 80GB | Mixed | 342.22 | 7.77 |
| | BERT-LARGE | TensorFlow | 8 | T4 16GB | FP32 | 16.04 | 1.00 |
| | BERT-LARGE | TensorFlow | 8 | T4 16GB | Mixed | 62.99 | 3.93 |
| | BERT-Base | TensorFlow | 8 | V100 16GB | FP32 | 146.15 | 1.00 |
| | BERT-Base | TensorFlow | 8 | V100 16GB | Mixed | 504.24 | 3.45 |
| | BERT-Base | TensorFlow | 8 | A100 80GB | TF32 | 645.88 | 4.42 |
| | BERT-Base | TensorFlow | 8 | A100 80GB | Mixed | 846.81 | 5.79 |
| NLP | BERT-Base | TensorFlow | 8 | T4 16GB | FP32 | 51.33 | 1.00 |
| | BERT-Base | TensorFlow | 8 | T4 16GB | Mixed | 192.61 | 3.75 |
| | Transformer-XL | TensorFlow | 32 | V100 16GB | FP32 | 8555.6 | 1.00 |
| | Transformer-XL | TensorFlow | 32 | V100 16GB | Mixed | 11215.5 | 1.31 |
| | Transformer-XL | TensorFlow | 32 | A100 40GB | TF32 | 19434.5 | 2.27 |
| | Transformer-XL | TensorFlow | 32 | A100 40GB | Mixed | 21854.7 | 2.55 |
| | Transformer-XL | TensorFlow | 32 | T4 16GB | FP32 | 3439.1 | 1.00 |
| | Transformer-XL | TensorFlow | 32 | T4 16GB | Mixed | 6174.3 | 1.80 |
| | Transformer | PyTorch | 10240 | V100 16GB | FP32 | 3782 | 1.00 1.97 |
| | Transformer | PyTorch | 10240 | V100 16GB | Mixed | 7464 | |
| | Transformer Transformer | PyTorch | 10240 | A100 40GB | TF32 | 7755 | 2.05 |
| | | PyTorch | 10240 | A100 40GB | Mixed | | 2.55 |
| | | | | | | 9653 | |
We do not include estimated mixed precision performance for all GPUs that support it because we have not found sufficient benchmarks for all GPUs to carry out an estimation. Also, we do not consider INT8 precision format because in many cases using this format leads to performance downgrade, and therefore the accuracy metric of the models should be adapted for a fair analysis. We perform a different estimation for CV and for NLP networks because these two kinds of networks operate in different ways and take different advantage of mixed precision. During training the speedup from mixed precision in comparison to FP32 is usually of 2x for image models, and up to 4x for language models [Li, 2020]. This is corroborated in information about some benchmarks on Nvidia blogs too [NVIDIA, 2018].
## Hardware mixed precision speed-ups
As we have discussed, theoretical FLOPS for tensor cores are very high, as we can see in Fig. 7 in the main text. However, the performance for inference using tensor cores is not so high. For this reason we propose an estimation for the Nvidia GPUS: V100, A100 and T4 for CV models and for NLP models. For these calculations we collected inference data from NVIDIA NGC. The estimations for A100 are in relation to V100 because there is no data about FP32 for A100 (because FP32 is substituted by TF32 9 , which is a precision format in between of FP32 and FP16), so we estimated the speed-up to V100 FP32 FLOPS.
Table 4: Mixed precision speed ups from experimental results for inference.
| GPU | Precision speed up | CV models | NLP models |
|-------|--------------------------------------------------------------------|-------------|--------------|
| V100 | Mixed speed up ratio to V100 FP32 | 2.27 | 2.64 |
| A100 | TF32 speed up ratio to V100 FP32 Mixed speed up ratio to V100 FP32 | 1.75 3.33 | 3.56 4.67 |
| T4 | Mixed speed up ratio to T4 FP32 | 2.7 | 3.16 |
## Performance and compute (NLP)
We represent the improvement on the GLUE score over the years as well as models inference GFLOPs (bubbles size) in Fig. 15. GFLOPs are for single input of length 128, which is a reasonable sequence length for many use cases, being able to fit text messages or short emails. We can observe a very similar evolution to the evolution observed in ImageNet: SOTA models require a large number of FLOPs, but in a short period of time other models appear, which require much fewer FLOPs to reach the same score.
Figure 15: GFLOPs per token analysis for NLP models.
<details>
<summary>Image 15 Details</summary>

### Visual Description
## Scatter Plot: GLUE Score vs. Date with GFLOPs Size Encoding
### Overview
The image is a scatter plot visualizing the relationship between **GLUE scores** (y-axis) and **dates** (x-axis), with bubble sizes encoding **GFLOPs** values. The plot spans from July 2018 to July 2020, showing temporal trends in performance metrics.
---
### Components/Axes
- **X-axis (Date)**: Labeled "Date," with markers at:
- 2018-07
- 2019-01
- 2019-07
- 2020-01
- 2020-07
- **Y-axis (GLUE Score)**: Labeled "GLUE score," ranging from 70 to 90 in increments of 5.
- **Legend**: Located in the **top-left corner**, titled "GFLOPs," with five size categories:
- 4 (smallest bubble)
- 8
- 16
- 32
- 64 (largest bubble)
- **Bubble Colors**: All bubbles are cyan (#00BFFF), with size proportional to GFLOPs values.
---
### Detailed Analysis
1. **Data Points**:
- **2018-07**: A single bubble at ~70 GLUE score, GFLOPs = 4.
- **2019-01**: Two bubbles:
- GFLOPs = 16, GLUE score ~80.
- GFLOPs = 32, GLUE score ~82.
- **2019-07**: One bubble at ~85 GLUE score, GFLOPs = 64.
- **2020-01**: Four bubbles:
- GFLOPs = 4, GLUE score ~75.
- GFLOPs = 8, GLUE score ~78.
- GFLOPs = 16, GLUE score ~79.
- GFLOPs = 32, GLUE score ~83.
- **2020-07**: One bubble at ~77 GLUE score, GFLOPs = 8.
2. **Trends**:
- **Upward Correlation**: Higher GFLOPs generally correspond to higher GLUE scores (e.g., 64 GFLOPs → 85 score vs. 4 GFLOPs → 70 score).
- **Temporal Progression**: Scores improve steadily until 2019-07 (peak at 85), then fluctuate in 2020 with lower GFLOPs but mixed results.
- **Anomaly**: The 2020-01 cluster shows high scores (75–83) despite lower GFLOPs (4–32), suggesting efficiency gains or other optimizations.
---
### Key Observations
- **GFLOPs-Score Relationship**: Larger bubbles (higher GFLOPs) dominate higher score regions, but exceptions exist (e.g., 2020-01).
- **Temporal Decline in GFLOPs**: After 2019-07, GFLOPs drop to 8–32 by 2020-07, yet scores remain above 75.
- **Outlier**: The 2020-07 bubble (GFLOPs = 8, score = 77) deviates from the trend, indicating potential inefficiency or external factors.
---
### Interpretation
The data suggests that **computational power (GFLOPs)** strongly influences GLUE scores initially, but post-2019, **efficiency improvements** (e.g., algorithmic optimizations) may sustain performance despite reduced hardware power. The 2020-01 cluster highlights a decoupling of raw compute from performance, implying advancements in model architecture or training techniques. The 2020-07 outlier raises questions about whether efficiency gains plateaued or were offset by other constraints (e.g., data quality, model complexity).
</details>
## FLOPS estimation for CV models
## EfficientNet-Based Models FLOPs Estimation
There are many EfficientNet variations, mostly using different input resolution or scaling. For these modifications, FLOPs are not always reported. In this work, we estimate them following the relation presented in Equation 3
$$F L O P s \, \infty \, d + w ^ { 2 } + r ^ { 2 } \quad ( 3 )$$
for the following models:
- NoisyStudent-L2 : Having the scale factors of the networks (Table 5) we estimate NoisyStudent-L2 FLOPs as shown in Equation 4
9 https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
## NLP data
Many times researchers report GLUE score without the punctuation on the WNLI task, because this task is problematic. We have marked which scores are reported without this task. Since there are 9 tasks in total, we consider that excluding one of them is not problematic for our analysis.
We did not find inference GFLOPs for the model Bert-Large, but we have ELECTRA-Large GFLOPs and this is actually the same model but following a different training strategy. In this
Table 5: EfficientNet models architecture specifications obtained from [Xie et al., 2020].
| Model | w | d | Test Resolution |
|-----------------|-----|-----|-------------------|
| EfficientNet-B7 | 2 | 3.1 | 600 × 600 |
| EfficientNet-L2 | 4.3 | 5.3 | 800 × 800 |
$$\begin{array} { r l r } & { N o i s y S t u d e n - L 2 F L O P s = } \\ & { = E f f i c i e n T e n t - B 7 F L O P s \cdot d _ { \sigma } \cdot w _ { \sigma } ^ { 2 } \cdot r _ { \sigma } ^ { 2 } } \end{array}$$
where d σ , w σ and r σ are the scaled factors for, respectively, the network depth, width and input resolutions. By using the values from Table 5, d σ = 5 . 3 / 3 . 1 = 1 . 7097 , w σ = 4 . 3 / 2 = 2 . 15 and r σ = 800 / 600 = 1 . 3334 . Knowing that the GFLOPS for EfficientNetB7 are 74, substituting in 4, we obtain the estimation of 74 GFLOPs · 1 . 7097 · 2 . 15 2 · 1 . 3334 2 ≈ 1040 GFLOPS for NoisyStudent-L2.
- Meta Pseudo Labels L2 : We use the estimation of NoisyStudent-L2 FLOPs for Meta Pseudo Labels L2, because it is the same model and only changes the training strategy.
- FixEfficientNet-L2 : In FixEfficientNet-L2 they use a resolution of 600 × 600 for testing, so the estimation is the same as for NoisyStudent-L2 but without taking into account the resolution scaling ( r σ ). Then, the estimated GFLOPS are 74 GFLOPs · 1 . 7097 · 2 . 15 2 ≈ 585 GFLOPS.
- FixEfficientNet-B7 : This model is the same as EfficientNet-B7 but using a slightly different resolution ( 632 × 632 ). Therefore, r σ = 632 / 600 = 1 . 0534 and, thus we estimate 74 GFLOPs · 1 . 0534 2 ≈ 82 GFLOPs.
- FixEfficientNet-B0 : This model is the same as EfficientNet-B0 but using a higher resolution ( 320 × 320 ). Therefore, r σ = 320 / 224 = 1 . 4286 and, thus we estimate 0 . 78 GFLOPs · 1 . 4286 2 ≈ 1 . 6 GFLOPs.
## ViT-G/14 FLOPs Estimation
In the paper [Zhai et al., 2021] introducing the model, although authors provide the GFLOPs for 224 × 224 and 384 × 384 resolutions (see Table 6), they also also use 518 × 518 resolution for ViT-G finetuning, so we assume they use the same resolution for testing too. ViT-G/14 is a vision transformer model, so the scale relation presented in 3 do not apply for this kind of models. However, knowing the GFLOPs for 224 × 224 and 384 × 384 , we may calculate how GFLOPs scale with resolution (given that r 2 σ = (384 / 224) 2 = 2 . 9388 ). In this regard, we calculate the GFLOPs ratio as 2859 . 9 / 965 . 3 = 2 . 9627 and we observe that GFLOPs scale quadratically with respect to resolution. Note, in this paper they report 'real' FLOPs and not multiply-add operations. Therefore, we recalculate r σ = 518 / 384 = 1 . 3490 and multiply the GFLOPs for 384 × 384 resolution by this scale factor estimating 2859 . 9 GFLOPs · 1 . 3490 2 ≈ 5270 GFLOPs for the ViT-G/14 model.
Table 6: ViT-G/14 GFLOPs from.
| Model | GFLOPS | GFLOPS |
|----------|-----------|-----------|
| Model | 224 × 224 | 384 × 384 |
| ViT-G/14 | 965.3 | 2859.9 |
sense, we use take ELECTRA-Large GFLOPs as BERT-Large GFLOPs. For ELMo we take GLUE 'dev-set' score because we do not found the score on the test set (we assume this score should be close to the test set). Values shown in Table 7.
Table 7: NLP models data set. If there is a citation next to the GFLOPs value means that GFLOPs and Input Tokens values are extracted from that source, otherwise the values are from the paper (cited in the 'Model' column). The symbol ♠ means that GLUE score was calculated without punctuation on the WNLI task; the symbol ∗ means that we estimated the value and ♣ means that GLUE score is for GLUE dev set instead of test set.
<details>
<summary>Image 16 Details</summary>

### Visual Description
## Table: Comparative Analysis of NLP Models
### Overview
The table presents a comparative analysis of various natural language processing (NLP) models, including their input token capacity, computational efficiency (GFLOPs), parameter count (in millions), release dates, and performance on the GLUE benchmark test set. The data spans models developed between 2017 and 2020, with annotations for specific metrics and symbols.
### Components/Axes
- **Headers**: Model, Input Tokens, GFLOPs, Params (M), Date, GLUE test set.
- **Rows**: Each row corresponds to a distinct model, with data points aligned under respective columns.
- **Symbols**:
- ♣: Indicates a specific metric (e.g., Clark et al., 2020).
- ♠: Denotes a distinct evaluation (e.g., Devlin et al., 2019).
### Detailed Analysis
1. **Model Variants**:
- **Transformer** (Vaswani et al., 2017): 512 input tokens, 54 GFLOPs, 65M parameters, released 12/06/2017, no GLUE score.
- **ELMo** (Peters et al., 2018): 128 input tokens, 26 GFLOPs, 96M parameters, released 15/02/2018, GLUE score 71.2 ♣.
- **GPT-1** (Radford et al., 2018): 128 input tokens, 30 GFLOPs, 117M parameters, released 11/06/2018, GLUE score 75.1 ♠.
- **BERT-Large** (Devlin et al., 2019): 128 input tokens, 79 GFLOPs, 335M parameters (marked with *), released 11/10/2018, GLUE score 82.1 ♠.
- **BERT-Small** (Devlin et al., 2019): 128 input tokens, 3.7 GFLOPs, 14M parameters, released 11/10/2018, no GLUE score.
- **BERT-Base** (Devlin et al., 2019): 128 input tokens, 29 GFLOPs, 110M parameters, released 11/10/2018, GLUE score 79.6 ♠.
- **GPT-2** (Radford et al., 2019): 1024 input tokens, 3400 GFLOPs, 1500M parameters, released 14/02/2019, no GLUE score.
- **Megatron** (Shoeybi et al., 2020): 1024 input tokens, 18,000 GFLOPs, 8300M parameters, released 17/09/2019, no GLUE score.
- **ALBERT-xxl** (Lan et al., 2020): 512 input tokens, 2500 GFLOPs, 235M parameters, released 26/09/2019, no GLUE score.
- **ALBERT-base** (Lan et al., 2020): 128 input tokens, 22.5 GFLOPs, 12M parameters, released 26/09/2019, no GLUE score.
- **Theseus 6/768** (Xu et al., 2020): 128 input tokens, 11.3 GFLOPs, 66M parameters, released 07/02/2020, GLUE score 77.1.
- **Microsoft T-NLG** (Rosset, 2020): 1024 input tokens, 36,000 GFLOPs, 17,000M parameters, released 13/02/2020, no GLUE score.
- **ELECTRA-Large** (Clark et al., 2020): 128 input tokens, 79 GFLOPs, 335M parameters, released 23/03/2020, GLUE score 88.6 ♠.
- **ELECTRA-Small** (Clark et al., 2020): 128 input tokens, 3.7 GFLOPs, 14M parameters, released 23/03/2020, GLUE score 78 ♠.
- **ELECTRA-Base** (Clark et al., 2020): 128 input tokens, 29 GFLOPs, 110M parameters, released 23/03/2020, GLUE score 83.5 ♠.
- **MobileBERT** (Sun et al., 2020): 128 input tokens, 5.36 GFLOPs, 25.3M parameters, released 06/04/2020, GLUE score 78.5 ♠.
- **MobileBERT tiny** (Sun et al., 2020): 128 input tokens, 3.1 GFLOPs, 15.1M parameters, released 06/04/2020, GLUE score 75.8 ♠.
- **GPT-3** (Brown et al., 2020): 2048 input tokens, 740,000 GFLOPs, 175,000M parameters, released 28/05/2020, no GLUE score.
- **SqueezeBERT** (Iandola et al., 2020): 128 input tokens, 7.42 GFLOPs, 51.1M parameters, released 19/06/2020, GLUE score 78.1.
2. **Missing Data**:
- GLUE scores are absent for Transformer, GPT-2, Megatron, ALBERT variants, Theseus, Microsoft T-NLG, and GPT-3.
3. **Parameter Trends**:
- Larger models (e.g., GPT-3, Microsoft T-NLG) exhibit significantly higher parameter counts (175,000M and 17,000M, respectively) compared to smaller models like BERT-Small (14M).
4. **Computational Efficiency**:
- GFLOPs vary widely, with GPT-3 requiring 740,000 GFLOPs and BERT-Small using only 3.7 GFLOPs.
5. **GLUE Performance**:
- ELECTRA-Large achieves the highest GLUE score (88.6 ♠), while MobileBERT tiny scores the lowest (75.8 ♠).
### Key Observations
- **Model Size vs. Performance**: Larger models (e.g., GPT-3, ELECTRA-Large) generally achieve higher GLUE scores, though exceptions exist (e.g., BERT-Large outperforms GPT-1 despite fewer parameters).
- **Efficiency Trade-offs**: Smaller models (e.g., BERT-Small, MobileBERT tiny) use fewer computational resources but show lower GLUE scores.
- **Input Token Capacity**: Models like GPT-2 and GPT-3 support longer sequences (1024–2048 tokens), while most others are limited to 128 tokens.
- **Release Timeline**: Models evolved from 2017 (Transformer) to 2020 (GPT-3), reflecting rapid advancements in NLP.
### Interpretation
The data highlights a trade-off between model size, computational efficiency, and performance. While larger models (e.g., GPT-3) demonstrate superior parameter counts and input capacity, their computational demands (GFLOPs) are substantially higher. Conversely, smaller models (e.g., BERT-Small) offer efficiency but lag in GLUE performance. The absence of GLUE scores for some models (e.g., GPT-3) suggests either unavailability of data or focus on other evaluation metrics. The symbols ♣ and ♠ likely denote distinct evaluation frameworks, emphasizing the need for standardized benchmarks. Overall, the table underscores the diversity of approaches in NLP, balancing scalability, efficiency, and accuracy.
</details>
| Model | Input Tokens | GFLOPs | Params (M) | Date | GLUE test set |
|------------------------------------|----------------|--------------------------------|--------------|------------|------------------------------|
| Transformer [Vaswani et al., 2017] | 512 | 54 [Gholami et al., 2021b] | 65 | 12/06/2017 | - |
| ELMo [Peters et al., 2018] | 128 | 26 [Clark et al., 2020] | 96 | 15/02/2018 | 71.2 [Clark et al., 2020] ♣ |
| GPT-1 [Radford et al., 2018] | 128 | 30 [Clark et al., 2020] | 117 | 11/06/2018 | 75.1 [Devlin et al., 2019] ♠ |
| BERT Large [Devlin et al., 2019] | 128 | 79 | 335 ∗ | 11/10/2018 | 82.1 ♠ |
| BERT-Small [Devlin et al., 2019] | 128 | 3.7 [Clark et al., 2020] | 14 | 11/10/2018 | - |
| BERT-Base [Devlin et al., 2019] | 128 | 29 [Clark et al., 2020] | 110 | 11/10/2018 | 79.6 ♠ |
| GPT-2 [Radford et al., 2019] | 1024 | 3400 [Gholami et al., 2021b] | 1500 | 14/02/2019 | - |
| Megatron [Shoeybi et al., 2020] | 1024 | 18000 [Gholami et al., 2021b] | 8300 | 17/09/2019 | - |
| ALBERT-xxl [Lan et al., 2020] | 512 | 2500 [Gholami et al., 2021b] | 235 | 26/09/2019 | - |
| ALBERT-base [Lan et al., 2020] | 128 | 22.5 [Iandola et al., 2020] | 12 | 26/09/2019 | - |
| Theseus 6/768 [Xu et al., 2020] | 128 | 11.3 [Iandola et al., 2020] | 66 | 07/02/2020 | 77.1 [Iandola et al., 2020] |
| Microsoft T-NLG [Rosset, 2020] | 1024 | 36000 [Gholami et al., 2021b] | 17000 | 13/02/2020 | - |
| ELECTRA Large [Clark et al., 2020] | 128 | 79 [Gholami et al., 2021b] | 335 | 23/03/2020 | 88.6 ♠ |
| ELECTRA-Small [Clark et al., 2020] | 128 | 3.7 | 14 | 23/03/2020 | 78 ♠ |
| ELECTRA-Base [Clark et al., 2020] | 128 | 29 | 110 | 23/03/2020 | 83.5 ♠ |
| MobileBERT [Sun et al., 2020] | 128 | 5.36 | 25.3 | 06/04/2020 | 78.5 ♠ |
| MobileBERT tiny [Sun et al., 2020] | 128 | 3.1 | 15.1 | 06/04/2020 | 75.8 ♠ |
| GPT-3 [Brown et al., 2020] | 2048 | 740000 [Gholami et al., 2021b] | 175000 | 28/05/2020 | - |
| SqueezeBERT [Iandola et al., 2020] | 128 | 7.42 | 51.1 | 19/06/2020 | 78.1 |
## GPU consumption data
Tables 8 and 9 show further technical details regarding, respectively, the GPU's theoretical characteristics (compiled from the manufacturer's specification sheet and reference manuals), and their throughput and power consumption 'adapted', if necessary, to the specifics of CV or NLP tasks.
Table 8: Nvidia GPUs theoretical data recopilation.
<details>
<summary>Image 17 Details</summary>

### Visual Description
## Table: GPU Performance Metrics
### Overview
The table presents a comprehensive dataset of GPU models, detailing their technical specifications, performance metrics, and launch information. It includes consumer-grade GeForce GPUs, professional-grade Tesla GPUs, and server-focused models, with metrics like TFLOPS, power consumption (Watts), launch dates, and efficiency (GFLOPS/Watt).
### Components/Axes
- **Columns**:
- **GPU**: Model name (e.g., GeForce GTX 580, Tesla A100).
- **Precision**: Compute architecture (FP32, FP16, FP32 Tensor).
- **TFLOPS**: Theoretical peak performance in teraflops.
- **Watts**: Power consumption in watts.
- **Launch date**: Release date in MM/DD/YYYY format.
- **Type**: Categorization as "Desktop" or "Server".
- **GFLOPS/Watt**: Efficiency metric (performance per watt).
### Detailed Analysis
1. **GeForce GTX Series (Desktop)**:
- Early models (GTX 580–GTX 980 Ti) show gradual improvements in TFLOPS (1.58–6.69) and GFLOPS/Watt (6.48–24.24).
- Later models (RTX 2080 Ti, RTX 3090) use FP16/Tensor precision, achieving TFLOPS up to 56.90 and GFLOPS/Watt up to 227.60.
- Power consumption ranges from 244W (GTX 580) to 350W (RTX 3090).
2. **Tesla Series (Server/Desktop)**:
- Server GPUs (Tesla K10–A30) prioritize high TFLOPS (4.58–165.00) and efficiency (20.36–1000.00 GFLOPS/Watt).
- A100 (FP16) achieves 19.50 TFLOPS at 400W (48.75 GFLOPS/Watt), while A30 (FP32 Tensor) reaches 10.30 TFLOPS at 165W (62.42 GFLOPS/Watt).
- Tesla V100 (FP16/Tensor) dominates with 125.00 TFLOPS at 300W (416.67 GFLOPS/Watt).
3. **Precision Impact**:
- FP32 models (e.g., GTX 1080 Ti) average ~10 TFLOPS, while FP16/Tensor models (e.g., RTX 3090 Ti) exceed 50 TFLOPS.
- Tensor cores (e.g., RTX 3090 Ti) enable higher efficiency despite similar power consumption.
### Key Observations
- **Performance Scaling**: Newer GPUs (e.g., RTX 3090 Ti, A100) show exponential gains in TFLOPS and efficiency compared to older models (e.g., GTX 580).
- **Server Dominance**: Tesla GPUs (e.g., A100, A30) outperform consumer GPUs in both raw performance and efficiency, reflecting their optimization for computational workloads.
- **Power Efficiency**: High-end models like the RTX 3090 Ti (227.60 GFLOPS/Watt) and Tesla A100 (48.75 GFLOPS/Watt) balance power draw with performance.
- **Precision Trade-offs**: FP16/Tensor models achieve higher TFLOPS but may sacrifice single-precision accuracy, critical for specific workloads.
### Interpretation
The data highlights the evolution of GPU architecture toward higher performance and efficiency, driven by advancements in precision (FP16/Tensor) and parallel processing. Server GPUs (Tesla) prioritize sustained computational power for AI/ML and HPC tasks, while consumer GPUs (GeForce) focus on gaming and general computing. The RTX 3090 Ti and Tesla A100 represent benchmarks in consumer and server segments, respectively, with the latter excelling in data-center efficiency. The table underscores the importance of precision and architecture design in meeting diverse computational demands.
</details>
| GPU | Precision | TFLOPSWatts | | Launch date | Type | GFLOPS/Watt |
|-------------------------|------------------|---------------|---------|---------------|---------|---------------|
| GeForce GTX 580 | FP32 | 1.58 | 244 | 09/11/2010 | Desktop | 6.48 |
| GeForce GTX 590 | FP32 | 2.49 | 365 | 24/03/2011 | Desktop | 6.82 |
| GeForce GTX 680 | FP32 | 3.09 | 195 | 22/03/2012 | Desktop | 15.85 |
| GeForce GTX 690 | FP32 | 5.62 | 300 | 29/04/2012 | Desktop | 18.73 |
| GeForce GTX 780 | FP32 | 4.16 | 250 | 23/04/2013 | Desktop | 16.62 |
| GeForce GTX 780 TI | FP32 | 5.35 | 250 | 07/11/2013 | Desktop | 21.38 |
| GeForce GTX Titan Black | FP32 | 5.65 | 250 | 18/02/2014 | Desktop | 22.58 |
| GeForce GTX Titan Z | FP32 | 8.12 | 375 | 28/05/2014 | Desktop | 21.66 |
| GeForce GTX 980 | FP32 | 4.98 | 165 | 18/09/2014 | Desktop | 30.19 |
| GeForce GTX 980 Ti | FP32 | 6.06 | 250 | 02/06/2015 | Desktop | 24.24 |
| GeForce GTX TITAN X | FP32 | 6.69 | 250 | 17/03/2015 | Desktop | 26.76 |
| GeForce GTX 1080 | FP32 | 8.87 | 180 | 26/05/2016 | Desktop | 49.29 |
| GeForce GTX 1080 Ti | FP32 | 11.34 | 250 | 10/03/2017 | Desktop | 45.36 |
| TITAN X Pascal | FP32 | 10.97 | 250 | 02/08/2016 | Desktop | 43.88 |
| TITAN XP | FP32 | 12.15 | 250 | 06/04/2017 | Desktop | 48.6 |
| GeForce RTX 2080 | FP32 | 10.07 | 215 | 20/09/2018 | Desktop | 46.84 |
| GeForce RTX 2080 Ti | FP32 | 13.45 | 250 | 20/09/2018 | Desktop | 53.8 |
| Nvidia Titan RTX | FP32 | 16.31 | 280 | 18/12/2018 | Desktop | 58.26 |
| GeForce RTX 3080 | FP32 | 29.8 | 320 | 01/09/2020 | Desktop | 93.13 |
| GeForce RTX 3090 | FP32 | 35.6 | 350 | 01/09/2020 | Desktop | 101.71 |
| GeForce RTX 2080 | FP16 | 20.14 | 215 | 20/09/2018 | Desktop | 93.67 |
| GeForce RTX 2080 Ti | FP16 | 26.9 | 250 | 20/09/2018 | Desktop | 107.6 |
| Nvidia Titan RTX | FP16 | 32.62 | 280 | 18/12/2018 | Desktop | 116.5 |
| GeForce RTX 3080 | FP16 | 29.8 | 320 | 01/09/2020 | Desktop | 93.13 |
| GeForce RTX 3090 | FP16 | 35.6 | 350 | 01/09/2020 | Desktop | 101.71 |
| GeForce RTX 2080 | FP16/FP32 Tensor | 40.3 | 215 | 20/09/2018 | Desktop | 187.44 |
| GeForce RTX 2080 Ti | FP16/FP32 Tensor | 56.9 | 250 | 20/09/2018 | Desktop | 227.6 |
| Nvidia Titan RTX | FP16/FP32 Tensor | 130.5 | 280 | 18/12/2018 | Desktop | 466.07 |
| GeForce RTX 3080 | FP16/FP32 Tensor | 59.5 | 320 | 01/09/2020 | Desktop | 185.94 |
| GeForce RTX 3090 | FP16/FP32 Tensor | 71 | 350 | 01/09/2020 | Desktop | 202.86 |
| Tesla K10 | FP32 | 4.58 | 225 | 01/05/2012 | Server | 20.36 |
| Tesla K20x | FP32 | 3.94 | 235 | 12/11/2012 | Server | 16.74 |
| Tesla K40 | FP32 | 5.04 | 235 | 08/10/2013 | Server | 21.45 |
| Tesla K80 | FP32 | 8.22 | 300 | 17/10/2014 | Server | 27.4 |
| Tesla M40 | FP32 | 6.84 | 250 | 10/10/2015 | Server | 27.36 |
| Tesla M60 | FP32 | 9.65 | 300 | 30/08/2015 | Server | 32.17 |
| Tesla P100 | FP16 | 21.2 | 300 | 20/05/2016 | Server | 70.67 |
| Tesla V100 | FP16 | 31.4 | 300 | 27/03/2018 | Server | 104.67 |
| A100 | FP16 | 78 | 400 | 14/04/2020 | Server | 195 |
| Tesla P100 | FP32 | 10.6 | 300 | 20/05/2016 | Server | 35.33 |
| Tesla V100 | FP32 | 15.7 | 300 | 27/03/2018 | Server | 52.33 |
| A100 | FP32 | 19.5 | 400 | 14/04/2020 | Server | 48.75 |
| A30 | FP32 | 10.3 | 165 | 12/04/2021 | Server | 62.42 |
| Tesla V100 | FP16/FP32 Tensor | 125 | 300 | 27/03/2018 | Server | 416.67 |
| A100 | FP16/FP32 Tensor | 312 | | 14/04/2020 | Server | 780 |
| A30 | FP16/FP32 Tensor | 165 | 400 165 | 12/04/2021 | Server | 1000 |
| T4 | FP32 | 8.1 | 70 | 13/09/2018 | Server | 115.71 |
| T4 | FP16/FP32 Tensor | 65 | 70 | 13/09/2018 | Server | 928.57 |
Table 9: GPUs throughput and power consumption data compilation.
| Adapted | GPU | Precision | TFLOPSWatts | | Launch date | Type | GFLOPS/Watt |
|-----------|-------------------------|-------------|---------------|-----|---------------|---------|---------------|
| | GeForce GTX 580 | FP32 | 1.58 | 244 | 09/11/2010 | Desktop | 6.48 |
| | GeForce GTX 590 | FP32 | 2.49 | 365 | 24/03/2011 | Desktop | 6.82 |
| | GeForce GTX 680 | FP32 | 3.09 | 195 | 22/03/2012 | Desktop | 15.85 |
| | GeForce GTX 690 | FP32 | 5.62 | 300 | 29/04/2012 | Desktop | 18.73 |
| | Tesla K10 | FP32 | 4.58 | 225 | 01/05/2012 | Server | 20.36 |
| | Tesla K20x | FP32 | 3.94 | 235 | 12/11/2012 | Server | 16.77 |
| | GeForce GTX 780 | FP32 | 4.16 | 250 | 23/04/2013 | Desktop | 16.64 |
| | Tesla K40 | FP32 | 5.04 | 235 | 08/10/2013 | Server | 21.45 |
| | GeForce GTX 780 TI | FP32 | 5.35 | 250 | 07/11/2013 | Desktop | 21.4 |
| | GeForce GTX Titan Black | FP32 | 5.65 | 250 | 18/02/2014 | Desktop | 22.6 |
| | GeForce GTX Titan Z | FP32 | 8.12 | 375 | 28/05/2014 | Desktop | 21.65 |
| | GeForce GTX 980 | FP32 | 4.98 | 165 | 18/09/2014 | Desktop | 30.18 |
| | Tesla K80 | FP32 | 8.22 | 300 | 17/10/2014 | Server | 27.4 |
| No | GeForce GTX TITAN X | FP32 | 6.69 | 250 | 17/03/2015 | Desktop | 26.76 |
| No | GeForce GTX 980 Ti | FP32 | 6.06 | 250 | 02/06/2015 | Desktop | 24.24 |
| No | Tesla M60 | FP32 | 9.65 | 300 | 30/08/2015 | Server | 32.17 |
| No | Tesla M40 | FP32 | 6.84 | 250 | 10/10/2015 | Server | 27.36 |
| No | GeForce GTX 1080 | FP32 | 8.87 | 180 | 26/05/2016 | Desktop | 49.28 |
| No | TITAN X Pascal | FP32 | 10.97 | 250 | 02/08/2016 | Desktop | 43.88 |
| No | GeForce GTX 1080 Ti | FP32 | 11.34 | 250 | 10/03/2017 | Desktop | 45.36 |
| No | TITAN XP | FP32 | 12.15 | 250 | 06/04/2017 | Desktop | 48.6 |
| No | Tesla V100 | FP32 | 15.7 | 300 | 27/03/2018 | Server | 52.33 |
| No | Tesla T4 | FP32 | 8.1 | 70 | 13/09/2018 | Server | 115.71 |
| No | GeForce RTX 2080 | FP32 | 10.07 | 215 | 20/09/2018 | Desktop | 46.84 |
| No | GeForce RTX 2080 Ti | FP32 | 13.45 | 250 | 20/09/2018 | Desktop | 53.8 |
| No | Nvidia Titan RTX | FP32 | 16.31 | 280 | 18/12/2018 | Desktop | 58.25 |
| No | GeForce RTX 3080 | FP32 | 29.8 | 320 | 01/09/2020 | Desktop | 93.13 |
| No | GeForce RTX 3090 | FP32 | 35.6 | 350 | 01/09/2020 | Desktop | 101.71 |
| For CNN | Tesla V100 | Mixed | 35.71 | 300 | 27/03/2018 | Server | 119.03 |
| For CNN | Tesla T4 | Mixed | 21.85 | 70 | 13/09/2018 | Server | 312.15 |
| For CNN | A100 | TF32 | 27.41 | 400 | 14/04/2020 | Server | 68.52 |
| For CNN | A100 | Mixed | 52.35 | 400 | 14/04/2020 | Server | 130.88 |
| For NLP | Tesla V100 | Mixed | 41.44 | 300 | 27/03/2018 | Server | 138.13 |
| For NLP | Tesla T4 | Mixed | 25.58 | 70 | 13/09/2018 | Server | 365.46 |
| For NLP | A100 | TF32 | 55.85 | 400 | 14/04/2020 | Server | 139.64 |
| For NLP | A100 | Mixed | 73.29 | 400 | 14/04/2020 | Server | 183.23 |