## Compute and Energy Consumption Trends in Deep Learning Inference
## Radosvet Desislavov
VRAIN. Universitat Polit` ecnica de Val` encia, Spain radegeo@inf.upv.es
## Fernando Mart´ ınez-Plumed
European Commission, Joint Research Centre fernando.martinez-plumed@ec.europa.eu
VRAIN. Universitat Polit` ecnica de Val` encia, Spain fmartinez@dsic.upv.es
Jos´ e Hern´ andez-Orallo
VRAIN. Universitat Polit` ecnica de Val` encia, Spain jorallo@upv.es
## Abstract
The progress of some AI paradigms such as deep learning is said to be linked to an exponential growth in the number of parameters. There are many studies corroborating these trends, but does this translate into an exponential increase in energy consumption? In order to answer this question we focus on inference costs rather than training costs, as the former account for most of the computing effort, solely because of the multiplicative factors. Also, apart from algorithmic innovations, we account for more specific and powerful hardware (leading to higher FLOPS) that is usually accompanied with important energy efficiency optimisations. We also move the focus from the first implementation of a breakthrough paper towards the consolidated version of the techniques one or two year later. Under this distinctive and comprehensive perspective, we study relevant models in the areas of computer vision and natural language processing: for a sustained increase in performance we see a much softer growth in energy consumption than previously anticipated. The only caveat is, yet again, the multiplicative factor, as future AI increases penetration and becomes more pervasive.
## Introduction
As Deep Neural Networks (DNNs) become more widespread in all kinds of devices and situations, what is the associated energy cost? In this work we explore the evolution of different metrics of deep learning models, paying particular attention to inference computational cost and its associated energy consumption. The full impact, and its final carbon footprint, not only depends on the internalities (hardware and software directly involved in their operation) but also on the externalities (all social and economic activities around it). From the AI research community, we have more to say and do about the former. Accordingly, more effort is needed, within AI, to better account for the internalities, as we do in this paper.
For a revised version and its published version refer to:
Desislavov, Radosvet, Fernando Mart´ ınez-Plumed, and Jos´ e Hern´ andez-Orallo. ' Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning ' Sustainable Computing: Informatics and Systems , Volume 38, April 2023. (DOI: https://doi.org/10.1016/j.suscom.2023.100857)
In our study, we differentiate between training and inference. At first look it seems that training cost is higher. However, for deployed systems, inference costs exceed training costs, because of the multiplicative factor of using the system many times [Martinez-Plumed et al., 2018]. Training, even if it involves repetitions, is done once but inference is done repeatedly. It is estimated that inference accounts for up to 90% of the costs [Thomas, 2020]. There are several studies about training computation and its environmental impact [Amodei and Hernandez, 2018, Gholami et al., 2021a, Canziani et al., 2017, Li et al., 2016, Anthony et al., 2020, Thompson et al., 2020] but there are very few focused on inference costs and their associated energy consumption.
DNNs are deployed almost everywhere [Balas et al., 2019], from smartphones to automobiles, all having their own compute, temperature and battery limitations. Precisely because of this, there has been a pressure to build DNNs that are less resource demanding, even if larger DNNs usually outperform smaller ones. Alternatively to this in-device use, many larger DNNs are run on data centres, with people accessing them repeated in a transparent way, e.g., when using social networks [Park et al., 2018]. Millions of requests imply millions of inferences over the same DNN.
Many studies report that the size of neural networks is growing exponentially [Xu et al., 2018, Bianco et al., 2018]. However, this does not necessarily imply that the cost is also growing exponentially, as more weights could be implemented with the same amount of energy, mostly due to hardware specialisation but especially as the energy consumption per unit of compute is decreasing. Also, there is the question of whether the changing costs of energy and their carbon footprint [EEA, 2021] should be added to the equation. Finally, many studies focus on the state-of-the-art (SOTA) or the cutting-edge methods according to a given metric of performance, but many algorithmic improvements usually come in the months or few years after a new technique is introduced, in the form of general use implementations having similar results with much lower compute requirements. All these elements have been studied separately, but a more comprehensive and integrated analysis is necessary to properly evaluate whether the impact of AI on energy consumption and its carbon footprint is alarming or simply worrying, in order to calibrate the measures to be taken in the following years and estimate the effect in the future.
For conducting our analysis we chose two representative domains: Computer Vision (CV) and Natural Language Processing (NLP). For CV we analysed image classification models, and ImageNet [Russakovsky et al., 2015] more specifically, because there is a great quantity of historical data in this area and many advances in this domain are normally brought to other computer vision tasks, such as object detection, semantic segmentation, action recognition, or video classification, among others. For NLP we analysed results for the General Language Understanding Evaluation (GLUE) benchmark [Wang et al., 2019], since language understanding is a core task in NLP.
We focus our analysis on inference FLOPs (Floating Point Operations) required to process one input item (image or text fragment). We collect inference FLOPs for many different DNNs architectures following a comprehensive literature review. Since hardware manufacturers have been working on specific chips for DNN, adapting the hardware to a specific case of use leads to performance and efficiency improvements. We collect hardware data over the recent years, and estimate how many FLOPs can be obtained with one Joule with each chip. Having all this data we finally estimate how much energy is needed to perform one inference step with a given DNN. Our main objective is to study the evolution of the required energy for one prediction over the years.
The main findings and contributions of this paper are to (1) showcase that better results for DNN models are in part attributable to algorithmic improvements and not only to more computing power; (2) determine how much hardware improvements and specialisation is decreasing DNNs energy consumption; (3) report that, while energy consumption is still increasing exponentially for new cutting-edge models, DNN inference energy consumption could be maintained low for increasing performance if the efficient models that come relatively soon after the breakthrough are selected.
We provide all collected data and performed estimations as a data set, publicly available in the appendixes and as a GitHub repository 1 . The rest of the paper covers the background, introduces the methodology and presents the analysis of hardware and energy consumption of DNN models and expounds on some forecasts. Discussion and future work close the paper.
1 Temporary copy in: https://bit.ly/3DTHvFC
## Background
In line with other areas of computer science, there is some previous work that analyses compute and its cost for AI, and DNNs more specifically. Recently, OpenAI carried out a detailed analysis about AI efficiency [Hernandez and Brown, 2020], focusing on the amount of compute used to train models with the ImageNet dataset. They show that 44 times less compute was required in 2020 to train a network with the performance AlexNet achieved seven years before.
However, a demand for better task performance, linked with more complex DNNs and larger volumes of data to be processed, the growth in demand for AI compute is still growing fast. [Thompson et al., 2020] reports the computational demands of several Deep Learning applications, showing that progress in them is strongly reliant on increases in computing power. AI models have doubled the computational power used every 3.4 months since 2012 [Amodei and Hernandez, 2018]. The study [Gholami et al., 2021a] declare similar scaling rates for AI training compute to [Amodei and Hernandez, 2018] and they forecast that DNNs memory requirements will soon become a problem. This exponential trend seems to impose a limit on how far we can improve performance in the future without a paradigm change.
Compared to training costs, there are fewer studies on inference costs, despite using a far more representative share of compute and energy. Canziani et al. (2017) study accuracy, memory footprint, parameters, operations count, inference time and power consumption of 14 ImageNet models. To measure the power consumption they execute the DNNs on a NVIDIA Jetson TX1 board. A similar study [Li et al., 2016] measures energy efficiency, Joules per image, for a single forward and backward propagation iteration (a training step). This study benchmarks 4 Convolutional Neural Networks (CNNs) on CPUs and GPUs on different frameworks. Their work shows that GPUs are more efficient than CPUs for the CNNs analysed. Both publications analyse model efficiency, but they do this for very concrete cases. We analyse a greater number of DNNs and hardware components in a longer time frame.
These and other papers are key in helping society and AI researchers realise the issues about efficiency and energy consumption. Strubell et al. (2019) estimate the energy consumption, the cost and CO2 emissions of training various of the most popular NLP models. Henderson et al. (2020) performs a systematic reporting of the energy and carbon footprints of reinforcement learning algorithms. Bommasani et al. (2021) (section 5.3) seek to identify assumptions that shape the calculus of environmental impact for foundation models. Schwartz et al. (2019) analyse training costs and propose that researchers should put more attention on efficiency and they should report always the number of FLOPs. These studies contribute to a better assessment of the problem and more incentives for their solution. For instance, new algorithms and architectures such as EfficientNet [Tan and Le, 2020] and EfficientNetV2 [Tan and Le, 2021] have aimed at this reduction in compute.
When dealing about computing effort and computing speed (hardware performance), terminology is usually confusing. For instance, the term 'compute' is used ambiguously, sometimes applied to the number of operations or the number of operations per second. However, it is important to clarify what kind of operations and the acronyms for them. In this regard, we will use the acronym FLOPS to measure hardware performance, by referring to the number of floating point operations per second , as standardised in the industry, while FLOPs will be applied to the amount of computation for a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting a multiply-add operation pair as two operations. An extended discussion about this can be found in the appendix.
## Methodology
We collect most of our information directly from research papers that report results, compute and other data for one or more newly introduced techniques for the benchmarks and metrics we cover in this work. We manually read and inspected the original paper and frequently explored the official GitHub repository, if exists. However, often there is missing information in these sources, so we need to get the data from other sources, namely:
- Related papers : usually the authors of another paper that introduces a new model compare it with previously existing models, providing further information.
- Model implementations : PyTorch [Paszke et al., 2016] contains many (pre-trained) models, and their performance is reported. Other projects do the same (see, e.g., [Cadene, 2016, S´ emery, 2019]).
- Existing data compilations : there are some projects and public databases collecting information about deep learning architectures and their benchmarks, e.g., [Albanie, 2016, Coleman et al., 2017, Mattson et al., 2020, Gholami et al., 2021b, Stojnic and Taylor, 2021].
- Measuring tools : when no other source was available or reliable, we used the ptflops library [Sovrasov, 2020] or similar tools to calculate the model's FLOPs and parameters (when the implementation is available).
Given this general methodology, we now discuss in more detail how we made the selection of CV and NLP models, and the information about hardware.
## CV Models Data Compilation
There is a huge number of models for image classification, so we selected models based on two criteria: popularity and accuracy. For popularity we looked at the times that the paper presenting the model is cited on Google Scholar and whether the model appears mentioned in other papers (e.g., for comparative analyses). We focused on model's accuracy as well because having the best models per year in terms of accuracy is necessary for analysing progress. To achieve this we used existing compilations [Stojnic and Taylor, 2021] and filtered by year and accuracy. For our selection, accuracy was more important than popularity for recent models, as they are less cited than the older ones because they have been published for a shorter time. Once we selected the sources for image classification models, we collected the following information: Top-1 accuracy on ImageNet, number of parameters, FLOPs per forward pass, release date and training dataset. Further details about model selection, FLOPs estimation, image cropping [Krizhevsky et al., 2012] and resolution [Simonyan and Zisserman, 2015, Zhai et al., 2021] can be found in the Appendix (and Table 2).
## NLP Models Data Compilation
For NLP models we noted that there is much less information about inference (e.g., FLOPs) and the number of models for which we can get the required information is smaller than for CV. We chose GLUE for being sufficiently representative and its value determined for a good number of architectures. To keep the numbers high we just included all the models since 2017 for which we found inference compute estimation [Clark et al., 2020]. Further details about FLOPs estimation and counting can be found in the Appendix (selected models in in Table 7).
## Hardware Data Compilation
Regarding hardware evolution, we collected data for Nvidia GPUs 2 . Wechose Nvidia GPUs because they represent one of the most efficient hardware platforms for DNN 3 and they have been used for Deep Learning in the last 10 years, so we have a good temporal window for exploration. In particular, we collected GPU data for Nvidia GPUs from 2010 to 2021. The collected data is: FLOPS, memory size, power consumption (reported as Thermal Design Power, TDP) and launch date. As explained before, FLOPS is a measure of computer performance. From the FLOPS and power consumption we calculate the efficiency, dividing FLOPS by Watts. We use TDP and the reported peak FLOPS to calculate efficiency. This means we are considering the efficiency (GLOPS/Watt) when the GPU is at full utilisation. In practice the efficiency may vary depending on the workload, but we consider this estimate ('peak FLOPS'/TDP) accurate enough for analysing the trends and for giving an approximation of energy consumption. In our compilation there are desktop GPUs and server GPUs. We pay special attention to server GPUs released in the last years, because they are more common for AI, and DNNs in particular. A discussion about discrepancies between theoretical and real FLOPS as well as issues regarding Floating Point (FP) precision operations can be found in the Appendix.
2 https://developer.nvidia.com/deep-learning
3 We considered Google's TPUs (https://cloud.google.com/tpu?hl=en) for the analysis but there is not enough public information about them, as they are not sold but only available as a service.
## Computer Vision Analysis
In this section, we analyse the evolution of ImageNet [Deng et al., 2009] (one pass inference) according to performance and compute. Further details in the Appendix.
## Number of Parameters and FLOPs
The number of parameters is usually reported, but it is not directly proportional to compute. For instance, in CNNs, convolution operations dominate the computation: if d , w and r represent the network's depth, widith and input resolution, the FLOPs grow following the relation [Tan and Le, 2020]:
$$F L O P s \, \infty \, d + w ^ { 2 } + r ^ { 2 }$$
This means that FLOPs do not directly depend on the number of parameters. Parameters affect network depth ( d ) or width ( w ), but distributing the same number of parameters in different ways will result in different numbers of FLOPs. Moreover, the resolution ( r ) does not depend on the number of parameters directly, because the input resolution can be increased without increasing network size.
Figure 1: Relation between the number of parameters and FLOPs (both axes are logarithmic).
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Scatter Plot: GFLOPS vs. Parameters for CNN and Transformer Architectures
### Overview
This image presents a scatter plot comparing the computational cost (GFLOPS) of Convolutional Neural Networks (CNNs) and Transformer architectures as a function of their number of parameters (in millions). Two regression lines are overlaid on the data to show the general trend for each architecture.
### Components/Axes
* **X-axis:** Parameters (M) - Scale is logarithmic, ranging from approximately 2 to 2000. Tick marks are present at 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, and 1000.
* **Y-axis:** GFLOPS - Scale is logarithmic, ranging from approximately 3e-01 to 3e+03. Tick marks are present at 3e-01, 1e+00, 3e+00, 1e+01, 3e+01, 1e+02, 3e+02, 1e+03, and 3e+03.
* **Legend:** Located in the top-right corner.
* "Architecture" label.
* CNN: Represented by light blue circles.
* Transformer: Represented by dark brown diamonds.
* **Data Points:** Scatter plot of individual CNN and Transformer models.
* **Regression Lines:** Two lines representing the trend for each architecture. The CNN line is light blue, and the Transformer line is dark brown. Shaded areas around the lines indicate confidence intervals.
### Detailed Analysis
**CNN Data (Light Blue Circles):**
The CNN data points generally follow an upward trend, indicating that as the number of parameters increases, the GFLOPS also increase. The trend is approximately linear on this log-log scale.
* At approximately 2M parameters, GFLOPS is around 0.3.
* At approximately 5M parameters, GFLOPS is around 1.
* At approximately 10M parameters, GFLOPS is around 3.
* At approximately 20M parameters, GFLOPS is around 8.
* At approximately 50M parameters, GFLOPS is around 20.
* At approximately 100M parameters, GFLOPS is around 50.
* At approximately 200M parameters, GFLOPS is around 120.
* At approximately 500M parameters, GFLOPS is around 250.
* At approximately 1000M parameters, GFLOPS is around 600.
There is some scatter around the regression line, indicating variability in GFLOPS for CNNs with similar parameter counts.
**Transformer Data (Dark Brown Diamonds):**
The Transformer data points also exhibit an upward trend, but appear to have a steeper slope than the CNN data.
* At approximately 2M parameters, GFLOPS is around 0.5.
* At approximately 5M parameters, GFLOPS is around 2.
* At approximately 10M parameters, GFLOPS is around 6.
* At approximately 20M parameters, GFLOPS is around 15.
* At approximately 50M parameters, GFLOPS is around 40.
* At approximately 100M parameters, GFLOPS is around 100.
* At approximately 200M parameters, GFLOPS is around 250.
* At approximately 500M parameters, GFLOPS is around 700.
* At approximately 1000M parameters, GFLOPS is around 1500.
The Transformer data also shows some scatter, but appears more tightly clustered around its regression line than the CNN data.
**Regression Lines:**
The regression lines visually confirm the upward trends for both architectures. The Transformer line has a noticeably steeper slope, indicating a faster increase in GFLOPS with increasing parameters compared to CNNs.
### Key Observations
* Transformers generally require more GFLOPS than CNNs for a given number of parameters.
* Both architectures exhibit a roughly linear relationship between parameters and GFLOPS on this log-log scale.
* There is variability within each architecture, as evidenced by the scatter of data points around the regression lines.
* The confidence intervals around the regression lines suggest some uncertainty in the estimated trends.
### Interpretation
The data suggests that Transformers are computationally more expensive than CNNs, particularly as the model size (number of parameters) increases. This is likely due to the attention mechanism inherent in Transformers, which requires more computations than the convolutional operations used in CNNs. The linear relationship on the log-log scale indicates that the computational cost scales approximately polynomially with the number of parameters for both architectures. The scatter in the data suggests that other factors, such as network depth, layer types, and specific implementation details, also influence the GFLOPS. The steeper slope of the Transformer line implies that the computational cost increases more rapidly with parameter count for Transformers, potentially limiting their scalability compared to CNNs. This information is valuable for researchers and engineers designing and deploying deep learning models, as it helps to understand the trade-offs between model size, computational cost, and performance.
</details>
Despite this, Fig. 1 shows a linear relation between FLOPs and parameters. We attribute this to the balanced scaling of w , d and r . These dimensions are usually scaled together with bigger CNNs using higher resolution. Note that recent transformer models [Vaswani et al., 2017] do not follow the growth relation presented above. However, the correlation between the number of parameters and FLOPs for CNNs is 0.772 and the correlation for transformers is 0.994 (Fig. 1). This suggests that usually in both architectures parameters and FLOPs scale in tandem. We will use FLOPs, as they allow us to estimate the needed energy relating hardware FLOPS with required FLOPs for a model [Hollemans, 2018, Clark et al., 2020].
## Performance and Compute
There has been very significant progress for ImageNet. In 2012, AlexNet achieved 56% Top-1 accuracy (single model, one crop). In 2021, Meta Pseudo Labels (EfficientNet-L2) achieved 90.2% Top-1 accuracy (single model, one crop). However, this increase in accuracy comes with an increase in the required FLOPs for a forward pass. A forward pass for AlexNet is 1.42 GFLOPs while for EfficientNet-L2 is 1040 GFLOPs (details in the appendix).
Fig. 2 shows the evolution from 2012 to 2021 in ImageNet accuracy (with the size of the bubbles representing the FLOPs of one forward pass). In recent papers some researchers began using more data than those available in ImageNet1k for training the models. However, using extra data only affects training FLOPs, but does not affect the computational cost for inferring each classification (forward pass).
If we only look at models with the best accuracy for each year we can see an exponential growth in compute (measured in FLOPs). This can be observed clearly in Fig. 3: the dashed line represents an exponential growth (shown as a linear fit since the y -axis is logarithmic). The line is fitted with
Figure 2: Accuracy evolution over the years. The size of the balls represent the GFLOPs of one forward pass.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Scatter Plot: Top-1 Accuracy vs. Date for Machine Learning Models
### Overview
This scatter plot visualizes the relationship between the date a machine learning model was released and its Top-1 accuracy, with the size of the data point representing the model's GFLOPs (Giga Floating Point Operations). The color of the data point indicates whether the model was trained with "Extra data" or not.
### Components/Axes
* **X-axis:** Date, ranging from approximately 2012 to 2021.
* **Y-axis:** Top-1 Accuracy (%), ranging from approximately 58% to 92%.
* **Legend 1 (Top-Left):** GFLOPs, with corresponding circle sizes:
* 1 (Smallest circle)
* 10 (Medium circle)
* 100 (Large circle)
* 1000 (Largest circle)
* **Legend 2 (Center-Left):** Extra data:
* No (Pink circles)
* Yes (Cyan circles)
* **Data Points:** Scatter plot points representing individual machine learning models.
### Detailed Analysis
The plot shows a general upward trend in Top-1 accuracy over time. The size of the circles (GFLOPs) also generally increases with time and accuracy, though there is significant variation.
**Data Point Analysis (Approximate values, based on visual estimation):**
* **2012-2014:** Predominantly pink points (No extra data) with small circle sizes (1-10 GFLOPs). Accuracy ranges from approximately 58% to 75%.
* **2015-2017:** A mix of pink and cyan points, with increasing accuracy. Circle sizes begin to increase, ranging from 10 to 100 GFLOPs. Accuracy ranges from approximately 68% to 82%.
* **2018-2019:** More cyan points (Yes extra data) appear, and the circle sizes continue to increase, reaching up to 100 GFLOPs. Accuracy ranges from approximately 75% to 88%.
* **2020-2021:** Predominantly cyan points with larger circle sizes (100-1000 GFLOPs). Accuracy is generally high, ranging from approximately 82% to 92%. There is a cluster of points around 80-85% accuracy with varying GFLOPs.
**Specific Data Points (Approximate):**
* **2013:** Pink point, ~60% accuracy, 1 GFLOP.
* **2014:** Pink point, ~72% accuracy, 10 GFLOP.
* **2016:** Cyan point, ~78% accuracy, 100 GFLOP.
* **2018:** Cyan point, ~85% accuracy, 100 GFLOP.
* **2019:** Cyan point, ~88% accuracy, 100 GFLOP.
* **2020:** Cyan point, ~90% accuracy, 1000 GFLOP.
* **2021:** Cyan point, ~92% accuracy, 1000 GFLOP.
* **2021:** Pink point, ~82% accuracy, 100 GFLOP.
### Key Observations
* Models trained with "Extra data" (cyan points) generally achieve higher accuracy than those trained without (pink points), especially after 2018.
* There is a strong correlation between GFLOPs and Top-1 accuracy. Larger models (higher GFLOPs) tend to have higher accuracy.
* The rate of accuracy improvement appears to be accelerating over time, particularly in the 2020-2021 period.
* There are some outliers: a few pink points in 2020-2021 with relatively high accuracy, suggesting that models without "Extra data" can still achieve good performance.
### Interpretation
The data suggests that advancements in machine learning model accuracy are driven by both increased computational resources (GFLOPs) and the use of larger datasets ("Extra data"). The upward trend in accuracy over time reflects the ongoing progress in the field. The increasing size of the circles (GFLOPs) indicates that more powerful models are being developed. The shift towards cyan points (models with "Extra data") suggests that data augmentation and larger datasets are becoming increasingly important for achieving state-of-the-art performance. The outliers suggest that model architecture and training techniques also play a significant role, as some models without "Extra data" can still achieve competitive results. The acceleration of accuracy improvement in recent years may be due to breakthroughs in model architectures (e.g., Transformers) and training algorithms.
</details>
Figure 3: GFLOPs over the years. The dashed line is a linear fit (note the logarithmic y -axis) for the models with highest accuracy per year. The solid line includes all points.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Scatter Plot: DNN Performance Over Time
### Overview
This image presents a scatter plot illustrating the relationship between the date (from 2012 to 2021) and GFLOPS (floating point operations per second) for Deep Neural Networks (DNNs). The plot also incorporates Top-1 Accuracy as a color-coded attribute of each data point. Three lines are overlaid to represent trends: Average per year, All DNNs, and Best DNNs. The plot also differentiates data points based on whether they have "Extra data" associated with them.
### Components/Axes
* **X-axis:** Date, ranging from 2012 to 2021.
* **Y-axis:** GFLOPS, on a logarithmic scale from 0.3 to 3000.
* **Color Scale (Top-1 Accuracy):** A gradient from yellow (90) to blue (60), representing Top-1 Accuracy.
* **Legend:** Located at the top-right of the plot.
* `Average per year`: Represented by a dotted gray line.
* `All DNNs`: Represented by a solid black line.
* `Best DNNs`: Represented by a dashed gray line.
* `No`: Represented by a solid blue circle.
* `Yes`: Represented by a green triangle.
### Detailed Analysis
The plot shows a general upward trend in GFLOPS over time for all DNNs. Let's analyze each trend line and data series:
* **Average per year (dotted gray line):** This line shows a slow, relatively steady increase in GFLOPS from approximately 1.5 GFLOPS in 2012 to around 30 GFLOPS in 2021.
* **All DNNs (solid black line):** This line exhibits a more pronounced upward trend, starting at approximately 1.5 GFLOPS in 2012 and reaching around 60 GFLOPS in 2021.
* **Best DNNs (dashed gray line):** This line shows the most rapid increase, starting at approximately 2 GFLOPS in 2012 and reaching over 1000 GFLOPS in 2021.
**Data Point Analysis (Color-coded by Top-1 Accuracy):**
* **2012:** A few data points (blue circles) around 1-2 GFLOPS, with Top-1 Accuracy around 60-70.
* **2013-2016:** Data points (blue circles) are scattered, generally below 10 GFLOPS, with Top-1 Accuracy ranging from 60-80.
* **2017-2018:** An increase in data points (blue circles and green triangles) between 10-100 GFLOPS, with Top-1 Accuracy ranging from 60-90.
* **2019-2021:** A significant increase in data points (green triangles and yellow circles) ranging from 30-3000 GFLOPS, with Top-1 Accuracy ranging from 70-90. The highest GFLOPS values (above 1000) are associated with yellow data points (Top-1 Accuracy of 90).
* **"Extra data" differentiation:** Green triangles (Yes) generally appear at higher GFLOPS values than blue circles (No), particularly from 2018 onwards.
**Approximate Data Points (based on visual estimation):**
| Year | All DNNs (GFLOPS) | Best DNNs (GFLOPS) | Average per year (GFLOPS) |
|---|---|---|---|
| 2012 | 1.5 | 2 | 1.5 |
| 2015 | 5 | 10 | 5 |
| 2018 | 20 | 200 | 15 |
| 2021 | 60 | 1200 | 30 |
### Key Observations
* There is a clear exponential growth in the performance (GFLOPS) of DNNs over the period 2012-2021.
* The "Best DNNs" consistently outperform the average and all DNNs.
* Top-1 Accuracy generally increases with GFLOPS, suggesting a correlation between computational power and model accuracy.
* Data points with "Extra data" tend to have higher GFLOPS values, indicating that additional data may contribute to improved performance.
* The logarithmic scale on the Y-axis emphasizes the rapid growth in GFLOPS, especially in the later years.
### Interpretation
The data demonstrates the rapid advancement of DNN technology over the past decade. The increasing GFLOPS values indicate a significant increase in computational power, which is likely driven by advancements in hardware (e.g., GPUs) and algorithmic improvements. The correlation between GFLOPS and Top-1 Accuracy suggests that increasing computational power leads to more accurate models. The differentiation based on "Extra data" suggests that the quality and quantity of training data also play a crucial role in model performance. The divergence between the "All DNNs" and "Best DNNs" lines highlights the impact of research and development efforts in pushing the boundaries of DNN capabilities. The plot provides a compelling visual representation of the progress in the field of deep learning and its potential for future advancements. The use of a logarithmic scale is important to understand the magnitude of the growth, as a linear scale would compress the earlier data points and obscure the exponential trend.
</details>
the models with highest accuracy for each year. However not all models released in the latest years need so much compute. This is reflected by the solid line, which includes all points. We also see that for the same number of FLOPs we have models with increasing accuracy as time goes by.
In Table 1 there is a list of models having similar number of FLOPs as AlexNet. In 2019 we have a model (EfficientNet-B1) with the same number of operations as AlexNet achieving a Top-1 accuracy of 79.1% without using extra data, and a model (NoisyStudent-B1) achieving Top-1 accuracy of 81.5% using extra data. In a period of 7 years, we have models with similar computation with much higher accuracy. We observe that when a SOTA model is released it usually has a huge number of FLOPs, and therefore consumes a large amount of energy, but in a couple of years there is a model with similar accuracy but with much lower number of FLOPs. These models are usually those that become popular in many industry applications. This observation confirms that better results for DNN models of general use are in part attributable to algorithmic improvements and not only to the use of more computing power.
Finally, Fig. 4 shows that the Pareto frontier (in grey) is composed of new models (in yellow and green), whereas old models (in purple and dark blue) are relegated below the Pareto. As expected, the models which use extra data are normally those forming the Pareto frontier. Let us note again that extra training data does not affect inference GFLOPs.
| Model | Top-1 Accuracy | GFLOPs | Year |
|----------------------------------------|------------------|----------|--------|
| AlexNet [Krizhevsky et al., 2012] | 56.52 | 1.42 | 2012 |
| ZFNet [Zeiler and Fergus, 2013] | 60.21 | 2.34 | 2013 |
| GoogleLeNet [Szegedy et al., 2014] | 69.77 | 3 | 2014 |
| MobileNet [Howard et al., 2017] | 70.6 | 1.14 | 2017 |
| MobileNetV2 1.4 [Sandler et al., 2019] | 74.7 | 1.18 | 2018 |
| EfficientNet-B1 [Tan and Le, 2020] | 79.1 | 1.4 | 2019 |
| NoisyStudent-B1 [Xie et al., 2020] | 81.5 | 1.4 | 2019 |
Table 1: Results for several DNNs with a similar number of FLOPs as AlexNet.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Scatter Plot: Top-1 Accuracy vs. GFLOPS
### Overview
This image presents a scatter plot illustrating the relationship between GFLOPS (floating point operations per second) and Top-1 Accuracy (in percentage) for machine learning models across different years. The data points are color-coded by year and shaped by whether or not they include "extra data". A grey line represents a trendline through the data.
### Components/Axes
* **X-axis:** GFLOPS, ranging from approximately 0.5 to 1000, on a logarithmic scale. Axis label: "GFLOPS".
* **Y-axis:** Top-1 Accuracy (%), ranging from approximately 65% to 92%. Axis label: "Top-1 Accuracy (%)".
* **Color Legend (Top-Right):** Represents the year of the data point.
* 2021: Yellow
* 2019: Light Green
* 2017: Green
* 2015: Teal
* 2013: Purple
* **Shape Legend (Bottom-Right):** Indicates whether the data point includes "extra data".
* No: Circle
* Yes: Triangle
### Detailed Analysis
The plot shows a general trend of increasing Top-1 Accuracy with increasing GFLOPS. The grey trendline confirms this, sloping upwards from the bottom-left to the top-right.
**Data Point Analysis (Approximate values based on visual estimation):**
* **2013 (Purple):**
* Around 1 GFLOPS: ~70% Accuracy
* Around 10 GFLOPS: ~74% Accuracy
* Around 100 GFLOPS: ~78% Accuracy
* **2015 (Teal):**
* Around 1 GFLOPS: ~72% Accuracy
* Around 10 GFLOPS: ~78% Accuracy
* Around 100 GFLOPS: ~82% Accuracy
* **2017 (Green):**
* Around 1 GFLOPS: ~75% Accuracy
* Around 10 GFLOPS: ~80% Accuracy
* Around 100 GFLOPS: ~84% Accuracy
* **2019 (Light Green):**
* Around 1 GFLOPS: ~78% Accuracy
* Around 10 GFLOPS: ~82% Accuracy
* Around 100 GFLOPS: ~86% Accuracy
* **2021 (Yellow):**
* Around 10 GFLOPS: ~84% Accuracy
* Around 100 GFLOPS: ~88% Accuracy
* Around 1000 GFLOPS: ~91% Accuracy
**Shape Analysis:**
* **Circles (No Extra Data):** Predominantly represent data from earlier years (2013-2019). There is a cluster of circles around 10 GFLOPS with accuracy ranging from 74% to 82%.
* **Triangles (Yes Extra Data):** Primarily represent data from later years (2019-2021). Triangles generally exhibit higher accuracy for a given GFLOPS value compared to circles.
### Key Observations
* The trendline suggests diminishing returns: the increase in accuracy slows down as GFLOPS increase.
* Models with "extra data" (triangles) consistently achieve higher accuracy than those without (circles) for the same computational cost (GFLOPS).
* Accuracy has improved significantly over time, even for models with the same GFLOPS.
* There is a noticeable gap in data points between approximately 100 and 1000 GFLOPS, particularly for earlier years.
### Interpretation
The data demonstrates a clear correlation between computational power (GFLOPS) and model accuracy. However, the diminishing returns observed at higher GFLOPS values suggest that simply increasing computational resources is not a sustainable path to continuous improvement. The inclusion of "extra data" appears to be a significant factor in boosting accuracy, indicating the importance of data quality and quantity. The temporal trend shows that advancements in algorithms and model architectures, alongside increased computational power, have led to substantial gains in accuracy over the years. The gap in data points at higher GFLOPS values could indicate a limitation in the availability or cost of running models at that scale, or a point of diminishing returns where further increases in GFLOPS yield only marginal improvements in accuracy. The visualization suggests that the field is approaching a point where algorithmic innovation and data optimization are becoming more crucial than simply scaling up computational resources.
</details>
GFLOPs
Figure 4: Relation between accuracy and GFLOPs.
## Natural Language Analysis
In this section, we analyse the trends in performance and inference compute for NLP models. To analyse performance we use GLUE, which is a popular benchmark for natural language understanding, one key task in NLP. The GLUE benchmark 4 is composed of nine sentence understanding tasks, which cover a broad range of domains. The description of each task can be found in [Wang et al., 2019].
## Performance and Compute
We represent the improvement on the GLUE score in relation to GFLOPs over the years in Fig. 5 (and in Fig. 15 in the Appendix). GFLOPs are for single input of length 128, which is a reasonable sequence length for many use cases, being able to fit text messages or short emails. We can observe a very similar evolution to the evolution observed in ImageNet: SOTA models require a large number of FLOPs, but in a short period of time other models appear, which require much fewer FLOPs to reach the same score. There are many models that focus on being efficient instead of reaching high score, and this is reflected in their names too (e.g., MobileBERT [Sun et al., 2020] and SqueezeBERT [Iandola et al., 2020]). We note that the old models become inefficient (they have lower score with higher number of GLOPs) compared to the new ones, as it happens in CV models.
## Compute Trend
In Fig. 6 we include all models (regardless of having performance results) for which we found inference FLOPs estimation. The dashed line adjusts to the models with higher GFLOPs (models that, when released, become the most demanding model) and the solid line to all NLP models. In this plot we indicate the input sequence length, because in this plot we represent models with different input sequence lengths. We observe a similar trend as in CV: the GFLOPS of the most cutting-edge models have a clear exponential growth, while the general trend, i.e., considering all models, does not scale so aggressively. Actually, there is a good pocket of low-compute models in the last year.
4 Many recent models are evaluated on SUPERGLUE, but we choose GLUE to have a temporal window for our analysis.
Figure 5: GFLOPs per token analysis for NLP models.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Scatter Plot: GLUE Score vs. GFLOPs for Language Models
### Overview
This image presents a scatter plot comparing the performance (GLUE score) of various language models against their computational cost (GFLOPs - Giga Floating Point Operations per second). Each point on the plot represents a specific language model. The color of each point indicates the date of the model's release.
### Components/Axes
* **X-axis:** GFLOPs (ranging from approximately 3 to 70).
* **Y-axis:** GLUE score (ranging from approximately 74 to 86).
* **Data Points:** Represent individual language models.
* **Legend:** Located in the bottom-right corner, color-coded by release date:
* 2020-07 (Yellow)
* 2020-01 (Light Green)
* 2019-07 (Green)
* 2019-01 (Blue-Green)
* 2018-07 (Blue)
* 2018-01 (Dark Purple)
* **Models Labeled:** ELECTRA Large, ELECTRA-Base, BERT Large, BERT-Base, GPT-1, ELMo, Theseus 6/768, SqueezeBERT, MobileBERT, ELECTRA-Small, MobileBERT tiny.
### Detailed Analysis
The plot shows a general trend of higher GLUE scores correlating with higher GFLOPs, but with significant variation.
Here's a breakdown of the approximate data points, cross-referencing with the legend for color accuracy:
* **MobileBERT tiny:** Approximately (4 GFLOPs, 75 GLUE score) - Yellow (2020-07)
* **MobileBERT:** Approximately (5 GFLOPs, 79 GLUE score) - Light Green (2020-01)
* **ELECTRA-Small:** Approximately (6 GFLOPs, 77 GLUE score) - Yellow (2020-07)
* **SqueezeBERT:** Approximately (7 GFLOPs, 80 GLUE score) - Yellow (2020-07)
* **Theseus 6/768:** Approximately (10 GFLOPs, 79 GLUE score) - Light Green (2020-01)
* **ELECTRA-Base:** Approximately (22 GFLOPs, 83 GLUE score) - Green (2019-07)
* **BERT-Base:** Approximately (30 GFLOPs, 82 GLUE score) - Blue-Green (2019-01)
* **GPT-1:** Approximately (30 GFLOPs, 78 GLUE score) - Blue-Green (2019-01)
* **ELMo:** Approximately (30 GFLOPs, 75 GLUE score) - Blue (2018-07)
* **BERT Large:** Approximately (70 GFLOPs, 82 GLUE score) - Blue (2018-07)
* **ELECTRA Large:** Approximately (50 GFLOPs, 85 GLUE score) - Yellow (2020-07)
The trend for ELECTRA models is generally upward as the model size increases (tiny -> small -> base -> large). BERT models also show an increase in GLUE score with increased GFLOPs (Base -> Large).
### Key Observations
* **Release Date Correlation:** Newer models (released in 2020) tend to achieve higher GLUE scores for a given number of GFLOPs, suggesting improvements in model architecture or training techniques.
* **ELMo Outlier:** ELMo, released in 2018, has a relatively low GLUE score compared to models released in later years with similar GFLOPs.
* **ELECTRA Large Performance:** ELECTRA Large achieves the highest GLUE score in the dataset.
* **GPT-1 Performance:** GPT-1 has a relatively low GLUE score compared to other models with similar GFLOPs.
### Interpretation
The data suggests a trade-off between model performance (GLUE score) and computational cost (GFLOPs). While increasing the number of GFLOPs generally leads to higher performance, the efficiency of models has improved over time. Newer models, like ELECTRA, achieve better performance with fewer GFLOPs than older models like ELMo. This indicates advancements in model design and training methodologies.
The plot highlights the importance of considering both performance and efficiency when selecting a language model for a specific application. The release date provides a temporal context, showing the evolution of language models and the progress made in the field. The outliers, such as ELMo and GPT-1, suggest that factors beyond GFLOPs influence performance, such as model architecture and training data. The clustering of models around certain GFLOP ranges suggests potential sweet spots for performance-cost trade-offs.
</details>
Figure 6: GFLOPs per token analysis for NLP models.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Scatter Plot: GFLOPS vs. Date for DNN Models
### Overview
This image presents a scatter plot illustrating the relationship between GFLOPS (Gigafloating point operations per second) and Date for different Deep Neural Network (DNN) models, categorized by the number of tokens used. Two trend lines are included: one for all DNNs and another for DNNs with higher GFLOPS.
### Components/Axes
* **X-axis:** Date, ranging from approximately 2017 to 2021.
* **Y-axis:** GFLOPS, displayed on a logarithmic scale from 1e+00 (1) to 1e+06 (1,000,000).
* **Legend:** Located in the top-right corner, categorizes data points by the number of tokens:
* 128 (Pink)
* 512 (Magenta)
* 1024 (Blue)
* 2048 (Cyan)
* **Lines:**
* "All DNNs" - Solid black line.
* "DNNs with higher GFLOPS" - Dashed grey line.
### Detailed Analysis
The plot shows scattered data points representing individual DNN models. The data points are color-coded based on the number of tokens used.
**Data Point Analysis (Approximate Values):**
* **128 Tokens (Pink):**
* 2017: ~10 GFLOPS
* 2018: ~20 GFLOPS
* 2019: ~10 GFLOPS
* 2020: ~5 GFLOPS, ~10 GFLOPS, ~20 GFLOPS
* **512 Tokens (Magenta):**
* 2018: ~50 GFLOPS
* 2019: ~100 GFLOPS
* 2020: ~50 GFLOPS
* **1024 Tokens (Blue):**
* 2019: ~1000 GFLOPS
* 2020: ~2000 GFLOPS, ~5000 GFLOPS
* **2048 Tokens (Cyan):**
* 2020: ~100000 GFLOPS
**Trend Line Analysis:**
* **"All DNNs" (Black Line):** The line exhibits a slight upward slope, indicating a gradual increase in GFLOPS over time. It starts at approximately 10 GFLOPS in 2017 and ends at approximately 100 GFLOPS in 2021.
* **"DNNs with higher GFLOPS" (Grey Dashed Line):** This line shows a much steeper upward slope, indicating a rapid increase in GFLOPS over time. It starts at approximately 10 GFLOPS in 2017 and ends at approximately 100000 GFLOPS in 2021.
### Key Observations
* There's a clear positive correlation between date and GFLOPS, especially for models with higher GFLOPS.
* The number of tokens appears to be related to GFLOPS, with models using more tokens generally exhibiting higher GFLOPS.
* The spread of data points for 128 tokens is wider than for other token counts, suggesting more variability in GFLOPS for these models.
* The "DNNs with higher GFLOPS" trend line significantly outpaces the "All DNNs" trend line, indicating that the most powerful models are growing in computational demand at a faster rate.
### Interpretation
The data suggests a trend of increasing computational requirements for DNN models over time. The steeper slope of the "DNNs with higher GFLOPS" line indicates that the most advanced models are driving this trend. The correlation between tokens and GFLOPS suggests that model size (as measured by the number of tokens) is a key factor in determining computational demand. The variability in GFLOPS for models with 128 tokens could be due to differences in architecture, training data, or other factors.
The plot highlights the growing need for more powerful hardware to support the development and deployment of increasingly complex DNN models. The divergence between the two trend lines suggests that the gap between the computational requirements of standard models and cutting-edge models is widening, potentially creating challenges for researchers and developers. The logarithmic scale on the Y-axis emphasizes the exponential growth in GFLOPS, particularly for the higher-performing models.
</details>
## Hardware Progress
We use FLOPS as a measure of hardware performance and FLOPS/Watt as a measure of hardware efficiency. We collected performance for different precision formats and tensor cores for a wide range of GPUs. The results are shown in Fig. 7. Note that the y -axis is in logarithmic scale. Theoretical FLOPS for tensor cores are very high in the plot. However, the actual performance for inference using tensor cores is not so high, if we follow a more realistic estimation for the Nvidia GPUs (V100, A100 and T4 5 ). The details of this estimation are shown in Table 3 in the appendix.
Figure 7: Theoretical Nvidia GPUs GFLOPS per Watt. Data in Table 8 in the appendix.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Scatter Plot: Performance Over Time by Precision
### Overview
This image presents a scatter plot illustrating the performance (GFLOPS/Watt) of computing hardware over time (2011-2021) for different precision levels: FP16, FP16/FP32 Tensor, and FP32. The plot shows how performance has evolved for each precision type over the decade.
### Components/Axes
* **X-axis:** Date, ranging from 2011 to 2021. The axis is labeled "Date".
* **Y-axis:** GFLOPS/Watt, ranging from 0 to 1000. The axis is labeled "GFLOPS/Watt".
* **Legend:** Located in the top-left corner, defining the color-coding for each precision type:
* Black: FP16
* Light Blue: FP16/FP32 Tensor
* Yellow: FP32
### Detailed Analysis
The plot contains data points for each precision type across the specified date range.
**FP32 (Yellow):**
The FP32 data series shows a generally upward trend, starting at approximately 5 GFLOPS/Watt in 2011 and reaching around 70 GFLOPS/Watt by 2021. There is some fluctuation, but the overall trend is positive.
* 2011: ~5 GFLOPS/Watt
* 2012: ~15 GFLOPS/Watt
* 2013: ~20 GFLOPS/Watt
* 2014: ~25 GFLOPS/Watt
* 2015: ~30 GFLOPS/Watt
* 2016: ~40 GFLOPS/Watt
* 2017: ~45 GFLOPS/Watt
* 2018: ~50 GFLOPS/Watt
* 2019: ~60 GFLOPS/Watt
* 2020: ~65 GFLOPS/Watt
* 2021: ~70 GFLOPS/Watt
**FP16/FP32 Tensor (Light Blue):**
This series exhibits the most significant performance gains. It starts at a lower value than FP32 around 2018, but quickly surpasses it.
* 2018: ~100 GFLOPS/Watt
* 2019: ~500 GFLOPS/Watt
* 2020: ~700 GFLOPS/Watt
* 2021: ~600 GFLOPS/Watt
**FP16 (Black):**
The FP16 data series appears later in the timeline, starting around 2016. It shows a more erratic pattern, with some significant jumps in performance.
* 2016: ~75 GFLOPS/Watt
* 2017: ~100 GFLOPS/Watt
* 2018: ~100 GFLOPS/Watt
* 2019: ~80 GFLOPS/Watt
* 2020: ~200 GFLOPS/Watt
* 2021: ~80 GFLOPS/Watt
### Key Observations
* FP16/FP32 Tensor precision demonstrates the most substantial performance improvement over the decade, significantly outpacing FP32 and FP16.
* FP32 shows a steady, but less dramatic, increase in performance.
* FP16 performance is variable, with a large jump in 2020, followed by a decrease in 2021.
* The data suggests a trend towards higher performance with lower precision (FP16/FP32 Tensor).
### Interpretation
The data illustrates the evolution of computing performance across different precision levels. The dominance of FP16/FP32 Tensor in recent years suggests a shift towards utilizing these precision levels for improved efficiency and performance, likely driven by the demands of machine learning and AI workloads. The relatively stable growth of FP32 indicates its continued relevance, while the fluctuating performance of FP16 might be due to variations in hardware implementations or specific application optimizations. The plot highlights the trade-offs between precision and performance, and how advancements in hardware and software are enabling higher performance at lower precision levels. The decrease in FP16 performance in 2021 could indicate a temporary setback or a change in focus for hardware manufacturers.
</details>
5 Specifications in: https://www.nvidia.com/en-us/data-center/.
With these estimations we obtained good linear fits (with the y -axis in logarithmic scale) to each data set, one for CV and another for NLP, as shown by the solid lines in Fig. 8. Notice that there is a particular point in Fig. 8 for year 2018 that stands out among the others by a large margin. This corresponds to T4 using mixed precision, a GPU specifically designed for inference, and this is the reason why it is so efficient for this task.
Figure 8: Nvidia GPU GFLOPS per Watt adapted for CV (CNNs) and NLP models. Data in Table 9 in the appendix.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Scatter Plots: GFLOPS per Watt Estimation for CNN and NLP Models
### Overview
The image presents two scatter plots, side-by-side. The left plot shows GFLOPS per Watt estimation for Convolutional Neural Networks (CNN), and the right plot shows the same for Natural Language Processing (NLP) models. Both plots display data points over time (Date) and use different colors to represent different precision levels (FP32, Mixed, TF32). A black line represents a trendline for each plot.
### Components/Axes
Both plots share the following components:
* **X-axis:** Date, ranging from approximately 2010 to 2022, with markers at 2011, 2013, 2015, 2017, 2019, and 2021.
* **Y-axis:** GFLOPS/Watt, ranging from approximately 7 to 300 (left plot) and 7 to 400 (right plot).
* **Legend (Top-Left of each plot):**
* Precision: FP32 (Dark Gray)
* Precision: Mixed (Light Blue)
* Precision: TF32 (Yellow)
* **Trendline:** A black solid line representing the general trend of the data.
### Detailed Analysis or Content Details
**Left Plot (CNN):**
* **FP32 (Dark Gray):** The data points generally follow an upward trend.
* Approximate data points: (2011, 8), (2012, 12), (2013, 16), (2014, 20), (2015, 25), (2016, 30), (2017, 40), (2018, 50), (2019, 60), (2020, 80), (2021, 100).
* **Mixed (Light Blue):** Fewer data points are present.
* Approximate data points: (2017, 150), (2019, 200), (2021, 300).
* **TF32 (Yellow):** Only one data point is visible.
* Approximate data point: (2018, 45).
**Right Plot (NLP):**
* **FP32 (Dark Gray):** The data points generally follow an upward trend.
* Approximate data points: (2011, 7), (2012, 12), (2013, 18), (2014, 24), (2015, 30), (2016, 40), (2017, 50), (2018, 60), (2019, 70), (2020, 90), (2021, 110).
* **Mixed (Light Blue):** Fewer data points are present.
* Approximate data points: (2017, 180), (2019, 220), (2021, 320).
* **TF32 (Yellow):** Only one data point is visible.
* Approximate data point: (2018, 80).
In both plots, the trendlines slope upwards, indicating an increasing trend in GFLOPS per Watt over time.
### Key Observations
* Both CNN and NLP models show a consistent increase in GFLOPS per Watt over the period from 2011 to 2021.
* Mixed precision consistently outperforms FP32 precision in both CNN and NLP models, showing significantly higher GFLOPS/Watt.
* TF32 precision has limited data points, but appears to offer performance between FP32 and Mixed precision.
* The rate of improvement appears to be accelerating in recent years (2019-2021) for all precision levels.
### Interpretation
The data suggests a significant improvement in the efficiency of both CNN and NLP models over the past decade. This improvement is likely due to advancements in hardware (e.g., GPUs, TPUs) and software (e.g., model architectures, optimization techniques). The superior performance of Mixed precision indicates that utilizing lower precision data types can substantially increase computational throughput without significant loss of accuracy. The limited data for TF32 suggests it is a relatively newer precision format, but shows promise. The accelerating trend in recent years suggests that the pace of innovation in this field is increasing. The difference between the two charts could be due to the inherent computational demands of CNNs versus NLP models, or the different optimization strategies employed for each. The plots demonstrate the ongoing drive for more efficient machine learning models, which is crucial for reducing energy consumption and enabling deployment on resource-constrained devices.
</details>
## Energy Consumption Analysis
Once we have estimated the inference FLOPs for a range of models and the GFLOPS per Watt for different GPUs, we can estimate the energy (in Joules) consumed in one inference. We do this by dividing the FLOPs for the model by FLOPS per Watt for the GPU. But how can we choose the FLOPS per Watt that correspond to the model? We use the models presented in Fig. 8 to obtain an estimation of GLOPS per Watt for the model's release date . In this regard, Henderson et al. (2020) report that FLOPs for DNNs can be misleading sometimes, due to underlying optimisations at the firmware, frameworks, memory and hardware that can influence energy efficiency. They show that energy and FLOPs are highly correlated for the same architecture, but the correlation decreases when different architectures are mixed. We consider that this low correlation does not affect our estimations significantly as we analyse the trends through the years and we fit in the exponential scale, where dispersion is reduced. To perform a more precise analysis it would be necessary to measure power consumption for each network with the original hardware and software, as unfortunately the required energy per (one) inference is rarely reported.
<details>
<summary>Image 9 Details</summary>

### Visual Description
\n
## Scatter Plot: Energy Consumption of DNN Models Over Time
### Overview
This image presents a scatter plot illustrating the energy consumption (in Joules) of different Deep Neural Network (DNN) models over time (from 2012 to 2021). The plot differentiates models based on their Top-1 Accuracy and whether they were trained with "Extra Data". Three lines represent average energy consumption trends: "All DNNs", "DNNs", and "Best" models. The y-axis is on a logarithmic scale.
### Components/Axes
* **X-axis:** Date (ranging from 2012 to 2021).
* **Y-axis:** Joules (logarithmic scale, ranging from 0.003 to 30.000).
* **Legend:**
* **Model:**
* All DNNs (dotted line)
* DNNs (solid line)
* Best (dashed line)
* **Extra Data:**
* No (triangle markers)
* Yes (circle markers)
* **Color Scale:** Top-1 Accuracy (ranging from 60 to 90, with a gradient from blue to yellow).
* **Markers:** Triangle and Circle markers are used to indicate whether extra data was used.
### Detailed Analysis
The plot shows three trend lines representing the average energy consumption of different model types. The data points are colored based on Top-1 Accuracy, with blue representing lower accuracy and yellow representing higher accuracy.
**All DNNs (dotted line):** This line shows an upward trend, increasing from approximately 0.03 Joules in 2012 to around 10 Joules in 2021.
* 2012: ~0.03 Joules
* 2014: ~0.1 Joules
* 2016: ~0.3 Joules
* 2018: ~1.0 Joules
* 2020: ~3.0 Joules
* 2021: ~10 Joules
**DNNs (solid line):** This line is relatively flat, with a slight upward trend. It starts at approximately 0.05 Joules in 2012 and increases to around 0.3 Joules in 2021.
* 2012: ~0.05 Joules
* 2014: ~0.08 Joules
* 2016: ~0.15 Joules
* 2018: ~0.2 Joules
* 2020: ~0.25 Joules
* 2021: ~0.3 Joules
**Best (dashed line):** This line shows a significant upward trend, starting at approximately 0.05 Joules in 2012 and increasing to around 20 Joules in 2021.
* 2012: ~0.05 Joules
* 2014: ~0.2 Joules
* 2016: ~0.7 Joules
* 2018: ~2.0 Joules
* 2020: ~7.0 Joules
* 2021: ~20 Joules
**Data Points:**
* **Extra Data = No (Triangles):** These points are scattered throughout the plot, with a concentration in the lower energy consumption range (below 1 Joule) in earlier years (2012-2017). In later years (2018-2021), they are more spread out, with some points reaching higher energy consumption levels (up to 10 Joules). The color of the triangles varies from blue (lower accuracy) to yellow (higher accuracy).
* **Extra Data = Yes (Circles):** These points are also scattered, but generally show a higher concentration in the higher energy consumption range (above 1 Joule) in later years (2018-2021). The color of the circles also varies from blue to yellow.
### Key Observations
* The energy consumption of the "Best" models has increased dramatically over time, far exceeding the energy consumption of "All DNNs" and "DNNs".
* Models trained with "Extra Data" tend to consume more energy, particularly in recent years.
* There is a positive correlation between Top-1 Accuracy and energy consumption, as indicated by the color gradient. Higher accuracy models generally consume more energy.
* The "DNNs" line remains relatively stable, suggesting that the average energy consumption of these models has not increased significantly over time.
### Interpretation
The data suggests that achieving higher accuracy in DNN models requires significantly more energy, especially for the "Best" performing models. The use of "Extra Data" also contributes to increased energy consumption. This trend raises concerns about the environmental impact of increasingly complex AI models. The relatively stable energy consumption of "DNNs" might indicate that these models have reached a plateau in terms of performance gains per unit of energy. The logarithmic scale of the y-axis emphasizes the exponential growth in energy consumption for the "Best" models. The plot highlights the trade-off between model accuracy and energy efficiency, and the need for research into more energy-efficient AI algorithms and hardware. The data points show a wide range of energy consumption within each category, indicating that model architecture, training methods, and other factors also play a significant role.
</details>
Extra Data
No
Yes
Top-1 Accuracy
Lines
Average per year
Model all DNNs
Model best DNNs
Figure 9: Estimated Joules of a forward pass (CV). The dashed line is a linear fit (logarithmic y -axis) for the models with highest accuracy per year. The solid line fits all models.
We can express the efficiency metric FLOPS per Watt as FLOPs per Joule, as shown in Eq. 1. Having this equivalence we can use it to divide the FLOPs needed for a forward pass and obtain the required Joules, see Eq. 2. Doing this operation we obtain the consumed energy in Joules.
Figure 10: Estimated Joules of a forward pass (NLP). Same interpretation as in Fig. 9.
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Scatter Plot: Energy Consumption vs. Model Size Over Time
### Overview
This image presents a scatter plot illustrating the relationship between energy consumption (in Joules) and time (Date) for different model sizes (measured in Tokens). Two trend lines are overlaid to show the growth of GFLOPs for all models and for models with higher GFLOPs. The data points are color-coded based on the number of tokens.
### Components/Axes
* **X-axis:** Date, ranging from approximately 2017 to 2021.
* **Y-axis:** Joules, displayed on a logarithmic scale from 1e-01 to 1e+04.
* **Legend:** Located in the top-right corner, defining the color-coding for the number of tokens:
* Pink: 128 Tokens
* Purple: 512 Tokens
* Blue: 1024 Tokens
* Cyan: 2048 Tokens
* **Lines:**
* Solid Black Line: Growth GFLOPs all models
* Dashed Gray Line: Growth GFLOPs of models with higher GFLOPs
### Detailed Analysis
The plot contains scattered data points representing energy consumption for different models at various dates. The solid black line represents the overall trend of GFLOPs growth for all models, while the dashed gray line represents the trend for models with higher GFLOPs.
**Data Point Analysis (Approximate Values):**
* **128 Tokens (Pink):**
* 2017: ~0.03 Joules
* 2018: ~0.05 Joules
* 2019: ~0.1 Joules
* 2020: ~0.08 Joules (with significant variation between ~0.02 and ~0.2 Joules)
* **512 Tokens (Purple):**
* 2017: ~0.5 Joules
* 2018: ~0.7 Joules
* 2019: ~1.5 Joules
* 2020: ~1.2 Joules (with variation between ~0.5 and ~2 Joules)
* **1024 Tokens (Blue):**
* 2019: ~2 Joules
* 2020: ~3 Joules
* **2048 Tokens (Cyan):**
* 2020: ~10 Joules
**Trend Lines:**
* **Solid Black Line (Growth GFLOPs all models):** The line is relatively flat, indicating a slow and steady growth in GFLOPs for all models. It starts at approximately 0.1 Joules in 2017 and reaches approximately 1 Joule in 2021.
* **Dashed Gray Line (Growth GFLOPs of models with higher GFLOPs):** This line slopes upward more steeply than the solid black line, indicating a faster growth in GFLOPs for models with higher GFLOPs. It starts at approximately 0.05 Joules in 2017 and reaches approximately 5 Joules in 2021.
### Key Observations
* Energy consumption generally increases with the number of tokens.
* The growth in energy consumption for models with higher GFLOPs is significantly faster than for all models combined.
* There is considerable variation in energy consumption for models with 128 and 512 tokens in 2020.
* The logarithmic scale on the Y-axis emphasizes the exponential growth in energy consumption for larger models.
### Interpretation
The data suggests that as models grow in size (number of tokens), their energy consumption increases. The steeper growth curve for models with higher GFLOPs indicates that more powerful models require significantly more energy. This highlights the growing energy demands of increasingly complex AI models. The variation in energy consumption for smaller models in 2020 could be due to differences in model architecture, training methods, or hardware used. The logarithmic scale is crucial for visualizing the large differences in energy consumption between smaller and larger models. The plot demonstrates a clear trade-off between model performance (GFLOPs) and energy efficiency. The divergence of the two trend lines suggests that the energy cost of increasing model complexity is accelerating. This has implications for the sustainability of AI development and the need for more energy-efficient algorithms and hardware.
</details>
$$E \text {efficiency} & = \frac { \text {HW Perf. } } { \text {Power} } \text { in units: } \frac { \ F L O P S } { W a t t } = \frac { \ F L O P s / s } { J o u l e s / s } = \frac { \ F L O P s } { J o u l e } & ( 1 ) \\ E \text {energy} & = \frac { \text {Fwd. Pass } } { \text {Efficiency} } \text { in units: } \frac { \ F L O P s } { \ F L O p s / J o u l e } = J o u l e$$
Applying this calculation to all collected models we obtain Fig. 9 for CV. The dashed line represents an exponential trend (a linear fit as the y -axis is logarithmic), adjusted to the models with highest accuracy for each year, like in Fig. 2, and the dotted line represent the average Joules for each year. By comparing both plots we can see that hardware progress softens the growth observed for FLOPs, but the growth is still clearly exponential for the models with high accuracy. The solid line is almost horizontal, but in a logarithmic scale this may be interpreted as having an exponential growth with a small base or a linear fit on the semi log plot that is affected by the extreme points. In Fig. 10 we do the same for NLP models and we see a similar picture.
Fig. 11 shows the relation between Top-1 Accuracy and Joules. Joules are calculated in the same way as in Fig. 9. The relation is similar as the observed in Fig. 4, but in Fig. 11 the older models are not only positioned further down in the y -axis (performance) but they tend to cluster on the bottom right part of the plot (high Joules), so their position on the y -axis is worse than for Fig. 4 due to the evolution in hardware. This is even more clear for NLP, as seen in Fig. 12.
Figure 11: Relation between Joules and Top-1 Accuracy over the years (CV, ImageNet).
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Scatter Plot: Top-1 Accuracy vs. Joules
### Overview
This image presents a scatter plot visualizing the relationship between Joules and Top-1 Accuracy, with data points colored by Date and shaped by whether or not Extra Data was used. The plot appears to demonstrate how accuracy improves with increasing Joules, and how this relationship changes over time.
### Components/Axes
* **X-axis:** Joules, ranging from approximately 0.003 to 30. The scale is logarithmic, with markers at 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, and 10.
* **Y-axis:** Top-1 Accuracy, ranging from approximately 60 to 90. The scale is linear, with markers at 60, 70, 80, and 90.
* **Legend 1 (Top-Right):** Date, with the following color mapping:
* 2021: Yellow (#FFDA63)
* 2019: Light Green (#98FF98)
* 2017: Green (#32CD32)
* 2015: Cyan (#00FFFF)
* 2013: Purple (#8A2BE2)
* **Legend 2 (Bottom-Right):** Extra Data, with the following shape mapping:
* No: Circle (Blue)
* Yes: Triangle (Green)
### Detailed Analysis
The plot contains numerous data points, each representing a specific experiment or configuration. The data can be analyzed by considering both the Date and Extra Data dimensions.
**Data Trends by Date:**
* **2013 (Purple):** Data points are clustered towards the lower left of the plot, indicating low Joules and low Top-1 Accuracy. Accuracy ranges from approximately 60 to 75, with Joules ranging from 0.003 to 0.3.
* **2015 (Cyan):** Data points show a slight improvement in accuracy compared to 2013, with accuracy ranging from approximately 65 to 80 and Joules ranging from 0.01 to 1.
* **2017 (Green):** Data points continue to show improvement, with accuracy ranging from approximately 70 to 85 and Joules ranging from 0.03 to 3.
* **2019 (Light Green):** Data points show further improvement, with accuracy ranging from approximately 75 to 90 and Joules ranging from 0.1 to 10.
* **2021 (Yellow):** Data points are generally clustered towards the upper right of the plot, indicating high Joules and high Top-1 Accuracy. Accuracy ranges from approximately 80 to 90, with Joules ranging from 0.3 to 30.
**Data Trends by Extra Data:**
* **No (Circles - Blue):** The majority of data points fall within this category. The trend shows that as Joules increase, Top-1 Accuracy generally increases. There are some outliers, particularly at higher Joules values (around 1-3), where accuracy plateaus or even slightly decreases.
* **Yes (Triangles - Green):** Data points in this category generally exhibit higher Top-1 Accuracy for a given Joules value compared to the "No" category. The trend is similar, with accuracy increasing with Joules, but the improvement is more pronounced.
**Specific Data Points (Approximate):**
* (0.003, 65) - Purple Circle (2013, No Extra Data)
* (0.01, 75) - Green Triangle (2017, Yes Extra Data)
* (0.1, 70) - Cyan Circle (2015, No Extra Data)
* (0.3, 80) - Light Green Triangle (2019, Yes Extra Data)
* (1, 85) - Yellow Circle (2021, No Extra Data)
* (3, 88) - Yellow Triangle (2021, Yes Extra Data)
* (10, 89) - Yellow Triangle (2021, Yes Extra Data)
* (30, 87) - Yellow Circle (2021, No Extra Data)
### Key Observations
* There is a clear positive correlation between Joules and Top-1 Accuracy.
* The use of Extra Data generally leads to higher Top-1 Accuracy for a given Joules value.
* Accuracy has improved significantly over time (from 2013 to 2021).
* The rate of accuracy improvement appears to be slowing down in recent years (2019-2021).
* There are some outliers, suggesting that other factors besides Joules and Extra Data may influence Top-1 Accuracy.
### Interpretation
The data suggests that increasing computational resources (represented by Joules) generally leads to improved model accuracy. The consistent improvement in accuracy over time likely reflects advancements in model architecture, training techniques, and data quality. The benefit of using "Extra Data" indicates that more comprehensive datasets can further enhance model performance. The plateauing of accuracy at higher Joules values suggests diminishing returns – at some point, increasing computational resources yields only marginal improvements. This could be due to limitations in the model architecture or the inherent complexity of the task. The outliers suggest that other variables, not captured in this visualization, play a role in determining model accuracy. Further investigation could explore the impact of these additional factors. The logarithmic scale on the x-axis indicates that the relationship between Joules and accuracy is not linear, and that the initial increase in Joules has a more significant impact on accuracy than subsequent increases.
</details>
## Forecasting and Multiplicative Effect
In our analysis we see that DNNs as well as hardware are improving their efficiency and do not show symptoms of standstill. This is consistent with most studies in the literature: performance will
Figure 12: Relation between Joules and GLUE score over the years (NLP, GLUE).
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Scatter Plot: Model Performance (GLUE vs. Joules)
### Overview
This image presents a scatter plot comparing the performance of various language models on the GLUE benchmark against their energy consumption in Joules. Each point represents a different model, and the color of the point indicates the date of the model's release.
### Components/Axes
* **X-axis:** Joules (Energy Consumption). Scale ranges from approximately 0.00 to 1.00, with markers at 0.03, 0.05, 0.10, 0.30, 0.50, and 1.00.
* **Y-axis:** GLUE (General Language Understanding Evaluation) score. Scale ranges from approximately 74 to 86, with markers at 75, 80, 85.
* **Legend:** Located in the top-left corner, the legend maps colors to dates:
* 2020-01 (Light Green)
* 2019-07 (Medium Green)
* 2019-01 (Light Blue)
* 2018-07 (Dark Purple)
* **Data Points:** Each point represents a language model, labeled with its name.
### Detailed Analysis
Here's a breakdown of the data points, their approximate coordinates, and corresponding release dates:
* **ELECTRA Large:** (0.95, 85.5). Color: Light Green (2020-01)
* **BERT Large:** (1.00, 82.5). Color: Light Blue (2019-01)
* **ELECTRA-Base:** (0.85, 84.5). Color: Light Green (2020-01)
* **BERT-Base:** (0.75, 81.5). Color: Light Blue (2019-01)
* **Theseus 6/768:** (0.12, 79.5). Color: Medium Green (2019-07)
* **SqueezeBERT:** (0.07, 77.5). Color: Medium Green (2019-07)
* **MobileBERT:** (0.04, 80.5). Color: Medium Green (2019-07)
* **ELECTRA-Small:** (0.06, 76.5). Color: Light Green (2020-01)
* **MobileBERT tiny:** (0.03, 75.5). Color: Light Green (2020-01)
* **GPT-1:** (0.40, 76.0). Color: Dark Purple (2018-07)
* **ELMo:** (0.30, 74.5). Color: Dark Purple (2018-07)
**Trends:**
* Generally, models released in 2020-01 (light green) tend to have higher GLUE scores for a given energy consumption (Joules) compared to models released earlier.
* There's a positive correlation between energy consumption and GLUE score, but it's not strictly linear. Some models achieve high GLUE scores with relatively low energy consumption.
* The older models (2018-07, dark purple) generally have lower GLUE scores and lower energy consumption.
### Key Observations
* ELECTRA Large achieves the highest GLUE score, but also has the highest energy consumption.
* MobileBERT and MobileBERT tiny are notable for their relatively high GLUE scores given their very low energy consumption.
* GPT-1 and ELMo are the oldest models and have the lowest GLUE scores.
* There is a cluster of models released in 2019-07 (medium green) that occupy a middle ground in terms of both GLUE score and energy consumption.
### Interpretation
The data suggests a trade-off between model performance (as measured by GLUE) and energy consumption. Newer models (2020-01) demonstrate improved efficiency, achieving higher GLUE scores with comparable or even lower energy consumption than older models. This indicates progress in model architecture and training techniques.
The positioning of models like MobileBERT and MobileBERT tiny is particularly interesting. They represent a design choice prioritizing efficiency over absolute performance, making them suitable for resource-constrained environments.
The spread of points indicates that simply increasing model size (and thus energy consumption) doesn't guarantee a proportional increase in GLUE score. There are diminishing returns, and other factors like model architecture and training data play a crucial role.
The color-coding by release date provides a historical perspective, showing how the landscape of language models has evolved over time. The trend towards higher performance and greater efficiency is clearly visible.
</details>
Figure 13: Estimated Joules per forward pass (e.g., one prediction) compared to human energy consumption in 1s (CV).
<details>
<summary>Image 13 Details</summary>

### Visual Description
## Chart: Energy Consumption of DNNs and Humans
### Overview
The image presents a chart comparing the energy consumption (in Joules) of Deep Neural Networks (DNNs) and humans over time (from 2012 to 2021). The chart uses a logarithmic y-axis to display a wide range of energy consumption values. It shows the energy consumption of "Best DNNs" and "All DNNs" alongside the "Human external energy" and "Human internal consumption".
### Components/Axes
* **X-axis:** Date, ranging from 2012 to 2021.
* **Y-axis:** Energy consumption in Joules, displayed on a logarithmic scale (base 10). The scale ranges from 1e-02 to 1e+04.
* **Data Series:**
* "Best DNNs" (dashed red line)
* "All DNNs" (solid magenta line)
* "Human external energy" (solid blue line)
* "Human internal consumption" (light blue line)
* **Data Points:** Black circles representing individual data points.
* **Legend:** Located at the top-center of the chart, identifying each data series by color and line style.
### Detailed Analysis
The chart shows the following trends:
* **Human external energy:** Remains relatively constant at approximately 1e+03 Joules throughout the period.
* **Human internal consumption:** Remains relatively constant at approximately 1e+02 Joules throughout the period.
* **All DNNs:** Shows a generally upward trend, starting around 1e-01 Joules in 2012 and increasing to approximately 1e+01 Joules by 2021. The line is relatively flat between 2012 and 2016, then begins to increase more rapidly.
* **Best DNNs:** Also shows an upward trend, starting around 1e-01 Joules in 2012 and increasing to approximately 1e+01 Joules by 2021. The "Best DNNs" line is consistently above the "All DNNs" line, indicating that the most efficient DNNs consume more energy than the average.
* **Data Points:** The black data points are scattered around the "All DNNs" line, showing the variability in energy consumption for different DNNs. There is a cluster of data points around 2019-2021 with higher energy consumption.
Approximate Data Points (extracted visually):
| Date | All DNNs (Joules) | Best DNNs (Joules) |
|---|---|---|
| 2012 | ~0.1 | ~0.1 |
| 2013 | ~0.1 | ~0.1 |
| 2014 | ~0.1 | ~0.2 |
| 2015 | ~0.2 | ~0.3 |
| 2016 | ~0.3 | ~0.4 |
| 2017 | ~0.2 | ~0.5 |
| 2018 | ~0.3 | ~0.6 |
| 2019 | ~1 | ~2 |
| 2020 | ~3 | ~5 |
| 2021 | ~10 | ~10 |
### Key Observations
* The energy consumption of DNNs has increased significantly over the period from 2012 to 2021.
* The energy consumption of humans (both external and internal) remains relatively stable in comparison.
* The gap between the "Best DNNs" and "All DNNs" energy consumption is widening, suggesting that improvements in DNN efficiency are not keeping pace with the overall increase in energy demand.
* There is significant variability in the energy consumption of DNNs, as indicated by the scattered data points.
### Interpretation
The chart demonstrates a clear trend of increasing energy consumption by DNNs, while human energy consumption remains relatively constant. This suggests that the growing computational demands of AI are having a significant impact on energy resources. The logarithmic scale highlights the exponential growth in DNN energy usage. The fact that the "Best DNNs" are also increasing in energy consumption, despite being the most efficient, indicates that fundamental improvements in energy efficiency are needed to mitigate the environmental impact of AI. The clustering of data points in 2019-2021 suggests a potential inflection point or a period of rapid development and deployment of more energy-intensive DNN models. The chart serves as a visual representation of the growing energy footprint of artificial intelligence and the need for sustainable AI practices.
</details>
continue growing as compute grows, but at the same time efficiency is increasing. However, this is the first work that analyses whether these two things cancel, especially when we analyse inference and not training. Our conclusion is that they not cancel out for the cutting-edge models of each moment but this is less clear for the regular models in general use by industries and invididuals.
However, since we are focusing on inference costs, we need to consider the multiplicative factor. How many inferences are performed per capita ? This has definitely increased very significantly with the use of smart devices, Internet of things and many other devices around us, which are incorporating DNN-based services. However, how many inference passes per capita do we have at this moment, and how is this growing? This is very difficult to estimate, and we leave it for future work. However, it is interesting to analyse possible hypotheses: assume there is one inference pass of a neural network application per second per capita. What would this imply in terms of energy consumption?
In order to put this inference energy consumption in context we calculate the value of average human body energy consumption (we will refer to it as somatic or internal consumption) in one second and the average energy that a human being consumes in one second with all their commodities (we will refer to it as external consumption). The internal consumption is calculated assuming 2000 KCal per person day, and converting this to Joules/s, giving approximately 100 Joules/s. The external consumption is the sum of total energy consumption, including electricity, transport and heating, using the USA as a reference [Ritchie and Roser, 2020]. This suggests 79,897 Kwh/year in 2019, which is approximately 10,000 Joules every second. The comparison of these two references with the trends can be seen in Fig. 13 (CV). As we see, the energy consumed for one inference of the best models approaches the energy consumed by the human body in one second but stills far from the external energy consumed in one second. If each human did an AI-based decision implying a forward pass every second during the whole day (and night), this would be still well below their
Figure 14: Estimated Joules per forward pass (e.g., one prediction) compared to human consumption in 1s (NLP).
<details>
<summary>Image 14 Details</summary>

### Visual Description
## Line Chart: Energy Consumption Over Time
### Overview
This line chart depicts the energy consumption of Deep Neural Networks (DNNs) and humans over time, from 2017 to 2021. The chart compares the energy usage of "Best DNNs", "All DNNs", "Human external energy", and "Human internal consumption". The Y-axis represents energy in Joules on a logarithmic scale, and the X-axis represents the date.
### Components/Axes
* **X-axis:** Date, ranging from 2017 to 2021.
* **Y-axis:** Joules, on a logarithmic scale (1e+00 to 1e+04).
* **Lines/Data Series:**
* "Best DNNs" (dashed red line)
* "All DNNs" (solid magenta line)
* "Human external energy" (solid blue line)
* "Human internal consumption" (solid cyan line)
* **Data Points:** Black circles representing individual data points.
* **Legend:** Located in the top-left corner, associating colors with each data series.
### Detailed Analysis
The chart shows a clear divergence in energy consumption trends.
* **Best DNNs (dashed red line):** This line exhibits a strong upward trend, increasing exponentially from approximately 0.05 Joules in 2017 to approximately 1000 Joules in 2021.
* **All DNNs (solid magenta line):** This line is relatively flat, fluctuating around 1 Joule throughout the period. There is a slight downward trend towards the end of the period.
* **Human external energy (solid blue line):** This line is nearly flat, remaining consistently around 10000 Joules throughout the period.
* **Human internal consumption (solid cyan line):** This line is also nearly flat, remaining consistently around 100 Joules throughout the period.
* **Data Points (black circles):** These points are scattered around the "All DNNs" line, with some variation. There is a cluster of points around 2020 that show a decrease in energy consumption.
Approximate data points (reading from the chart, with uncertainty due to the logarithmic scale and point size):
| Date | Best DNNs (Joules) | All DNNs (Joules) | Human external energy (Joules) | Human internal consumption (Joules) |
|---|---|---|---|---|
| 2017 | ~0.05 | ~0.8 | ~10000 | ~100 |
| 2018 | ~0.2 | ~1.2 | ~10000 | ~100 |
| 2019 | ~1 | ~0.9 | ~10000 | ~100 |
| 2020 | ~10 | ~0.7 | ~10000 | ~100 |
| 2021 | ~1000 | ~0.5 | ~10000 | ~100 |
### Key Observations
* The energy consumption of the "Best DNNs" is increasing dramatically, while the energy consumption of "All DNNs" remains relatively stable.
* Human energy consumption (both external and internal) is significantly higher than DNN energy consumption, but remains relatively constant.
* The data points for "All DNNs" show some variability, but generally cluster around a consistent energy level.
* There is a noticeable decrease in the data points around 2020, suggesting a potential dip in energy consumption for some DNNs.
### Interpretation
The chart demonstrates a significant and rapidly increasing energy demand from the most advanced DNNs. While the average energy consumption of all DNNs remains relatively low, the "Best DNNs" are exhibiting exponential growth in energy usage. This suggests that as DNNs become more powerful, their energy requirements are increasing at an alarming rate.
The relatively constant energy consumption of humans, in comparison, highlights the growing energy footprint of artificial intelligence. The disparity between the two trends raises concerns about the sustainability of increasingly complex AI models. The slight downward trend in "All DNNs" towards 2021 could indicate improvements in energy efficiency for some models, but this is not enough to offset the dramatic increase in energy consumption of the "Best DNNs". The clustering of data points around 2020 suggests a possible event or change in methodology that affected the energy consumption of several DNNs. Further investigation would be needed to understand the cause of this dip.
</details>
internal consumption. However, AI-based decisions are becoming more ubiquitous. For instance, a self-driving car or a surveillance camera may be making many forward passes per second. For NLP, the trends are similar but the best models are growing much faster, as we see in Fig. 14, while the regular models may even decrease. Here, the interpretation in terms of how many decisions are made in a second is also hard to determine. For instance, a language model interfaced by a human does not require more than the basic 128-token windows per second. However, many applications of language models can process data without interacting with humans at a much higher speed.
## Discussion and Future Work
In this work we have combined the analysis of several elements about AI, compute and energy consumption that allow us to have a different and more comprehensive perspective about the energy impact of AI. The most distinctive element of our analysis is that we focus on inference cost, which is usually lower than the training cost when both are reported in research papers, but because of multiplicative factors, it is much higher overall. Many DNN models are trained once and applied millions of times (forward passes).
Our findings are very different from the unbridled exponential growth that is usually reported when just looking at the number of parameters of new deep learning models [Hestness et al., 2017, Kaplan et al., 2020, Henighan et al., 2020]. When we focus on inference costs of these networks, the energy that is associated is not growing so fast, because of several factors that partially compensate the growth, such as algorithmic improvements, hardware specialisation and hardware consumption efficiency. The gap gets closer when we analyse those models that settle, i.e., those models whose implementation become very popular one or two years after the breakthrough algorithm was introduced. These general-use models can achieve systematic growth in performance at an almost constant energy consumption. The main conclusion is that even if the energy used by AI were kept constant, the improvement in performance could be sustained with algorithmic improvements and fast increase in the number of parameters.
This conclusion has an important limitation. It assumes a constant multiplicative factor. As more and more devices use AI (locally or remotely) the energy consumption can escalate just by means of increased penetration, in the same way that cars have become more efficient in the past two decades but there are many more cars in the world today.
We hope this paper contributes to the increasing debate about AI and energy consumption by analysing the inference costs. As these are dominated by multiplicative factors, this should encourage not only AI researchers but economists and social scientists to participate in this analysis. Future studies would be enriched by socio-economic indicators about the use of AI (the degree of penetration), the cost of energy and devices as well as the carbon footprint per Joule [EEA, 2021]. Similarly, comparing energy consumption by AI and trends in human salaries could help determine where automation [Tolan et al., 2021] becomes cost effective in economic terms.
Finally, this paper has many limitations that originate from the limited information reported in scientific papers. Many papers include the number of hyperparameters, but it is less common to have complete information about FLOPs and energy consumption. It is even rarer when looking for inference costs. This information is not only necessary for the transparency of the field but it is of utmost relevance for producing studies such as the one we have presented here, with a larger number of benchmarks and models. Also, it is important that new techniques are reported with new but also old benchmarks, so that we can have larger temporal windows where we can analyse the evolution of the field. We hope that future studies can build on this one and better publishing practices.
## References
- S. Albanie. Convnet burden: Estimates of memory consumption and flop counts for various convolutional neural networks., 2016. https://github.com/albanie/convnet-burden.
- D. Amodei and D. Hernandez. Ai and compute. https://openai.com/blog/ai-and-compute/, 2018.
- L. F. W. Anthony, B. Kanding, and R. Selvan. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051 , 2020.
- V. E. Balas, S. S. Roy, D. Sharma, and P. Samui. Handbook of deep learning applications , volume 136. Springer, 2019.
- S. Bianco, R. Cadene, L. Celona, and P. Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access , 6:64270-64277, 2018.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models, 2021.
- A. Brock, S. De, S. L. Smith, and K. Simonyan. High-performance large-scale image recognition without normalization, 2021.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners, 2020.
- R. Cadene. Pretrained models for Pytorch , 2016. https://github.com/Cadene/pretrained-models. pytorch#torchvision.
- A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications, 2017.
11. C.-F. Chen, Q. Fan, and R. Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification, 2021.
- Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks, 2017.
- F. Chollet. Keras applications, 2015. https://keras.io/api/applications/.
- F. Chollet. Xception: Deep learning with depthwise separable convolutions, 2017.
- K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre-training text encoders as discriminators rather than generators, 2020.
- C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. R´ e, and M. Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition. Training , 100(101):102, 2017.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248-255. Ieee, 2009.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- E. E. A. EEA. Greenhouse gas emission intensity of electricity generation in europe. https://www.eea.europa.eu/data-and-maps/indicators/overview-of-the-electricity-production3/assessment-1, 2021.
- A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer. Ai and memory wall. RiseLab Medium Post , 2021a.
- A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer. Ai and memory wall. RiseLab Medium Post , 2021b.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual networks github, 2015a. https://github.com/ KaimingHe/deep-residual-networks.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015b.
- P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research , 21(248):1-43, 2020.
- T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 , 2020.
- D. Hernandez and T. B. Brown. Measuring the algorithmic efficiency of neural networks, 2020.
- J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 , 2017.
- M. Hollemans. How fast is my model?, 2018. https://machinethink.net/blog/how-fast-is-my-model/.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
- J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu. Squeeze-and-excitation networks, 2019.
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks, 2018.
- F. N. Iandola, A. E. Shaw, R. Krishna, and K. W. Keutzer. Squeezebert: What can computer vision teach nlp about efficient neural networks?, 2020.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 , 2020.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems , 25:1097-1105, 2012.
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020.
- C. Li. Openai's gpt-3 language model: A technical overview. https://lambdalabs.com/blog/ demystifying-gpt-3, 2020.
- D. Li, X. Chen, M. Becchi, and Z. Zong. Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus. In 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom) , pages 477-484, 2016. doi: 10.1109/BDCloud-SocialCom-SustainCom.2016.76.
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search, 2018.
- Z. Liu, Y. Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design, 2018.
- D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining, 2018.
- F. Martinez-Plumed, S. Avin, M. Brundage, A. Dafoe, S. ´ O. h ´ Eigeartaigh, and J. Hern´ andez-Orallo. Accounting for the neglected dimensions of ai progress. arXiv preprint arXiv:1806.00610 , 2018.
- P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, et al. Mlperf: An industry standard benchmark suite for machine learning performance. IEEE Micro , 40(2):8-16, 2020.
- C. NVIDIA. Achieved FLOPs , 2015. https://docs.nvidia.com/gameworks/content/developertools/ desktop/analysis/report/cudaexperiments/kernellevel/achievedflops.htm.
- C. NVIDIA. Nvidia tesla v100 gpu architectur, 2017. https://images.nvidia.com/content/voltaarchitecture/pdf/volta-architecture-whitepaper.pdf.
- C. NVIDIA. Training with mixed precision, 2018. https://docs.nvidia.com/deeplearning/ performance/mixed-precision-training/index.html.
- J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur, J. Pino, M. Schatz, A. Sidorov, V. Sivakumar, A. Tulloch, X. Wang, Y. Wu, H. Yuen, U. Diril, D. Dzhulgakov, K. Hazelwood, B. Jia, Y. Jia, L. Qiao, V. Rao, N. Rotem, S. Yoo, and M. Smelyanskiy. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications, 2018.
- A. Paszke, S. Gross, S. Chintala, and G. Chanan. Torchvision models, 2016. https://pytorch.org/ vision/stable/models.html.
- M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations, 2018.
- H. Pham, Z. Dai, Q. Xie, M.-T. Luong, and Q. V. Le. Meta pseudo labels, 2021.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
- E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search, 2019.
- H. Ritchie and M. Roser. Energy. Our World in Data , 2020. https://ourworldindata.org/energy.
- C. Rosset. Turing-nlg: A 17-billion-parameter language model by microsoft, 2020. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-languagemodel-by-microsoft/.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge, 2015.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks, 2019.
- R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green ai, 2019.
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
- V. Sovrasov. Flops counter for convolutional networks in pytorch framework , 2020. https://github. com/sovrasov/flops-counter.pytorch.
- A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani. Bottleneck transformers for visual recognition, 2021.
- R. Stojnic and R. Taylor. Papers with code imagenet benchmark (image classification), 2021. https: //paperswithcode.com/sota/image-classification-on-imagenet.
- E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in nlp, 2019.
- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices, 2020.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions, 2014.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision, 2015.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.
- O. S´ emery. Computer vision models on pytorch, 2019. https://pypi.org/project/pytorchcv/.
- M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2020.
- M. Tan and Q. V. Le. Efficientnetv2: Smaller models and faster training, 2021.
- D. Thomas. Reducing machine learning inference cost for pytorch models - aws online tech talks. https://www.youtube.com/watch?v=ET2KVe2du3Y, 2020.
- N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso. The computational limits of deep learning. arXiv preprint arXiv:2007.05558 , 2020.
- S. Tolan, A. Pesole, F. Mart´ ınez-Plumed, E. Fern´ andez-Mac´ ıas, J. Hern´ andez-Orallo, and E. G´ omez. Measuring the occupational impact of ai: tasks, cognitive abilities and ai benchmarks. Journal of Artificial Intelligence Research , 71:191-236, 2021.
- H. Touvron, A. Vedaldi, M. Douze, and H. J´ egou. Fixing the train-test resolution discrepancy: Fixefficientnet, 2020.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Deit: Data-efficient image transformers github, 2021a. https://github.com/facebookresearch/deit.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Training data-efficient image transformers & distillation through attention, 2021b.
- H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. J´ egou. Going deeper with image transformers, 2021c.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998-6008, 2017.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
- Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification, 2020.
- S. Xie, R. Girshick, P. Doll´ ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks, 2017.
- C. Xu, W. Zhou, T. Ge, F. Wei, and M. Zhou. Bert-of-theseus: Compressing bert by progressive module replacing, 2020.
- X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for edge inference of deep neural networks. Nature Electronics , 1(4):216-222, 2018.
- I. Z. Yalniz, H. J´ egou, K. Chen, M. Paluri, and D. Mahajan. Semi-supervised and semi-weakly supervised imagenet models github, 2019a. https://github.com/facebookresearch/semi-supervisedImageNet1K-models.
- I. Z. Yalniz, H. J´ egou, K. Chen, M. Paluri, and D. Mahajan. Billion-scale semi-supervised learning for image classification, 2019b.
- L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet, 2021.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks, 2013.
- X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers, 2021.
- H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, and A. Smola. Resnest: Split-attention networks, 2020.
- X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices, 2017.
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition, 2018.
## Appendix
In this technical appendix we include some supplementary material giving detailed information about 1) differences between FLOPs and FLOPS; 2) methodological details for CV and NLP models used in our analyses; 3) benchmarks addresed; 4) hardware specifics regarding precision; 5) further analysis for performance and compute in NLP tasks; 6) FLOPS estimation procedures; 7) Results for the GLUE benchmark; and 8) GPU consumption data.
## FLOPs vs FLOPS
When dealing about computing effort and computing speed (hardware performance), terminology is usually confusing. The term 'compute' is usually ambiguous, sometimes applied for a number of operations or the number of operations per second. However, it is important to clarify what kind of operations and the acronyms for them. In this regard, we will use the acronym FLOPS to measure hardware performance, by referring to the number of floating point operations per second , as standardised in the industry, while FLOPs will be applied to the amount of computation for a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting a multiply-add operation pair as two operations.
For instance, we found out that the acronym FLOP may be misleading. By FLOP, we mean one floating point operation, a measure of the amount of compute (computing effort) and by FLOPS, we mean floating point operations per second , i.e., FLOPS = FLOP/s. However, many papers, especially CVpapers, use the terms FLOPs and FLOPS to refer to the number of operations, but we will be just use FLOPs as the plural of FLOP, never as FLOPS. Then there is the question of what a FLOP is. When dealing with DNN, this is usually associated with the number of multiply-add operations, even there are other type of operations involved when executing a DNN. This is done this way because it is usually a good estimation [Hollemans, 2018, Clark et al., 2020]. More specifically, we will count one fused multiply-add operation as 2 FLOPs (note the lowercase 's'). Hardware manufacturers count them in this manner [NVIDIA, 2015], because in fact there are two mathematical operations. However, CV research papers count a multiply-add operation as only one operation. In this case, we will multiply the number of operations reported by 2. In sum, the acronym FLOPS will be applied to measure hardware performance, by referring to the number of floating point operations per second , as standardised in the industry, while FLOPs will be applied to the amount of computation for a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting a multiply-add operation pair as two operations.
## Methodology Details for CV Models
Accuracy and FLOPs metrics were collected carefully, taking into account that there are different sampling techniques to reach a given accuracy. For instance, in the AlexNet paper [Krizhevsky et al., 2012], to classify a single image they make 10 predictions, they take 10 different crops 6 from the original image and average the 10 predictions to get the final prediction. While this is a useful trick, it is not fair to compare the accuracy of a model achieved with 10 crops with another achieved with 1 crop. Furthermore, the use of several crops or other kinds of repetitions is problematic, as the papers usually report the number of FLOPs for one forward pass 7 (if 10 forward passes are needed to make a single prediction, then the FLOPs should be multiplied by 10). For these reasons we only report 1-crop accuracy for all models, to make a meaningful comparison.
Note that the FLOPs also depend of the input image resolution: the higher the image resolution, the more operations (FLOPs) are required to process it. Some researchers report results with different image resolutions [Simonyan and Zisserman, 2015, Zhai et al., 2021], and sometimes it is not clear which resolution the results are reported for. In these cases, we need to investigate until we find that information. In sum, all the collected FLOPs in this work are for a forward pass with the resolution used for inference. The selected models and their values are shown in Table 2.
6 Cropping is a common image manipulation process: while cropping the middle square (down-sampling) from input images is a good practice for data preparation, random cropping is also a good practice for train-data augmentation
7 A'forward pass' refers to calculation process, values of the output layers from the inputs data. It's traversing through all neurons from first to last layer. A loss function is calculated from the output values.
Table 2: CV models data set. A citation next to a given value means that this value is extracted from that source, otherwise the values are from the paper (cited in model column). The symbol † means that this value was obtained or checked from a model implementation using model analysis tools, and the symbol ∗ means that we estimated the value.
| Model | Top-1 Acc. | Params (M) | GFLOPs | Extra Data | Date | Architecture |
|----------------------------------------------------------------------------------------|----------------------------------------------------|----------------------------------------------------|---------------------------------------------------------|---------------------------------|----------------------------------|-------------------------------------|
| AlexNet [Krizhevsky et al., 2012] | 56.52 [Paszke et al., 2016] | 61.00 † | 1.42 † | No | 01/06/2012 | CNN |
| ZFNet-b [Zeiler and Fergus, 2013] | 63.63 [S´ emery, 2019] | 107.63 [S´ emery, 2019] | 4.96 [S´ emery, 2019] | No | 11/11/2013 | CNN |
| ZFNet [Zeiler and Fergus, 2013] VGG-19 [Simonyan and Zisserman, 2015] | 60.21 [S´ emery, 2019] 72.37 [Paszke et al., 2016] | 62.36 [S´ emery, 2019] 144.00 | 2.34 [S´ emery, 2019] 39.34 † | No No | 12/11/2013 04/09/2014 | CNN CNN |
| VGG-16 [Simonyan and Zisserman, 2015] | 71.59 [Paszke et al., 2016] 69.77 2016] | 138.00 | 31.00 † | No | 04/09/2014 | CNN |
| Inception V1/GoogleLeNet [Szegedy et al., 2014] | [Paszke et al., | 6.80 | 3.00 4.10 2019] | No No | 17/09/2014 11/02/2015 | CNN |
| Inception V2/Incepton BN [Ioffe and Szegedy, 2015] Inception V3 [Szegedy et al., 2015] | 74.80 78.80 | 11.29 [S´ emery, 2019] 23.83 | [S´ emery, 11.48 | No | 02/12/2015 | CNN CNN |
| ResNet-50 [He et al., 2015b] | 75.30 [He et al., 2015a] | [Chollet, 2015] | 7.60 | No | 10/12/2015 | CNN |
| ResNet-101 [He et al., 2015b] ResNet-152 [He et al., 2015b] | 76.40 [He et al., 2015a] 77.00 2015a] | 26.00 45.00 [Chollet, 2015] | 15.20 | No No | 10/12/2015 10/12/2015 | CNN CNN |
| Inception V4 [Szegedy et al., 2016] | [He et al., 80.00 | 60.00 [Chollet, 2015] 42.68 [S´ emery, 2019] | 22.60 [S´ emery, | No | 23/02/2016 | CNN |
| Inception ResNet V2 [Szegedy et al., 2016] | 80.10 | 55.84 [S´ emery, 2019] | 24.60 2019] 26.38 [S´ emery, 2019] | No | 23/02/2016 | CNN |
| Densenet-121 [Huang et al., 2018] | 74.98 | 7.98 [S´ emery, 2019] | 5.74 [S´ emery, 2019] | No | 25/08/2016 | CNN |
| Densenet-169 [Huang et al., 2018] Densenet-201 [Huang et al., 2018] | 76.20 77.42 | 14.15 [S´ emery, 2019] 20.01 [S´ emery, 2019] | 6.80 [S´ emery, 2019] 8.68 [S´ emery, 2019] | No No | 25/08/2016 25/08/2016 | CNN CNN |
| Xception [Chollet, 2017] | 79.00 | 22.86 | 16.80 [S´ emery, 2019] | No | 07/10/2016 | CNN |
| ResNeXt-50 (32x4d) [Xie et al., 2017] | 77.80 | 25.00 | 8.40 | No | 16/11/2016 | CNN |
| ResNeXt-101 (64x4d) [Xie et al., 2017] | 79.60 | 83.46 | 31.20 † | No | 16/11/2016 | CNN |
| MobileNet [Howard et al., 2017] | 70.60 | 4.20 | 1.14 | No | 17/04/2017 | CNN |
| ShuffleNet x1.0 (g=8) [Zhang et al., 2017] DPN-131 (40 × 4d) [Chen et al., 2017] | 67.60 | 2.43 [S´ emery, 2019] | 0.28 | No | 04/07/2017 06/07/2017 | CNN CNN |
| DPN-98 (40 × 4d) [Chen et al., 2017] | 80.07 79.80 | 79.50 61.70 | 32.00 23.40 | No No | 06/07/2017 | CNN |
| DPN-92 (32 × 3d) [Chen et al., 2017] | 79.30 | 37.80 | 13.00 | No | 06/07/2017 | CNN |
| NASNet-A (6 @4032) [Zoph et al., 2018] NASNet-A (7 @1920) [Zoph et al., | 82.70 | 88.90 | 47.60 | No | 21/07/2017 21/07/2017 | CNN |
| 2018] | 80.80 | 22.60 115.09 2019] | 9.86 [S´ emery, | No No | | CNN |
| SENet-154 [Hu et al., 2019] PNASNet-5 (N = 4, F = 216) [Liu et al., 2018] 2019] | 81.32 82.90 | [S´ emery, 86.10 | 41.50 2019] 50.00 | No | 05/09/2017 02/12/2017 | CNN CNN |
| PNASNet-5 (N = 3, F = 54) [Hu et al., | 74.20 | 5.10 | 1.18 0.60 | No | 02/12/2017 | CNN |
| MobileNetV2 [Sandler et al., 2019] MobileNetV2 1.4 [Sandler et al., 2019] | 72.00 74.70 | 3.40 6.90 | 1.18 | No No | 13/01/2018 13/01/2018 | CNN CNN |
| AmoebaNet-A (N=6, F=190) [Real et al., 2019] | 82.80 | 86.70 | 46.20 | No | 05/02/2018 | CNN |
| AmoebaNet-A (N=6, F=448) [Real et al., 2019] ResNeXt-101 32×32d [Mahajan et al., 2018] | 83.90 85.10 | 469.00 466.00 | 208.00 174.00 | No Instagram 940M | 05/02/2018 02/05/2018 | CNN CNN |
| ResNeXt-101 32×48d [Mahajan et al., 2018] ShuffleNetV2 x1.0 [Ma et al., 2018] | 85.40 | 829.00 2019] | 306.00 0.30 | Instagram 940M | 02/05/2018 | CNN |
| | 69.40 | 2.28 [S´ emery, | | No | | CNN |
| ResNeXt-101 32x16d [Yalniz et al., 2019b,a] ResNeXt-101 32x8d [Yalniz et al., 2019b,a] | 84.80 | 193.00 | 72.00 | Custom 940M | 30/07/2018 02/05/2019 | CNN CNN |
| ResNeXt-50 32x4d [Yalniz et al., 2019b,a] | 84.30 | 88.00 25.00 | 32.00 8.00 | Custom 940M Custom 940M | 02/05/2019 | CNN CNN |
| EfficientNet-B0 [Tan and Le, 2020] EfficientNet-B1 [Tan and Le, 2020] | 82.20 77.10 | 5.30 | 0.78 | No | 02/05/2019 28/05/2019 | |
| EfficientNet-B2 [Tan and Le, 2020] | 79.10 80.10 | 7.80 9.20 | 1.40 2.00 | No No | 28/05/2019 | CNN |
| EfficientNet-B3 [Tan and Le, 2020] | 81.60 | 12.00 | 3.60 | No | 28/05/2019 | CNN CNN |
| EfficientNet-B4 [Tan and Le, 2020] EfficientNet-B5 [Tan and Le, 2020] | 82.90 83.60 | 19.00 30.00 | 8.40 19.80 | No | 28/05/2019 28/05/2019 | CNN CNN |
| EfficientNet-B6 [Tan and Le, 2020] | 84.00 | 43.00 | | No | | |
| EfficientNet-B7 [Tan and Le, 2020] | | 66.00 | 38.00 74.00 | No | 28/05/2019 28/05/2019 | CNN CNN |
| NoisyStudent-B0 [Xie et al., 2020] | 84.30 78.80 | 5.30 | 0.78 | No | 28/05/2019 | CNN CNN |
| NoisyStudent-B1 [Xie et al., 2020] NoisyStudent-B2 [Xie et al., | 81.50 | 7.80 | 1.40 2.00 | JFT 300M JFT 300M | 11/11/2019 11/11/2019 | CNN |
| 2020] NoisyStudent-B3 [Xie et al., 2020] | 82.40 | 9.20 | 3.60 | JFT 300M JFT | 11/11/2019 11/11/2019 | CNN |
| NoisyStudent-B4 [Xie et al., 2020] | 84.10 85.30 | 12.00 19.00 | 8.40 | 300M JFT | | CNN |
| NoisyStudent-B5 [Xie et al., 2020] | 86.10 | 30.00 | 19.80 | 300M JFT | 11/11/2019 | CNN |
| NoisyStudent-B6 [Xie et al., 2020] | 86.40 | 43.00 66.00 | 38.00 74.00 | 300M JFT 300M JFT 300M | 11/11/2019 11/11/2019 | CNN CNN |
| NoisyStudent-B7 [Xie et al., 2020] NoisyStudent-L2 [Xie et al., 2020] | 86.90 | | 1040.00 | JFT 300M | 11/11/2019 11/11/2019 | CNN |
| FixEfficientNet-L2 [Touvron et al., 2020] FixEfficientNet-B7 [Touvron et al., | 88.40 88.50 85.30 | 480.00 480.00 66.00 | ∗ 585.00 ∗ 82.00 ∗ 1.60 ∗ | JFT 300M No | 18/03/2020 18/03/2020 18/03/2020 | CNN CNN |
| 2020] FixEfficientNet-B0 [Touvron et al., 2020] 2021] | 79.30 | 5.30 | 1040.00 | No JFT 300M | | CNN |
| Meta Pseudo Labels L2 [Pham et al., ResNeSt-269 [Zhang et al., 2020] | | | 155.8 † | No | | CNN CNN |
| ResNeSt-200 [Zhang et al., 2020] | 90.20 84.50 83.90 | 480.00 | ∗ 71.56 † | No | 23/03/2020 | CNN |
| ResNeSt-50 [Zhang et al., 2020] | | 111.00 70.00 27.50 | 10.78 | No | 19/04/2020 19/04/2020 19/04/2020 | CNN |
| ViT-L/16 [Dosovitskiy et al., 2021] | 81.13 85.30 | 304.00 [Tan and Le, 2021] | 384.00 [Tan and Le, 2021] | ImageNet 21k | 22/10/2020 22/10/2020 | Transformer |
| ViT-L/16 [Dosovitskiy et al., 2021] ViT-B/16 [Dosovitskiy et al., 2021] | 87.12 84.60 [Tan and Le, 2021] | 304.00 [Tan and Le, 2021] 87.00 [Tan and Le, 2021] | 384.00 [Tan and Le, 2021] 112.00 [Tan and Le, | JFT 300M ImageNet 21k | 22/10/2020 | Transformer |
| DeiT-small [Touvron et al., 2021b,a] DeiT-small-Distilled [Touvron et al., | 79.90 81.20 | 22.00 22.00 | 2021] 9.20 [Yuan et al., 2021] 9.40 [Yuan et al., 2021] | No No | 23/12/2020 23/12/2020 | Transformer |
| 2021b,a] DeiT-base [Touvron et al., 2021b,a] | | 86.00 86.00 | 36.00 [Tan and Le, 112.00 [Tan and Le, | No No | 23/12/2020 23/12/2020 | Transformer Transformer Transformer |
| DeiT-base-384 [Touvron et al., 2021b,a] | 81.80 | | 2021] 2021] 92.00 | | | Transformer |
| BotNet-T7 [Srinivas et al., 2021] BotNet-T5 [Srinivas et al., 2021] | 82.90 | 75.00 75.10 | 38.60 | No No | 27/01/2021 27/01/2021 | Hybrid Hybrid |
| T2T-ViTt-14 [Yuan et al., 2021] | 84.70 83.50 81.70 | 21.50 | 12.20 19.60 | No | 28/01/2021 | Transformer |
| T2T-ViTt-19 [Yuan et al., 2021] | | 39.20 64.10 | 30.00 | No | 28/01/2021 | |
| T2T-ViTt-24 [Yuan et al., 2021] NFNet-F4+ [Brock et al., 2021] | 82.20 82.60 89.20 | | 734.00 | No | 28/01/2021 11/02/2021 | CNN |
| NFNet-F0 [Brock et al., 2021] | 83.60 | 527.00 71.50 | 24.76 | No | | Transformer Transformer CNN |
| NFNet-F6+SAM [Brock et al., | 86.50 | 438.40 88.00 | | JFT 300M No | | CNN |
| 2021] Swin-B 224 [Liu et al., 2021] | 85.20 | 88.00 | 754.56 30.80 94.00 | | 11/02/2021 11/02/2021 | |
| Swin-B 384 [Liu et al., 2021] Swin-L [Liu et al., 2021] | 86.00 86.40 | 197.00 | | ImageNet 21k | 25/03/2021 | Transformer |
| CrossViT-15 [Chen et al., 2021] | 81.50 | 27.40 43.30 | 207.80 11.60 18.06 | ImageNet 21k ImageNet 21k No No | 25/03/2021 25/03/2021 27/03/2021 | Transformer |
| CrossViT-18 [Chen et al., 2021] CaiT-S36 [Touvron et al., 2021c] | 82.50 | 68.00 | 27.80 | No | 27/03/2021 | Transformer Transformer Transformer |
| CaiT-S36 dist [Touvron et al., 2021c] | 83.30 84.00 | 68.00 | 27.80 | No | 31/03/2021 31/03/2021 | Transformer Transformer |
| CaiT-S24-384 dist [Touvron et al., | 85.10 | 46.90 | 64.40 | No | 31/03/2021 | Transformer |
| CaiT-M48-448 dist [Touvron et al., | 86.50 | 356.00 | | No | 31/03/2021 | Transformer |
| 2021c] 2021c] EfficientNetV2-S [Tan and Le, 2021] | 83.90 | 24.00 | 659.20 17.60 | No | 01/04/2021 01/04/2021 | CNN CNN |
| EfficientNetV2-M [Tan and Le, 2021] EfficientNetV2-L [Tan and Le, 2021] | 85.10 | 55.00 121.00 | 48.00 | No No | 01/04/2021 | CNN |
| EfficientNetV2-S [Tan and Le, 2021] | 85.70 | | | | 01/04/2021 | CNN |
| EfficientNetV2-M [Tan and Le, 2021] | 85.00 86.10 | 24.00 55.00 | 106.00 17.60 48.00 | ImageNet ImageNet ImageNet | 01/04/2021 | CNN CNN |
| EfficientNetV2-L [Tan and Le, 2021] | 86.80 | 121.00 | 106.00 | 21k 21k 21k | 01/04/2021 | Transformer |
| | 90.45 | | 5270.00 ∗ | JFT 3B | 08/06/2021 | |
| ViT-G/14 [Zhai et al., 2021] | | | | | | |
| | | 1843.00 | | | | |
## Methodology Details for NLP Models
As previously stated, for NLP models we just included all the models since 2017 for which we find inference compute estimation. Many papers do not explain how they count FLOPs (as single mathematical operations or single hardware instructions), but we ultimately found out this information explained in [Clark et al., 2020]. We compare the presented numbers with estimations in other publications (we compare the numbers for repeated and similar models) and we see that these numbers are very similar. We assume that the other authors follow this as the standard procedure to count FLOPs. In NLP, they count FLOPs as single mathematical operations and not as a single hardware instructions (like in CV). The important thing is that we use the same approach in all the NLP models, as the comparison and analysis will be intra-domain and never inter-domain.
## Datasets
## ImageNet
ImageNet is the most used dataset in the last decade for training and evaluating CV models. The full dataset consists of 14,197,122 images distributed in 21,841 classes. Researchers refer to this dataset as ImageNet21k or ImageNet22k. However, researchers commonly use a subset of the full ImageNet dataset. This subset consists of 1.2 million images for training and 50,000 images for validation distributed in 1,000 classes. This subset was released for ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) and is usually referred as ImageNet1k or just as ImageNet. In 2012 the AlexNet model [Krizhevsky et al., 2012] won the ILSVRC 2012 Image Classification with an impressive result, outperforming the other models by large margin. AlexNet was the first DNN to win this competition. Since then many other DNNs have been created for image classification.
## GLUE
The General Language Understanding Evaluation (GLUE) benchmark [Wang et al., 2019] is a collection of resources for evaluating and analysing the performance of models across a diverse range of existing NLP tasks with the goal of driving 'research in the development of general and robust natural language understanding systems'. The collection in GLUE consists of nine 'difficult and diverse' tasks, mostly adopted from existing datasets. The tasks involve sentiment analysis, acceptability, paraphrasing, natural language inference and coreference resolution. GLUE is modelagnostic, but it incentivises sharing knowledge across tasks (using parameter sharing or other transfer learning techniques) due to the limited training data for certain tasks.
## Hardware data compilation: floating point precision details
At the end of 2017 Nvidia launched GPUs with new features for AI acceleration (improved lower precision performance and tensor cores, which can improve low-precision calculations) [NVIDIA, 2017]. For instance, many new GPUs have accelerated FP16 operations through tensor cores (DNN can operate at low precision in many calculations without problems) and combine them with FP32 precision operations when is necessary. In this way we benefit from higher performance, maintaining calculation's precision. Nvidia specifies different FLOPS for FP16 and for tensor cores. Nowadays, frameworks as PyTorch and TensorFlow allow to train and infer with a DNN with mixed precision, i.e., taking advantage of the tensor cores, easily without practically any significant reduction in accuracy. Because of all this, we consider necessary to include the performance achieved with tensor cores in our analysis.
Theoretical FLOPS using tensor cores are very high, but this increase in FLOPS does not correspond with the gain seen in practice for deep learning applications (maybe gaming is different). This is because it is not possible to use tensor cores for all operations. To solve the discrepancy between tensor core FLOPS and the real utilisation of these FLOPS, we calculate the speed up achieved for DNN when inference is done with mixed precision. We have looked for experimental results to adjust the tensor FP16/FP32 FLOPS to real performance improvement, the inference experimental results that we use are available in Nvidia NGC Catalog 8 . The collected data can be found in Table
3.
8 https://ngc.nvidia.com/catalog/resources
Table 3: Throughput measures for V100, A100 and T4 GPUs on different Models. The 'speed-up' column is the speed-up achieved with respect to FP32 throughput using different precision formats. A100 speed-up is calculated with respect to V100 FP32 throughput. The data is obtained from NVIDIA NGC catalog (https://ngc.nvidia.com/catalog/resources).
| Task | Model | Framework | Batch size | GPU | Presicion | Throughput | Speed-up |
|--------|-----------------------------------------|-----------------------|--------------|---------------------|-------------|---------------|------------|
| | efficientnet-b0 | PyTorch | 256 | V100 16GB | FP32 | 2968 | 1.00 |
| | efficientnet-b0 | PyTorch | 256 | V100 16GB | Mixed | 6176 | 2.08 |
| | efficientnet-b0 | PyTorch | 256 | A100 80GB | TF32 | 5154 | 1.74 |
| | efficientnet-b0 | PyTorch | 256 | A100 80GB | Mixed | 10239 | 3.45 |
| | efficientnet-b4 | PyTorch | 128 | V100 16GB | FP32 | 376 | 1.00 |
| | efficientnet-b4 | PyTorch | 128 | V100 16GB | Mixed | 843 | 2.24 |
| | efficientnet-b4 | PyTorch | 128 | A100 80GB | TF32 | 700 | 1.86 |
| | efficientnet-b4 | PyTorch | 128 | A100 80GB | Mixed | 1418 | 3.77 |
| | ResNeXt101-32x4d | PyTorch | 256 | V100 16GB | FP32 | 533 | 1.00 |
| | ResNeXt101-32x4d | PyTorch | 256 | V100 16GB | Mixed | 1746 | 3.28 |
| | ResNeXt101-32x4d | PyTorch | 256 | T4 16GB | FP32 | 161 | 1.00 |
| | ResNeXt101-32x4d | PyTorch | 256 | T4 16GB | Mixed | 598 | 3.71 |
| | ResNet v1.5 | PyTorch | 256 | V100 16GB | FP32 | 1261 | 1.00 |
| | ResNet v1.5 | PyTorch | 256 | V100 16GB | Mixed | 3382 | 2.68 |
| | ResNet v1.5 | PyTorch | 256 | T4 16GB | FP32 | 415 | 1.00 |
| | ResNet v1.5 | PyTorch | 256 | T4 16GB | Mixed | 1198 | 2.89 |
| | ResNet v1.5 | TensorFlow | 256 | V100 16GB | FP32 | 1348.52 | 1.00 |
| | ResNet v1.5 | TensorFlow | 256 | V100 16GB | Mixed | 2742.14 | 2.03 |
| CV | ResNet v1.5 | TensorFlow | 256 | A100 40GB | TF32 | 1911.96 | 1.42 |
| | ResNet v1.5 | TensorFlow | 256 | A100 40GB | Mixed | 3229.32 | 2.39 |
| | ResNet v1.5 | TensorFlow | 256 | T4 16GB | FP32 | 425.72 | 1.00 |
| | ResNet v1.5 | TensorFlow | 256 | T4 16GB | Mixed | 993.39 | 2.33 |
| | SSD v1.1 | PyTorch | 32 | V100 16GB | FP32 | 271.73 | 1.00 |
| | SSD v1.1 | PyTorch | 32 | V100 16GB | Mixed | 438.85 | 1.62 |
| | SSD v1.1 SSD v1.1 | PyTorch | 32 | A100 40GB A100 40GB | TF32 Mixed | 548.75 910.17 | 2.02 3.35 |
| | | PyTorch | 32 | | | | |
| | UNet Industrial | TensorFlow | 16 | V100 16GB | FP32 | 250.23 | 1.00 |
| | UNet Industrial UNet Industrial | TensorFlow TensorFlow | 16 16 | V100 16GB | Mixed TF32 | 469.27 424.57 | 1.88 1.70 |
| | UNet Industrial | TensorFlow | 16 | A100 40GB A100 40GB | Mixed | 823.46 | 3.29 |
| | | | 128 | | FP32 | 460.82 | 1.00 |
| | SE-ResNeXt101-32x4d | TensorFlow | 128 | V100 16GB | | | |
| | SE-ResNeXt101-32x4d SE-ResNeXt101-32x4d | TensorFlow TensorFlow | 128 | V100 16GB | Mixed TF32 | 1102 802.64 | 2.39 1.74 |
| | SE-ResNeXt101-32x4d | TensorFlow | 128 | A100 40GB A100 40GB | Mixed | 1728.27 | 3.75 |
| | SE-ResNeXt101-32x4d | TensorFlow | 128 | T4 16GB | FP32 | 105.16 | 1.00 |
| | SE-ResNeXt101-32x4d BERT-LARGE | TensorFlow TensorFlow | 128 8 | T4 16GB V100 16GB | Mixed FP32 | 195.17 44.03 | 1.86 1.00 |
| | BERT-LARGE | TensorFlow | 8 | V100 16GB | Mixed | 168.34 | 3.82 |
| | BERT-LARGE | TensorFlow | 8 | A100 80GB | TF32 | 241.68 | 5.49 |
| | BERT-LARGE | TensorFlow | 8 | A100 80GB | Mixed | 342.22 | 7.77 |
| | BERT-LARGE | TensorFlow | 8 | T4 16GB | FP32 | 16.04 | 1.00 |
| | BERT-LARGE | TensorFlow | 8 | T4 16GB | Mixed | 62.99 | 3.93 |
| | BERT-Base | TensorFlow | 8 | V100 16GB | FP32 | 146.15 | 1.00 |
| | BERT-Base | TensorFlow | 8 | V100 16GB | Mixed | 504.24 | 3.45 |
| | BERT-Base | TensorFlow | 8 | A100 80GB | TF32 | 645.88 | 4.42 |
| | BERT-Base | TensorFlow | 8 | A100 80GB | Mixed | 846.81 | 5.79 |
| NLP | BERT-Base | TensorFlow | 8 | T4 16GB | FP32 | 51.33 | 1.00 |
| | BERT-Base | TensorFlow | 8 | T4 16GB | Mixed | 192.61 | 3.75 |
| | Transformer-XL | TensorFlow | 32 | V100 16GB | FP32 | 8555.6 | 1.00 |
| | Transformer-XL | TensorFlow | 32 | V100 16GB | Mixed | 11215.5 | 1.31 |
| | Transformer-XL | TensorFlow | 32 | A100 40GB | TF32 | 19434.5 | 2.27 |
| | Transformer-XL | TensorFlow | 32 | A100 40GB | Mixed | 21854.7 | 2.55 |
| | Transformer-XL | TensorFlow | 32 | T4 16GB | FP32 | 3439.1 | 1.00 |
| | Transformer-XL | TensorFlow | 32 | T4 16GB | Mixed | 6174.3 | 1.80 |
| | Transformer | PyTorch | 10240 | V100 16GB | FP32 | 3782 | 1.00 1.97 |
| | Transformer | PyTorch | 10240 | V100 16GB | Mixed | 7464 | |
| | Transformer Transformer | PyTorch | 10240 | A100 40GB | TF32 | 7755 | 2.05 |
| | | PyTorch | 10240 | A100 40GB | Mixed | | 2.55 |
| | | | | | | 9653 | |
We do not include estimated mixed precision performance for all GPUs that support it because we have not found sufficient benchmarks for all GPUs to carry out an estimation. Also, we do not consider INT8 precision format because in many cases using this format leads to performance downgrade, and therefore the accuracy metric of the models should be adapted for a fair analysis. We perform a different estimation for CV and for NLP networks because these two kinds of networks operate in different ways and take different advantage of mixed precision. During training the speedup from mixed precision in comparison to FP32 is usually of 2x for image models, and up to 4x for language models [Li, 2020]. This is corroborated in information about some benchmarks on Nvidia blogs too [NVIDIA, 2018].
## Hardware mixed precision speed-ups
As we have discussed, theoretical FLOPS for tensor cores are very high, as we can see in Fig. 7 in the main text. However, the performance for inference using tensor cores is not so high. For this reason we propose an estimation for the Nvidia GPUS: V100, A100 and T4 for CV models and for NLP models. For these calculations we collected inference data from NVIDIA NGC. The estimations for A100 are in relation to V100 because there is no data about FP32 for A100 (because FP32 is substituted by TF32 9 , which is a precision format in between of FP32 and FP16), so we estimated the speed-up to V100 FP32 FLOPS.
Table 4: Mixed precision speed ups from experimental results for inference.
| GPU | Precision speed up | CV models | NLP models |
|-------|--------------------------------------------------------------------|-------------|--------------|
| V100 | Mixed speed up ratio to V100 FP32 | 2.27 | 2.64 |
| A100 | TF32 speed up ratio to V100 FP32 Mixed speed up ratio to V100 FP32 | 1.75 3.33 | 3.56 4.67 |
| T4 | Mixed speed up ratio to T4 FP32 | 2.7 | 3.16 |
## Performance and compute (NLP)
We represent the improvement on the GLUE score over the years as well as models inference GFLOPs (bubbles size) in Fig. 15. GFLOPs are for single input of length 128, which is a reasonable sequence length for many use cases, being able to fit text messages or short emails. We can observe a very similar evolution to the evolution observed in ImageNet: SOTA models require a large number of FLOPs, but in a short period of time other models appear, which require much fewer FLOPs to reach the same score.
Figure 15: GFLOPs per token analysis for NLP models.
<details>
<summary>Image 15 Details</summary>

### Visual Description
\n
## Scatter Plot: GLUE Score vs. Date & GFLOPs
### Overview
This image presents a scatter plot visualizing the relationship between the GLUE score, date, and GFLOPs (floating point operations per second). The plot displays data points representing models evaluated on the GLUE benchmark over time, with the size of each point indicating the model's computational cost in GFLOPs.
### Components/Axes
* **X-axis:** Date, ranging from approximately July 2018 to July 2020. The axis is labeled "Date".
* **Y-axis:** GLUE score, ranging from approximately 73 to 83. The axis is labeled "GLUE score".
* **Point Size:** Represents GFLOPs, with a legend indicating the following mapping:
* 4 GFLOPs (Smallest circle)
* 8 GFLOPs
* 16 GFLOPs
* 32 GFLOPs
* 64 GFLOPs (Largest circle)
* **Legend:** Located in the top-right corner of the plot.
### Detailed Analysis
The plot shows a general trend of increasing GLUE scores over time. The size of the data points (representing GFLOPs) also generally increases with the GLUE score, but this relationship is not strictly linear.
Here's a breakdown of the data points, approximating values based on visual inspection:
* **2018-07:** GLUE score ≈ 73, GFLOPs ≈ 4
* **2018-10:** GLUE score ≈ 75, GFLOPs ≈ 8
* **2019-01:** GLUE score ≈ 78, GFLOPs ≈ 16
* **2019-04:** GLUE score ≈ 80, GFLOPs ≈ 32
* **2019-07:** GLUE score ≈ 81, GFLOPs ≈ 32
* **2019-10:** GLUE score ≈ 82, GFLOPs ≈ 64
* **2020-01:** GLUE score ≈ 77, GFLOPs ≈ 32
* **2020-01:** GLUE score ≈ 82, GFLOPs ≈ 64
* **2020-04:** GLUE score ≈ 76, GFLOPs ≈ 8
* **2020-04:** GLUE score ≈ 78, GFLOPs ≈ 16
* **2020-07:** GLUE score ≈ 78, GFLOPs ≈ 16
* **2020-07:** GLUE score ≈ 80, GFLOPs ≈ 32
### Key Observations
* There's a positive correlation between GFLOPs and GLUE score, suggesting that more computationally expensive models generally achieve higher scores.
* The largest GFLOPs values (64) are associated with the highest GLUE scores (around 82).
* There is some variance in GLUE scores for models with similar GFLOPs values, indicating that computational cost is not the only factor determining performance.
* The data suggests a plateauing of GLUE score improvement in the later stages (2020), despite continued increases in GFLOPs.
### Interpretation
The data demonstrates the trade-off between model size (measured in GFLOPs) and performance (measured by GLUE score) in natural language processing. Initially, increasing model size leads to significant improvements in GLUE score. However, the rate of improvement appears to diminish as models become larger, suggesting diminishing returns. This could be due to factors such as data limitations, optimization challenges, or the inherent complexity of the tasks included in the GLUE benchmark. The plateauing trend in 2020 might indicate that the current approaches are reaching their limits, and new architectures or training techniques are needed to achieve further gains. The scatter plot effectively visualizes this relationship, allowing for a quick assessment of the performance-cost trade-offs for different models.
</details>
## FLOPS estimation for CV models
## EfficientNet-Based Models FLOPs Estimation
There are many EfficientNet variations, mostly using different input resolution or scaling. For these modifications, FLOPs are not always reported. In this work, we estimate them following the relation presented in Equation 3
$$F L O P s \, \infty \, d + w ^ { 2 } + r ^ { 2 } \quad ( 3 )$$
for the following models:
- NoisyStudent-L2 : Having the scale factors of the networks (Table 5) we estimate NoisyStudent-L2 FLOPs as shown in Equation 4
9 https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
## NLP data
Many times researchers report GLUE score without the punctuation on the WNLI task, because this task is problematic. We have marked which scores are reported without this task. Since there are 9 tasks in total, we consider that excluding one of them is not problematic for our analysis.
We did not find inference GFLOPs for the model Bert-Large, but we have ELECTRA-Large GFLOPs and this is actually the same model but following a different training strategy. In this
Table 5: EfficientNet models architecture specifications obtained from [Xie et al., 2020].
| Model | w | d | Test Resolution |
|-----------------|-----|-----|-------------------|
| EfficientNet-B7 | 2 | 3.1 | 600 × 600 |
| EfficientNet-L2 | 4.3 | 5.3 | 800 × 800 |
$$\begin{array} { r l r } & { N o i s y S t u d e n - L 2 F L O P s = } \\ & { = E f f i c i e n T e n t - B 7 F L O P s \cdot d _ { \sigma } \cdot w _ { \sigma } ^ { 2 } \cdot r _ { \sigma } ^ { 2 } } \end{array}$$
where d σ , w σ and r σ are the scaled factors for, respectively, the network depth, width and input resolutions. By using the values from Table 5, d σ = 5 . 3 / 3 . 1 = 1 . 7097 , w σ = 4 . 3 / 2 = 2 . 15 and r σ = 800 / 600 = 1 . 3334 . Knowing that the GFLOPS for EfficientNetB7 are 74, substituting in 4, we obtain the estimation of 74 GFLOPs · 1 . 7097 · 2 . 15 2 · 1 . 3334 2 ≈ 1040 GFLOPS for NoisyStudent-L2.
- Meta Pseudo Labels L2 : We use the estimation of NoisyStudent-L2 FLOPs for Meta Pseudo Labels L2, because it is the same model and only changes the training strategy.
- FixEfficientNet-L2 : In FixEfficientNet-L2 they use a resolution of 600 × 600 for testing, so the estimation is the same as for NoisyStudent-L2 but without taking into account the resolution scaling ( r σ ). Then, the estimated GFLOPS are 74 GFLOPs · 1 . 7097 · 2 . 15 2 ≈ 585 GFLOPS.
- FixEfficientNet-B7 : This model is the same as EfficientNet-B7 but using a slightly different resolution ( 632 × 632 ). Therefore, r σ = 632 / 600 = 1 . 0534 and, thus we estimate 74 GFLOPs · 1 . 0534 2 ≈ 82 GFLOPs.
- FixEfficientNet-B0 : This model is the same as EfficientNet-B0 but using a higher resolution ( 320 × 320 ). Therefore, r σ = 320 / 224 = 1 . 4286 and, thus we estimate 0 . 78 GFLOPs · 1 . 4286 2 ≈ 1 . 6 GFLOPs.
## ViT-G/14 FLOPs Estimation
In the paper [Zhai et al., 2021] introducing the model, although authors provide the GFLOPs for 224 × 224 and 384 × 384 resolutions (see Table 6), they also also use 518 × 518 resolution for ViT-G finetuning, so we assume they use the same resolution for testing too. ViT-G/14 is a vision transformer model, so the scale relation presented in 3 do not apply for this kind of models. However, knowing the GFLOPs for 224 × 224 and 384 × 384 , we may calculate how GFLOPs scale with resolution (given that r 2 σ = (384 / 224) 2 = 2 . 9388 ). In this regard, we calculate the GFLOPs ratio as 2859 . 9 / 965 . 3 = 2 . 9627 and we observe that GFLOPs scale quadratically with respect to resolution. Note, in this paper they report 'real' FLOPs and not multiply-add operations. Therefore, we recalculate r σ = 518 / 384 = 1 . 3490 and multiply the GFLOPs for 384 × 384 resolution by this scale factor estimating 2859 . 9 GFLOPs · 1 . 3490 2 ≈ 5270 GFLOPs for the ViT-G/14 model.
Table 6: ViT-G/14 GFLOPs from.
| Model | GFLOPS | GFLOPS |
|----------|-----------|-----------|
| Model | 224 × 224 | 384 × 384 |
| ViT-G/14 | 965.3 | 2859.9 |
sense, we use take ELECTRA-Large GFLOPs as BERT-Large GFLOPs. For ELMo we take GLUE 'dev-set' score because we do not found the score on the test set (we assume this score should be close to the test set). Values shown in Table 7.
Table 7: NLP models data set. If there is a citation next to the GFLOPs value means that GFLOPs and Input Tokens values are extracted from that source, otherwise the values are from the paper (cited in the 'Model' column). The symbol ♠ means that GLUE score was calculated without punctuation on the WNLI task; the symbol ∗ means that we estimated the value and ♣ means that GLUE score is for GLUE dev set instead of test set.
<details>
<summary>Image 16 Details</summary>

### Visual Description
\n
## Data Table: Language Model Performance Comparison
### Overview
This image presents a data table comparing the performance of various language models across several metrics: Input Tokens, GFLOPs, Parameters (in millions), Date, and GLUE test set score. The table includes citations for the reported values.
### Components/Axes
The table has the following columns:
* **Model:** Name of the language model.
* **Input Tokens:** The number of input tokens the model processes. Values: 128, 512, 1024.
* **GFLOPs:** The number of floating-point operations per second required for the model.
* **Params (M):** The number of parameters in the model, in millions.
* **Date:** The date the model was released or the performance was reported. Format: DD/MM/YYYY.
* **GLUE test set:** The GLUE (General Language Understanding Evaluation) score. An upward-pointing arrow (▲) indicates an improvement over a previous model. A downward-pointing arrow (▼) indicates a decrease. A hyphen (-) indicates no reported score.
### Detailed Analysis or Content Details
Here's a reconstruction of the table's content:
| Model | Input Tokens | GFLOPs | Params (M) | Date | GLUE test set |
|-----------------------------|--------------|--------|------------|------------|---------------|
| Transformer (Vaswani et al., 2017) | 512 | 54 | 65 | 12/06/2017 | - |
| ELMo (Peters et al., 2018) | 128 | 26 | 96 | 15/02/2018 | 71.2 [Clark et al., 2020] |
| GPT-1 (Radford et al., 2018) | 128 | 30 | 117 | 11/06/2018 | 75.1 [Devlin et al., 2019] ▲ |
| BERT Large (Devlin et al., 2019) | 128 | 79 | 335 | 11/10/2018 | 82.1 ▲ |
| BERT-Small (Devlin et al., 2019) | 128 | 3.7 | 14 | 11/10/2018 | - |
| BERT-Base (Devlin et al., 2019) | 128 | 29 | 110 | 11/10/2018 | 79.6 ▲ |
| GPT-2 (Radford et al., 2019) | 1024 | 3400 | 1500 | 14/02/2019 | - |
| Megatron (Shoeybi et al., 2020) | 1024 | 18000 | 8300 | 17/09/2019 | - |
| ALBERT-xxl (Lan et al., 2020) | 512 | 2500 | 235 | 26/09/2019 | - |
| ALBERT-base (Lan et al., 2020) | 128 | 22.5 | 12 | 26/09/2019 | - |
| Theseus 6/768 (Xu et al., 2020) | 1024 | 11.3 | 66 | 07/02/2020 | 77.1 [Lan et al., 2020] ▲ |
| Microsoft T-NLG (Rosset, 2020) | 1024 | 3600 | 17000 | 23/03/2020 | - |
| ELECTRA Large (Clark et al., 2020) | 128 | 79 | 335 | 23/03/2020 | 88.6 ▲ |
| ELECTRA-Small (Clark et al., 2020) | 128 | 3.7 | 14 | 23/03/2020 | 78 ▲ |
| MobileBERT (Sun et al., 2020) | 128 | 5.36 | 25.3 | 06/04/2020 | 85.5 ▲ |
| MobileBERT-tiny (Sun et al., 2020) | 128 | 3.6 | 11 | 06/04/2020 | 78.5 ▲ |
| LLAMA (Brown et al., 2020) | 2048 | 74000 | 175000 | 23/07/2020 | - |
| SqueezeBERT (Dehghani et al., 2020) | 128 | 8.4 | 51 | 19/06/2020 | 82.1 ▲ |
### Key Observations
* **GLUE Score Trend:** Generally, models released later in time tend to have higher GLUE scores, indicating improvements in language understanding capabilities.
* **Parameter Size and Performance:** There's a general correlation between the number of parameters and GLUE score, but it's not strictly linear. ELECTRA Large, with 335M parameters, achieves a high GLUE score of 88.6.
* **GFLOPs and Performance:** Higher GFLOPs don't always translate to higher GLUE scores. For example, GPT-2 has very high GFLOPs (3400) but no reported GLUE score.
* **MobileBERT:** The MobileBERT models demonstrate a good balance between performance (GLUE score) and computational cost (GFLOPs and parameters).
* **LLAMA:** LLAMA has a significantly larger number of parameters (175000M) compared to other models in the table, but no GLUE score is reported.
### Interpretation
This table provides a snapshot of the evolution of language models. The data suggests that increasing model size (parameters) and computational resources (GFLOPs) generally leads to improved performance on the GLUE benchmark, but this isn't a guaranteed relationship. The inclusion of citations indicates that the reported values are based on published research. The upward arrows next to some GLUE scores highlight improvements over previous models, demonstrating the ongoing progress in the field. The absence of GLUE scores for some models (e.g., GPT-2, Megatron, LLAMA) suggests that these models may have been evaluated on different benchmarks or that the results were not publicly available at the time the table was compiled. The table is valuable for researchers and practitioners interested in comparing the characteristics and performance of different language models.
</details>
| Model | Input Tokens | GFLOPs | Params (M) | Date | GLUE test set |
|------------------------------------|----------------|--------------------------------|--------------|------------|------------------------------|
| Transformer [Vaswani et al., 2017] | 512 | 54 [Gholami et al., 2021b] | 65 | 12/06/2017 | - |
| ELMo [Peters et al., 2018] | 128 | 26 [Clark et al., 2020] | 96 | 15/02/2018 | 71.2 [Clark et al., 2020] ♣ |
| GPT-1 [Radford et al., 2018] | 128 | 30 [Clark et al., 2020] | 117 | 11/06/2018 | 75.1 [Devlin et al., 2019] ♠ |
| BERT Large [Devlin et al., 2019] | 128 | 79 | 335 ∗ | 11/10/2018 | 82.1 ♠ |
| BERT-Small [Devlin et al., 2019] | 128 | 3.7 [Clark et al., 2020] | 14 | 11/10/2018 | - |
| BERT-Base [Devlin et al., 2019] | 128 | 29 [Clark et al., 2020] | 110 | 11/10/2018 | 79.6 ♠ |
| GPT-2 [Radford et al., 2019] | 1024 | 3400 [Gholami et al., 2021b] | 1500 | 14/02/2019 | - |
| Megatron [Shoeybi et al., 2020] | 1024 | 18000 [Gholami et al., 2021b] | 8300 | 17/09/2019 | - |
| ALBERT-xxl [Lan et al., 2020] | 512 | 2500 [Gholami et al., 2021b] | 235 | 26/09/2019 | - |
| ALBERT-base [Lan et al., 2020] | 128 | 22.5 [Iandola et al., 2020] | 12 | 26/09/2019 | - |
| Theseus 6/768 [Xu et al., 2020] | 128 | 11.3 [Iandola et al., 2020] | 66 | 07/02/2020 | 77.1 [Iandola et al., 2020] |
| Microsoft T-NLG [Rosset, 2020] | 1024 | 36000 [Gholami et al., 2021b] | 17000 | 13/02/2020 | - |
| ELECTRA Large [Clark et al., 2020] | 128 | 79 [Gholami et al., 2021b] | 335 | 23/03/2020 | 88.6 ♠ |
| ELECTRA-Small [Clark et al., 2020] | 128 | 3.7 | 14 | 23/03/2020 | 78 ♠ |
| ELECTRA-Base [Clark et al., 2020] | 128 | 29 | 110 | 23/03/2020 | 83.5 ♠ |
| MobileBERT [Sun et al., 2020] | 128 | 5.36 | 25.3 | 06/04/2020 | 78.5 ♠ |
| MobileBERT tiny [Sun et al., 2020] | 128 | 3.1 | 15.1 | 06/04/2020 | 75.8 ♠ |
| GPT-3 [Brown et al., 2020] | 2048 | 740000 [Gholami et al., 2021b] | 175000 | 28/05/2020 | - |
| SqueezeBERT [Iandola et al., 2020] | 128 | 7.42 | 51.1 | 19/06/2020 | 78.1 |
## GPU consumption data
Tables 8 and 9 show further technical details regarding, respectively, the GPU's theoretical characteristics (compiled from the manufacturer's specification sheet and reference manuals), and their throughput and power consumption 'adapted', if necessary, to the specifics of CV or NLP tasks.
Table 8: Nvidia GPUs theoretical data recopilation.
<details>
<summary>Image 17 Details</summary>

### Visual Description
## Data Table: GPU Performance Comparison
### Overview
This image presents a data table comparing the performance characteristics of various Graphics Processing Units (GPUs) from NVIDIA and AMD. The table includes metrics such as TFLOPS (Tera Floating Point Operations Per Second), Watts (power consumption), Launch Date, Type (Desktop/Notebook), and GFLOPS/Watt (performance efficiency). The GPUs are listed vertically, and the performance metrics are listed horizontally as columns.
### Components/Axes
The table has the following columns:
* **GPU:** Name of the Graphics Processing Unit.
* **Precision:** Floating-point precision used for calculations (FP32, FP16, INT8).
* **TFLOPS:** A measure of the GPU's theoretical peak performance in Tera Floating Point Operations Per Second.
* **Watts:** The typical power consumption of the GPU in Watts.
* **Launch date:** The date the GPU was released. Format is DD/MM/YYYY.
* **Type:** Indicates whether the GPU is designed for Desktop or Notebook computers.
* **GFLOPS/Watt:** A measure of the GPU's performance efficiency, calculated as TFLOPS divided by Watts.
### Detailed Analysis / Content Details
Here's a reconstruction of the data table content. Note that some values are approximate due to image quality.
| GPU | Precision | TFLOPS | Watts | Launch date | Type | GFLOPS/Watt |
|-----------------------|-----------|--------|-------|-------------|----------|-------------|
| GeForce GTX 580 | FP32 | 1.58 | 244 | 09/11/2010 | Desktop | 6.48 |
| GeForce GTX 590 | FP32 | 2.49 | 365 | 24/03/2011 | Desktop | 6.82 |
| GeForce GTX 680 | FP32 | 3.09 | 195 | 22/03/2012 | Desktop | 15.85 |
| GeForce GTX 690 | FP32 | 5.62 | 300 | 29/04/2012 | Desktop | 18.73 |
| GeForce GTX 780 Ti | FP32 | 4.16 | 250 | 07/11/2013 | Desktop | 16.62 |
| GeForce GTX 780 | FP32 | 5.35 | 250 | 23/05/2013 | Desktop | 21.38 |
| GeForce GTX Titan Black| FP32 | 5.65 | 250 | 18/02/2014 | Desktop | 22.58 |
| GeForce GTX Titan Z | FP32 | 8.12 | 375 | 28/05/2014 | Desktop | 21.66 |
| GeForce GTX 980 | FP32 | 4.98 | 165 | 18/09/2014 | Desktop | 30.19 |
| GeForce GTX 980 Ti | FP32 | 6.06 | 250 | 02/06/2015 | Desktop | 24.24 |
| GeForce GTX TITAN X | FP32 | 6.69 | 250 | 17/03/2015 | Desktop | 26.76 |
| GeForce GTX 1080 | FP32 | 8.87 | 180 | 26/05/2016 | Desktop | 49.29 |
| GeForce GTX 1080 Ti | FP32 | 11.34 | 250 | 10/03/2017 | Desktop | 45.36 |
| TITAN X Pascal | FP32 | 10.97 | 250 | 02/08/2016 | Desktop | 43.88 |
| TITAN Xp | FP32 | 12.15 | 250 | 06/04/2017 | Desktop | 48.60 |
| GeForce RTX 2080 | FP32 | 10.07 | 215 | 20/09/2018 | Desktop | 46.84 |
| GeForce RTX 2080 Ti | FP32 | 13.45 | 250 | 20/09/2018 | Desktop | 53.80 |
| Nvidia Titan RTX | FP32 | 16.31 | 280 | 08/12/2018 | Desktop | 58.25 |
| GeForce RTX 3090 | FP32 | 35.58 | 350 | 24/09/2020 | Desktop | 101.71 |
| GeForce RTX 3080 | FP32 | 29.77 | 320 | 17/09/2020 | Desktop | 93.03 |
| GeForce RTX 2070 | FP16 | 7.97 | 175 | 17/10/2018 | Desktop | 45.56 |
| Nvidia Titan RTX | FP16 | 24.29 | 280 | 08/12/2018 | Desktop | 86.75 |
| GeForce RTX 3090 | FP16 | 71.16 | 350 | 24/09/2020 | Desktop | 203.31 |
| GeForce RTX 3080 | FP16 | 59.54 | 320 | 17/09/2020 | Desktop | 186.06 |
| GeForce RTX 3070 | FP16 | 46.35 | 220 | 17/09/2020 | Desktop | 210.68 |
| GeForce RTX 3060 | FP16 | 30.03 | 170 | 12/02/2021 | Desktop | 176.65 |
| Radeon RX 6900 XT | FP32 | 23.04 | 300 | 08/12/2020 | Desktop | 76.80 |
| Radeon RX 6800 XT | FP32 | 16.17 | 300 | 18/11/2020 | Desktop | 53.90 |
| Radeon RX 6800 | FP32 | 13.12 | 250 | 18/11/2020 | Desktop | 52.48 |
| Radeon RX 5700 XT | FP32 | 7.95 | 235 | 07/07/2019 | Desktop | 33.83 |
| Radeon RX 5700 | FP32 | 6.60 | 220 | 07/07/2019 | Desktop | 30.00 |
| Radeon VII | FP32 | 13.70 | 300 | 07/02/2019 | Desktop | 45.67 |
### Key Observations
* **GFLOPS/Watt Trend:** Generally, newer GPUs exhibit higher GFLOPS/Watt, indicating improved performance efficiency.
* **Performance Leap with RTX 30 Series:** The RTX 30 series (Nvidia) shows a significant jump in TFLOPS compared to previous generations.
* **AMD vs. Nvidia:** The AMD Radeon RX 6000 series offers competitive performance, but generally lags behind the top-end Nvidia RTX 30 series in TFLOPS and GFLOPS/Watt.
* **Precision Impact:** The table includes data for both FP32 and FP16 precision. FP16 generally yields higher TFLOPS for the same GPU.
* **Titan Series:** The Titan series GPUs consistently demonstrate high performance, but also high power consumption.
### Interpretation
This data table provides a comparative overview of GPU performance across different generations and manufacturers. The primary takeaway is the continuous improvement in GPU performance and efficiency over time. The RTX 30 series from Nvidia represents a substantial advancement, particularly in terms of raw performance (TFLOPS) and performance per watt (GFLOPS/Watt). The inclusion of both FP32 and FP16 precision data highlights the increasing importance of mixed-precision computing in modern GPU workloads. The comparison between Nvidia and AMD reveals a competitive landscape, with Nvidia currently holding a performance lead in the high-end segment. The data suggests that selecting a GPU involves a trade-off between performance, power consumption, and cost, depending on the specific application and user requirements. The launch date provides a timeline of technological advancement.
</details>
| GPU | Precision | TFLOPSWatts | | Launch date | Type | GFLOPS/Watt |
|-------------------------|------------------|---------------|---------|---------------|---------|---------------|
| GeForce GTX 580 | FP32 | 1.58 | 244 | 09/11/2010 | Desktop | 6.48 |
| GeForce GTX 590 | FP32 | 2.49 | 365 | 24/03/2011 | Desktop | 6.82 |
| GeForce GTX 680 | FP32 | 3.09 | 195 | 22/03/2012 | Desktop | 15.85 |
| GeForce GTX 690 | FP32 | 5.62 | 300 | 29/04/2012 | Desktop | 18.73 |
| GeForce GTX 780 | FP32 | 4.16 | 250 | 23/04/2013 | Desktop | 16.62 |
| GeForce GTX 780 TI | FP32 | 5.35 | 250 | 07/11/2013 | Desktop | 21.38 |
| GeForce GTX Titan Black | FP32 | 5.65 | 250 | 18/02/2014 | Desktop | 22.58 |
| GeForce GTX Titan Z | FP32 | 8.12 | 375 | 28/05/2014 | Desktop | 21.66 |
| GeForce GTX 980 | FP32 | 4.98 | 165 | 18/09/2014 | Desktop | 30.19 |
| GeForce GTX 980 Ti | FP32 | 6.06 | 250 | 02/06/2015 | Desktop | 24.24 |
| GeForce GTX TITAN X | FP32 | 6.69 | 250 | 17/03/2015 | Desktop | 26.76 |
| GeForce GTX 1080 | FP32 | 8.87 | 180 | 26/05/2016 | Desktop | 49.29 |
| GeForce GTX 1080 Ti | FP32 | 11.34 | 250 | 10/03/2017 | Desktop | 45.36 |
| TITAN X Pascal | FP32 | 10.97 | 250 | 02/08/2016 | Desktop | 43.88 |
| TITAN XP | FP32 | 12.15 | 250 | 06/04/2017 | Desktop | 48.6 |
| GeForce RTX 2080 | FP32 | 10.07 | 215 | 20/09/2018 | Desktop | 46.84 |
| GeForce RTX 2080 Ti | FP32 | 13.45 | 250 | 20/09/2018 | Desktop | 53.8 |
| Nvidia Titan RTX | FP32 | 16.31 | 280 | 18/12/2018 | Desktop | 58.26 |
| GeForce RTX 3080 | FP32 | 29.8 | 320 | 01/09/2020 | Desktop | 93.13 |
| GeForce RTX 3090 | FP32 | 35.6 | 350 | 01/09/2020 | Desktop | 101.71 |
| GeForce RTX 2080 | FP16 | 20.14 | 215 | 20/09/2018 | Desktop | 93.67 |
| GeForce RTX 2080 Ti | FP16 | 26.9 | 250 | 20/09/2018 | Desktop | 107.6 |
| Nvidia Titan RTX | FP16 | 32.62 | 280 | 18/12/2018 | Desktop | 116.5 |
| GeForce RTX 3080 | FP16 | 29.8 | 320 | 01/09/2020 | Desktop | 93.13 |
| GeForce RTX 3090 | FP16 | 35.6 | 350 | 01/09/2020 | Desktop | 101.71 |
| GeForce RTX 2080 | FP16/FP32 Tensor | 40.3 | 215 | 20/09/2018 | Desktop | 187.44 |
| GeForce RTX 2080 Ti | FP16/FP32 Tensor | 56.9 | 250 | 20/09/2018 | Desktop | 227.6 |
| Nvidia Titan RTX | FP16/FP32 Tensor | 130.5 | 280 | 18/12/2018 | Desktop | 466.07 |
| GeForce RTX 3080 | FP16/FP32 Tensor | 59.5 | 320 | 01/09/2020 | Desktop | 185.94 |
| GeForce RTX 3090 | FP16/FP32 Tensor | 71 | 350 | 01/09/2020 | Desktop | 202.86 |
| Tesla K10 | FP32 | 4.58 | 225 | 01/05/2012 | Server | 20.36 |
| Tesla K20x | FP32 | 3.94 | 235 | 12/11/2012 | Server | 16.74 |
| Tesla K40 | FP32 | 5.04 | 235 | 08/10/2013 | Server | 21.45 |
| Tesla K80 | FP32 | 8.22 | 300 | 17/10/2014 | Server | 27.4 |
| Tesla M40 | FP32 | 6.84 | 250 | 10/10/2015 | Server | 27.36 |
| Tesla M60 | FP32 | 9.65 | 300 | 30/08/2015 | Server | 32.17 |
| Tesla P100 | FP16 | 21.2 | 300 | 20/05/2016 | Server | 70.67 |
| Tesla V100 | FP16 | 31.4 | 300 | 27/03/2018 | Server | 104.67 |
| A100 | FP16 | 78 | 400 | 14/04/2020 | Server | 195 |
| Tesla P100 | FP32 | 10.6 | 300 | 20/05/2016 | Server | 35.33 |
| Tesla V100 | FP32 | 15.7 | 300 | 27/03/2018 | Server | 52.33 |
| A100 | FP32 | 19.5 | 400 | 14/04/2020 | Server | 48.75 |
| A30 | FP32 | 10.3 | 165 | 12/04/2021 | Server | 62.42 |
| Tesla V100 | FP16/FP32 Tensor | 125 | 300 | 27/03/2018 | Server | 416.67 |
| A100 | FP16/FP32 Tensor | 312 | | 14/04/2020 | Server | 780 |
| A30 | FP16/FP32 Tensor | 165 | 400 165 | 12/04/2021 | Server | 1000 |
| T4 | FP32 | 8.1 | 70 | 13/09/2018 | Server | 115.71 |
| T4 | FP16/FP32 Tensor | 65 | 70 | 13/09/2018 | Server | 928.57 |
Table 9: GPUs throughput and power consumption data compilation.
| Adapted | GPU | Precision | TFLOPSWatts | | Launch date | Type | GFLOPS/Watt |
|-----------|-------------------------|-------------|---------------|-----|---------------|---------|---------------|
| | GeForce GTX 580 | FP32 | 1.58 | 244 | 09/11/2010 | Desktop | 6.48 |
| | GeForce GTX 590 | FP32 | 2.49 | 365 | 24/03/2011 | Desktop | 6.82 |
| | GeForce GTX 680 | FP32 | 3.09 | 195 | 22/03/2012 | Desktop | 15.85 |
| | GeForce GTX 690 | FP32 | 5.62 | 300 | 29/04/2012 | Desktop | 18.73 |
| | Tesla K10 | FP32 | 4.58 | 225 | 01/05/2012 | Server | 20.36 |
| | Tesla K20x | FP32 | 3.94 | 235 | 12/11/2012 | Server | 16.77 |
| | GeForce GTX 780 | FP32 | 4.16 | 250 | 23/04/2013 | Desktop | 16.64 |
| | Tesla K40 | FP32 | 5.04 | 235 | 08/10/2013 | Server | 21.45 |
| | GeForce GTX 780 TI | FP32 | 5.35 | 250 | 07/11/2013 | Desktop | 21.4 |
| | GeForce GTX Titan Black | FP32 | 5.65 | 250 | 18/02/2014 | Desktop | 22.6 |
| | GeForce GTX Titan Z | FP32 | 8.12 | 375 | 28/05/2014 | Desktop | 21.65 |
| | GeForce GTX 980 | FP32 | 4.98 | 165 | 18/09/2014 | Desktop | 30.18 |
| | Tesla K80 | FP32 | 8.22 | 300 | 17/10/2014 | Server | 27.4 |
| No | GeForce GTX TITAN X | FP32 | 6.69 | 250 | 17/03/2015 | Desktop | 26.76 |
| No | GeForce GTX 980 Ti | FP32 | 6.06 | 250 | 02/06/2015 | Desktop | 24.24 |
| No | Tesla M60 | FP32 | 9.65 | 300 | 30/08/2015 | Server | 32.17 |
| No | Tesla M40 | FP32 | 6.84 | 250 | 10/10/2015 | Server | 27.36 |
| No | GeForce GTX 1080 | FP32 | 8.87 | 180 | 26/05/2016 | Desktop | 49.28 |
| No | TITAN X Pascal | FP32 | 10.97 | 250 | 02/08/2016 | Desktop | 43.88 |
| No | GeForce GTX 1080 Ti | FP32 | 11.34 | 250 | 10/03/2017 | Desktop | 45.36 |
| No | TITAN XP | FP32 | 12.15 | 250 | 06/04/2017 | Desktop | 48.6 |
| No | Tesla V100 | FP32 | 15.7 | 300 | 27/03/2018 | Server | 52.33 |
| No | Tesla T4 | FP32 | 8.1 | 70 | 13/09/2018 | Server | 115.71 |
| No | GeForce RTX 2080 | FP32 | 10.07 | 215 | 20/09/2018 | Desktop | 46.84 |
| No | GeForce RTX 2080 Ti | FP32 | 13.45 | 250 | 20/09/2018 | Desktop | 53.8 |
| No | Nvidia Titan RTX | FP32 | 16.31 | 280 | 18/12/2018 | Desktop | 58.25 |
| No | GeForce RTX 3080 | FP32 | 29.8 | 320 | 01/09/2020 | Desktop | 93.13 |
| No | GeForce RTX 3090 | FP32 | 35.6 | 350 | 01/09/2020 | Desktop | 101.71 |
| For CNN | Tesla V100 | Mixed | 35.71 | 300 | 27/03/2018 | Server | 119.03 |
| For CNN | Tesla T4 | Mixed | 21.85 | 70 | 13/09/2018 | Server | 312.15 |
| For CNN | A100 | TF32 | 27.41 | 400 | 14/04/2020 | Server | 68.52 |
| For CNN | A100 | Mixed | 52.35 | 400 | 14/04/2020 | Server | 130.88 |
| For NLP | Tesla V100 | Mixed | 41.44 | 300 | 27/03/2018 | Server | 138.13 |
| For NLP | Tesla T4 | Mixed | 25.58 | 70 | 13/09/2018 | Server | 365.46 |
| For NLP | A100 | TF32 | 55.85 | 400 | 14/04/2020 | Server | 139.64 |
| For NLP | A100 | Mixed | 73.29 | 400 | 14/04/2020 | Server | 183.23 |