On a Sandy Beach: On the Slow Death of Scaling

Mike's Notes

These data centres are very inefficient power users. The AI models are wasteful. As a result, I can see working people's power bills rising, causing more suffering.

This is a great article.

Resources

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5877662

References

Hooker, Sara, On the Slow Death of Scaling (December 06, 2025). Available at SSRN: https://ssrn.com/abstract=5877662 or http://dx.doi.org/10.2139/ssrn.5877662

Repository

Home > Ajabbi Research > Library >
Home > Handbook >

Last Updated

04/02/2026

On the Slow Death of Scaling

By: Sara Hooker

SSRN: 6/12/2025

Adaption Research Scientist.

Abstract

For the last decade, it has been hard to stray off the beaten path of accepted wisdom for what drives innovation. We have been held hostage to a painfully simple formula: scale model size and training data. A pervasive belief in scaling has resulted in a massive windfall in capital for industry labs and fundamentally reshaped the culture of conducting science in our field. Academia has been marginalized from meaningfully participating in AI progress and industry labs have stopped publishing. Yet, this essay will posit that the relationship between training compute and performance is highly uncertain and rapidly changing. Relying on scaling alone misses a critical shift that is underway, and ignores more interesting levers of progress. All this suggests that key disruptions lie ahead.

1 How we got here.

Estimated training cost of select AI models, 2016–23
Source: Epoch, 2023 | Chart: 2024 AI Index report

Training cost (in U.S. dollars - log scale)

Publication date

Figure 1: Estimated training cost of select AI models (Maslej et al., 2024). The last decade has been characterized by an explosion in the size of models and cost of participating at the frontier of research.

Many inventions are re-purposed for means unintended by their designers. Initially, the magnetron tube was developed for radar technology during World War II. In 1945, a self-taught American engineer, Percy Spencer, noticed that a chocolate bar melted in his pocket whenever he was close to a radar set. This innocuous discovery resulted in the patent for the first microwave (Zhang, 2017). In a similar vein, deep neural networks only began to work when an existing technology was unexpectedly re-purposed.

A graphical processing unit (GPU) was originally introduced in the 1970s as a specialized accelerator for video games and for developing graphics for movies and animation. In the 2000s, like the magnetron tube, GPUs were re-purposed for an entirely unimagined use case – to train deep neural networks (Chellapilla et al., 2006; Hooker, 2021b; Oh & Jung, 2004; Payne et al., 2005).

GPUs had one critical advantage over CPUs - they were far better at parallelizing matrix multiplication (Brodtkorb et al., 2013; Dettmers, 2023), a mathematical operation which dominates the definition of deep neural network layers (Fawzi et al., 2022; Davies et al., 2024). This higher number of floating operation points per second (FLOP/s) combined with the clever distribution of training between GPUs unblocked the training of deeper networks. The depth of the network turned out to be critical. Performance on ImageNet jumped with ever deeper networks in 2011 (Ciresan et al., 2011), 2012 (Krizhevsky et al., 2012) and 2015 (Szegedy et al., 2014). A striking example of this jump in compute is a comparison of the now famous 2012 Google paper which used 16,000 CPU cores to classify cats (Le et al., 2012) to a paper published a mere year later that solved the same task with only two CPU cores and four GPUs (Coates et al., 2013).

This would ignite a rush for compute which has led to a bigger-is-better race in the number of model parameters over the last decade (Canziani et al., 2016; Strubell et al., 2019b; Rae et al., 2021; Raffel et al., 2020; Bommasani et al., 2021; Bender et al., 2021). The computer scientist Ken Thompson famously said “When in doubt, use brute force.” This was formalized as the “bitter lesson” by Rich Sutton who posited that computer science history tells us that throwing more compute at a problem has consistently outperformed all attempts to leverage human knowledge of a domain to teach a model (Sutton, 2019). In a punch to the ego of every computer scientist out there, what Sutton is saying is that symbolic methods that codify human knowledge have not worked as well as letting a model learn patterns for itself coupled with ever-vaster amounts of compute.

This essay will ask: is bigger always better? For the last decade, computer science progress has been caught by our own Moore’s law (Schaller, 1997) of a painfully simple formula for innovation by adding more model parameters and data to training. Yet, this essay will posit it is far from clear that future innovation or large gains in performance will come from training compute alone. As we will see in the next section, the relationship between compute and performance is far from straightforward. Compute is changing rapidly, as fast as the technology it serves.

Number of notable machine learning models by geographic area, 2003–23 (sum)
Source: Epoch, 2023 | Chart: 2024 AI Index report

Figure 2: Number of notable machine learning models by geography (Maslej et al., 2024). The explosion in both necessary compute and associated cost has compounded concentration of breakthroughs in a few regions of the world.

Why does answering this question matter? The pervasive belief that compute drives progress has fundamentally reshaped the culture of conducting science in our field. Academia has been left unable to participate in breakthroughs because of a lack of access to compute (Maslej et al., 2024). Compute access disparities persist by region, concentrating participation in the West and China (Longpre et al., 2023; Maslej et al., 2024; Singh et al., 2024a). The large capital investment required for compute hungry workloads has also led to a changing publication culture. Our historically very open field is increasingly closed. The pervasive view in industry labs is to limit publishing to preserve important commercial advantages (Morris & Heikkilä, 2025).

The implication that scale is the key lever for progress also pervades responsible scaling policies released by key industry players like Anthropic (Anthropic, 2023) and Open AI (OpenAI, 2023). These frameworks implicitly assume scaling is inevitable – with the only open question being how to do it responsibly. The assumption that scaling is the only reliable marker of progress extends to government interventions in policing AI. The introduction of compute thresholds as part of the EU AI act and other legislation rests on the assumption models will always trend bigger (Hooker, 2024; The White House, 2023; European Union, 2024; Linghan et al., 2024; Senate, 2024; Linghan et al., 2024) and access to compute or hardware (on Foreign Affairs, 2024; Reuters, 2024; Peppin et al., 2025) is the best indicator of increased capabilities and risk. This all makes probing the assumption that scaling is inevitable even more critical. We have re-orientated our entire field and culture of discovery around bigger is better. Is it always?

2 A shift in the relationship between compute and performance.

(a) Open Leaderboard Scores For Small Models (<13B) Over Time

(b) Large models (>13B) that perform worse than Small Models (<13B)

Figure 3: Left: Plot of the best daily 13B or smaller model submitted to the Open LLM leaderboard over time. Even amongst comparable small sized models, performance has been growing rapidly. Right: The best small models (under 13B) submitted to the Open LLM leaderboard easily outperform far larger models. Over time there, larger models have out-performed small <13B models.

It is controversial in many circles to state that scaling is dying. This is largely because all the evidence from the last decade suggests it is sensible to keep scaling. Scaling compute unlocks larger model sizes or datasets. It is a widely favored formula because it has provided persuasive gains in overall performance. As the computer scientist Michael Jordan quipped “Today we can’t think without holding a piece of metal.” Increasing compute also conveniently fits into the cadence of quarterly industry planning, it is less risky to propose training a larger model than it is to propose an alternative optimization technique.

Relying on compute alone misses a critical shift that is underway in the relationship between scaling and performance. It is not always the case that bigger models result in better performance. The bitter lesson doesn’t explain why Falcon 180B (Almazrouei et al., 2023) is easily outperformed by far smaller open weights models such as Llama-3 8B (AI@Meta, 2024), Command R 35B (Cohere & Team, 2024), Gemma 3 27B (Team, 2024). It also doesn’t explain why Aya 23 8B (Aryabumi et al., 2024) and Aya Expanse 8B (Dang et al., 2024b) both easily outperform BLOOM 176B (Workshop et al., 2023) despite each having only 4.5% of the parameters. These are not isolated examples, but part of a systematic trend where there is no guaranty larger models consistently outperform smaller models. In Figure 3b, we plot the scores of models submitted to the Open LLM Leaderboard (Beeching et al., 2023) over the last two years. The trend is striking: over time there has been a surge in the number of small compact models that outperform far larger ones.

To understand why this is the case, we must understand what key variables have been driving gains in performance over the last decade. In an era where there are diminishing returns for the amount of compute available (Lohn & Jackson, 2022; Thompson et al., 2020), optimization and architecture breakthroughs define the rate of return for a given unit of compute. It is this rate of return which is most critical to the pace of progress and to the level of risk incurred by additional compute.

3 What influences the rate of return for compute?

In complex systems, it is challenging to manipulate one variable in isolation and foresee all implications. Throughout the 20th century doctors recommended removing tonsils in response to swelling or infection, but research has recently shown the removal may lead to higher incidence of throat cancer (Liang et al., 2023). Early televised drug prevention advertisements in the 2000s led to increased drug use rather than curbing abuse of drugs as intended (Terry-McElrath et al., 2011). In a similar vein, the belief that more compute equates with predictable gains in capabilities belies a far more complex picture. Below we explore some of the core contradictions.

3.1 Diminishing returns of increasing model size

Why do we even need extra weights in the first place? Model size is often quantified by the number of trainable weights or parameters. This metric has exploded over the last decade. Some of the first widely adopted deep neural networks like Inception (Szegedy et al., 2016) had 23 million weights. In contrast, recent releases like Qwen3-235B-A22B (Team, 2025) have 235 billion parameters. While this startling growth in model size has been driven by empirical gains from larger models, a key limitation of simply throwing more weights at a task is that the relationship between additional trainable weights and generalization remains poorly understood. A deep neural network learns and adjusts model weights during training to improve performance. When we scale the size of models, typically we are adding to the number of total weights that are learned over training. However, it is unclear why we need so many additional weights. What is particularly puzzling is that we also observe that we can get rid of most of these weights after we reach the end of training with minimal loss to overall performance. For example, it is well accepted you can completely remove the majority of trained weights (Gale et al., 2019; Li et al., 2020; Hou et al., 2020; Chen et al., 2021; Bai et al., 2020; Han et al., 2015; Evci et al., 2019; Denil et al., 2013; Ahmadian et al., 2023) in a network after training while not incurring sizable performance degradation. However, if you start training without these weights active it is impossible to reach the same end performance. If we can get rid of them afterwards, why do we need them in the first place?

Denil et al. (2014) find that a small set of weights can be used to predict 95% of weights in the network. This suggests many weights are highly correlated and there is a high degree of redundancy in the learned feature space. All of this suggests there is considerable redundancy in the size of the network. This may have more to do with the inefficiency of our learning techniques for deep neural networks, and how unstable optimization is if we start with a smaller network. If we had better learning techniques, we would probably need far smaller networks.

Increasing model size is a very costly way to learn the long tail. Deep neural networks are incredibly inefficient learners. Although deep neural networks learn common and frequent features efficiently and early in training (Agarwal & Hooker, 2020; Paul et al., 2021; Mangalam & Prabhu, 2019; Siddiqui et al., 2022; Abbe et al., 2021), these architectures require an incredible amount of compute and training time to learn infrequent features. This is because all modern networks are trained based upon minimization of average error (Goodfellow et al., 2016). Our typical training regime requires that all examples are shown the same number of times during the training (Xue et al., 2023), hence the signal of infrequent attributes is diluted in batch updates (Achille et al., 2017; Jiang et al., 2020; Mangalam & Prabhu, 2019; Faghri et al., 2020; Frankle et al., 2020; Arpit et al., 2017; Hooker et al., 2020; Hooker, 2021a). Most attributes in the real world are infrequent, part of what makes human intelligence unique is our ability to pattern match and process long tail and previously unseen instances efficiently. This is exactly where deep neural networks struggle the most. The bulk of compute during training is spent memorizing the long tail in a prohibitively costly way. It is akin to building a ladder to the moon.

3.2 Data quality reduces reliance on compute.

Models trained on better data do not require as much compute. A large body of work has emerged which shows that efforts to better curate training corpus, including de-duping (Taylor et al., 2022; Kocetkov et al., 2022), data pruning (Marion et al., 2023; Singh et al., 2024b; Sorscher et al., 2023; Albalak et al., 2024; Tirumala et al., 2023; Chimoto et al., 2024) or data prioritization (Boubdir et al., 2023; Thakkar et al., 2023) can compensate for larger models. This suggests that the number of learnable parameters is not definitively the constraint on improving performance; investments in better data quality mitigate the need for more weights (Singh et al., 2024b; Penedo et al., 2023; Raffel et al., 2020; Lee et al., 2022; D’souza et al., 2025). If the size of a training dataset can be reduced without impacting performance (Marion et al., 2023), training time is reduced. This directly impacts the amount of training time and means less compute is needed.

3.3 New algorithmic techniques compensate for compute.

Progress over the last few years has been as much due to algorithmic improvements as it has been due to compute. This includes extending pre-training with instruction finetuning to teach models instruction following (Singh et al., 2024a), model distillation using synthetic data from larger more performant "teachers" to train highly capable, smaller "students" (Team et al., 2024b; Aryabumi et al., 2024), chain-of-thought reasoning (Wei et al., 2023; Hsieh et al., 2023), increased context-length (Xiong et al., 2023), retrieval augmented generation (Pozzobon et al., 2023; Lewis et al., 2020), and preference training to align models with human feedback (Dang et al., 2024a; Ahmadian et al., 2024; Ouyang et al., 2022; Bai et al., 2022; Lee et al., 2023; Tunstall et al., 2023; Khalifa et al., 2021; Rafailov et al., 2023; Azar et al., 2023). All of these techniques compensate for the need for heavy weights or expensive prolonged training (Ho et al., 2024b). All things equal, these have been shown to dramatically improve model performance relative to a model trained without these optimization tricks given the same level of compute (Davidson et al., 2023; Hernandez & Brown, 2020; Erdil & Besiroglu, 2023; METR Team, 2023; Liu et al., 2024). We are doing significantly more with the same amount of resources.

3.4 Architecture plays a significant role in determining scalability

Architecture plays an enormous role at determining the overall rate of return in performance given a unit of compute. It also plays a crucial role in determining the ceiling of progress. The introduction of a new architecture design can fundamentally change the relationship between compute and performance (Tay et al., 2022; Sevilla et al., 2022; Ho et al., 2024a) and render any existing scaling law irrelevant. For example, the key breakthroughs in AI adoption around the world were the introduction of architectures like convolutional neural networks (CNNs) for vision (Ciresan et al., 2011; Krizhevsky et al., 2012; Szegedy et al., 2014) and Transformers for language modeling (Vaswani et al., 2023).

4 The limits of scaling laws.

Warren Buffett once said,

“Don’t ask the barber if you need a haircut.”

In the same vein, don’t ask a computer scientist or economist whether they can predict the future. The temptation to say yes often overrides a necessary humility about what can and cannot be predicted accurately. One such area where hubris has overridden common sense is attempts to predict the relationship between scale and performance in the form of scaling laws (Kaplan et al., 2020; Hernandez et al., 2021; Dhariwal et al., 2021) which either try and predict how a model’s pre-training loss scales (Bowman, 2023) or how downstream properties emerge with scale.

Figure 4: Bytes Magazine Cover, Volume 5, 1980. Compute is rarely the only determinant of progress. Data quality, instructionfinetuning, preference training, retrieval augmented networks, enabled tool use, chain-of-thought reasoning, increased context-length are all algorithmic techniques which add little or no training compute but result in significant gains in performance.

Scaling laws emerged as a symptom of our extreme trust in compute being one of the primary catalysts of progress. It has entered the mainstream discussion as a catchall phrase to justify everything from massive capital investments in AI startups to policy decisions about compute thresholds. It is easy to understand why scaling laws are enticing, if you can predict how capabilities change with the amount of compute you can justify capital expenditures on compute. However, while performance typically increases with scaling, our track record of predicting exactly how much it does is surprisingly lacking. This means that it is difficult to scientifically determine a rate of return for a given level of compute.

One of the biggest limitations of scaling laws is that they have only been shown to hold when predicting a model’s pretraining test loss (Bowman, 2023), which measures the model’s ability to correctly predict how an incomplete piece of text will be continued. Indeed, when actual performance on downstream tasks is used, the results are often murky or inconsistent (Ganguli et al., 2022; Schaeffer et al., 2023; Anwar et al., 2024a; Ganguli et al., 2022; Schaeffer et al., 2024; Hu et al., 2024). Ironically, the term emerging properties is often used to describe this discrepancy (Wei et al., 2022; Srivastava et al., 2023): a property that appears “suddenly” as the complexity of the system increases and cannot be predicted. Somewhat humorously, the acceptance that there are emergent properties which appear out of nowhere is another way of saying our scaling laws don’t actually equip us to know what is coming.

Even when limited to predicting test loss, there have been issues with replicability of scaling results under slightly different assumptions about the distribution (Besiroglu et al., 2024; Anwar et al., 2024b). Research has also increasingly found that many downstream capabilities display irregular scaling curves (Srivastava et al., 2023) or non power-law scaling (Caballero et al., 2023). For complex systems that require projecting into the future, small errors end up accumulating due to time step dependencies being modeled. This makes accurate predictions of when risks will emerge inherently difficult, which is compounded by the small sample sizes that are often available for analysis. Each data point is a model, and computation cost means scaling “laws” are frequently based upon analysis of less than 100 data points (Ruan et al., 2024)). This means that many reported power law relationships can lack statistical support and power (Stumpf & Porter, 2012). The reliability of the scaling laws varies considerably by domain. For example, code-generation has shown fairly predictable power law scaling across 10 orders of magnitude of compute (Hu et al., 2024; Anwar et al., 2024a). However, other capabilities have been shown to scale far more erratically (Srivastava et al., 2023; Caballero et al., 2023).

Scaling laws may be useful for planning training runs because they hold well when architecture, optimization and data quality stay the same. These tend to be short term changes in a controlled regime. However, scaling laws have not stood up to rigor when extrapolated over even medium time horizons (Stumpf & Porter, 2012). The failure of scaling laws supports the takeaway that scaling compute is far from a straightforward axis of progress. Indeed, frontier AI companies which place disproportionate emphasis on scaling laws are likely under-investing in other directions of innovation which will unlock future gains.

5 The way forward.

In folklore, the silver bullet was one of the few techniques that was an effective defense not only against werewolves, but also a protection against vampires and witches. This led to the term silver bullet to describe an intervention that solves many things at once. In computer science, we have treated compute as our silver bullet.

We are observing a bifurcation in compute trends. On the one hand, at least in the short term, models are likely to continue to get bigger as we attempt to squeeze more out of our dying architecture. On the other hand, the relationship between compute and performance is increasingly strained and hard to predict (Niu et al., 2024).

Figure 5: It is a fun time to be a computer scientist as our levers for teaching models how to think are changing rapidly. These new techniques are often far more efficient and many don’t even require gradient updates. Many of these spaces are also under-explored and will evolve rapidly over the next few years.

So where should we go next? The frontier labs which will lead in innovation will not bet on compute alone. Indeed, the most interesting axes of progress are due to fundamental paradigm shifts in the optimization spaces available to computer scientists. One key aspect that makes this era different is the expanded set of tools computer scientists must optimize. This will change a great deal about where computer scientists spend time and the nature of discovery itself. I will include some thoughts below on the most exciting to explore.

5.1 New Optimization spaces

Gradient free exploration Increasingly, a lot of computation is spent outside of training to improve performance of a model. Traditionally, if you wanted higher performance from a machine learning model, you paid for it with more training or data or parameters. A key departure from this is the recent emphasis on scaling up compute at inference time rather than at training time (Khairi et al., 2025; Hooker, 2024; Wei et al., 2023; Hsieh et al., 2023; Wang et al., 2023; Mora et al., 2025). These strategies which include search, tool use, agentic swarms and adaptive compute allow for improvements in performance by spending more compute without any alterations to the model itself. In a radical departure from the last 30 years of AI progress, many of these techniques are gradient free involving no updates to the parameters in order to induce changes in performance. The limited work to-date which has evaluated a subset of ‘inference-time compute” improvements estimates these can impart gains between 5x and 20x of base level posttraining performance (Davidson et al., 2023). Relative to the large volumes of compute needed for pre-training these techniques have minimal footprint (Villalobos & Atkinson, 2023; Huang et al., 2022; D’souza et al., 2025).

A malleable data space Historically, high-quality labeled data has been costly to curate due to, amongst other factors, scarcity of available data (Singh et al., 2024a; 2025) and financial cost (Gilardi et al., 2023; Boubdir et al., 2023). This high cost has precluded adapting training sets “on-the-fly” to increase coverage or task diversity. As a result, researchers have often treated datasets as static representations of the world, far from the rich, ever-evolving environment we navigate as humans. These frozen snapshots in time like MNIST (Deng, 2012), ImageNet (Deng et al., 2009), SQuAD (Rajpurkar et al., 2016) were the foundation upon which progress in AI has been built.

The cost of having historically static and rigid training datasets is extremely high. Models perform better on the distribution they are trained to mimic (Schwartz et al., 2022; Vashishtha et al., 2023; Khondaker et al., 2023). At inference time, data points are not equally relevant, but it is often prohibitively expensive to go back and change the training distribution for each individual inference request. Hence, there is a mismatch in the distribution at training and inference time: training time distribution is often determined by ease of access to prior data collections and data augmentation efforts, while at inference time, new use cases might be underrepresented in the data but highly relevant to the user.

A fundamental revolution is underway where the cost of generating synthetic data is now low enough that we can treat the data space as malleable and something which can be optimized. We can steer synthetic data towards desirable properties (Shimabucoro et al., 2024; Dash et al., 2025), and make previously invisible worlds with limited data coverage more visible (Aryabumi et al., 2024; Üstün et al., 2024; Team et al., 2024a; Mora et al., 2025). The importance of this is hard to overstate, with a malleable data space it is possible to target parts of the distribution that are less frequent. It is also a radical departure from assumptions that have guided machine learning fundamentals such as assuming IID (independent and identically distributed) samples. We are now able to intentionally skew the distribution towards what we hope to represent, rather than accepting a random sample of the world as it is.

The role of design and interface The most intelligent system will increasingly be defined by building an algorithm that can interact with the world. This means for the first time researchers who care about intelligence need also be obsessed with how a model interacts. What was previously the narrow purview of UX designers, artists and human computer interaction specialists, should now be of great interest to all computer scientists. Increasingly, progress at the frontier will require building a system involving multiple components rather than a single algorithm to rule them all.

5.2 Future odds of a return to scaling

Does this mean we will never return to scaling? As long as we are stuck with transformers as an architecture it doesn’t make sense to keep scaling compute. Our current architecture shows all the signs of plateauing in returns from additional compute. While progress has revolved around deep neural networks for the last decade, there is much to suggest that the next significant step forward will require an entirely different architecture. As our models interact with the world, we need new ways to mitigate catastrophic forgetting, where performance deteriorates on the original task because new information interferes with previously learned behavior (Mcclelland et al., 1995; Pozzobon et al., 2023). Deep neural networks are particularly poor at continual learning because of our reliance on global updates, which leads to more stable training but doesn’t allow for specialization of knowledge in similar ways to what we have with regions of the brain.

What does the slow death of scaling training compute mean for the environmental impact of AI? It is important to make a distinction between the shifting trends between compute and performance, and overall computational overhead of AI as a whole. While we will see ever smaller, more performant models – AI workloads will also be deployed in many more settings. This means that this essay should not be taken as a position that the overall environmental impact and energy cost of AI is not a formidable problem. This caveat is important to make, because the majority of energy requirements of AI workloads is not in training, but instead the cost to productionize an ML workload and serve it to billions of users. Even if model size is trending smaller, the widespread adoption of AI means overall energy requirements will likely continue to rise and is far from negligible (Strubell et al., 2019a; Schwartz et al., 2020; Derczynski, 2020; Patterson et al., 2021; Luccioni et al., 2025; Wu et al., 2022; Treviso et al., 2023).

5.3 Parting thoughts

It is meaningful that the statement that we can’t rely solely on compute is gaining recognition in the mainstream conversation. This essay brings together previous writing on several topics that are timely: the changing relationship we have with compute, our expanded optimization spaces and how scaling has irrevocably changed out research culture. One thing is certain, is the less reliable gains from compute makes our purview as computer scientists interesting again. We can now stray from the beaten path of boring, predictable gains from throwing compute at the problem. It is fitting to conclude with a quote from Alan Turing

“We can only see a short distance ahead, but we can see plenty there that needs to be done.”

5.4 Acknowledgments

This work draws upon reflections I have shared in talks over the last few years with some of my existing writing on the topic (Hooker, 2020; 2024). Decreasing returns to scaling has started to gain prominence in the wider conversation, which has resulted in renewed interest in these works. A warm thank you to Sudip Roy, Hugo Larochelle and John Dang who read and provided feedback on a version of this draft. Thanks to Thomas Euyang who helped with an earlier version of the design for Figure 5.

References

Emmanuel Abbe, Enric Boix-Adsera, Matthew Brennan, Guy Bresler, and Dheeraj Nagaraj. The staircase property: How hierarchical structure can guide deep learning, 2021. URL https://arxiv.org/abs/2108.10573.

Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. ArXiv, abs/1711.08856, 2017.

Chirag Agarwal and Sara Hooker. Estimating example difficulty using variance of gradients, 2020.

Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen StephenGou, Phil Blunsom, Ahmet Üstün, and Sara Hooker. Intriguing properties of quantization at scale. In A. Oh, T. Naumann, A. Globerson, K. Saenko,

M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 34278–34294. Curran Associates, Inc., 2023. URL https: //proceedings.neurips.cc/paper_files/paper/2023/file/6c0 ff499edc529c7d8c9f05c7c0ccb82-Paper-Conference.pdf.

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024.

AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-lla ma/llama3/blob/main/MODEL_CARD.md.

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. A survey on data selection for language models, 2024.

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The falcon series of open language models, 2023. URL https://arxiv.org/abs/2311.16867.

Anthropic. Responsible scaling of ai, 2023. URL https://www-cdn.anthrop ic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsi ble-scaling-policy.pdf.

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, and David Krueger. Foundational challenges in assuring alignment and safety of large language models, 2024a.

Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 233–242. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/arpit17a.html.

Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to further multilingual progress, 2024.

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023.

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, X. Jiang, Qun Liu, Michael R. Lyu, and Irwin King. BinaryBERT: Pushing the Limit of BERT Quantization. ArXiv, abs/2012.15701, 2020.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022.

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/open-llm -leaderboard-old/open_llm_leaderboard, 2023.

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp. 610–623, New York, NY, USA, 2021. Association for Computing Machinery. URL https://doi.org/10.1145/344218 8.3445922.

Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla scaling: A replication attempt, 2024.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the Opportunities and Risks of Foundation Models, 2021.

Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. Which prompts make the difference? data prioritization for efficient human llm evaluation, 2023.

Samuel R. Bowman. Eight things to know about large language models, 2023.

André R. Brodtkorb, Trond R. Hagen, and Martin L. Sætra. Graphics processing unit (gpu) programming strategies and trends in gpu computing. Journal of Parallel and Distributed Computing, 73(1):4 – 13, 2013. ISSN 0743-7315. doi: https://doi.org/10.1016/j.jpdc.2012.04.003. URL http://www.sciencedir ect.com/science/article/pii/S0743731512000998. Metaheuristics on GPUs.

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws, 2023.

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An Analysis of Deep Neural Network Models for Practical Applications. arXiv e-prints, pp. arXiv:1605.07678, May 2016.

Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional neural networks for document processing, 10 2006.

Xiao-Han Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, and Jing jing Liu. EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets. ArXiv, abs/2101.00063, 2021.

Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce A. Bassett, and Sara Hooker. Critical learning periods: Leveraging early training dynamics for efficient data pruning, 2024.

Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification., 07 2011.

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with cots hpc systems. In Sanjoy Dasgupta and David McAllester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1337– 1345, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedi ngs.mlr.press/v28/coates13.html.

Cohere and Cohere For AI Team. C4ai command r+, 2024. URL https://hugg ingface.co/CohereForAI/c4ai-command-r-plus. Accessed: 2024-06

John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet Üstün, and Sara Hooker. Rlhf can speak many languages: Unlocking multilingual preference optimization for llms, 2024a. URL https://arxiv.org/abs/2407.0 2552.

John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Venkitesh, David Cairuz, Bowen Yang, Tim Chung, Wei-Yin Ko, Sylvie Shang Shi, Amir Shukayev, Sammie Bae, Aleksandra Piktus, Roman Castagné, Felipe Cruz-Salinas, Eddie Kim, Lucas Crawhall-Stein, Adrien Morisot, Sudip Roy, Phil Blunsom, IvanZhang, Aidan Gomez, Nick Frosst, Marzieh Fadaee, Beyza Ermis, Ahmet Üstün, and Sara Hooker. Aya expanse: Combining research breakthroughs for a new multilingual frontier, 2024b. URL https://arxiv.org/abs/2412.04261.

Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Pierre Richemond, Acyr Locatelli, Nick Frosst, Phil Blunsom, Aidan Gomez, Ivan Zhang, Marzieh Fadaee, Manoj Govindassamy, Sudip Roy, Matthias Gallé, Beyza Ermis, Ahmet Üstün, and Sara Hooker. Aya vision: Advancing the frontier of multilingual multimodality, 2025. URL https://arxiv.org/abs/2505.08751.

Tom Davidson, Jean-Stanislas Denain, Pablo Villalobos, and Guillem Bas. Ai capabilities can be significantly improved without expensive retraining, 2023.

Michael Davies, Ian McDougall, Selvaraj Anandaraj, Deep Machchhar, Rithik Jain, and Karthikeyan Sankaralingam. A journey of a 1,000 kernels begins with a single step: A retrospective of deep learning on gpus. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, pp. 20–36, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703850. doi: 10.1145/3620665.3640367. URL https://doi.org/10.1145/362066 5.3640367.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.52 06848.

Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012. doi: 10.1109/MSP.2012.2211477.

Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting Parameters in Deep Learning. arXiv e-prints, pp. arXiv:1306.0543, Jun 2013. URL https://arxiv.org/abs/1306.0543.

Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning, 2014.

Leon Derczynski. Power Consumption Variation over Activation Functions. arXiv preprint arXiv:2006.07237v1, 2020. URL https://arxiv.org/abs/2006.0 7237v1.

Tim Dettmers. Which gpu for deep learning in 2023?, 2023. URL https://ti mdettmers.com/2023/01/30/which-gpu-for-deep-learning/.

Prafulla Dhariwal, Girish Sastry, Mark Chen, Dan I. Moldovan, Alex, Beutel, and Jonathan Deaton. Data and parameter scaling laws for neural machine translation, 2021. URL https://api.semanticscholar.org/CorpusID: 235415752.

Daniel D’souza, Julia Kreutzer, Adrien Morisot, Ahmet Üstün, and Sara Hooker. Treasure hunt: Real-time targeting of the long tail using training-time markers, 2025. URL https://arxiv.org/abs/2506.14702.

Ege Erdil and Tamay Besiroglu. Algorithmic progress in computer vision, 2023.

European Union. Eu artificial intelligence act, 2024. URL https://artifici alintelligenceact.eu/the-act/. Accessed: 2024-06-30.

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners, 2019.

Fartash Faghri, David Duvenaud, David J. Fleet, and Jimmy Ba. A Study of Gradient Variance in Deep Learning. arXiv e-prints, art. arXiv:2007.04532, July 2020.

Ali Fawzi, Miklos Balog, Alex Huang, Ziwei Song, Yang Song, and Oriol Vinyals. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610:47–53, 2022.

Jonathan Frankle, David J. Schwab, and Ari S. Morcos. The early phase of neural network training. CoRR, abs/2002.10365, 2020. URL https://arxiv.org/ abs/2002.10365.

Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks, 2019.

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei, and Jack Clark. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22. ACM, June 2022. doi: 10.1145/3531146.3533229. URL http://dx.doi.org/10.1145/3531146 .3533229.

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Efficient Neural Network. In NeurIPS, pp. 1135–1143, 2015.

Danny Hernandez and Tom B. Brown. Measuring the algorithmic efficiency of neural networks. CoRR, abs/2005.04305, 2020. URL https://arxiv.org/ abs/2005.04305.

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer, 2021.

Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, and Jaime Sevilla. Algorithmic progress in language models, 2024a.

Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, and Jaime Sevilla. Algorithmic progress in language models, 2024b. URL https://arxiv.org/abs/2403.05812.

Sara Hooker. The Hardware Lottery, 2020.

Sara Hooker. Moving beyond “algorithmic bias is a data problem”. Patterns, 2(4): 100241, 2021a. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter.2021.10 0241. URL https://www.sciencedirect.com/science/article/pi i/S2666389921000611.

Sara Hooker. The hardware lottery. Commun. ACM, 64(12):58–65, nov 2021b. ISSN 0001-0782. doi: 10.1145/3467017. URL https://doi.org/10.1145/ 3467017.

Sara Hooker. On the limitations of compute thresholds as a governance strategy, 2024. URL https://arxiv.org/abs/2407.05694.

Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Characterising Bias in Compressed Models, 2020.

Lu Hou, Lifeng Shang, X. Jiang, and Qun Liu. DynaBERT: Dynamic BERT with Adaptive Width and Depth. ArXiv, abs/2004.04037, 2020.

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun. Predicting emergent abilities with infinite resolution evaluation, 2024. URL https://arxiv.org/abs/2310.03262.

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022.

Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C Mozer. Exploring the memorization-generalization continuum in deep learning. arXiv preprint arXiv:2002.03206, 2020.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.

Ammar Khairi, Daniel D’souza, Ye Shen, Julia Kreutzer, and Sara Hooker. When life gives you samples: The benefits of scaling up inference compute for multilingual LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 27547–27571, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 9798-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1402. URL https: //aclanthology.org/2025.emnlp-main.1402/.

Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation, 2021.

Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, and Muhammad Abdul-Mageed. Gptaraeval: A comprehensive evaluation of ChatGPT on Arabic NLP, 2023.

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84–90, 2012. doi: 10.1145/3091627. URL https://doi.org/10.1145/30 91627.

Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, and Andrew Y. Ng. Building high-level features using large scale unsupervised learning, 2012.

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better, 2022.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Had-sell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 9459–9474. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/fi le/6b493230205f780e1bc26945df7481e5-Paper.pdf.

Bei Li, Ziyang Wang, H. Liu, Quan Du, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. Learning Light-Weight Translation Models from Deep Transformer. ArXiv, abs/2012.13866, 2020.

Jinfeng Liang, Yi Huang, Li Yin, Fatemeh Sadeghi, Yanping Yang, Xue Xiao, Hans-Olov Adami, Weimin Ye, Zhe Zhang, and Fang Fang. Cancer risk following surgical removal of tonsils and adenoids — a population-based, sibling-controlled cohort study in sweden. BMC Medicine, 21, 05 2023. doi: 10.1186/s12916-023 -02902-x.

Zhang Linghan, Yang Jianjun, Cheng Ying, Zhao Jingwu, Han Xuzhi, Zheng Zhifeng, and Xu Xiaoben. Artificial intelligence law of the people’s republic of china (draft for suggestions from scholars), 2024. URL https://cset.g eorgetown.edu/publication/china-ai-law-draft/.

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024.

Andrew J. Lohn and Krystal A. Jackson. Will AI make cyber swords or shields?, August 2022.

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, and Sara Hooker. The data provenance initiative: A large scale audit of dataset licensing & attribution in AI, 2023. URL http://arxiv.org/abs/2310.16787.

Sasha Luccioni, Boris Gamazaychikov, Sara Hooker, Régis Pierrard, Emma Strubell, Yacine Jernite, and Carole-Jean Wu. Light bulbs have energy ratings – so why can’t ai chatbots? Nature, 632(8026):736–738, may 2025. doi: 10.1038/d41586-024-02680-3. URL https://www.nature.com/article s/d41586-024-02680-3.

Karttikeya Mangalam and Vinay Uday Prabhu. Do deep neural networks learn shallow learnable examples first, 2019.

Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. When less is more: Investigating data pruning for pretraining llms at scale, 2023.

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, and Jack Clark. Artificial intelligence index report 2024, 2024. URL https://arxiv.org/ab s/2405.19522.

James Mcclelland, Bruce Mcnaughton, and Randall O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102:419–57, 08 1995. doi: 10.1037/0033-295X.102.3.419.

METR Team. Elicitation gap, 2023. URL https://metr.github.io/auton omy-evals-guide/elicitation-gap/.

David Mora, Viraat Aryabumi, Wei-Yin Ko, Sara Hooker, Julia Kreutzer, and Marzieh Fadaee. The art of asking: Multilingual prompt optimization for synthetic data, 2025. URL https://arxiv.org/abs/2510.19806.

Stephen Morris and Melissa Heikkilä. Deepmind slows down research releases to keep competitive edge in ai race, Apr 2025. URL https://www.ft.com/con tent/2ee1ffde-008e-4ea4-861b-24f15b25cf54.

Xueyan Niu, Bo Bai, Lei Deng, and Wei Han. Beyond scaling laws: Understanding transformer performance with associative memory, 2024.

Kyoung-Su Oh and Keechul Jung. Gpu implementation of neural networks. Pattern Recognition, 37(6):1311 – 1314, 2004. ISSN 0031-3203. doi: https: //doi.org/10.1016/j.patcog.2004.01.013. URL http://www.sciencedir ect.com/science/article/pii/S0031320304000524.

U.S. House Committee on Foreign Affairs. H.r.8315 -enhancing national frameworks for overseas restriction of critical exports act or enforce act., 2024. URL https://foreignaffairs.house.gov/wp-content/uploads/2024/ 05/HR-8315.pdf.

OpenAI. Our approach to frontier risk, 2023. URL https://openai.com/glo bal-affairs/our-approach-to-frontier-risk/.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon Emissions and Large Neural Network Training, 2021.

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training, 2021.

Bryson R. Payne, Saeid O. Belkasim, G. Scott Owen, Michael C. Weeks, and Ying Zhu. Accelerated 2d image processing on gpus. In Vaidy S. Sunderam, Geert Dick van Albada, Peter M. A. Sloot, and Jack J. Dongarra (eds.), Computational Science – ICCS 2005, pp. 256–264, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-32114-9.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.

Aidan Peppin, Anka Reuel, Stephen Casper, Elliot Jones, Andrew Strait, Usman Anwar, Anurag Agrawal, Sayash Kapoor, Sanmi Koyejo, Marie Pellat, Rishi Bommasani, Nick Frosst, and Sara Hooker. The reality of ai and biorisk. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, pp. 763–771, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400714825. doi: 10.1145/3715275.3732048. URL https://doi.org/10.1145/3715275.3732048.

Luiza Pozzobon, Beyza Ermis, Patrick Lewis, and Sara Hooker. Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models, 2023.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Bud-den, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Mas-son d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, 2021.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2020.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1

264. URL https://aclanthology.org/D16-1264/.

Reuters. U.s. eyes curbs on china’s access to ai software behind apps like chatgpt, 2024. URL https://www.reuters.com/technology/us-eyes-curbs -chinas-access-ai-software-behind-apps-like-chatgpt-202 4-05-08/.

Yangjun Ruan, Chris J. Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance, 2024. URL https://arxiv.org/abs/2405.10938.

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023.

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, and Sanmi Koyejo. Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024. URL https://arxiv.org/abs/2406.04391.

R.R. Schaller. Moore’s law: past, present and future. IEEE Spectrum, 34(6):52–59, 1997. doi: 10.1109/6.591665.

Reva Schwartz, Apostol Vassilev, Kristen Greene, Lori Perine, Andrew Burt, Patrick Hall, et al. Towards a standard for identifying and managing bias in artificial intelligence. NIST special publication, 1270(10.6028), 2022.

Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. Communications of the ACM (CACM), 63(12):54–63, November 2020. ISSN 0001-0782. doi: 10.1145/3381831. URL https://doi.org/10.1145/3381831.

California Senate. Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act., 2024. URL https://legiscan.com/CA/te xt/SB1047/id/2999979.

Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, July 2022. doi: 10.1109/ijcnn55064.2022.9891914. URL http://dx.doi.org/10.11 09/IJCNN55064.2022.9891914.

Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, and Sara Hooker. Llm see, llm do: Guiding data generation to target non-differentiable objectives, 2024. URL https://arxiv.org/abs/2407.01490.

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, and Sara Hooker. Metadata archaeology: Unearthing data subsets by leveraging training dynamics, 2022.

Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tuning, 2024a.

Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, AhmetÜstün, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619, 2024b.

Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume

1: Long Papers), pp. 18761–18799, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.a cl-long.919. URL https://aclanthology.org/2025.acl-long.919/.

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning, 2023.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, and Abubakar Abid et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. URL https://aclanthology.org/P19-1355.

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp, 2019b.

Michael P. H. Stumpf and Mason A. Porter. Critical truths about power laws. Science, 335(6069):665–666, 2012. doi: 10.1126/science.1216142. URL https://www.science.org/doi/abs/10.1126/science.1216142.

Richard Sutton. The bitter lesson, 2019. URL http://www.incompleteidea s.net/IncIdeas/BitterLesson.html.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, 2014.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling?, 2022.

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science, 2022.

Gemini Team, Rohan Anil, Sebastian Borgeaud, and Jean-Baptiste Alayrac et al. Gemini: A family of highly capable multimodal models, 2024a.

Gemma Team. Gemma, 2024. URL https://www.kaggle.com/m/3301.

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stan-way, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024b.

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/ 2505.09388.

Yvonne M. Terry-McElrath, Sherry Emery, Gery Szczypka, and Lloyd D. Johnston. Potential exposure to anti-drug advertising and drug-related attitudes, beliefs, and behaviors among united states youth, 1995-2006. Addictive Behaviors, 36 (1-2):116–124, 2011. doi: 10.1016/j.addbeh.2010.09.005.

Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. Self-influence guided data reweighting for language model pre-training, 2023.

The White House. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence, 2023. URL https://www.whitehou se.gov/briefing-room/presidential-actions/2023/10/30/exec utive-order-on-the-safe-secure-and-trustworthy-developme nt-and-use-of-artificial-intelligence/.

Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso. The Computational Limits of Deep Learning. arXiv e-prints, art. arXiv:2007.05558, July 2020.

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving llm pretraining via document de-duplication and diversification, 2023.

Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, and Roy Schwartz. Efficient Methods for Natural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 11:826–860, 07 2023. ISSN 2307-387X. doi: 10.1162/tacl_a _00577. URL https://doi.org/10.1162/tacl_a_00577.

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023.

Aniket Vashishtha, Kabir Ahuja, and Sunayana Sitaram. On evaluating and mitigating gender biases in multilingual settings, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.

Pablo Villalobos and David Atkinson. Trading off compute in training and inference, 2023. URL https://epochai.org/blog/trading-off-compute -in-training-and-inference. Accessed: 2024-05-28.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models, 2022. URL https: //arxiv.org/abs/2206.07682.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.

BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, and Suzana Ilić et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100.

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anastasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Benjamin Lee, Hsien-Hsin Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim Hazelwood. Sustainable AI: Environmental Implications, Challenges and Opportunities. In

D. Marculescu, Y. Chi, and C. Wu (eds.), Proceedings of Machine Learning and Systems, volume 4, pp. 795–813, 2022. URL https://proceedings.mlsy s.org/paper/2022/file/ed3d2c21991e3bef5e069713af9fa6ca-P aper.pdf.

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023.

Fuzhao Xue, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, and Yang You. Adaptive computation with elastic input sequence, 2023.

Hua Zhang. The History of Microwave Heating. 01 2017.

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model, 2024.

On a Sandy Beach

Pages

On the Slow Death of Scaling

Mike's Notes

Resources

References

Repository

Last Updated

On the Slow Death of Scaling

Abstract

1 How we got here.

2 A shift in the relationship between compute and performance.

3 What influences the rate of return for compute?

3.1 Diminishing returns of increasing model size

3.2 Data quality reduces reliance on compute.

3.3 New algorithmic techniques compensate for compute.

3.4 Architecture plays a significant role in determining scalability

4 The limits of scaling laws.

5 The way forward.

5.1 New Optimization spaces

5.2 Future odds of a return to scaling

5.3 Parting thoughts

5.4 Acknowledgments

References

No comments:

Post a Comment