Efficiency is good, but scale is better

Jan 9 2025

Introduction

I recently came across a very interesting paper by Meta, Pagnoni 2024 et al.. The title of which, Byte Latent Transformer: Patches Scale Better Than Tokens, has a rather interesting assumption; it may be more desirable to scale better than to simply perform better.

The main result of their paper is unsurprisingly that their new transformer, BLT, scales better than previous SOTA techniques:

![[Pasted image 20250109011752.png]] Figure 1 Pagnoni 2024 et al..

This is really a remarkable result, because they train up to 8B parameters, and if the scaling trend were to continue up to, say, 40B (the size of the LLAMA-3 model which is said to be on par with GPT-4), then their technique would very likely unravel a better model.

The central idea behind their proposed technique is to change the tokenizer strategy itself to be learnt, and thereby allow the model to dynamically allocate resources during inference. There are only few added parameters, and the architecture scales much better

In this blog post, I argue that their results are similar to what we’ve seen across the board in various domains, where large labs have seen significant benefit from pivoting to more scalable architectures, rather than efficient ones.

Introduction#

Introduction