49 Comments
Dec 27, 2023·edited Dec 27, 2023Liked by Dwarkesh Patel

"Presumably, animal drawings in LaTeX are not part of GPT-4’s training corpus."

Why presume? If you simply search for "drawing animals with latex", you find a huge pre-existing literature on how to make animal drawings with raw LaTeX code or libraries like TikZ. LaTeX art is a well-established thing, and I fooled around with it as an undergrad long ago.

Never underestimate how much leakage there may be between training and test sets, and how much memorization can be happening! "Text from the internet" contains much weirder stuff than you think.

But the deeper point is that it's completely impossible to evaluate LLM performance without knowing the training set, which the big companies all refuse to reveal. My favorite paper on this is "Pretraining on the Test Set Is All You Need" https://arxiv.org/abs/2309.08632 where the author shows that you can beat all the big LLMs on the benchmarks with a tiny LLM if you train on the test set. It's a brilliant parody, but it has a point: how do we know the big LLMs aren't also doing this accidentally? I wouldn't update my beliefs much on how well LLMs can generalize, until the big companies reveal much more about how their models were built.

Expand full comment
author

Hmm good point, though I think the other examples in the sparks of AGI paper together paint a pretty conclusive picture, even if one of them alone may be possible to do by cherrypicking

Expand full comment

Assuming it'll be impossible to prove the contents of an LLM's pre-training dataset, I'm curious what you would consider to be convincing enough that you would significantly update your beliefs that LLMs can "generalize"?

Would it be something like ARC (https://github.com/fchollet/ARC)? Or even if GPT-N gets high ARC one could say oh maybe OpenAI just trained it on the the same or similar reasoning dataset. There's also GAIA (https://arxiv.org/abs/2311.12983) but could apply the same reasoning to say that there's something in the training dataset that gives it a non-trivial advantage.

Anecdotally, my personal experience in using them to solve work problems (at a human level) on proprietary datasets and tasks has strongly convinced me GPT-4 is an amazing generalist. Not AGI (by DeepMind definition) but I would argue getting close enough that I'm a "believer".

Expand full comment
Dec 28, 2023·edited Dec 28, 2023

Well, ARC or GAIA won't work now, since they have been put on the Internet. That has been the problem with machine learning benchmarks throughout my career. Even if their test sets don't get accidentally incorporated into pre-training corpora, there will tend to be generalization error in model selection and hyperparameter tuning. After all, people will only select, train and publish models that do well on the benchmarks.

I think it's too pessimistic to assume we can't know or understand the contents of LLM pre-training datasets. They can be published and studied. With the training sets, we can run ablation studies or statistical studies, train probing models and so forth to investigate the extent to which the LLM produced its answers by memorization or generalization. The main barrier is the refusal of (most of) the big AI companies to open their datasets. This isn't just a problem with LLMs. For example, Google still keeps its computer vision datasets like JFT-3B secret, even as it publishes research based on it. (What's the point of that? No one can replicate the research. Why boast about their neat toy?) Dataset secrecy is a miserable practice that needs to be condemned and rejected.

I have no doubt that the various LLMs are useful, and I'm glad you are getting value from them. But useful isn't the same as intelligent. I think that most people actually just want a "super search engine" rather than an actually intelligent system. They don't really care or need to care if the system is drawing its response from existing text somewhere out there, or if it's really thinking, understanding and generating something truly new using real intelligence (whatever that is!). Memorization isn't a barrier to usefulness. But it is a barrier to AGI, and if it is happening too much it means that scaling won't work.

Expand full comment

Always curious about the use case when I read these comments where folks problem solve at work. Can you be specific as to what work problems you are actually solving? Are those problems actually novel or just novel to you?

Expand full comment

Email security, input: email + meta data (e.g. domain scores), output: classification + explanation

I'm sure there are plenty of email spam classification examples in pre-training but my expectation is that only gets you up to x% and for the more obvious cases. GPT-4's performance on novel zero days (very convincing and malicious spam that notably get past most existing filters) was very high. Higher than some (but not the more senior) trained security analysts.

Some related marketing content: https://abnormalsecurity.com/blog/how-abnormal-trains-llms

Expand full comment

Just a spectacular piece of writing. Thank you Dwarkesh for being a continuing source of enlightenment.

Expand full comment

"Even taking handwavy scaling curves seriously implies that we’ll need 1e35 FLOPs for an AI that is reliable and smart enough to write a scientific paper (that’s table stakes for the abilities an AI would need to automate further AI research and continue progress once scaling becomes infeasible)"

This is almost the opposite of the truth, and I'm confused why Tamay and Matthew didn't correct this. They say in the report "The Direct Approach yields an upper bound on the training compute to automate a task because there may be more efficient ways of getting a model to automate some task than to train the model to emulate humans performing that task directly."

So your sentence should read instead "even taking handwavy scaling curves seriously implies that without any additional algorithmic progress, simply scaling up existing architectures with more data, should result in AI that is reliable and smart enough to write academic papers by 1e35 FLOP." Notice the huge difference between "we need at least X flop" and "we need at most X flop."

Realistically we'll need far less than 1e35 FLOP because, as they say, there are way more efficient ways to get AI to be good at some task than the way implicitly assumed in this model.

(i.e. = train the AI on human demonstrations until it is so damn good at simulating human demonstrations that it can autoregressively generate tens of thousands of tokens before anyone can tell the difference. This is SO UNNECESSARILY HARD. Imagine if you, a human, were being trained to solve hard math problems, but the way we did it was by showing you completed proofs with chunks of the ending cut off and asking you to finish the proofs but with a timer so that you couldn't stop to think, you just had to keep typing as fast as possible and you couldn't undo mistakes either. Grok how much harder it would be to learn to prove difficult math theorems this way! "harder" is an understatement, more like "impossible!")

Expand full comment
Jan 3Liked by Dwarkesh Patel

>Even taking handwavy scaling curves seriously implies that we’ll need 1e35 FLOPs for an AI that is reliable and smart enough to write a scientific paper

I interpreted this as an estimate, not a lower-bound.

I also read Dwarkesh as incorporating algorithmic progress (e.g. in his discussion of data efficiency), though this is not explicitly reflected by e.g. shaving this off the required physical compute budget.

Using the 1e35 estimate of effective compute seems fine given that the topic of discussion is whether LLM-like models with something like on-trend algorithmic progress will deliver the ability to do science. I think this number is pretty made up. It seems plausible, though not very likely, that more than that is needed for an LLM+ to emulate what humans do when they do science.

Expand full comment

Good point, I probably just misinterpreted it as a lower bound. I guess I mostly retract my criticism and replace it with "your current phrasing might be read by some readers as a lower bound instead of an upper bound or central estimate; IMO it should be more a soft upper bound but if you want to treat it as a central estimate fair enough but anyhow please clarify"

Expand full comment
author

But it's also possible that it takes more than 1e35 FLOPs, for normal Hofstadter's law type reasons - especially given that the scaling laws on which we've gotten the 1e35 number are based on internet data entropy, not necessarily scientific reasoning entropy.

Expand full comment
author

So settling on 1e35 as a central estimate doesn't seem unreasonable, and may indeed be an underestimate.

Expand full comment

If you had said 1e35 as a central estimate, I wouldn't have complained so much. Instead you set it as a lower bound!

I do agree that it's possible it'll take more than 1e35, but I think the balance of considerations pushing it lower is a lot stronger than the balance of considerations pushing it higher.

Also, Hofstadter's law doesn't apply here IMO. This is an important point and a huge crux between me and other people apparently; I briefly explain my reasoning in this discussion: https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines#Hofstadter_s_law_in_AGI_forecasting

Expand full comment

Good post by the way! I forgot to say that earlier.

Expand full comment

It seems highly unreasonable to me - references to Hofstadter aside, the “direct approach” does not seem compelling as a tight bound on the compute needed *at all*. I am very curious what other academics would think about this limit.

Expand full comment

Not sure why you changed your tune in replies - the “Direct Approach” does not tell us much at all about when we will reach certain capabilities and the 1e35 estimate is likely a very very high overestimate. There is no such thing as an ‘ideal discriminator’ for the distribution of all possible human utterances.

Expand full comment

I mean I think I agree with those claims but didn't feel confident enough to argue for them succinctly here; I did say I think we'll probably need far less than 1e35.

The direct approach is just one way of reading the tea leaves among many; it's not my favorite way but it is IMO similarly plausible (if not more plausible) than a common alternative which is to compare model size to human brain size and then either stop there or additionally assume that models will be data-inefficient and trained using chinchilla scaling laws.

Expand full comment
Dec 26, 2023Liked by Dwarkesh Patel

re: synthetic data

My understanding is that to the extent LLMs have world models, it is by establishing statistical relationships between concepts, as represented by each concept's constituent components (tokens). Unless the synthetic data provides the model with new information about the relationships between various concepts, I don't understand the mechanism by which it can make LLMs more intelligent.

I think you would need an LLM to "observe" more phenomena about the world and create its own additional training data (akin to Whisper) rather than create completely artificial training data.

Expand full comment
Dec 27, 2023Liked by Dwarkesh Patel

Daniel: the goal is to make models which are more reliable reasoners. Not more knowledgeable, but better able to come to reliable deductions.

Synthetic data which teaches them how to “think step by step” might eliminate a tremendous number of errors in reasoning that we observe.

Expand full comment
Dec 27, 2023Liked by Dwarkesh Patel

Great post! You should write more in this style

Expand full comment

Great writing, insightful. I enjoyed reading that.

Expand full comment
Jan 19Liked by Dwarkesh Patel

You should put this on the podcast feed

Expand full comment
author

done

Expand full comment

The point about technological development preceding theory is key - in that it typifies the misconception that anchors most current thought around AI. This is not a novel technology, but a novel science - and science does not develop in absence of supporting theory or a large body of empirical evidence drawn from systematic experimentation.

The key terms in AI still lack consensus definitions - to weigh comparative architectural success in ‘reasoning’ is meaningless given there is no consensus relative to AI what reasoning denotes and, crucially, a complete disinterest about the organic processes that the principals are seeking to emulate. The fact that clarificatory notions like mechanistic interpretability derive from an unfancied stream of the overall AI epistemology, and that the computer scientists in question are satisfied with the obscurity of even these present systems unfit to emulate the functions that are the stuff of ASI dreams, demonstrates that the appetite to grasp the foundations of this stuff is just not there (yet). I imagine this is substantially a function of the money to be made in overestimating the near-term attainability of transformative versions of this technology, and partly one of profound idealistic self-deception on the part of CS majors who have decisively limited conceptions about how organic intelligence works.

Once one bridges to notions like consciousness the improbability of the extension of present methods leading to success becomes completely fanciful - to credit scaling is to credit the notion that one of the most analysis-impervious concepts in all of the knowledge estate (i.e. consciousness), which we cannot even faintly model in its mechanics or the mathematical formalisms to which it must presumably abide, will be achieved in ignorance even of the precise outline of the unknown areas within it, by comparatively straightforward efforts of iterate-and-scale. It’s unimaginably unlikely.

Expand full comment

Interesting, please unpack.

I would not tie too many wagons to the problem of consciousness, which I suspect is more of a DV than an IV. A child learns the meaning of the word apple, first spoken, then written. As the individual mind fills up with such meaning via learning (from the apple example to the meaning of sorrow…), consciousness takes real form. See:

https://www.sciencedirect.com/science/article/pii/S0149763422002615

The problem for human-like intelligence lies with meaning making, including emergent insight meaning reducible to training. The mechanism of meaning making can be operationalized, trained (perhaps), and tested for.

Expand full comment

Very interesting study, thank you. I am hard of the position that neurobiology will play a vastly more central role to the achievement of AI than it does presently in most AI companies. Language will assuredly play a part in this, but the assertion of language's particular centrality to consciousness seems overextended in the paper.

"Indeed, consciousness is considered to be widespread in the animal kingdom, including cephalopods, birds, and mammals (Birch et al., 2020). While these animals (including our pets) exhibit primary consciousness, humans have higher-order consciousness that might be the result of a lifetime’s exposure to more than three billion words."

Ha, I mean, yeah, it 'might' be the result of that, but that's so contingent I wouldn't even know where to begin to interrogate it. It's not even so much a shaky first principle as a middle principle occurring somewhere in a chain of reasoning whose initial and subsequent principles we aren't entreated to know.

I would suggest that 'the problem of "the problem of consciousness"' is very much epitomised within that paper, and also within the your first in-earnest paragraph - there are several concepts within those three sentences ("consciousness", "learns", "meaning") that, while consensus-defined in the vernacular (and even then only by necessity), cannot at all be defined to consensus mechanically, nor formally (i.e. in such a way as would allow them to be assayed analytically, or expressed programmatically). We know what some of the output characteristics of 'learning' are but we have few means, beyond comparison of those output characteristics, of determining how similar the process of 'learning' is between an organic agent and a synthetic one. This is one of the epistemological lacunae into which we've ventured using OI shorthands for AI - whereby we confuse the shorthand ("this AI is 'learning'!") for the function-in-earnest ("this AI is learning!"), the reproduction for the production.

You can see this in the paper, which can't get beyond conflation of different concepts - concepts that are again, defined only in the scarcity of proof, and not to an analytic standard - through comparison of output characteristics.

"To summarise, there is a fundamental relationship between language, inner speech, and higher-order consciousness that roughly seems to be associated with the separation of words (as felt and heard) from our other ‘thoughts’ (as in primary consciousness) that allow us to describe the world and ourselves in a manner that both creates and extends the self in time."

There is not much we can see 'in the engine' that demonstrates this. Yes, the fact that these three things are only known to reside in a single brain suggests plausible interrelation, but faintly, in my view, and doesn't do anything to illuminate the potential dynamics of their relationship in forming consciousness. The data in the paper is suggestive of language's power as a meaning-retrieval mechanism (unsurprising) but is not further allusive.

I used to be heard a few years ago suggesting that linguistics researchers would become the most important scientists of the 21st century, pending their presumed centrality to the development of artificial intelligence and consciousness. Ironically, it's actually the coming, output, and limitations of LLMs that has convinced me conclusively that I was wrong.

Expand full comment

This article is very interesting, but I'm slightly struggling with the logic.

The first part lays out some detailed arguments on why or why not we might expect scaling to continue. This is a very interesting debate. The data issue still looks to me like a big problem. Feeding LLM output into LLM input feels like it should break, there's no extra information coming in here, just reinforcing what the LLM already thinks. But that's just an intuition.

The second part (the conclusion) basically throws all this reasoning out of the window and just looks at track records. Which is a totally reasonable argument, but makes the first part seem a bit pointless. I'd be interested to see specific predictions systematically evaluated to expand on how well the hypothesis has held up over the past decade.

Expand full comment

"The data issue still looks to me like a big problem. Feeding LLM output into LLM input feels like it should break, there's no extra information coming in here, just reinforcing what the LLM already thinks."

Plenty of ways an LLM's own output can increase it's own signal and I think reinforcement learning (specifically policy optimization) is already an entire domain of models-learning-from-itself. Granted you do need *something* external, a preference, a hint, a reward, or some type of feedback but I would argue that these are much less limiting than the internet itself as a token source.

Cool example: STaR: Bootstrapping Reasoning With Reasoning (https://arxiv.org/abs/2203.14465)

Expand full comment

What does self-play actually mean? Getting AI to create text and using another AI to label it as good or bad, then training AI on that data.

What if AI is trained on captcha like data? Sending tweets and other forms of communication to humans and scoring for intelligible responses?

Expand full comment

I'll register my LLM-scaling-to-ASI skepticism right here and now. A brilliant neural net outsmarting humanity would be like a brilliant neuron outsmarting a brain. It can only work if a part emulates the whole, e.g. via a virtual civilization, perhaps needing physical feedback/friction.

GenAI will be economically and politically transformative, but will not recursively self-improve and become ASI.

Expand full comment

Why? Why can't it be instead a network of brilliant neural nets outsmarting humanity is like a brilliant neural net outsmarting a brain? In your analogy, an LLM is like a single neuron, which I would have expected it to be difficult to accept without some more reasoning how this analogy holds.

And I don't see how this analogy has anything to do with the question of if AI can recursively self-improve?

Expand full comment

The smartest thing in your skull is your entire brain, not any one neuron. To outsmart the smartest thing in your skull, it would not suffice for an artificial neuron to be smarter than your smartest neuron.

The smartest thing on our planet is the planetary human civilization, not any one human brain. To outsmart the smartest thing on our planet, it would not suffice for an LLM to be smarter than Earth's smartest human brain.

The human brains who are impressed at LLMs outsmarting them are not smart enough to recognize that human brains are not the smartest thing on our planet. :-)

My statement about recursive self-improvement is a separate prediction, not strictly dependent on whether my foregoing analogy is valid. I'm here to get on the record, not persuade people. The reality will be very clear in 5-10 years.

Expand full comment

I would be interested to break down the question more. Sometimes you say “automate most cognitive labor” as if that is a fixed and singular goal. But there are so many ways that could “automate most cognitive labor” that would transform the world in different degrees. Would lawyers be automated away? Doctors? Would we just still have all the same jobs but everyone has decent assistants? Will physical jobs like folding laundry just be harder to handle than software engineer? Sales? Marketing? Finance?

I know we don’t have the answers to these questions, but the whole post is about grappling with a question that we don’t have the answer to. Would it be easier or harder to answer a question like “how will the advance of AI affect medical practice”?

Expand full comment

Nice post, this was a good read!

> “Oh we have 5x less data than we need - we just need a couple of 2x improvements in data efficiency, and we’re golden”

Out of curiosity, did you manage to find any numbers on what the data efficiency doubling rate is?

Expand full comment

That 1e35 flops number is an *upper bound* and is basically a complete shot in the dark. We have no clue what abilities are unlocked at certain perplexity.

Expand full comment

We still have a lot of signal we can wring out of the data. We have barely cracked the surface of long-sequence modeling.

Expand full comment

What does self-play actually mean? Getting AI to create text and using another AI to label it as good or bad, then training AI on that data.

What if AI is trained on captcha like data? Sending tweets and other forms of communication to humans and scoring for intelligible responses?

Expand full comment

On data bottlenecks, I suspect people will discover new stores of data now that they are valuable to language models. Scarcity creates high prices which alleviates scarcity.

Can new data come from synthetic data? I think no (except for specific areas) unless we have teams of people filtering text generated by language models. See my discussion with Gwern here:

https://www.lesswrong.com/posts/vh4Cq6gwBAcPSj8u2/bootstrapping-language-models

The straightforward solution is to steadily increase data quality and quantity using more and more people. Teams of people (with AI assistants) can produce high quality text and training datasets for the LM's, putting increasing amounts of effort and resources into data quality as the model scales. We can repeat this process for every conceivable task, putting in ever more work to weed out errors and make models robust.

Expand full comment