Why Large Language Models haven't made New Discoveries

Until recently

May 19, 2025

Dwarkesh (host of a popular AI podcast) has asked many guests the following question:

As a scientist yourself, what should we make of the fact that despite having basically every known fact about the world memorized, these models haven’t, as far as I know, made a single new discovery? Even a moderately intelligent person who has so much stuff memorized would make all kinds of new connections (connect fact x, and y, and the logical implication is new discovery z).

Of course, LLMs might be being used privately by scientists generate ideas without being mentioned, but they clearly haven’t revolutionized scientific discovery yet.

The question assumes that new discoveries naturally emerge from gathering enough facts. That’s similar to Francis Bacon’s Baconian method:

first, a description of facts; second, a tabulation, or classification, of those facts into three categories…; third, the rejection of whatever appears, in the light of these tables, not to be connected with the phenomenon under investigation and the determination of what is connected with it.

Francis Bacon was right about the importance of empiricism but he was wrong about scientific progress emerging from exhaustive data collection. Data alone is noisy; the scientific method that developed depended on intelligent people proposing insightful scientific hypotheses that could then be tested out. Similarly, an AI that is just trained on next token prediction may be able to develop a very broad knowledge of the world but there’s little reason to expect it to start thinking like a scientist or come up with insightful hypotheses.

Instead, the LLM needs to be combined with other tools. For example, Google DeepMind recently announced AlphaEvolve, which combines LLMs with tools for improving the prompts, evaluating the results and evolving better solutions over time. AlphaEvolve discovered many new algorithms, including improvements to matrix multiplication and algorithms that made itself more optimal.

Algorithms are straightforward to evaluate quickly but other areas of the world are more difficult. Google has also been working on a more general AI co-scientist, where a system of AIs "generate, debate, and evolve" ideas, and has been successful at generating original and useful scientific hypotheses.

LLMs based on next-token prediction alone aren’t optimal at generating original insights, but combined with other tools they can. In the future they’ll likely fine-tune the base models themselves to improve their ability to generate scientific hypotheses. Perhaps they’ll first need to understand how humans do it…

Zappable (formerly Age of AI)

Discussion about this post