Discussion about this post

User's avatar
Marcus Seldon's avatar

I agree with your vent when you're critiquing the "stochastic parrot" crowd; AI is clearly a big deal. I myself pay for Claude and use it almost daily. But I think you are overconfident in your projections of how rapid AI progress will be and how quickly jobs will be automated. "AI is a big deal" =/= "AI will replace all white collar work in 5 years". It's certainly possible that AI will replace most cognitive work in the near future, I wouldn't be shocked, but there's plenty of good reasons for skepticism too.

You talk about scaling laws, but we are running out of non-synthetic data to train models on. We have seen rapid progress in coding and math because it's easy to generate large amounts of synthetic data and verifiable problems, but it's not clear we will find ways to do this for other kinds of tasks. Compute can scale, but if training data ceases to scale then it's unclear if progress will continue at the same rate.

Claude Code is awesome and will require us to rethink how many white collar jobs function, but it's still far from replacing researchers or programmers. Look no further than the fact that Anthropic themselves have over 100 engineering job postings on their website. It will certainly accelerate and augment human engineers and researchers, but it seems unlikely a generalist could get Claude Code to create publishable academic papers at a high rate of reliability.

"But they keep getting better at a rapid rate". Again, though, we're running low on training data, and many of the recent improvements seem to be in introducing tooling and "harnesses" for the AIs rather than underlying model improvements. The gaps between where AI is now and a human researcher seem unlikely to be solved merely with tooling. It's things like high context knowledge, creativity, taste, continuous learning, episodic memory, more robust generalization, reliable intuitions about the physical world, and so on.

One last point I'll make is I feel like whenever progress in AI is made, people freak out and overestimate how powerful the AIs are, but over time as they use them more and more they see more of the flaws and limitations. It seems like we're at the beginning of another one of those cycles with coding agents.

Remember when GPT-4 first came out, and people were saying it would immediately replace countless jobs? That didn't happen because we realized it was more limited than we had realized at first. I remember similar things when Deep Research capabilities came out. At first, they seemed miraculous, but now they seem more like a tool that a researcher uses rather than a replacement. I've found that Deep Research tools have lots of limitations and they're just a supplement for me actually searching for and reading things myself. Don't get me wrong, incredibly useful, but not a replacement for humans. And I'm just an amateur researching things for fun.

Terragrafia's avatar

My job is in science basically earth and ecosystem science with satellite and other data. Reading some of the recent stuff about how quickly agentic AI is advancing I thought, ok, let me try to just tell it a project and see if it can develop the code to do the analysis. How much input from me will it need?

The answer is, yes it can write code related to the project but with huge caveats that necessitate having an expert spending a lot of time on overseeing it.

So for example, it will generate a script for an analysis that I prompted it to. But then first issue, there’s a bug in the code and when I give it the error output, it latched on to one theory of what’s causing the error and went down a path rewriting everything according to this idea that ended up being wrong. Not only that but after correcting it, it still kept fixing future bugs according to that old theory even after being reminded several times that this was wrong. In the end it would have had a bunch of pointless processes in the code without me catching this.

Second, it chose technically correct datasets, but when I really looked at the data I realized, it’s not the right data to answer the question. It was data which was one processing level too high and which was processed using assumptions that would have essentially messed up the analysis. We needed a different dataset but without being able to look at the data visually the model wouldn’t have caught that.

Next, I had a factor that I would need to correct for or else the analysis would be pretty useless. The AI didn’t catch this on its own, so I had to suggest doing the correction. Ok, it accepts the need to do so, but of course the data used to correct it itself has some biases that that may mess up the analysis. So I ask the AI what are some approaches?

I’ll get specific here and say the factor we’re trying to correct for is the seasonal changes in leaf biomass in tropical forest. The AI suggests, well, since tropical forests don’t change their biomass much seasonally, we can just go without doing it.

I know that’s wrong. I tell it, no, go do research and tell me how much tropical forest canopy biomass varies seasonally. It comes back. Oh, canopy biomass varies by as much as 25% over seasons. The seasonal correction is ESSENTIAL to do. (The all caps was the model saying this). It wouldn’t have caught this without me knowing that its answers were incorrect. It took my guidance to go from “we can go without doing this because x”, to “it’s ESSENTIAL to do this”.

Continuing on, it got stuck into a loop of improperly aligning pixels, proposing one fix, the fix being wrong, proposing another fix, that fix being wrong, and all the time the interpretations of what’s wrong and the proposed solutions are getting more and more complicated. (A loop that LLMs seem quite prone to). Several times I had to tell it, no! Were getting too complicated. Simplify this down to the most basic solution. But even after saying this, it still likes to keeps elements of the failed approaches in the code which start to cause problems.

In the end my conclusion is, LLMs are good at doing basic and common things which are well represented in their training data. But they struggle when a task requires precise domain level knowledge which is not well represented in the training dataset. They also are prone to overconfidence in their interpretations, for example given a bug, and can get stuck in the cycle of rising complexity which they can really struggle to step back from or escape, sometimes even with guidance.

Some of this like the precise domain knowledge challenge may sound specific to my own obscure science problem. But I think most applications also rely on some level of quite specific domain knowledge that won’t necessarily be well represented in LLM training data, if at all.

So can it do my job while I go sip coffee? No. It needs an expert sitting there checking every step and every assumption.

In the end it might save some work for me (and it really has been amazing for me in other ways), but the need to check its work and assumptions and correct its misunderstandings for many cases end up taking just as long as having written the script myself would have been.

55 more comments...

No posts

Ready for more?