this post was submitted on 21 Sep 2025
251 points (99.6% liked)
Fuck AI
4124 readers
990 users here now
"We did it, Patrick! We made a technological breakthrough!"
A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Okay, whenever I see an article that says that "LLMs are bad at doing X task", I always comment something to the tune of "not surprised, those clankers are shit at everything", but for this one, I genuinely am surprised, because if there has been one thing that LLMs have seemed to be consistently good at, it is generating summaries for passages of text. They were never perfect at that task, but they seemed to actually be the suitable tool for that task. Apparently not.
As the post says:
Which make sense as they're a statistical text prediction engine and have no notion of what is important or not in a text, so unlike humans don't treat different parts differently depending on how important they are in that domain.
In STEM fields accuracy is paramount and there are things which simply cannot be dropped from the text when summarizing it, but LLMs are totally unable to guarantee that.
It's the same reason why code generated by LLMs almost never works without being reviewed and corrected - the LLM will drop essential elements so the code doesn't work as it should or won't even compile, but at least with code the compiler validates some of the accuracy of that text at compile time against a set of fixed rules, whilst the person reviewing the code knows upfront the intention for that code - i.e. what it was supposed to do - and can use that as a guide for spotting problems in the generated code.
One thing is summarizing a management meeting where most of the things said are vague waffle, with thing often repeated and were nobody really expects precision and accuracy (sometimes, quite the opposite) and hence a loss of precision is generally fine, a whole different think is summarizing things were at least some parts must be precise and accurate.
Personally I totally expect that LLMs fail miserable in areas requiring precision and accuracy, were a text statistical engine with a pretty much uniform error probability in terms of gravity of the error (i.e. just as likely to make a critical mistake as it is to make a minor mistake) will in summarizing just as easily mangle or drop elements in critical areas requiring accuracy as it does in areas which are just padding.
An LLM can probably be trained to distinguish what humans regard as "important" using an evolutionary training strategy.
If that was the case, why hasn't it been done yet?
I see three problems there:
In summary: you end up with LLMs trained at human speed instead of machine speed because of the need for a human to review and feed back the product of the LLM's training (adversarial training doesn't work here because you don't have an NN that can recognize what a domain expert thinks is properly summarized data, to train the generator with), you don't really know how much more training is needed to push it beyond it's current level of "importance" encoding (and prospects aren't good since the speed of improvement of quality of LLM output vs amount of training input has fallen steeply over time which means that it's not a linear ratio of input quantity to output quality and instead the ratio is something that very quickly grows and we're already at the steep part of the curve, needing tons more input data to yield small improvements) and last, you would need to train an LLM for each expert domain you want to support because expertise level awareness of importance of certain elements in one domain does not work in other domains, and even in LLMs for the domain which seems to be the one into which most investment in domain specific LLMs - Software Development - their capabilities are stuck at the level of a quite junior Junior Developer.
It's my understanding that this is one of the ways the DeepSeek really shines - instead of having a general one-size-fits-all model and trying to make LLMs into GenAI, they use a multitude of smaller models that can be hotswapped for different tasks in different contexts. The kind of summary you want for a news article is vastly different from the kind of summary you want for an academic paper, and being able to recognize when to use different models for different use cases is very powerful.