this post was submitted on 21 Sep 2025

251 points (99.6% liked)

Fuck AI

4124 readers

990 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 2 years ago

MODERATORS

VerbFlow@lemmy.world

MrMcGasion@lemmy.world

TootSweet@lemmy.world

BigMikeInAustin@lemmy.world

cynar@lemmy.world

drmeanfeel@lemmy.world

pavnilschanda@lemmy.world

CriticalMedicine@lemmy.world

WonderfulWanderer@lemmy.world

Communist@lemmy.ml

eatCasserole@lemmy.world

SpaceNoodle@lemmy.world

NutWrench@lemmy.world

Soup@lemmy.cafe

iAvicenna@lemmy.world

Tinks@lemmy.world

wizblizz@lemmy.world

corus_kt@lemmy.world

Prandom_returns@lemm.ee

JimSamtanko@lemm.ee

TrickDacy@lemmy.world

TheFriar@lemm.ee

ArmokGoB@lemmy.dbzer0.com

HawlSera@lemm.ee

andrew_bidlaw@sh.itjust.works

MeDuViNoX@sh.itjust.works

33550336@lemmy.world

Nougat@fedia.io

Lost_My_Mind@lemmy.world

Sterile_Technique@lemmy.world

Quill7513@slrpnk.net

glowing_hans@sopuli.xyz

e8d79@discuss.tchncs.de

ThefuzzyFurryComrade@pawb.social

251

Science Journalists Find ChatGPT Is Bad at Summarizing Scientific Papers (arstechnica.com)

submitted 4 days ago by ThefuzzyFurryComrade@pawb.social to c/fuck_ai@lemmy.world

19 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] NateNate60@lemmy.world 28 points 4 days ago (4 children)

Okay, whenever I see an article that says that "LLMs are bad at doing X task", I always comment something to the tune of "not surprised, those clankers are shit at everything", but for this one, I genuinely am surprised, because if there has been one thing that LLMs have seemed to be consistently good at, it is generating summaries for passages of text. They were never perfect at that task, but they seemed to actually be the suitable tool for that task. Apparently not.

[–] Aceticon@lemmy.dbzer0.com 8 points 4 days ago* (last edited 4 days ago) (1 children)

As the post says:

LLM “tended to sacrifice accuracy for simplicity” when writing news briefs.

Which make sense as they're a statistical text prediction engine and have no notion of what is important or not in a text, so unlike humans don't treat different parts differently depending on how important they are in that domain.

In STEM fields accuracy is paramount and there are things which simply cannot be dropped from the text when summarizing it, but LLMs are totally unable to guarantee that.

It's the same reason why code generated by LLMs almost never works without being reviewed and corrected - the LLM will drop essential elements so the code doesn't work as it should or won't even compile, but at least with code the compiler validates some of the accuracy of that text at compile time against a set of fixed rules, whilst the person reviewing the code knows upfront the intention for that code - i.e. what it was supposed to do - and can use that as a guide for spotting problems in the generated code.

One thing is summarizing a management meeting where most of the things said are vague waffle, with thing often repeated and were nobody really expects precision and accuracy (sometimes, quite the opposite) and hence a loss of precision is generally fine, a whole different think is summarizing things were at least some parts must be precise and accurate.

Personally I totally expect that LLMs fail miserable in areas requiring precision and accuracy, were a text statistical engine with a pretty much uniform error probability in terms of gravity of the error (i.e. just as likely to make a critical mistake as it is to make a minor mistake) will in summarizing just as easily mangle or drop elements in critical areas requiring accuracy as it does in areas which are just padding.

[–] NateNate60@lemmy.world -2 points 3 days ago (2 children)

An LLM can probably be trained to distinguish what humans regard as "important" using an evolutionary training strategy.

[–] Aceticon@lemmy.dbzer0.com 2 points 3 days ago* (last edited 3 days ago)

If that was the case, why hasn't it been done yet?

I see three problems there:

The first is the simplest to explain: the advantage of an LLM over a human is that it can destil "knowledge" (actually an N-space of probabilities of tokens appearing relative to other tokens, rather than what we would think of as knowledge) from millions, billions of pieces of input text in a tiny fraction of the time it would take a human. That breaks if you use an evolutionary training strategy: it actually needs to generate summaries then have a human validate them and feed the corrections back in some way - so its training speed is now down to the speed of humans reviewing the texts at which point the LLM isn't any faster. I mean, you could theoretically thrown lots of humans at it to make it go faster, but just how many, for example, mycology researchers are you willing to pay for at the amount of money they will demand to do this work? The strategy used so far is to use freely available data to train the LLMs, but how much freely available data is out there at the domain expert level encoding importance of high level concerns. From my experience in the handful of domains I'm an expert in, almost none: have a go at trying to learn a domain from information on the Internet and you eventually reach the conclusion that most stuff out there is "gifted amateur" level and the actual experts seldom take the time to explain their understanding of things at their high expertise levels - at most the few that take the time of their work to share their knowledge tend to teach that domain at a far lower level than their knowledge because that's were most pupils are - when learning a domain from freely available information, even humans quickly hit a ceiling were more advanced knowledge isn't there to be found in freely available sources, at which point it's down to learning from experience and a handful of input documents, neither of which LLMs can do.
This links to the second problem: Whilst LLMs do encode information about the importance of tokens in a given context, they do so at a low level of the information hierarchy (i.e. data structuring, for example that in an English language text certain words should be in certain places if certain other words are present in certain relative locations) and the higher we go into the information hierarchy (i.e. into domain expertise) that breaks down, so for example even the most specialized LLMs - those for Software Development - can't even get right some Junior Developer concepts (like "though shall not authenticate with a clear text username, password") much less higher experience level things (say, which requirements in a Requirements Specification for a project are more important because of their broader implementation implications, something it would need to "know" to properly summarize such a document). Whilst we don't know how much training is actually needed to push that encoding of importance in an LLM up the knowledge hierarchy, all the indications we have from last couple of years of LLM improvements by training them with more data is that they've been yielding steeply lower returns than before (and, as I say, still stuck at the Junior expertise level) so proper domain expert awareness of importance probably requires insane amounts of training and is maybe unachievable.
Third is simply the same problem that humans face: Hyperspecialization. For example knowing all about the implications of certain Requirements for a technical solution in Software Development doesn't in any way form or shape allows a person to know what's relevant in a mycology paper. This would imply that LLMs would have to be trained specifically on each knowledge domain, same as humans, which linked with the points I made above means that even if possible it's financially unfeasable to do it and get an actual profit from it.

In summary: you end up with LLMs trained at human speed instead of machine speed because of the need for a human to review and feed back the product of the LLM's training (adversarial training doesn't work here because you don't have an NN that can recognize what a domain expert thinks is properly summarized data, to train the generator with), you don't really know how much more training is needed to push it beyond it's current level of "importance" encoding (and prospects aren't good since the speed of improvement of quality of LLM output vs amount of training input has fallen steeply over time which means that it's not a linear ratio of input quantity to output quality and instead the ratio is something that very quickly grows and we're already at the steep part of the curve, needing tons more input data to yield small improvements) and last, you would need to train an LLM for each expert domain you want to support because expertise level awareness of importance of certain elements in one domain does not work in other domains, and even in LLMs for the domain which seems to be the one into which most investment in domain specific LLMs - Software Development - their capabilities are stuck at the level of a quite junior Junior Developer.

[–] queermunist@lemmy.ml 0 points 3 days ago

It's my understanding that this is one of the ways the DeepSeek really shines - instead of having a general one-size-fits-all model and trying to make LLMs into GenAI, they use a multitude of smaller models that can be hotswapped for different tasks in different contexts. The kind of summary you want for a news article is vastly different from the kind of summary you want for an academic paper, and being able to recognize when to use different models for different use cases is very powerful.

load more comments (2 replies)