Surely you jest :]
Fuck AI
"We did it, Patrick! We made a technological breakthrough!"
A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.
If only scientific papers had some kind of summary. You know like something in front of the paper that abstracts the entire thing down to less than a page.
to be fair abstracts are extremely dense and not really meant to be read by anyone outside that particular field
As someone who works in communications in a very science-heavy field: in fairness, journalists are also typically terrible at summarizing scientific papers.
Do you recall any examples of science journalists doing it well in your field? I've been increasingly interested in science journalism, and have been on the lookout for examples of it done right.
I don't, really. But my field is also kinda niche (it's not like some popular field like genetics or infectious diseases. There aren't many journalists covering us at all, yet. I work in marketing for an industrial exosuit company (think practical, assistive, biomechanical wearables). Most of the journalists that are covering us at this point are used to covering news about forklifts or warehouse automation, so they aren't used to reading peer reviewed scientific publications at all. Their exposure to papers on biomechanics and injury risk factors is more rare, and they might as well be Latin (well, sometimes they do have a lot of Latin).
But it's also something of a joke. When I was back taking journalism classes for my communications and marketing degree, the professors would joke about how journalists covering either legal summaries or scientific summaries would say that 1 + 2 = 5 all the time, leaving out important details that were critical to the conclusions because they weren't interesting. I think the scientists put up with it because as long as the conclusion is correct, they're just happy to have anyone paying attention.
This is super interesting insight, thanks for sharing.
" I think the scientists put up with it because as long as the conclusion is correct, they're just happy to have anyone paying attention."
I think you've hit the nail on the head with this
Okay, whenever I see an article that says that "LLMs are bad at doing X task", I always comment something to the tune of "not surprised, those clankers are shit at everything", but for this one, I genuinely am surprised, because if there has been one thing that LLMs have seemed to be consistently good at, it is generating summaries for passages of text. They were never perfect at that task, but they seemed to actually be the suitable tool for that task. Apparently not.
As the post says:
LLM “tended to sacrifice accuracy for simplicity” when writing news briefs.
Which make sense as they're a statistical text prediction engine and have no notion of what is important or not in a text, so unlike humans don't treat different parts differently depending on how important they are in that domain.
In STEM fields accuracy is paramount and there are things which simply cannot be dropped from the text when summarizing it, but LLMs are totally unable to guarantee that.
It's the same reason why code generated by LLMs almost never works without being reviewed and corrected - the LLM will drop essential elements so the code doesn't work as it should or won't even compile, but at least with code the compiler validates some of the accuracy of that text at compile time against a set of fixed rules, whilst the person reviewing the code knows upfront the intention for that code - i.e. what it was supposed to do - and can use that as a guide for spotting problems in the generated code.
One thing is summarizing a management meeting where most of the things said are vague waffle, with thing often repeated and were nobody really expects precision and accuracy (sometimes, quite the opposite) and hence a loss of precision is generally fine, a whole different think is summarizing things were at least some parts must be precise and accurate.
Personally I totally expect that LLMs fail miserable in areas requiring precision and accuracy, were a text statistical engine with a pretty much uniform error probability in terms of gravity of the error (i.e. just as likely to make a critical mistake as it is to make a minor mistake) will in summarizing just as easily mangle or drop elements in critical areas requiring accuracy as it does in areas which are just padding.
An LLM can probably be trained to distinguish what humans regard as "important" using an evolutionary training strategy.
If that was the case, why hasn't it been done yet?
I see three problems there:
- The first is the simplest to explain: the advantage of an LLM over a human is that it can destil "knowledge" (actually an N-space of probabilities of tokens appearing relative to other tokens, rather than what we would think of as knowledge) from millions, billions of pieces of input text in a tiny fraction of the time it would take a human. That breaks if you use an evolutionary training strategy: it actually needs to generate summaries then have a human validate them and feed the corrections back in some way - so its training speed is now down to the speed of humans reviewing the texts at which point the LLM isn't any faster. I mean, you could theoretically thrown lots of humans at it to make it go faster, but just how many, for example, mycology researchers are you willing to pay for at the amount of money they will demand to do this work? The strategy used so far is to use freely available data to train the LLMs, but how much freely available data is out there at the domain expert level encoding importance of high level concerns. From my experience in the handful of domains I'm an expert in, almost none: have a go at trying to learn a domain from information on the Internet and you eventually reach the conclusion that most stuff out there is "gifted amateur" level and the actual experts seldom take the time to explain their understanding of things at their high expertise levels - at most the few that take the time of their work to share their knowledge tend to teach that domain at a far lower level than their knowledge because that's were most pupils are - when learning a domain from freely available information, even humans quickly hit a ceiling were more advanced knowledge isn't there to be found in freely available sources, at which point it's down to learning from experience and a handful of input documents, neither of which LLMs can do.
- This links to the second problem: Whilst LLMs do encode information about the importance of tokens in a given context, they do so at a low level of the information hierarchy (i.e. data structuring, for example that in an English language text certain words should be in certain places if certain other words are present in certain relative locations) and the higher we go into the information hierarchy (i.e. into domain expertise) that breaks down, so for example even the most specialized LLMs - those for Software Development - can't even get right some Junior Developer concepts (like "though shall not authenticate with a clear text username, password") much less higher experience level things (say, which requirements in a Requirements Specification for a project are more important because of their broader implementation implications, something it would need to "know" to properly summarize such a document). Whilst we don't know how much training is actually needed to push that encoding of importance in an LLM up the knowledge hierarchy, all the indications we have from last couple of years of LLM improvements by training them with more data is that they've been yielding steeply lower returns than before (and, as I say, still stuck at the Junior expertise level) so proper domain expert awareness of importance probably requires insane amounts of training and is maybe unachievable.
- Third is simply the same problem that humans face: Hyperspecialization. For example knowing all about the implications of certain Requirements for a technical solution in Software Development doesn't in any way form or shape allows a person to know what's relevant in a mycology paper. This would imply that LLMs would have to be trained specifically on each knowledge domain, same as humans, which linked with the points I made above means that even if possible it's financially unfeasable to do it and get an actual profit from it.
In summary: you end up with LLMs trained at human speed instead of machine speed because of the need for a human to review and feed back the product of the LLM's training (adversarial training doesn't work here because you don't have an NN that can recognize what a domain expert thinks is properly summarized data, to train the generator with), you don't really know how much more training is needed to push it beyond it's current level of "importance" encoding (and prospects aren't good since the speed of improvement of quality of LLM output vs amount of training input has fallen steeply over time which means that it's not a linear ratio of input quantity to output quality and instead the ratio is something that very quickly grows and we're already at the steep part of the curve, needing tons more input data to yield small improvements) and last, you would need to train an LLM for each expert domain you want to support because expertise level awareness of importance of certain elements in one domain does not work in other domains, and even in LLMs for the domain which seems to be the one into which most investment in domain specific LLMs - Software Development - their capabilities are stuck at the level of a quite junior Junior Developer.
It's my understanding that this is one of the ways the DeepSeek really shines - instead of having a general one-size-fits-all model and trying to make LLMs into GenAI, they use a multitude of smaller models that can be hotswapped for different tasks in different contexts. The kind of summary you want for a news article is vastly different from the kind of summary you want for an academic paper, and being able to recognize when to use different models for different use cases is very powerful.
Hydrologists find water is wet
Shocker, LLMs are only built to predict the next most likely word in a sentence and nothing else. Glad we test all the things there not good for individually though.
ChatGPT summary: ChatGPT is really good at summarizing scientific papers and these scientists are judgmental meanies.
Four words too many in that title.
Paper says robots could never take my job and I am totally not an old man yelling at a cloud.
Summaries and searches with imprecise terms is actually my main use case for AI at work. My main use case however is having it generate custom short stories for my 4-year old of our two cats, a third imaginary one and a princess having adventures and learning important lessons. That is kind of it.