Technology

75338 readers

3334 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

785

Elon Musk wants to rewrite "the entire corpus of human knowledge" with Grok (programming.dev)

submitted 3 months ago by Pro@programming.dev to c/technology@lemmy.world

196 comments fedilink hide all child comments

We will use Grok 3.5 (maybe we should call it 4), which has advanced reasoning, to rewrite the entire corpus of human knowledge, adding missing information and deleting errors.

Then retrain on that.

Far too much garbage in any foundation model trained on uncorrected data.

Source.

More Context

Source.

you are viewing a single comment's thread
view the rest of the comments

[–] Elgenzay@lemmy.ml 21 points 3 months ago (4 children)

Aren't you not supposed to train LLMs on LLM-generated content?

Also he should call it Grok 5; so powerful that it skips over 4. That would be very characteristic of him

[–] Voroxpete@sh.itjust.works 19 points 3 months ago* (last edited 3 months ago) (1 children)

There are, as I understand it, ways that you can train on AI generated material without inviting model collapse, but that's more to do with distilling the output of a model. What Musk is describing is absolutely wholesale confabulation being fed back into the next generation of their model, which would be very bad. It's also a total pipe dream. Getting an AI to rewrite something like the total training data set to your exact requirements, and verifying that it had done so satisfactorily would be an absolutely monumental undertaking. The compute time alone would be staggering and the human labour (to check the output) many times higher than that.

But the whiny little piss baby is mad that his own AI keeps fact checking him, and his engineers have already explained that coding it to lie doesn't really work because the training data tends to outweigh the initial prompt, so this is the best theory he can come up with for how he can "fix" his AI expressing reality's well known liberal bias.

[–] Deflated0ne@lemmy.world 3 points 3 months ago

Model collapse is the ideal.

[–] hansolo@lemmy.today 7 points 3 months ago* (last edited 3 months ago)

Musk probably heard about "synthetic data" training, which is where you use machine learning to create thousands of things that are typical-enough to be good training data. Microsoft uses it to take documents users upload to Office365, train the ML model, and then use that ML output to train an LLM so they can technically say "no, your data wasn't used to train an LLM." Because it trained the thing that trained the LLM.

However, you can't do that with LLM output and stuff like... History. WTF evidence and documents are the basis for the crap he wants to add? The hallucinations will just compound because who's going to cross-check this other than Grok anyway?

[–] brucethemoose@lemmy.world 5 points 3 months ago* (last edited 3 months ago)

There’s some nuance.

Using LLMs to augment data, especially for fine tuning (not training the base model), is a sound method. The Deepseek paper using, for instance, generated reasoning traces is famous for it.

Another is using LLMs to generate logprobs of text, and train not just on the text itself but on the *probability a frontier LLM sees in every ‘word.’ This is called distillation, though there’s some variation and complication. This is also great because it’s more power/time efficient. Look up Arcee models and their distillation training kit for more on this, and code to see how it works.

There are some papers on “self play” that can indeed help LLMs.

But yes, the “dumb” way, aka putting data into a text box and asking an LLM to correct it, is dumb and dumber, because:

You introduce some combination of sampling errors and repetition/overused word issues, depending on the sampling settings. There’s no way around this with old autoregressive LLMs.
You possibly pollute your dataset with “filler”
In Musk's specific proposition, it doesn’t even fill knowledge gaps the old Grok has.

In other words, Musk has no idea WTF he’s talking about. It’s the most boomer, AI Bro, not techy ChatGPT user thing he could propose.

[–] the_q@lemmy.zip 4 points 3 months ago (1 children)

Watch the documentary "Multiplicity".

[–] dzsimbo@lemm.ee 3 points 3 months ago

I rented that multiple times when it came out!