this post was submitted on 29 Aug 2025
443 points (99.8% liked)

Not The Onion

18448 readers
1797 users here now

Welcome

We're not The Onion! Not affiliated with them in any way! Not operated by them in any way! All the news here is real!

The Rules

Posts must be:

  1. Links to news stories from...
  2. ...credible sources, with...
  3. ...their original headlines, that...
  4. ...would make people who see the headline think, “That has got to be a story from The Onion, America’s Finest News Source.”

Please also avoid duplicates.

Comments and post content must abide by the server rules for Lemmy.world and generally abstain from trollish, bigoted, or otherwise disruptive behavior that makes this community less fun for everyone.

And that’s basically it!

founded 2 years ago
MODERATORS
 

"The surveillance, theft and death machine recommends more surveillance to balance out the death."

you are viewing a single comment's thread
view the rest of the comments
[–] Techlos@lemmy.dbzer0.com 7 points 1 month ago (1 children)

i'm running 2x 3060 (12gb cards, vram matters more than clockspeed usually) and if you have the patience to run them the 30b qwen models are honestly pretty decent; if you have the ability and data to fine-tune or LORA them to the task you're doing, it can sometimes exceed zero shot performance from SOTA subscription models.

the real performance gains come from agentifying the language model. With access to wikipedia, arxiv, and a rolling embedding database of prior conversations, the quality of the output shoots way up.

[–] chicken@lemmy.dbzer0.com 1 points 1 month ago (1 children)

agentifying the language model

Any recommendations on setups for this?

[–] Techlos@lemmy.dbzer0.com 6 points 1 month ago* (last edited 1 month ago) (1 children)

this is a good guide on a standard approach to agentic tool use for LLMs

For conversation history:

this one is a bit more arbitrary, because it seems whenever someone finds a good history model they found a startup about it so i had to make it up as i went along.

First, BERT is used to create an embedding for each input/reply pair, and a timestamp is attached to give it some temporal info.

A buffer keeps the last 10 or so input/reply pairs beyond a token limit, and periodically summarise the short-term buffer into paragraph summaries that are then embedded; i used an overlap of 2 pairs between each summary. There's an additional buffer that did the same thing with the summaries themselves, creating megasummaries. These get put into a hierarchical database where each summary has a link to what generated the summary. Depending on how patient i'm feeling i can configure the queries to use either the megasummary embedding or summary embedding for searching the database.

During conversation, the input as well as the last input/reply get fed into a different instruction prompt that says "Create three short sentences that accurately describe the topic, content, and if mentioned the date and time that are present in the current conversation". The sentences, along with the input + last input/reply, are embedded and used to find the most relevant items in the chat history based on a weighed metric of 0.9*(1-cosine_distance(input+reply embedding, history embedding)+0.1*(input.norm(2)-reply.norm(2))^2 (magnitude term worked well when i was doing unsupervised learning, completely arbitrary here), which get added in the system prompt with some instructions on how to treat the included text as prior conversation history with timestamps. I found two approaches that worked well.

for lightweight predictable behaviour, take the average of the embeddings and randomly select N entries from the database, weighed on how closely they match using softmax + temperature to control memory exploration.

For more varied and exploratory recall, use N randomly weighted averages of the embeddings and find the closest matches for each. Bit slower because you have way more database hits, but it tends to perform much better at tying relevant information from otherwise unrelated conversations. I prefer the first method for doing research or literature review, second method is great when you're rubber ducking an idea with the LLM.

an N of 3~5 works pretty well without taking up too much of the context window. Including the most relevant summary as well as the input/reply pairs gives the best general behaviour, strikes a good balance between broad recall and detail recall. The stratified summary approach also lets me prune input/reply entries if it they don't get accessed much (whenever the db goes above a size limit, a script prunes a few dozen entries based on retrieval counts), while leaving the summary to retain some of the information.

It works better if you don't use the full context window that the model is capable of. Transformer models just aren't that great at needle in a haystack problems, and i've found a context window of 8~10k is the sweet spot for the model both paying attention to the conversation recall as well as the current conversation for the qwen models.

A ¿fun? side effect of this method is that using different system prompts for LLM behaviour will affect how the LLM summarises and recalls information, and it's actually quite hard to avoid the LLM from developing a "personality" for lack of a better word. My first implementation included the phrase "a vulcan-like attention to formal logic", and within a few days of testing each conversation summary started with "Ship's log", and developed a habit of calling me captain, probably a side effect of summaries ending up in the system prompt. It was pretty funny, but not exactly what i was aiming for.

apologies if this was a bit rambling, just played a huge set last night and i'm on a molly comedown.

[–] chicken@lemmy.dbzer0.com 1 points 1 month ago

Nah that's some great info, though sounds like a pretty big project to set it up that way. Would definitely want to hear about it if you ever decide to put your work on a git repo somewhere.