20
submitted 7 months ago* (last edited 7 months ago) by rufus@discuss.tchncs.de to c/localllama@sh.itjust.works

"This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing."

This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.

This is probably also more or less related to recent advancements like Streaming-LLM

This won't help once text gets inserted "in the middle" or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I'm able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.

It's just a clever way to re-use the KV cache in one specific case. But I've wished for this for quite some time.

top 2 comments
sorted by: hot top controversial new old
[-] breakingcups@lemmy.world 3 points 7 months ago

Cool stuff! Smarter than smart contexts.

[-] rufus@discuss.tchncs.de 2 points 7 months ago* (last edited 7 months ago)

I wasn't able to get good use out if the old 'Smartcontext' anyways and seems other people had the same problem. To me, this is a huge improvement. And it doesn't even need extra memory or anything.

I really like how the KoboldCPP dev(s(?)) and the llama.cpp community constantly implement all the crazy stuff.

this post was submitted on 04 Nov 2023
20 points (100.0% liked)

LocalLLaMA

2087 readers
1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS