"Wowie look at my tool call format": this sucks and is getting me nowhere.
always on HF (huggingface), always pulling the latest llama-cpp to test, quantize and run the latest smol text predictors and image segmenters.
my brain is mush, my mood on the subject is low and my interest in scaffolding and harnessing is decreasing at a rapid pace.
real agents need real models: large ones. deepseek V4 flash, minimax M3, Kimi.
i dont want to pay for those, that defeats the "fun playing" part of it and makes me think of money (yes, I know its cheap, I'd be paying like 2โฌ a month, but still!)
if i make a godot agent, nobody cares about it, its "efficiency", how its animated, how easy and fun it is to set up and how fluid the text parsing and tool call previews are. there wont be a youtuber person saying "ooh thats smooth.... how are other agents so much worse?" about it.
only vibe-coding claude-suckling losers are gonna use it, complain about no free tier and leave.
the only one who cares is me, and i should spend that effort elsewhere and stop caring about this stupid hypie bs.
large-scale stuff is fun, whatever, but its out of my interest pool, so i dont care.
smol local models suck, which is where i live, so:
my experience sucks as much as the local models do.
why is it so difficult to distance myself from this stuff?
is it
- fear?
- actual interest?
- weird investment thoughts?
- the idea of making something good with it?
- the wish of local LMs to become viable? (yes it is)
its not like im profiting in any way, nobody cares about what i do and i also stop caring.
but what do i follow instead?
... watch.... gameplay videos? naw
... work on Bloom. yea fair. whatever I do, I should always be working on Bloom instead. (its a game about some butterfly character exploring a 3D mountain in the Austrian alps and-- collecting nectar from flowers and such. I still only have concept art.)
im sitting alone on the balcony, while peeps are celebrating in the hall and I dont want to celebrate with them.
they are partially celebrating our replacement.
this is the softest "overthrow the world" scenario ive heard of so far.
its not all overblown. its not all bs. its not all cash-grabby CEO sloppery. its not all useless chatbots in places you dont want them. its not all "optimize the human away" agendas. its not all just there to be there.
but oh boy ๐ฆ it sure seems so from the outside.
i believe that language models in particular have actual good uses.
they can simplify computer navigation, help or come up with scientific / mathmatical breakthroughs and they can help with understanding subjects on a deeper level, without having to pay a personal tutor or have smart friends (which would obviously be preferred).
is that worth training on the entire internet? im not sure.
if chatgpt ++ pro max and claude opus mythos plus times 10 xhigh 3x speed were free to use and run locally, maybe we (or I at least) would think differently about this. because.... let's be real: ai is not "eating all of our water". most data centers use basic fans or closed-loop water cooling. just like any water-cooled PC at home does and the main environmental problems stem from the power consumption, which is an energy-source problem and less of an LM-specific problem.
if THIS was whats possible with local language models: that would be great!
no more man-in-the-middle company, no bad feels about continuous compute waste, no more feeling of being sold back all of digital humanities labour with no compensation.
but thats not where we are. these capabilities are behind 200โฌ/month paywalls, proprietary models and... awfully excited CEOs selling their text predictors as the second coming of jesus Christ (our all lord and saviour), god (whos less interesting) and mosis (I think he can split oceans. pretty cool).
i need a permanent vacation from local language models.
if anyone reads this, give us your thoughts! or give them to me at least, if no one else is around, I will read them.
this was human-written in case anyone thought otherwise.
I saw your post! Fair enough if you're burning out on it. I'm about two months into my own LLM experience and while I haven't burned out yet, I can definitely feel the fatigue from how much effort I've poured into trying to make local models viable.
I have actually gotten a fair bit of utility out for the effort I've put in. Actually good OCR without the ton of preprocessing I used to need to do for Tesseract. "Good enough" dictionary look-up, and "good enough" language translation for individual words or small bits of text. Brainstorming assistance. UI prototyping. (They're terrible at writing maintainable code, but getting something "good enough" for quick hands-on proof-of-concept checks has been nice.) "Catch my stupid mistake" code debugging. Structured data extraction from natural language for some tedious recurring work tasks I need to do. Plus a lot of just silly things for my own amusement.
But yeah... Dealing with the rough edges in the ecosystem esp. churn in llama.cpp result in old advice leading me down rabbit holes of wrong paths for hours/days, and crappy/no documentation for things like the proper User/Agent/Tool Call/Tool Response interaction sequence has been draining lately. I think I've burned two weekends at this point trying to just make my custom harness stop reliably when I need to interrupt it without taking the whole program down and bringing it back up. (I have something that mostly works at this point, but it's got enough annoying issues that I'm still not satisfied...)
the "interrupting it, send a message and continue" problem is generally pretty easy to implement, at least in code (ive done it before).
are you vibecoding your thing or actually stuck on something? or is it more of a UI problem rather than a logical messages array problem?
I've been using ollama for most of my experiments so far (slowly switching over to using llama.cpp directly now). My UI is in the browser and I have a custom Tornado server in the middle (which handles local file access, etc. for tool calls). The UI talks to my server via a websocket, and the server talks to ollama via httpx. I wrote both parts using async -- which seemed like a natural choice at the time, but was not a good one in retrospect...
Detangling which parts can block each other in my code from other issues like misunderstanding the sequencing of tool calls and assistant messages (e.g. LLMs can send assistant messages back before making a tool call -- not just during thinking -- and I also implemented tool call execution immediately upon receiving the call... which was a mistake) and running head-first into unexpected limitations of ollama/llama.cpp like the fact that I can't actually run two instances of the same MoE model concurrently in ollama (but can run multiple other models concurrently like Qwen3.6 + Gemma4 or two versions of Qwen that have separate modelfiles). I wanted some tool calls to be able to use other LLMs with different configuration... (It also seems like ollama just completely ignores temperature settings from HTTP requests... which sent me on a wild goose chase for a long time.) Add into all that mess the fact that I want to store the entire structure of the conversation history and be able to both (1) refresh the client page and reload the conversation from the server, and (2) export the conversation to a file and reload it from the client -- which I got working early on with the wrong history structure... -- and the whole thing just ends up making me want to tear my hair out.
Interrupting the easy cases was easy by dropping the connection. Interrupting it in the middle of tool use with all my mistakes and cutting myself on the sharp edges in the ecosystem? Not so much...
I know what I need to do now to get to where I want to be -- I'm not stuck -- but it's been a slog to untangle the whole thing.
Edit: Oh, I also forgot to mention that I stream the responses (including thinking) to the client as they arrive. That was one more knot in the puzzle.
yyyyes, highly recommend switching to llama cpp.
the serving command is easy:
and you can quantize the context for much better speed and use Qwen 3.5 MTP and such. that can look like this:
it surprising howuvh more speed you get over long contexts with those context quantizations with percievably low quality loss.