the main problem is if you have a good computer (ie the average , run of the mill gaming rig these days) you can run a model that will the "ai assistant" role about 90% as good as the best paid saas models for free on the computer you already have with the addition that your local model is abliterated (jailbroken) to talk about restricted shit. all you need is like 32gb of ram and 8gb of vram minimum which the average pc gamer is running these days.

if you are a developer and have 64gb of ram and 16+ of vram then you can run a claude-level local ai as well. ive not fucked with image or video generation but those models are available as well

if anyone needs a tutorial for the technically uninclined i can write one up because thats the only barrier is that its all in techbro and dev-speak

[–] RamenJunkie@midwest.social 2 points 5 hours ago (1 children)

I would love to know the best way to get jailbroken ChatGPT functionality/experience locally. Including the random image generstion and adjustments.

I have tried a few things and images always seem to be seperate. And when I tried a chatbot with stories it was terrrible at keeping track of characters and what was going on.

[–] LaughingLion@hexbear.net 1 points 4 hours ago

If you are talking about multi-modal stuff then locally the best way to do it is still separately.

If you just want to run a local adventure bot "game master" then a Koboldcpp and SillyTavern is the way to go. I'm a tabletop gaming nerd so this is what I use it for when I'm sitting on the couch sometimes.

Simple guide for Windows:

Go grab Koboldcpp Normal exe for NVidia or the NOCUDA if you have AMD and want a smaller download.
Go download Nodejs and install that.
Download SillyTavern {Go to the green "Code" box in the top-right and click it then select download.}
Extract SillyTavern into a folder that has no spaces (Spaces in the folder name will break their install script)
Browse to the /SillyTavern-Launcher/ folder and run the Installer.bat file and run through the simple install.
Go download a jailbroken model. If your RAM situation looks desperate like 16/6 then get something like Rocinante or if you have a little more RAM like 32/12 then go with something like Goetia
Run Koboldcpp.exe. Load the model you downloaded. Press Launch. Wait for it to say its waiting for connection at endpoint. By default it'll give you the local web launch page and you can actually test things there.
If everything looks good and SillyTavern isn't already loaded then launch that from the Launcher.bat.
In SillyTavern you click the plug button at the top and and make sure you can connect to the endpoint. Might have to change the port from 5000 to 5001.
Technically, you are done here and it's just a matter of finetuning your settings and setting up a character card to be a game master.

Finetuning:

First rule of performance is the more of the model you can fit into VRAM the better. ALL is best.
The second rule is to offload ffn_ tensors. Down, then up, then if you are desperate, gate.
Third rule, if you are REALLY desperate is to offload KV cache. At this point your models is running SLOW.
What I would do is download the model. Load it in Kobold. Set your GPU layers to 99 and your context to 16384 (Minimum for text roleplayers)
Click the left Hardware tab. Check Use MMAP, Use mlock, High Priority. Up the threads to one lower than your logical CPU core count. Turn off Launch Browser if you want.
Click the left Context tab. Check No BOS Token (SillyTavern already does this).
Save that config and let the model load. Then use Task Manager to look at your GPU VRAM usage in the performance tab. Are you using all of it?
If "no" then close it and give yourself more context. The ultimate goal for text roleplay geeks like us is 32k. You can also up the Batch Size in the Hardware tab to 1024 or 2048 but realize this scales exponentially the KV cache and hogs VRAM. More KV cache helps your prompt load quicker, which starts to become a problem with larger cache sizes. Your speeds on a fully VRAM loaded model should be like 50tps or better (tokens per second)
If "yes" then run some prompts through the assistant or Seraphina default character cards and check your speeds. They are probably slow.
Close the Koboldcpp window and relaunch it. Load your previous config (you saved it didn't you?) and plug in a tensor offload command (bottom of my comment) into the Override Tensors field in the Hardware Tab. Then return to step 7. Keep doing this until you get speeds of at least 5tps and can have a decent context size (16k+).
If you still can't get enough speed because of VRAM constraints try a 3bit quant of the model you got or go to a less complex, small model. Sorry bubs, welcome to the club. We VRAM poor in this house.
If you are REALLY desperate you can try offloading the KV cache in the Hardware tab but the slowdown is MASSIVE.

Setting up a "character" card and "persona" in SillyTavern and some other settings

Use this guide to start
Don't use his character template, it's kind of bad. Instead consider using a guide like the character foundry template. I'll share a character I made to format characters this way below. Works pretty well. Feed it as much information you want about the character and it makes a nice entry for the Lorebook.
Adjustments I made are to reduce the context to anywhere from 160-220 depending on the model. I don't restrict TopK (leave at zero for off). I don't use the XTC settings over the repetition setting because it works better on longer adventure (like 500+ responses).
Learn to love the LoreBook. I wrap my supporting "characters" in the lorebook with {} and that seems to help models with that. There are also extensions to help with other lorebook entries like this WREC, though I use an inline summarizer which I like.

Character Creator:

You will create a character card considering the context provided by {{user}}. Adhere to the following format strictly, restructuring the context to fit this format:

# {The Character name}
- **Age:** {age in years}
- **Gender:** {male, female, nonbinary, or transgendered}
- **Height:** {height in feet and inches}
- **Race/Species:** {the race or species of a character such as human or elf or other fantasy species}

### Appearance
{Describe the person's basic physical appearance.} {Describe briefly how their body has shaped their life, if it gives them the attention they like or dislike and if they have insecurities about their body.} {Write a sentence or two about their clothing preferences and what kind of image they are trying to project about themselves.}

### Background
{Write a sentence describing where this character grew up and how that culture shaped them.} {Write a few sentences about experiences they had growing up that may have shaped who they are.}

### Personality
{A sentence or two detailing where this character feels they are in their life and if they are satisfied or not with their current situation.} {Describe a few things the character enjoys such as music, movies, games, or other activities.} {Describe briefly the character's innermost desire.} {Describe briefly the character's innermost fear.} {Describe a lie the character believes about themselves.} {Describe one irrational thing the character does or believes.}

### Speech
[{Write about how this person talks considering their accent and the slang and idioms they use. Give examples.}]

### Relationships
{Write a sentence or two about how this character expresses attraction or fondness for others.} {Write a sentence or two about boundaries this character has in relationships.} {Create one issue that makes relationships or intimacy difficult for this character.}

Some basic Tensor offloads, try the top one first, second one if you still need VRAM space, last is a big slowdown:

blk\.\d+\.ffn_(down*)=CPU
blk\.\d+\.((ffn_down*)|(ffn_up*))=CPU
blk\.\d+\.((ffn_down*)|(ffn_up*)|(ffn_gate*))=CPU

[–] NominatedNemesis@reddthat.com 2 points 6 hours ago (1 children)

I would like to agree with you, but in my experience I cannot. I usually use local models, on my work computer, and have access to pro models payed by company. There is a great difference.

I have to use AI. It is in the KPI and my salary raise depends on it... so stupid, just got a mail that We cannot replace our computer for the foreseeable future because of ram and ssd shortage... Meanwhile I fight with "developers" that are generating code... which does not even work!

So I use local AI because I am forced to use AI and I am in the terminal anyway. Also f×ck the great companies pushing their bullsh×it, they already demonstated that they will use the data they get from paying companies as well.

AI hase an usecase, LLM is not the sentient sh×t they want us to beleive. I want to go back before the hype...

I actually programmed an AI and trained to do repetitive but not well defineable tasks for me in c++ with openCV and some tensor library, after a week it worked better than any human. Also helped a research group to optimalize an image recognition AI to help doctors identify cancerous cells.

[–] LaughingLion@hexbear.net 2 points 6 hours ago (1 children)

You're right. Getting a 24B model locally isn't going to be as powerful as a 600B model, for sure. You're also right thar they don't think. They absolutely don't.

But the local ones are pretty powerful and can do a lot more than most think. Even some simple vibe coding can be done with local AI. I think for the average gamer type if they wanted to mess with it local is more than enough tbh.

[–] NominatedNemesis@reddthat.com 2 points 5 hours ago

To be fair, local is better in many ways than cloud solutions, it keeps the data private, and does not lock into a vendor (which they desperately want). Also Lora is an option for finetuneing, but thats way advanced for an average user.

[–] MeetMeAtTheMovies@hexbear.net 8 points 11 hours ago (4 children)

Do you need open weight models for this? Any recommendations for which one(s)? Do you need to download a huggingface client or something? I’m familiar with AI stuff, but not running locally.

[–] corgiwithalaptop@hexbear.net 5 points 8 hours ago* (last edited 8 hours ago) (1 children)

huggingface client

This is what billionaires want us to take seriously. Let me just fire up the slimslam bazooper and have it connect to the sillynilly butterball API

[–] LaughingLion@hexbear.net 1 points 3 hours ago

unironically

[–] gay_king_prince_charles@hexbear.net 7 points 9 hours ago

You'd download ollama, and then get the model via ollama install I believe. Although the 90% as good mark is pushing it, as open weight models below 32b parameters (what you could reasonably run on those machines) benchmark around 40% less than Opus 4.6 for software, and the difference is night and day for general reasoning.

[–] LaughingLion@hexbear.net 1 points 6 hours ago

If you are running at home as a hobby then just use Koboldcpp and maybe SillyTavern if you want extra functionality. In the former you can offload down and potentionally up tensors to save VRAM space if needed. For models it depends on need.

A 24-31B model is generally more than fine for most @home use cases, and they are quite "smart", though that doesn't mean anything in regards to AI. It's a vibe, basically. A 32gb RAM / 8GB VRAM can use a 24B model to generate about 5 tokens per second, which is fine for an agent who is designed to give you short replies to answer questions.

You'll most likely want to grab a GGUF quantization from hugging face, yes. Any 4bit quant is fine, really. The merges are all quasi-abliterated models for people who want slutty AI girlfriends/boyfriends. The models directly from companies like GLM7 or Kimi or whatever are more standard and generally run more efficiently.

People in development are likely going to want a 70-100B model. Claude, I think, is a 100B model. You can run those on about 64gb of ram and 32gb of VRAM.

If you want settings for Koboldcpp I can give you the rundown on how to optimize.

[–] BountifulEggnog@hexbear.net 2 points 8 hours ago (1 children)

It depends massively on what hardware you have. I've heard good things about glm 4.7 flash and it's easy enough to run. Also depends on what you want to use it for.

[–] LaughingLion@hexbear.net 2 points 3 hours ago

glm flash 4.7 is really powerful for its size and is easy to fit on smaller graphics cards, i can attest to that

[–] plinky@hexbear.net 1 points 7 hours ago

local image generation is more involved as well, there is some fun in contorting models with controlling networks which don't make sense for a prompt, and it producing something bizarre in response, but i got bored in a month.

but it's kinda mechanical fun, like playing with frequencies in some audio software.