Selfhosted

60409 readers

272 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

Be civil.
No spam.
Posts are to be related to self-hosting.
Don't duplicate the full text of your blog or readme if you're providing a link.
Submission headline should match the article title.
No trolling.
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world

113

Do you host your own AI? (aussie.zone)

submitted 1 week ago by SuspiciousCarrot78@aussie.zone to c/selfhosted@lemmy.world

199 comments fedilink hide all child comments

Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

you are viewing a single comment's thread
view the rest of the comments

[–] brucethemoose@lemmy.world 3 points 1 week ago* (last edited 1 week ago) (1 children)

CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

This only offloads "sparse" parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

[–] robber@lemmy.ml 0 points 1 week ago (1 children)

Since implementation of the --fit parameter and its relatives, and --fit on becoming the default, llama.cpp intelligently decides what to offload. For me, it made --n-cpu-moe obsolete.

[–] brucethemoose@lemmy.world 1 points 1 week ago (1 children)

Mostly, yeah.

Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded.

In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap, and don’t OOM on the GPU either.

[–] robber@lemmy.ml 1 points 1 week ago

You can control how much context should be fitted with --fit-ctx and how much space the algorithm should leave unallocated (even on a per-GPU basis) with --fit-target.