Technology

77043 readers

3490 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

266

Meet the AI workers who tell their friends and family to stay away from AI (www.theguardian.com)

submitted 2 days ago by chobeat@lemmy.ml to c/technology@lemmy.world

38 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] brucethemoose@lemmy.world 1 points 2 days ago* (last edited 2 days ago)

Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.

You still have to specify a maximum context though, and it is best to set that as low as possible.

…The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.

You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.

Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.

EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?

The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.