this post was submitted on 25 Jul 2024

647 points (100.0% liked)

196

19071 readers

382 users here now

Be sure to follow the rule before you head out.

Rule: You must post before you leave.

Other rules

Behavior rules:

No bigotry (transphobia, racism, etc…)
No genocide denial
No support for authoritarian behaviour (incl. Tankies)
No namecalling
Accounts from lemmygrad.ml, threads.net, or hexbear.net are held to higher standards
Other things seen as cleary bad

Posting rules:

No AI generated content (DALL-E etc…)
No advertisements
No gore / violence
Mutual aid posts are not allowed

NSFW: NSFW content is permitted but it must be tagged and have content warnings. Anything that doesn't adhere to this will be removed. Content warnings should be added like: [penis], [explicit description of sex]. Non-sexualized breasts of any gender are not considered inappropriate and therefore do not need to be blurred/tagged.

If you have any questions, feel free to contact us on our matrix channel or email.

Other 196's:

founded 2 years ago

MODERATORS

moss@lemmy.blahaj.zone

greembow@lemmy.blahaj.zone

moss@lemmy.world

queue@beehaw.org

funky_rodent@lemmy.blahaj.zone

PeachyMcPeachface@lemmy.blahaj.zone

threegnomes@lemmy.blahaj.zone

greembow@lemmy.world

remotelove@lemmy.ca

Roflmasterbigpimp@feddit.de

A_Very_Big_Fan@lemm.ee

qaz@lemmy.blahaj.zone

A_Very_Big_Fan@lemmy.world

qaz@lemmy.sdf.org

qaz@lemmy.world

qaz@sh.itjust.works

qaz@piefed.malplena.net

647

The Rule (lemmy.ml)

submitted 2 years ago by roon@lemmy.ml to c/196@lemmy.blahaj.zone

61 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] jlh@lemmy.jlh.name 87 points 2 years ago (8 children)

I doubt this person actually had a computer than could run the 405b model. You need over 200gb of ram, let alone having enough vram to run it with gpu acceleration.

[–] AVincentInSpace@pawb.social 90 points 2 years ago* (last edited 2 years ago) (3 children)

simple, just create 200GB of swap space and convince yourself that you really are patient enough to spend 3 days unable to use your computer while it uses its entire CPU and disk bandwidth to run ollama (and hate your SSD enough to let it spend 3 days constantly swapping)

[–] zurohki@aussie.zone 31 points 2 years ago (1 children)

Reminds me of the time I compiled Qt on a 1GB Raspberry Pi.

[–] AVincentInSpace@pawb.social 11 points 2 years ago

All I can think to say is 'ouch'.

[–] wander1236@sh.itjust.works 13 points 2 years ago (1 children)

SSD, huh? Real AI enthusiasts swap with an HDD.

[–] AnUnusualRelic@lemmy.world 7 points 2 years ago

I don't have any spare HDs but I can swap on a rewritable optical disc.

[–] AeonFelis@lemmy.world 5 points 2 years ago

Also invite some friends for BBQ. You don't even need to remember where you put your old grill - you won't be using it.

[–] Gormadt@lemmy.blahaj.zone 27 points 2 years ago* (last edited 2 years ago) (2 children)

In terms of RAM it's not impossible, my current little server has 192GB of RAM installed.

Pic from TrueNAS

The VRAM would be quite the hurdle though, I'm curious on it's requirements for VRAM

Edit: Moving data in anticipation of a hardware migration ATM so basically none of the services are running.

[–] dwindling7373@feddit.it 14 points 2 years ago (2 children)

That's not a little server.

[–] Gormadt@lemmy.blahaj.zone 11 points 2 years ago (1 children)

It's pretty old hardware to say the least, it's also really proprietary. (Old Dell PowerEdge T610)

My hardware migration I'm currently in the midst of is going to bring it more in line with my typical use case for it.

Basically taking it down from 192 GB of ECC DDR3 to around 32 GB (maybe 64 GB) of DDR4 RAM. Also down to a single CPU rather than dual socket.

[–] jlh@lemmy.jlh.name 3 points 2 years ago (1 children)

Old Epyc boards are super cheap on eBay. 8 channels of ddr4 and 80-100 lanes of pcie for nvme on an ATX mobo. You pay for the idle power consumption, but it's pretty cheap overall.

[–] Gormadt@lemmy.blahaj.zone 2 points 2 years ago

I'm just going with a Ryzen 1600x system because I have one on hand

My current system has a pair of 12 thread Xeon CPUs and I really don't need them, plus I'm wanting to go with normal consumer hardware for the new system for repairability reasons

[–] desktop_user@lemmy.blahaj.zone 3 points 2 years ago (2 children)

You can have that much RAM with consumer ddr5.

[–] dwindling7373@feddit.it 4 points 2 years ago

Yes but you can't call it a little amount.

[–] jlh@lemmy.jlh.name 1 points 2 years ago

4x64gb udimms would cost over $1000.

[–] Mikina@programming.dev 6 points 2 years ago (1 children)

VRAM would be 810Gb/403Gb/203Gb for FP16/FP8/INT4 for interferrence, according to their website.

[–] Gormadt@lemmy.blahaj.zone 4 points 2 years ago

Hot damn that's a lot! They ain't messing around with that requirement.

My current server has 32 MB of VRAM. Yes, MB not GB. Once I finish the hardware migration it's going to 8GB but that's not even a drop in the bucket compared to that requirement.

[–] PriorityMotif@lemmy.world 6 points 2 years ago (2 children)

You can probably find a used workstation/server capable of using 256GB of RAM for a few hundred bucks and get at least a few gpus in there. You'll probably spend a few hundred on top of that to max out the ram. Performance doesn't go up much past 4 gpus because the CPU will have a difficult time dealing with the traffic. So for a ghetto build you're looking at $2k unless you have a cheap/free local source.

[–] areyouevenreal@lemm.ee 3 points 2 years ago

Without sufficient VRAM it probably couldn't be GPU accelerated effectively. Regular RAM is for CPU use. You can swap data between both pools, and I think some AI engines do this to run larger models, but it's a slow process and you probably wouldn't gain much from it without using huge GPUs with lots of VRAM. PCIe just isn't as fast as local RAM or VRAM. This means it would still run on the CPU, just very slowly.

[–] AdrianTheFrog@lemmy.world 1 points 2 years ago

PCIe will probably be the bottleneck way before the number of GPUs is, if you're planning on storing the model in ram. Probably better to get a high end server CPU.

[–] Mikina@programming.dev 4 points 2 years ago (1 children)

I'm not sure what "FP16/FP8/INT4" means, and where would GTX 4090 fall in those categories, but the VRAM required is respectively 810Gb/403Gb/203Gb. I guess 4090 would fall under the INT4?

[–] technohacker@programming.dev 7 points 2 years ago* (last edited 2 years ago)

They stand for Floating Point 16-bit, 8-bit and 4 bit respectively. Normal floating point numbers are generally 32 or 64 bits in size, so if you're willing to sacrifice some range, you can save a lot of space used by the model. Oh, and it's about the model rather than the GPU

[–] Sabata11792@ani.social 4 points 2 years ago* (last edited 2 years ago) (1 children)

Some apps allow you to offload to GPU, and CPU while loading the active part of the model. I have a an old SSD that give me 500gb of "usable" ram set up as swap.

It is horrendously slow and pointless but you can do it. I got about 2 tokens in 10 minutes before I gave up on a 70b model on a 1080 ti.

[–] AeonFelis@lemmy.world 4 points 2 years ago (1 children)

Even if they used more powerful hardware than you, the model they ran is still almost 6 times bigger - so if you got two tokens in 10 minutes, one token in 30 minutes for them sounds plausible.

[–] Sabata11792@ani.social 4 points 2 years ago (1 children)

I would have to use an entire 1tb drive for swap but I'm sure I could manage 1 token before the heat death of the universe.

[–] AeonFelis@lemmy.world 2 points 2 years ago

I'd worry less about the heat death of the universe and more about your hardware's heat from all that load.

[–] josefo@leminal.space 3 points 2 years ago (2 children)

there are other options less ram consuming?

[–] PumpkinEscobar@lemmy.world 8 points 2 years ago (1 children)

There's quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

There's also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc... It's a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

Then there are some neat projects to distribute models across multiple computers like exo and petals. They're more targeted at a p2p-style random collection of computers. I've run petals in a small cluster and it works reasonably well.

[–] AdrianTheFrog@lemmy.world 1 points 2 years ago

Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb IDK if its even possible to quantize more, if it is, you're probably better of going with a smaller model anyways

[–] theneverfox@pawb.social 5 points 2 years ago

Why, of course! People on here saying it's impossible, smh

Let me introduce you to the wonderful world of thrashing. What is thrashing? It's when you run out of ram. Luckily, most computers these days do something like swap space - they just treat your SSD as extra slow extra RAM.

Your computer gets locked up when it genuinely doesn't have enough RAM still though, so it unloads some RAM into disk, puts what it needs right now back into RAM, executes a bit of processing, then the program tells it actually needs some of what got shelved on disk. And it does it super fast, so it's dropping the thing it needs hundreds of times a second - technology is truly remarkable

Depending on how the software handles it, it might just crash... But instead it might just take literal hours

[–] Iheartcheese@lemmy.world 3 points 2 years ago

I want this to be real though :(

[–] AdrianTheFrog@lemmy.world 2 points 2 years ago

Also worth noting that the 200 gb is for fp4, fp16 would be more like 800 gb