Isn't that super slow? I mean that could be slower than using llama.cpp on CPU? (If you always transfer layers between SSD, RAM and over the PCIE-Bus into the GPU...
I expect so, but as we start to get more agents capable of doing jobs without the hand holding, there are some jobs where time isn't as important as ability. You could potentially run a very powerful model on a GPU with 24GB of memory.
OK but the article implies that this approach saves money. I don't think it does that at all.
You know what's cheaper than a GPU with 120GB of RAM? Renting one, for a split second. You can do that for like 1 cent.
Yeah, I'm not sure how they get that, but maybe, if you're wanting to run a model in-house, as many people would prefer, you can then run much more capable models on consumer grade hardware and make savings there compared to requiring the more expensive kit. Many would already have decent hardware, and this extends what they can run before needing to fork out for new hardware.
I know, I'm guessing.
Or just not bother with the GPU at all, get a much cheaper computer/cloud instance without. And do it on CPU if you're going to pipe it through RAM anyways. Tests with llama.cpp (at least) have shown that it's bound by RAM (bus width and speed). Even my old 4-core Xeon can do the matrix multiplications faster than it can get the numbers in. So the extra step sending it to the GPU and doing the computations there seems to be superfluous, unless I'm missing something. Sure, I use quantized values and my computer is old and has DDR4 memory. (And less memory lanes than a proper, modern server.) So the story could be a little bit different in other circumstances. But I'd be surprised if this changed fundamentally.
I'm not sure if renting vs buying makes a difference, though. That depends on how much you use your GPU. And how. Sure, if it's just idle for the most time, or sits under your table, switched off at night, you'd be better off renting a cloud instance. But that's just you using it wrong. If you buy a car and then just use twice a year, it's the same. But not if you drive to work every single weekday.
@tinwhiskers: You're right. I kinda forgot that we also do stuff that's not fed to the user immediately. I can imagine slower inference being useful to index or sum up stuff during the night. Or have it work in conjunction with a smaller model. Maybe fact-check stuff with it's increased intelligence level.
Free Open-Source Artificial Intelligence
Welcome to Free Open-Source Artificial Intelligence!
We are a community dedicated to forwarding the availability and access to:
Free Open Source Artificial Intelligence (F.O.S.A.I.)
Have no idea where to begin with AI/LLMs? Try visiting our Lemmy Crash Course for Free Open-Source AI. When you're done with that, head over to FOSAI ▲ XYZ or check out the FOSAI LLM Guide for more info.
Monthly Roadmap
October 2023
Add new guides and fine-tuning resources.
Build and collect community datasets.
Publish a fosai model on HuggingFace?
More AI Communities
AI Resources
Learn
- FOSAI ▲ XYZ
- Create an LLM from Scratch with Python
- Fine-Tuning Mistral (brev)
- Fine-Tune Llama 2 (brev)
- Run ControlNet on Stable Diffusion WebUI (brev)
- What you need to know about CUDA to get things done on Nvidia GPUs
Build
Serve
Fediverse / FOSAI
- The Internet is Healing
- General Resources
- FOSAI Welcome Message
- FOSAI Crash Course
- FOSAI Nexus Resource Hub
- FOSAI LLM Guide
LLM Leaderboards
LLM Search Tools
LLM Evaluations
- Holistic Evaluation of Language Models (HELM)
- TextSynth
- The Curious Case of LLM Evaluations
- Mosaic Benchmarks