this post was submitted on 20 Oct 2025
53 points (100.0% liked)

Technology

1292 readers
101 users here now

A tech news sub for communists

founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] CriticalResist8@lemmygrad.ml 13 points 6 days ago (1 children)

Short on details but if you could deploy this to a single GPU you could instantly jump from a 12B model to potentially 80B or more, depending how much space the model takes up.

It's pretty clear China is leading (if not the only country interested in) model optimization. I also saw previously that Huawei managed to quantize models in a way that retained similar performances but quantized them to tiny values, making them take up a lot less Vram to run.

[–] PoY@lemmygrad.ml 7 points 5 days ago (1 children)

I thought this was more related to running multiple models at a time on the same hardware versus increased sizes and whatnot. I might be mistaken but it seemed like it was finding a way to use the same hardware to run more things at the same time or better scheduling queues, allowing a GPU to process per-token instead of pinning a model to a particular GPU.

[–] CriticalResist8@lemmygrad.ml 8 points 5 days ago

The article mentions "Packing multiple models per GPU", but also "using a token-level autoscaler to dynamically allocate compute as output is generated, rather than reserving resources at the request level" which I'm not sure what that means but may hint that there are ways to scale this down, possibly.

If not Alibaba then other researchers will eventually get to it.