this post was submitted on 20 Oct 2025
53 points (100.0% liked)

Technology

1292 readers
42 users here now

A tech news sub for communists

founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] PoY@lemmygrad.ml 7 points 6 days ago (1 children)

I thought this was more related to running multiple models at a time on the same hardware versus increased sizes and whatnot. I might be mistaken but it seemed like it was finding a way to use the same hardware to run more things at the same time or better scheduling queues, allowing a GPU to process per-token instead of pinning a model to a particular GPU.

[–] CriticalResist8@lemmygrad.ml 8 points 6 days ago

The article mentions "Packing multiple models per GPU", but also "using a token-level autoscaler to dynamically allocate compute as output is generated, rather than reserving resources at the request level" which I'm not sure what that means but may hint that there are ways to scale this down, possibly.

If not Alibaba then other researchers will eventually get to it.