this post was submitted on 20 Oct 2025
54 points (100.0% liked)

Technology

1302 readers
81 users here now

A tech news sub for communists

founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] CriticalResist8@lemmygrad.ml 8 points 1 week ago

The article mentions "Packing multiple models per GPU", but also "using a token-level autoscaler to dynamically allocate compute as output is generated, rather than reserving resources at the request level" which I'm not sure what that means but may hint that there are ways to scale this down, possibly.

If not Alibaba then other researchers will eventually get to it.