this post was submitted on 20 Oct 2025
93 points (100.0% liked)

technology

24157 readers
359 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

founded 5 years ago
MODERATORS
 

LLMs totally choke on long context because of that O(n2) scaling nightmare. It's the core scaling problem for almost all modern LLMs because of their self-attention mechanism.

In simple terms, for every single token in the input, the attention mechanism has to look at and calculate a score against every other single token in that same input.

So, if you have a sequence with n tokens, the first token compares itself to all n tokens. The second token also compares itself to all n tokens... and so on. This means you end up doing n*n, or n^2, calculations.

This is a nightmare because the cost doesn't grow nicely. If you double your context length, you're not doing 2x the work; you're doing 2^2=4x the work. If you 10x the context, you're doing 10^2=100x the work. This explodes the amount of computation and, more importantly, the GPU memory needed to store all those scores. This is the fundamental bottleneck that stops you from just feeding a whole book into a model.

Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be "optically compressed" into way, way fewer vision tokens.

To do this, they built a new thing called DeepEncoder. It’s a clever stack that uses a SAM-base for local perception, then a 16x convolutional compressor to just crush the token count, and then a CLIP model to get the global meaning. This whole pipeline means it can handle high-res images without the GPU just melting from memory activation.

And the results are pretty insane. At a 10x compression ratio, the model can look at the image and "decompress" the original text with about 97% precision. It still gets 60% accuracy even at a crazy 20x compression. As a bonus, this thing is now a SOTA OCR model. It beats other models like MinerU2.0 while using fewer than 800 tokens when the other guy needs almost 7,000. It can also parse charts into HTML, read chemical formulas, and understands like 100 languages.

The real kicker is what this means for the future. The authors are basically proposing this as an LLM forgetting mechanism. You could have a super long chat where the recent messages are crystal clear, but older messages get rendered into blurrier, lower-token images. It's a path to unlimited context by letting the model's memory fade, just like a human's.

all 36 comments
sorted by: hot top controversial new old
[–] SmithrunHills@hexbear.net 47 points 2 months ago (1 children)

Welp, time to watch the AI stocks crash once again

[–] 7bicycles@hexbear.net 21 points 2 months ago (1 children)

I don't see DeepSeek really having much sway in on the western AI bubble in the short term. The initial hit was like "oh shit the backwater hellhole china can do this?" and that shakes investors but then every government scrambled to just ban it's useage because the chinese are going to steal all your data and that's that.

See also: Chinese EVs (including, but not only, cars)

[–] peeonyou@hexbear.net 7 points 2 months ago* (last edited 2 months ago) (1 children)

tariffs are pretty effective against everything China does right now in the US, but once the rest of the world is lapping the US with cheaper more effective tools and products, it is no longer sustainable

[–] 7bicycles@hexbear.net 3 points 2 months ago (1 children)

define "rest of the world" here

[–] 7bicycles@hexbear.net 37 points 2 months ago (2 children)

Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be "optically compressed" into way, way fewer vision tokens.

Thís is bullshit, man. This is computer alchemy. I detest this, it should not work.

[–] yogthos@lemmygrad.ml 17 points 2 months ago
[–] SkingradGuard@hexbear.net 3 points 2 months ago

Information is information, I guess

[–] hello_hello@hexbear.net 31 points 2 months ago* (last edited 2 months ago) (3 children)

Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be "optically compressed" into way, way fewer vision tokens.

I am impressed that this actually does work, was this ever done with western models or is Deepseek the first to really pioneer it?

Also this means that the deepseek service would become even cheaper then, wouldn't this be a death kneel to the western AI business model?

[–] yogthos@lemmygrad.ml 23 points 2 months ago

As far as I know this is a completely novel approach, and yeah this should make DeepSeek cheaper and able to work on large documents, or code projects which is currently a problem for most models. I do expect that western companies will start implementing this idea as well to keep up.

[–] gay_king_prince_charles@hexbear.net 14 points 2 months ago (1 children)

Deepseek is already 40x cheaper than Claude right now. I don't think there is a tipping point here.

[–] hello_hello@hexbear.net 5 points 2 months ago (3 children)

If you can answer, I wonder how far I can go with just $20? Is that like months worth of constant use? I want to put the price in perspective because it's hard for me to wrap my mind around it.

I bought $2 of tokens in July and use it fairly heavily for my obsidian setup, nvim, and opencode. Still haven't ran out yet.

[–] BountifulEggnog@hexbear.net 5 points 2 months ago* (last edited 2 months ago)

Chat.deepseek.com is free. No paid tier at all, running their best model. Api pricing, eh it depends on use. That's what the 40x refers to.

I don't think it's worth bothering with the 1650 super. 4gb of vram is very little, you could run 4b models but they are not good for standard use.

edit: api pricing is 1m tokens in for 28 cents and 1m out for 42 cents. Assuming requests are 10k tokens and you get 5k out (a lot imo) you'd get 200 per dollar.

[–] aanes_appreciator@hexbear.net 1 points 2 months ago

Not cheaper necessarily when we've seen these things just balloon to meet demand. Besides, from my own experience deepseek's free models are sorta middle-of-the-road nowadays but i don't really use LLMs more than is ABSOLUTELY necessary to navigate the slop left behind by other LLMs

[–] JoeByeThen@hexbear.net 24 points 2 months ago (1 children)

Oh shit, I thought about trying something like that with RNNs years ago when I learned that there were folks doing audio and brainwave processing networks with CNNs. My life blew up and I never got to try it. Nifty!

[–] yogthos@lemmygrad.ml 13 points 2 months ago (1 children)

yeah, It's a really clever trick, always neat when you think of something and then get validated :)

[–] JoeByeThen@hexbear.net 14 points 2 months ago

Lol in my head right now:

skeleton-guns-akimbo I'M WICKED SMAHT!!!

[–] BadTakesHaver@hexbear.net 19 points 2 months ago (1 children)

AI shut down the Amazon servers earlier today I knew jt

[–] cricbuzz@hexbear.net 3 points 2 months ago

they're trying to save us from ourselves. please let the whole internet officially die next time

[–] LangleyDominos@hexbear.net 16 points 2 months ago (1 children)

So this works because a picture is worth a thousand words?

[–] yogthos@lemmygrad.ml 12 points 2 months ago

turns out that's no longer just a metaphor :)

[–] Moidialectica@hexbear.net 12 points 2 months ago (1 children)

I wonder if it can be used with RAG to capture those connected the closest with more clarity, and those with lower scores with less clarity It wouldnt matter since a good dataset will make it so the RAG retrieval is almost always accurate, but with worse models it could allow it to pick out those that are certain, and still keep those that are just 'maybe'

[–] yogthos@lemmygrad.ml 8 points 2 months ago (1 children)

Yeah, I imagine it would be relatively easy to track the original text, and then you could use the image encoding to zero in on the concrete part of the context you want to recall. Even if it's fuzzy, it would cut down the amount of search you have to do on retrieval.

[–] Moidialectica@hexbear.net 9 points 2 months ago (1 children)

For me, it's really good, especially the compression, even Gemini models struggle with 200 thousand tokens, but with deepseek OCR it should be possible to input 500k tokens and have it function like it's 50k, this is gonna be helpful once it's properly ready

[–] yogthos@lemmygrad.ml 7 points 2 months ago

Indeed, I think it'll be really handy for coding tasks as well. It'll be able to load large projects into context, and find things in them much easier now.

[–] NuraShiny@hexbear.net 5 points 2 months ago (4 children)

I really need to just block this sub, because the stupid hype for LLMs is so disgusting it makes my skin crawl.

[–] yogthos@lemmygrad.ml 41 points 2 months ago (1 children)

Why do people feel the need to announce how they're going to block a sub because they don't like what other people are interested in. Just do what you need to do, let the rest of us enjoy things.

[–] peeonyou@hexbear.net 14 points 2 months ago* (last edited 2 months ago) (1 children)

The constant moaning and whining from people not liking things other people like never gets old.

[–] yogthos@lemmygrad.ml 13 points 2 months ago* (last edited 2 months ago)

truly, it's like people have a protagonist complex

[–] MidnightPocket@hexbear.net 22 points 2 months ago

Hey now this is about more than LLM hype.

This is about DeepSeek crashing the nvidia stocks and causing the AI bubble to pop.

omori-manic

[–] thetaT@hexbear.net 11 points 2 months ago

this is not an airport, no need to announce your departure