this post was submitted on 05 Aug 2025
        
      
      30 points (100.0% liked)
      Free Open-Source Artificial Intelligence
    4024 readers
  
      
      1 users here now
      Welcome to Free Open-Source Artificial Intelligence!
We are a community dedicated to forwarding the availability and access to:
Free Open Source Artificial Intelligence (F.O.S.A.I.)
More AI Communities
LLM Leaderboards
Developer Resources
GitHub Projects
FOSAI Time Capsule
- The Internet is Healing
 - General Resources
 - FOSAI Welcome Message
 - FOSAI Crash Course
 - FOSAI Nexus Resource Hub
 - FOSAI LLM Guide
 
        founded 2 years ago
      
      MODERATORS
      
    you are viewing a single comment's thread
view the rest of the comments
    view the rest of the comments
Sorry for the slow reply, but I'll piggyback on this thread to say that I tend to target models a little but smaller than my total VRAM to leave room for a larger context window โ without any offloading to RAM.
As an example, with 24 GB VRAM (Nvidia 4090) I can typically get a 32b parameter model with 4-bit quantization to run with 40,000 tokens of context all on GPU at around 40 tokens/sec.