I might be wrong but this sounds like a quick way to make the web worse by putting a huge computational load on your machine for the purpose of privacy inside customer service chat bots that nobody wants. Please correct me if I’m wrong
WebLLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU.
WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including streaming, JSON-mode, function-calling (WIP), etc.
We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.