From centralized web servers to Content Delivery Networks, the digital landscape has shifted towards a more distributed approach. NextJS and Vercel Edge exemplify this, showcasing how web services can be delivered globally with unparalleled speed. The essence? Databases, files and servers aren't centralized; they're local to you, making web page responses nearly instantaneous.
Yet, AI, especially large language models (LLMs) like ChatGPT, presents a stark contrast. These models, hungry for vast computing power, are tucked away in massive centralized server farms, leading to inevitable latency and energy consumption.
In the age of smart devices, the craze for instant AI responses continues to grow. The solution? Smaller LLMs, stationed close to users. Lots of diverse efforts are pioneering AI at the edge:
- Apple's MLX framework allows you to efficiently run large models on Apple Silicon chip.
- Microsoft Phi 2 showcased surprising performance while being only a 2.7 billion parameters model.
- Mamba architecture attempts to get rid of the quadratic complexity of Transformers.
Inference optimization techniques like quantization, distillation, and Mixture of Experts (MoE) are also streamlining this process.
This shift mirrors the computing revolution of the 1960s when enormous computers evolved into today's 3nm chip devices. The bulky GPU datacenters of today are on the brink of being outmoded, paving the way for running sophisticated models on your smartphone.
We're at the dawn of a new era for AI, where LLMs become not only integral to our existence but also physically closer, making technology more personal, swift, and eco-friendly. Welcome to the future of AI at the edge — a fusion of immediacy, innovation, and intimacy.