The “Edge” of AI

Clément Thiriet February 24, 2024

From centralized web servers to Content Delivery Networks (CDNs), the digital landscape has shifted towards a more distributed approach. NextJS and Vercel Edge are great example of this, showcasing how web services can be delivered globally with unparalleled speed. How? Databases, files and servers aren't centralized; they're local to you.

Yet, Large Language Models (LLMs) like ChatGPT, presents a stark contrast. These models are hungry for computing power and run in huge centralized server farms, leading to inevitable latency and energy consumption.

In the age of smart devices, the demand for instant AI responses keeps growing. The solution? Smaller LLMs, served close to users. Lots of diverse efforts are pioneering AI at the edge:

  • Apple's MLX framework allows you to efficiently run large models on Apple Silicon chip.
  • Microsoft Phi 2 showcased surprising performance while being only a 2.7 billion parameters model.
  • Mamba architecture attempts to get rid of the quadratic complexity of Transformers.

Inference optimization techniques like quantization, distillation, and Mixture of Experts (MoE) are also streamlining this process.

This shift mirrors the computing revolution of the 1960s when enormous computers evolved into today's 3nm chip devices. Today’s bulky GPU data centers are becoming outdated, making way for running advanced models right on your smartphone.

We’re entering a new era of AI where LLMs become an essential part of our lives and are physically closer to us, making technology more personal, faster, and greener. Welcome to the future of AI at the edge — a mix of speed, innovation, and privacy.