The “Edge” of AI

Clément Thiriet February 24, 2024

From centralized web servers to Content Delivery Networks (CDNs), the digital world has shifted towards a more distributed approach. NextJS and Vercel Edge are great example of this, showcasing how web services can be delivered globally with speed. How? Databases, files and servers aren't centralized but close to you.

However, Large Language Models (LLMs) like ChatGPT are hungry for computing power and run in huge data centers, leading to inevitable latency and energy consumption.

In the age of smart devices, the demand for instant AI responses keeps growing. The solution? Smaller LLMs, served close to users.

Lots of diverse efforts are pioneering AI at the edge:

  • Apple's MLX framework allows you to efficiently run large models on Apple Silicon chip.
  • Microsoft Phi 2 showcased surprising performance while being only a 2.7 billion parameters model.
  • Mamba architecture attempts to get rid of the quadratic complexity of Transformers.

Inference optimization techniques like quantization, distillation, and Mixture of Experts (MoE) make this change possible.

This shift mirrors the computing revolution of the 1960s when huge computers the size of a room evolved into today's 3nm chips. We’re entering a new era of AI where LLMs are physically closer to us, making technology more personal and faster.