Apple and NVIDIA Collaborate on ‘ReDrafter’ Technique to Boost Large Language Model Speed
Apple has teamed up with NVIDIA in an intriguing partnership aimed at advancing the performance of Large Language Models (LLMs). In a surprising move, the two tech giants have developed and open-sourced a new technique called Recurrent Drafter (ReDrafter), which promises to significantly speed up text generation while reducing latency and power usage.
ReDrafter: A Breakthrough in LLM Inference
ReDrafter leverages two core techniques: beam search and tree attention, both designed to optimize text generation in LLMs. After rigorous internal testing, Apple collaborated with NVIDIA to integrate ReDrafter into the TensorRT-LLM inference acceleration framework, making it compatible with NVIDIA GPUs.
According to Apple's blog post:
"ReDrafter achieves state-of-the-art performance, delivering up to a 2.7x speed-up in token generation per second for greedy decoding in production models. This results in reduced latency, lower GPU usage, and significant power savings, making it an ideal choice for scaling LLM deployments."
Features of ReDrafter
Speed: Faster token generation with up to 2.7x improvement in performance.
Efficiency: Reduced power consumption and GPU requirements.
Versatility: Integrated into NVIDIA's TensorRT-LLM framework, benefiting developers leveraging NVIDIA GPUs for LLM applications.
Implications of the Collaboration
This unexpected collaboration underscores a shared goal between Apple and NVIDIA: pushing the boundaries of AI performance. While Apple traditionally relies on its custom silicon for AI tasks, this joint effort highlights the potential of short-term partnerships to accelerate innovation. However, due to historical friction between the two companies, a long-term collaboration remains unlikely.
For now, ReDrafter is a promising advancement for both researchers and developers working with large-scale AI models, particularly those looking to optimize performance on NVIDIA's GPU platforms.