Since the release of BERT by Google in 2019, Transformers architecture have taken over machine learning thanks to their 𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺, that gives them the ability to focus on important points of the input. But 𝙖𝙩𝙩𝙚𝙣𝙩𝙞𝙤𝙣 𝙘𝙤𝙢𝙥𝙪𝙩𝙖𝙩𝙞𝙤𝙣 𝙞𝙨 𝙦𝙪𝙖𝙙𝙧𝙖𝙩𝙞𝙘 𝙞𝙣 𝙩𝙝𝙚 𝙞𝙣𝙥𝙪𝙩 𝙡𝙚𝙣𝙜𝙩𝙝.
💫 The Mamba paper, published in December 2023, announced the return of the RNNs: it has no attention, but integrates a selection mechanism, which should be able to reproduce the “focus” ability of attention, in an architecture for which the compute requirements 𝗴𝗿𝗼𝘄 𝗼𝗻𝗹𝘆 𝗹𝗶𝗻𝗲𝗮𝗿𝗹𝘆 𝗶𝗻 𝗶𝗻𝗽𝘂𝘁 𝗹𝗲𝗻𝗴𝘁𝗵! 🤔 Would this work? We had yet to see a large Mamba model recovering the performance of Attention-based Transformers.
💥 But now it's done! A (Mamba + Transformers) hybrid just beat Transformers!
The AI21 Labs team just released Jamba. They insert a few Transformer layers to inject some attention in a big pile of Mamba layers, thus getting the best of both worlds.
𝙏𝙇;𝘿𝙍: 🏗️ 𝗡𝗲𝘄 𝗠𝗼𝗘 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: 4 Jamba blocks, each of these being 7 Mamba layers for 1 Transformer. 🏋️ 𝟱𝟮𝗕 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀, 𝟭𝟮𝗕 𝗮𝗰𝘁𝗶𝘃𝗲 𝗮𝘁 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲: This reduction is enabled by Mixture of Experts, and similar to Mixtral (47B parameters - 13B active). 🏎️ 𝗦𝗽𝗲𝗲𝗱: 𝘅𝟯 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁. Jamba is much faster than similar-sized Transformer models on long contexts. 📏 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗻𝗴𝘁𝗵: 𝟭𝟰𝟬𝗞 𝘁𝗼𝗸𝗲𝗻𝘀 on a single 80GB A100! 💪 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: 𝘀𝘁𝗮𝘁𝗲-𝗼𝗳-𝘁𝗵𝗲-𝗮𝗿𝘁 𝗳𝗼𝗿 𝘁𝗵𝗶𝘀 𝘀𝗶𝘇𝗲. The small injection of attention seems sufficient since Jamba beats the open-source reference Mixtral-8x7B on many benchmarks!