WWW.AIWIRE.NET
IBM Think 2025: Download a Sneak Peek of the Next Gen Granite Models
AtIBM Think 2025, IBM announced Granite 4.0 Tiny Preview, a preliminary version of the smallest model in the upcoming Granite 4.0 family of language models, to the open source community. IBM Granite models are a series of AI foundation models. Initially intended for use in IBMs cloud-based data and generative AI platform Watsonx, along with other models, IBM opened the source code of some code models. IBM Granite models are trained on datasets curated from the Internet, academic publications, code datasets, and legal and finance documents. The following is based on theIBM Think news announcement.At FP8 precision, the Granite 4.0 Tiny Preview is extremely compact and compute-efficient. It allows several concurrent sessions to perform long context (128K) tasks that can be run on consumer-grade hardware, including GPUs.Though the model is only partially trained, it has only seen 2.5T of a planned 15T or more training tokens, it already offers performance rivaling that of IBM Granite 3.3 2B Instruct despite fewer active parameters and a roughly 72% reduction in memory requirements. IBM anticipates Granite 4.0 Tinys performance to be on par with Granite 3.3 8B Instruct by the time it has completed training and post-training.Granite-4.0 Tiny performance compared to Granite-3.3 2B Instruct. Click to enlarge. (Source: IBM)As its name suggests, Granite 4.0 Tiny will be among the smallest offerings in the Granite 4.0 model family. It will be officially released this summer as part of a model lineupthat also includesGranite 4.0 Small and Granite 4.0 Medium. Granite 4.0 continues IBMs commitment to making efficiency and practicality the cornerstone of its enterprise LLM development.This preliminary version of Granite 4.0 Tiny is now available on Hugging Face under a standard Apache 2.0 license. IBM intends to allow GPU-poor developers to experiment and tinker with the model on consumer-grade GPUs. The models novel architecture is pending support in Hugging Face transformers andvLLM, which IBM anticipates will be completed shortly for both projects. Official support to run this model locally through platform partners, including Ollama and LMStudio, is expected in time for thefullmodel release later this summer.Enterprise Performance on Consumer HardwareIBM also mentions that LLM memory requirements are often provided, literally and figuratively, without proper context. Its not enough to know that a model can be successfully loaded into your GPU(s): you need to know that your hardware can handle the model at the context lengths that your use case requires.Furthermore, many enterprise use cases entail multiple model deployment, but batch inferencing ofmultipleconcurrent instances. Therefore, IBM endeavors to measure and report memory requirements with long context and concurrent sessions in mind.In that respect, IBM believes Granite 4.0 Tiny is one of todays most memory-efficient language models. Despiteverylong contexts with several concurrent instances of Granite 4.0, Tiny can easily run on a modest consumer GPU.Granite-4.0 Tiny memory requirements vs other popular models. Click to enlarge. (Source: IBM)An All-new Hybrid MoE ArchitectureWhereas prior generations of Granite LLMs utilized a conventional transformer architecture, all models in the Granite 4.0 family utilize a new hybrid Mamba-2/Transformer architecture, marrying the speed and efficiency of Mamba with the precision of transformer-based self-attention. Granite 4.0 Tiny-Preview is a fine-grained hybridmixture of experts (MoE)model,with 7B total parameters and only 1B active parameters at inference time.Many innovations informing the Granite 4 architecture arose fromIBM Researchs collaboration with the original Mamba creators onBamba, an experimental open-source hybrid model whose successor (Bamba v2) was released earlier this week.A Brief History of Mamba ModelsMambais a type ofstate space model(SSM) introduced in 2023, about six years after the debut oftransformersin 2017.SSMs are conceptually similar to therecurrent neural networks (RNNs)that dominated natural language processing (NLP) in the pre-transformer era. They wereoriginallydesigned to predict the nextstateof a continuous sequence (like an electrical signal) using only information from the current state, previous state, and range of possibilities (thestate space). Though theyve been used across several domains for decades, SSMs share certain shortcomings with RNNs that, until recently, limited their potential for language modeling.Unlike theself-attention mechanismof transformers, conventional SSMs have no inherent abilityto selectively focus on or ignore specific pieces of contextual information. So in 2023, Carnegie Mellons Albert Gu and Princetons Tri Dao introduced a type ofstructured state space sequence (S4) neural networkthat adds aselectionmechanism and ascanmethod (for computational efficiency)abbreviated as an S6 modeland achieved language modeling results competitive with transformers. They nicknamed their model Mamba because, among other reasons,all of those Ss sound like a snakes hiss.In 2024, Gu and Dao releasedMamba-2, a simplified and optimized implementation of the Mamba architecture. Equally importantly,their technical paperfleshed out the compatibility between SSMs and self-attention.Mamba-2 vs. TransformersMambas major advantages over transformer-based models center on efficiency and speed.Transformers have a crucial weakness: the computational requirements of self-attention scale quadratically with context. In other words, each time yourcontext lengthdoubles, theattention mechanismdoesnt just use double the resources, it usesquadruplethe resources. This quadratic bottleneck increasingly throttles speed and performance as the context window (and correspondingKV-cache) grows.Conversely, Mambas computational needs scalelinearly: if you double the length of an input sequence, Mamba uses only double the resources. Whereasself-attentionmust repeatedly compute the relevance of every previous token to each new token, Mamba simply maintains a condensed, fixed-size summary of prior context from prior tokens. As the model reads each new token, it determines that tokens relevance, then updates (or doesnt update) the summary accordingly. Essentially, whereas self-attention retainseverybit of information and then weights the influence of each based on its relevance, Mambaselectivelyretains only the relevant information.While Transformers are more memory-intensive and computationally redundant, the method has itsownadvantages. For instance,research has shownthat transformers still outpace Mamba and Mamba-2 on tasks requiring in-context learning (such asfew-shot prompting),copying, or long-context reasoning.The Best of Both WorldsFortunately, the respective strengths of transformers and Mamba are not mutually exclusive. In the original Mamba-2 paper, authors Dao and Gu suggest that a hybrid model could exceed the performance of a pure transformer or SSMa notion validated byNvidia research from last year. To explore this further, IBM Research collaborated with Dao and Guthemselves, along withthe University of Illinois at Urbana-Champaign (UIUC) s Minjia Zhang, onBambaandBamba V2. Bamba, in turn, informed many of the architectural elements of Granite 4.0.The Granite 4.0 MoE architecture employs 9 Mamba blocks for every one transformer block. In essence, the selectivity mechanisms of the Mamba blocks efficiently capture global context, which is then passed to transformer blocks that enable a more nuanced parsing of local context. The result is a dramatic reduction in memory usage and latency with no apparent tradeoff in performance.Granite 4.0 Tiny doubles down on these efficiency gains by implementing them within acompact, fine-grained mixture of experts (MoE) framework, comprising 7B total parameters and 64 experts, yielding 1B active parameters at inference time. Further details are available inGranite 4.0 Tiny Previews Hugging Face model card.Unconstrained Context LengthOne of the more tantalizing aspects of SSM-based language models is their theoretical ability to handle infinitely long sequences. However, due to practical constraints, the word theoretical typically does a lot of heavy lifting.One of those constraints, especially for hybrid-SSM models, comes from the positional encoding (PE) used to represent information about the order of words. PE adds computational steps, and research has shown that models using PE techniques such asrotary positional encoding (RoPE)struggle to generalize to sequences longer than theyve seen in training.The Granite 4.0 architecture usesno positional encoding(NoPE). IBM testing convincingly demonstrates that this has had no adverse effect on long-context performance.At present, IBM has already validated Tiny Previews long-context performance for at least 128K tokens and expects to validate similar performance on significantly longer context lengths by the time the model has completed training and post-training. Its worth noting that a key challenge in definitively validating performance on tasks in the neighborhood of 1 M-token context is the scarcity of suitable datasets.The other practical constraint on Mamba context length is compute. Linear scaling is better than quadratic scaling, but still adds up eventually. Here again, Granite 4.0 Tiny has two key advantages:Unlike PE, NoPE doesnt add additional computational burden to the attention mechanism in the models transformer layers.Granite 4.0 Tiny isextremelycompact and efficient, leaving plenty of hardware space for linear scaling.Put simply, the Granite 4.0 MoE architectureitself does not constrain context length. It can go as far as your hardware resources allow.Whats Happening NextIBM expressed its excitement about continuing pre-training Granite 4.0 Tiny, given such promising results so early in the process. It is also excited to apply its lessons from post-training Granite 3.3, particularly regarding reasoning capabilities and complex instruction following, to the new models.More information about new developments in the Granite Series was presented atIBM Think 2025and in the following weeks and months.You can find the Granite 4.0 Tiny onHugging Face.This article is based on IBM Think News Announcementauthored by Kate Soule, Director, Technical Product Management, Granite, andDave Bergmann Senior Writer, AI Models at IBM.
0 Reacties 0 aandelen 70 Views