Better Attention is All You Need – Large language models (LLM) have continued to impress over the last few years, especially in the last year or so. While performance, model size, architecture, and data sets have all seen massive improvements, one thing has remained relatively the same during this time: the maximum context length. Most models right now are about 16,385 tokens (GPT-4-turbo-preview / gpt-3.5-turbo), with some going up to 21K (Google facing serious problems about this with Gemini pro), and very few exceeding this limit.
Initially, when large language models first emerged, the context length was much smaller, maybe around 512 tokens. However, as models and techniques have advanced, the context size has remained unchanged. While 16k-ish tokens is substantial, it may still feel limiting for tasks that require processing a large amount of data without fine-tuning the model.
For tasks like responding to emails, tweets, writing code, or parsing through research papers, the 16,385 token limit or even the 21K limit can be restrictive. Techniques like summarizing and vectorizing information can help work around this limitation, but a truly effective solution to this problem is yet to be seen.
Some models, such as the open-source MPT models like MPT30B (up to 16,000 tokens) or the MPT Story Writer (up to 65,000+ tokens), offer larger context windows. However, not all context windows are created equal, and there are reasons why even advanced models like Chat GPT are released with limits up to 16K tokens.
As attention is stretched out in models, three major issues arise. The first is whether the attention can fit into available GPU memory. This is a binary answer – either it fits or it doesn’t. As the context size increases, GPU memory is also taxed, making it crucial to balance context size with memory availability.
The most important factor to consider when working with large models is the attention to detail. The larger your GPU memory requirement, the more significant the issue becomes.
The second issue to address is whether you are prepared for the exponentially increasing processing time that comes with a greater number of tokens in context. This processing time directly affects training and inference time, impacting the model’s usability in production.
When dealing with large context windows, the quality of the model is a crucial consideration. In experiments with mpt-30b and a context window of up to 16,000, the model’s artificial intelligence capabilities were called into question. It is essential to ensure that the model can comprehend and process information effectively, especially when working with larger contexts.
One way to enhance the model’s intelligence is by utilizing a context window to provide relevant information for upcoming prompts. By incorporating context, the model is more likely to retain important information, similar to short-term memory.
For example, passing the entire longnet paper into mpt-30b and asking specific questions can help assess the model’s performance and intelligence.
While a larger context can improve the model’s capabilities, it also presents challenges in terms of GPU memory requirements. Even with optimizations like running at 8-bit or 4-bit, fitting a 5,000-token context window on an 80-gigabyte GPU can be challenging.
The research from Stanford indicates that information in the middle of a large context window is remembered less effectively than information at the beginning or end, creating a u-shaped pattern of remembrance quality – Lost in the middle problem.
The problem that Large Language Models (LLMs) like to read over the middle of a text or context provided to them is called “lost in the middle”. Information from the beginning and end of a long context is known and can be processed, but information from the middle is simply missing. This phenomenon is one of the biggest problems of large language models – alongside hallucinations, i.e. made-up information. So far, there is no real solution. Microsoft’s researchers, together with scientists from Peking University, have at least come up with an approach that could minimize the problem. However, this requires a model to undergo a kind of second training.
Source. Heise.de
Achieving better attention to detail is essential for optimizing model performance, especially when working with large context windows. By addressing GPU memory requirements, processing time, and model quality, you can enhance the model’s intelligence and usability in various applications.
From my writing -> A Comprehensive Guide to ‘Attention Is All You Need’ in Transformers
The need for better Attention
The focus is on the need for better attention in the context of processing large amounts of data. The u-shape behavior suggests that the main issue lies in attention rather than the specific models being used. Attention works well up to 2K tokens, can be extended to 4K, stretched to 8K, and pushed to the limit at 16K. However, beyond this point, there are diminishing returns in terms of quality. Additionally, the processing speeds and memory requirements for handling large contexts become significant challenges.
One model that aims to address these challenges is LongNet from Microsoft Research. LongNet is designed to scale up to a billion tokens and potentially encompass the entire internet in its context. While LongNet is not the first of its kind, it represents ongoing efforts to overcome the limitations posed by processing large contexts efficiently.
The LongNet paper from Microsoft makes bold claims about its capabilities, including accommodating a billion-context length in a relatively compact build. By utilizing dilated attention and segmenting the context, LongNet can achieve processing speeds comparable to Transformers with larger context sizes. This approach allows for parallel calculation of attention within segments, which are then combined to produce the final output. The flexibility to adjust segment size and sparsity offers a customizable solution to meet specific processing needs.
Despite the advancements made by LongNet in addressing computational requirements and processing speeds, the question remains: does it effectively deliver on its promises in terms of functionality and performance? Microsoft compares LongNets with dilated attention perplexity scores to a sparse Transformer, and we can see that in each case, LongNet performs slightly better. As we increase the context, this comparison raises two major questions.
Firstly, one common criticism of this paper is that Microsoft only compared to typical Transformers up to 32,000 tokens. While it may have been beneficial to test with a larger token count, the challenge lies in finding a suitable model without creating a new one solely for testing purposes. The practicality of large context attention, in terms of memory requirements, processing time, and quality, remains a concern. Comparing LongNet’s giant attention scores could provide insights, but extrapolating to extreme contexts, like a million tokens, seems unrealistic given the current limitations of attention mechanisms.
Secondly, the issue of the lost in the middle problem arises when processing a billion tokens. Even with segmentation and optimization, maintaining quality becomes a challenge. Dilated attention, while showing slight improvements in perplexity compared to typical attention, still faces degradation over relatively small ranges, such as 8 to 32,000 tokens. If deterioration is significant within this range, achieving quality at the scale of a billion or even a million tokens seems questionable.
The concept of dilated attention, resembling segmented parallel calculations followed by attention computation, raises doubts about its scalability. Implementing attention every nth token or utilizing convolutions may offer solutions, but the feasibility of handling a billion tokens with such approaches remains uncertain. As we contemplate fitting a billion tokens, the practicality and effectiveness of dilated attention for large-scale contexts warrant further exploration. Better Attention is All You Need
Attention is discussed as being every nth token, where n equals one million. While this approach may seem acceptable, it results in a significant loss of data. There is a need to compress attention in order to address this issue. It is evident that current attention mechanisms must evolve before any improvements can be realized in the domain of context windows for Transformer-based large language models.
One crucial aspect to consider is the impact of attention on the overall performance and effectiveness of these models. By reevaluating and refining the attention mechanism, it is possible to enhance the model’s ability to process and retain essential information within a given context.
As advancements continue to be made in the field of artificial intelligence and natural language processing, the optimization of attention mechanisms remains a key area of focus. By improving attention strategies, researchers and developers can unlock new possibilities for enhancing the capabilities of large language models.
Thank you for reading. Share your thoughts in the comments below and stay tuned for more insightful content. See you in the next article.