Some of you have asked what the difference is between Danube1 vs Danube2 🤔 What's new in version 2👇 Improved accuracy, increasing performance on #HuggingFace Open LLM Average Leaderboard by 9% percentage points making it the top performing model under 2 billion parameters. In addition to general improved performance, users will see: 1️⃣ Improvements in long-context behavior: By removing the sliding window approach for attention, we effectively improve the retrieval capabilities in long context. H2O Danube2 models can handle a total of 8K tokens for input and output size combined. 2️⃣ Leverage Mistral Tokenizer: The choice of tokenizer is a crucial aspect of large language models, transforming and compressing the input text to token representations consumed by the language model. We found that switching to the Mistral tokenizer improves downstream performance, while keeping the same vocabulary size of 32000. 3️⃣ Improved Filtering: Better filtering and deduplication of training data by using advanced heuristics, as well as machine learning models (GBM and BERT) for predicting the quality of text and using predictions for filtering. 4️⃣ Data Curation Improvements: Significant improvements in underlying data curation leading to a three stage training of H2O-Danube2. At each stage, we gradually decrease the percentage of noisy web data in favor of higher quality data. Learn more: https://lnkd.in/ggV8fw5n #h2oai #huggingface #LLM #SLM #Mistral
H2O.ai is #1 Foundation Model on Hugging Face Open LLM Leaderboard for the <2B Range We are proud to announce that our latest open-weights (#Apache v2.0) small language model: H2O-Danube2-1.8B has topped the charts on #HuggingFace Open LLM Leaderboard for the <2B range, surpassing Microsoft, Alibaba.com and Google Gemma-2B model 📈 Open-source H2O-Danube2-1.8B is available worldwide on Hugging Face: https://lnkd.in/dRsWirvn https://lnkd.in/dhPmPrFU #h2oDanube2 #opensource #huggingface #LLM #SLM #AI