Defeating Llama 2 and competing against GPT-3.5, Stability AI's new model topped the open source large model rankings

Original Source: Heart of the Machine

Image source: Generated by Unbounded AI‌

In the blink of an eye, the open source big model has improved again. Do Google and OpenAI really have no moat?

"I just took a 30-minute lunch break, and our field has changed again?" After seeing the latest open source large model rankings, an entrepreneur in the AI field asked his soul.

Leaderboard link:

The "rookies" in the red box above are two large models from Stability AI and CarperAI lab: FreeWilly 1 and FreeWilly 2. Just now, they surpassed the Llama-2-70b-hf released by Meta three days ago, and successfully reached the top of HuggingFace's Open LLM leaderboard.

What's more striking is that FreeWilly 2 also beat ChatGPT (GPT-3.5) on many benchmarks, becoming the first open source model that can really compete with GPT-3.5, which is something that Llama 2 did not do.

FreeWilly 1 is built on the original LLaMA 65B base model and carefully supervised fine-tuning (SFT) using new synthetic datasets in the standard Alpaca format. FreeWilly2 is based on the latest LLaMA 2 70B base model.

From the blog published by Stability AI, we can see some details of these two new models:

Data Sources

The training method of the FreeWilly model is directly inspired by the method pioneered by Microsoft in their paper "Orca: Progressive Learning from Complex Explanation Traces of GPT-4". While FreeWilly's data generation process is similar, there are differences in the source of the data.

FreeWilly's dataset contains 600,000 data points (approximately 10% of the dataset size used in the original Orca paper), and it was generated by inspiring language models from the following high-quality instruction dataset created by Enrico Shippole:

  • COT Submix Original
  • NIV2 Submix Original
  • FLAN 2021 Submix Original
  • T0 Submix Original

Using this approach, the researchers generated 500,000 examples using a simpler LLM model and an additional 100,000 examples using a more complex LLM model. To ensure a fair comparison, they carefully screened these datasets and removed examples derived from the evaluation benchmark. Although the number of training samples is only 1/10 of the original Orca paper (which greatly reduces the cost and carbon footprint of training the model compared to the original paper), the resulting FreeWilly model performs well on various benchmarks, validating the effectiveness of their approach with synthetic datasets.

Performance Data

For internal evaluation of these models, the researchers used EleutherAI's lm--harness benchmark, incorporating AGI.

Among them, the lm--harness benchmark was created by the EleutherAI non-profit artificial intelligence research laboratory, which is behind the aforementioned HuggingFace Open LLM leaderboard.

AGI was created by Microsoft to evaluate the performance of the underlying model on "human-centric" standardized tests, such as math competitions and bar exams.

Both FreeWilly models perform very well on many fronts, including complex reasoning, understanding the subtleties of language, and answering complex questions involving specialized domains such as legal and mathematical questions.

The evaluation results of the two models on the lm--harness benchmark are as follows (these FreeWilly test results were evaluated by Stability AI researchers):

The performance of the two on the AGI benchmark is as follows (all 0-shot):

Additionally, they tested two models on the GPT4ALL benchmark (all 0-shot):

Overall, the performance of these two models is very good, further narrowing the gap with top AI models such as ChatGPT. Students who want to get the model can click the link below.

FreeWilly 1:

FreeWilly 2:

Judging from the reactions of all parties, the appearance of the FreeWilly model has brought a little shock to everyone, because they came too fast. After all, Llama 2 has only been launched for 3 days, and the ranking position is not hot. One researcher said that he recently had eye surgery and didn't watch the news for a week, but felt like he had been in a coma for a year. So, this is a "can't blink" period.

However, it is important to note that while both models are open access, unlike Llama 2 they are released under a non-commercial license for research purposes only.

However, such an approach has aroused doubts from netizens.

In response, Stability AI researchers replied that this situation (for research purposes only) is only temporary, and in the future, FreeWilly is expected to allow commercial use like Llama 2.

In addition, some people have questioned the benchmark adopted by the test:

This is also a more difficult problem at present. Previously, the event that the Falcon model crushed Llama on the HuggingFace leaderboard was controversial. Later, the event was completely reversed. It turned out that Llama was not crushed by Falcon, and HuggingFace also rewrote the leaderboard code for this. Today, with the emergence of large models, how to effectively evaluate these models is still a problem worthy of discussion. Therefore, it is necessary for us to maintain a more cautious attitude towards these top ranking models and wait for more evaluation results to be released.

Reference link:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)