Infinite Context for Large Models and the Art of Dataset Composition

2023-07-27 02:23:16

Source | Latent Space

OneFlow compilation

Translation｜Jia Chuan, Yang Ting, Wan Zilin

Image source: Generated by Unbounded AI tool, general model (game CG)

Context length used to be one of the biggest limitations of GPT-3. GPT-3 can only accept up to 4000 tokens (3000 words, 6 pages), otherwise an error will be reported. Therefore, in order to deal with long documents and prompts (), it is necessary to introduce other retrieval techniques such as LangChain. However, MosaicML (which has been acquired by Databricks for about $1.3 billion) opened the MPT-7B context in early May with a length of 84,000 tokens (63,000 words, 126 pages), greatly expanding the range of text that can be processed. Subsequently, The Claude model developed by Anthronpic has a context length extended to 100,000 tokens.

MPT-7B was trained from scratch using 1 trillion tokens of text and code as training data. Compared with other similar models (such as Pythia and OpenLLaMA use 300 billion tokens, StableLM uses 800 billion tokens), the training data of MPT-7B is larger, and its quality is comparable to that of LLaMA-7B. The model was trained on the MosaicML platform, using 440 GPUs, and the training process took 9.5 days without human intervention at a cost of about $200,000. Unlike other open models, MPT-7B is licensed for commercial use and optimized for fast training and inference with FlashAttention and FasterTransformer.

(MPT-7B performance on zero-shot academic tasks)

MosaicML also released three MPT-7B-Instruct, MPT-7B-Chat, MPT-7B-StoryWriter-65k+ models based on the base MPT-7B for fine-tuning.

The model is fine-tuned on dolly_hhrlhf. The dolly_hhrlhf dataset is built on top of the "dolly-5k" dataset.

The model is fine-tuned on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets.

The fine-tuning dataset for this model is a filtered subset of novels in books3 with a context length of 65k. While the advertised size was 65k tokens, the team was able to get a response of 84k tokens when running on a single node's A100-80GB GPU. The key technology behind this is ALiBi. The Great Gatsby originally only had about 68k tokens, so the team used the MPT-7B-StoryWriter-65k+ model to create a new ending for the novel.

In addition to model checkpoints, the team has open-sourced the complete codebase for pre-training, fine-tuning, and evaluating MPT via their new MosaicML LLM Foundry. The above table was created using the contextual learning assessment framework in LLM Foundry.

MosaicML Chief Scientist Jonathan Frankle and Research Scientist Abhinav Venigalla are the heads of MPT-7B, leading the entire training process of MPT-7B. In the latest podcast of Latent Space, the principal Swyx and Decibel Partners partner Alessio discussed with them the innovation of the MPT-7B training process and explained why LLM dataset combination is an important and mysterious art. Also, some traditional multiple-choice benchmarks may not be very helpful for the technology being built, and they will also explore the reasons behind this.

(The following content is compiled and released by OneFlow after authorization, source: https://

Construction of MPT-7B model

**Swyx: Why did you develop the MPT-7B? **

Abhinav: The MPT-7B project took about 6-12 months. We started working on language models last summer and published a blog post that analyzed language models and found that the cost of training may actually be much lower than people think. Also since then, inspired by the LLaMA model released by Meta AI and many other open source works, we set out to create a really good model with 7 billion parameters, which is the origin of MPT.

Alessio: You said in one of the podcasts: Mosaic has no plans to build and release models. But in the end you released the model anyway, what made you change your mind?

Jonathan: I think there are several factors: We still lack a first-class model. Unlike OpenAI, where our business revolves around customers creating their own models, we primarily provide them with the tools, and in order for those tools to be effective, we have to first create our own models.

It has to be clear that if our clients can do great things, we can do great things too. I've had a lot of people on Twitter questioning the veracity of the numbers Mosaic showed, like Ross Whiteman saying, "Let's see the actual results," to which I would say, "Ross, what do you think these How did it work?" We developed the model in 9.5 days at a cost of $200,000, so you can do it too.

**Swyx: **Referring to the data you released last year, it was initially estimated that the cost of training GPT-3 was less than $450,000, and then it was reduced to $100,000; the cost of Stable Diffusion was also reduced from $160,000 to less than $50,000 .

Jonathan: I'm still very cautious about the $100,000 figure. It’s not there yet, but we’re headed in that direction, and that’s a big challenge for Abhi.

Swyx: There are three variants of the MPT-7B model, one of which achieves SOTA in terms of context length, what is the training process for these models?

Abhinav: Our basic model is a re-creation of LLaMA-7B, with 7 billion parameters and a training data of 1 trillion tokens, providing an efficient training starting point for the fine-tuning model without excessive intervention. Fine-tuning the model is also very interesting, such as MPT-7B-StoryWriter-65k+ can be used for story writing, the context window length is 65,000, and it can also continue writing based on known content.

Of course, this is just one of the directions we think of. You can use the MPT-7B Base model to build custom models to suit different needs, such as long context code models or specific language models. So based on the basic model, three variants were built, MPT-7B-Instruct, MPT-7B-Chat and MPT-7B-StoryWriter-65k+, which are used to follow short instructions, chat dialogue and write stories respectively.

Alessio: How do you decide how many tokens and parameters to use when training the model? 7 billion and 3 billion model parameters seem to be two magic numbers that are currently in vogue.

Abhinav: For training models, the scaling law can tell you how to make the most efficient use of training computing resources. For example, if the budget is 200,000 US dollars, then according to the law of scale, the most effective training program can be given.

Among them, the one we most often follow is Chinchilla's law. For the MPT-7B model and its related variants, these laws are not strictly followed, because we want to ensure that the model is suitable for personal use and has good inference performance, so it is overtrained, exceeding the Chinchilla Point (referring to data level measured in tokens). Some people on the Internet jokingly call these models Llongboi because their training time is quite long. Taking the 7B model as an example, the Chinchilla Point may be 140 billion tokens, but we actually trained 1 trillion tokens, so The training time is almost 7 times longer than normal.

**Swyx: Is Llongboi referring to a training method? **

Jonathan: Llongboi is just an insider's joke, referring to a training method that uses more tokens than Chinchilla's law dictates. It can be seen that Llongboi has two "L" at the beginning, which is used to pay tribute to LLaMA. Our CEO once made the name public on Twitter, referring to the model as "Llongboi". Sometimes I really want to take his twitter password so it doesn't leak out early, but now the whole world knows the name.

About architecture, ALiBi, context

**Alessio:**Flash Attention and Faster Transformer are the two core elements of your model building. What are their advantages?

**Abhinav:**Flash Attention is a faster implementation of Full Attention, developed by Stanford's Hazy Research lab. We integrated Flash Attention into our library last September and it has played a big role in training and inference speed. Compared with other Hugging Face models, this model is very special. It can switch between general Torch Attention and Flash Attention specially designed for GPU, which makes the training speed of the model increased by about 2 times and the inference speed increased by 50. %-100%.

**Swyx: What motivated you to choose ALiBi positional encoding? **

Abhinav: We combined ALiBi positional encoding, Flash Attention and training stability in an interesting way. ALiBi is able to remove the need for positional embeddings in the model. Previously, if a token had position 1, then you needed to add a specific position embedding, and couldn't exceed the maximum position (usually 2000). But with ALiBi, this problem is solved. We just need to add a bias (bias) to the Attention Map, which is like a slope, and if a longer range of positions is required for inference, it will extend this slope to a longer number of positions. This approach works because the slope is continuous and can be interpreted.

Interestingly, through Flash Attention, the model saves a lot of memory and improves performance, so we started to perform performance tests on models with very long contexts (up to 65k) last year, and at the same time, it is very difficult to perform stable training . Later, we tried to integrate ALiBi into the model, and the stability of the model was significantly improved. We can now stably train story writing models on very long contexts and guarantee efficient use of them.

Jonathan: The context length is technically unlimited. As long as enough memory is given, the dialogue can continue indefinitely. We believe that the longest number the model can handle is 84K, which is the longest context length that humans can comfortably handle in practice. But we have also tried context lengths exceeding 84K in practice, and we can handle longer lengths.

**Swyx:**For example, we can input the novel "The Great Gatsby" into the model, and then let the model continue to write the novel according to the input text, and finally the model outputs quite exciting content.

Jonathan: There are a lot of really good versions of the end of the story within Mosaic. One version describes Gatsby's funeral, Nick starts talking to Gatsby's ghost, Gatsby's father also shows up, and then he and Tom show up at the police station. This version puts a lot of emphasis on the plot, describing what happens next. Also, many versions have very Fitzgerald-esque endings, and they are beautifully written. So it's exciting to see that the model does seem to be processing the input and producing meaningful output. We can do a lot with this context length.

Alessio: Memory starts to become one of the constraints of the model, so how should parameter size and context length be chosen?

Jonathan: Recently, research on long contexts has attracted a lot of attention and a series of related papers have emerged. However, these papers are not entirely accurate, and to some extent, especially with respect to attention mechanisms, they compare non-quadratic attention mechanisms (such as approximate, hierarchical attention) with explicit and correct quadratic attention. trade-offs or trade-offs. I'm bullish on approximation methods, so can't wait to dig into these papers.

Writing and reading papers taught me an important lesson about not trusting any data until you've done it yourself. At Mosaic, we were disappointed in implementations many times because papers that looked promising at first only realized after implementation that the papers had manipulated the data. As such, I'm always skeptical of data and don't trust any results until they've been re-implemented and validated. Overall, the practice paid off, and many times, the theories didn't work as well in practice as expected.

Features of MPT-7B

**Swyx: What are the specific features of the MPT-7B? **

Abhinav: I would break this down into two parts, the first is the stability of the training. This question can be divided into three parts. First, the model needs to avoid loss spikes during training, which is our first line of defense. In my opinion, loss spikes are not a big problem at a training size of 7 billion parameters. However, avoiding loss spikes becomes difficult as training time increases. We spent a long time figuring out how to tune initialization methods, optimizers, architectures, etc. to prevent loss spikes. Even during our training, if we look carefully, we can still find some small intermittent peaks, but these peaks will return to normal within a few hundred steps, which is a very magical phenomenon, which can help us naturally from the peak loss recovered.

Determinism and smart recovery strategies are our second line of defense. In the event of a catastrophic error, we will be able to quickly resume training, applying some intervention in the few batches before the failure. For possible problems, we have made various preparations. However, in the training of MPT-7B, we did not use these backup measures at all, which has to be said to be a kind of luck.

The right training infrastructure is the third line of defense. If we try to train the model on hundreds of GPUs, there are often hardware failures. For example, when training a model in a large cluster with 512 GPUs, the training will fail almost every two days. The reason for the failure may be a network failure.

Typically, people set up 24/7 on-call teams to deal with these failures. When there is a failure, the team tries to check the cluster, remove broken nodes, restart, etc., which is a very tedious task. We used to spend months manually checking for errors, but now we built a platform to automate every node in the model training process.

When there is a problem with a model run, our automated monitoring system stops the job, tests and checks for broken nodes, and restarts. Because of the deterministic and fast recovery capabilities of our software, the model continues to run just fine. As a result, we can sometimes see in the model logs that after a model fails at 2am, it is back up and running within minutes without manual intervention by a team member.

Jonathan: It is really not easy to do this. If there was a hardware failure in the model a few months ago, the team members would have to get up at two o'clock in the morning to check the cause of the node failure and restart the job. Previously, even at a training scale of 7 billion parameters, we often encountered catastrophic loss spikes, and these problems seriously affected the training of the model.

We have now addressed these issues through incremental improvements. As Abhinav said, we can now sit in an office while training multiple models without worrying about the model failing and interrupting the training.

Data selection and repetition and the evaluation challenges of LLM

**Swyx: Data selection is your focus, can you expand on it? **

Jonathan: Abhi almost killed me when I tried to use all the GPU for data processing instead of actually training the model. We know that training a model requires a lot of data, but there are also many uncertainties.

One is which kinds of different data sources are important, and the other is the importance of duplication. Among them, the question about duplication can be further broken down into quality and quantity trade-offs. Suppose I have the best 10 billion lexical data in the world, is it better to retrain it a hundred times, or is it better to use 1 trillion low-quality, up-to-date lexical data? Of course, there may be a compromise point, but how to determine high-quality data is also a problem, and there is no clear answer yet. If I were to go back to academia now, I would definitely write a paper on it, because I don't know anything about it yet.

Swyx: I haven't seen any research papers on this so far.

Jonathan: The central question of the thesis research is "what kind of data set combination should be used".

In the process of creating the model, I went back to Georgetown Law School, where I taught, and sat down with a group of law students to discuss it. I give them a high-quality dataset, how to mix the data, and the number of tokens they have, and let them create the best dataset for their model.

They don't know anything about LLMs other than that input data affects behavior. I tell them to create a hybrid that covers all the different tradeoffs. At first, a large amount of English corpus may be required, which can be obtained through the Internet; if you want to make it a multilingual model, then the English corpus will be reduced a lot; in addition, whether to include the code in it.

Some people think that code can make the model perform better in logical reasoning, but I have never seen any evidence to support this idea. Although we have indeed developed an excellent code model, whether the code model can lead to better thinking chain reasoning ability requires further research.

A version of GPT-3 is said to be trained from the novel "The Da Vinci Code", so some people think that this may be useful, but there is no evidence; ) will help the training of the model, but there is also a lack of evidence.

Therefore, we experimented with many different data mixtures and found that some data mixtures worked better or worse than others. For example, "The Pile" is a very stable data mix, but according to the evaluation metrics, there are other better data mixes. Next I will also touch on the issue of evaluation, which is very important.

The T5 model was originally trained on the C4 dataset, which performed exceptionally well. Others, including EleutherAI's Stella Beaterman, mentioned this when I tweeted about it. In the original paper on the T5 model, the preprocessing method for the C4 dataset looks weird, and the authors removed everything containing the word "Java" from the dataset because they didn't want Java-related warnings. Also, they removed the inclusion of curly braces because they didn't want to get the inclusion of Java.

They looked at a list of bad words and removed content that contained bad words. However, the list of bad words actually includes some words that aren't actually bad, like "gay". But because of this cleaning process, the resulting dataset seems to be unrivaled. From this point, we know nothing about the data.

In fact, we also used a data set called MC4, MC4 and C4 had the same preprocessing, but added more web calls (web calls), but compared with C4, the English part of MC4 is worse Many, for unknown reasons.

For this, I set two criteria:

First of all, the English part should be at least as good as MC4. Compared to other available datasets, the English part of MC4 is better. Second, go all out on data diversity and make sure the dataset includes things like code, scientific papers, and Wikipedia, because people will be using the model for a variety of different tasks.

But I think, most importantly, the model is only as good as the evaluation metric. Abhi may disagree on this point. We do not know how to accurately evaluate generative models when they are asked to perform specific tasks. In some cases, we have to admit that our own assessments don't even measure what we really care about, so we can only make reasonable choices.

Swyx: Do you think evaluation methods such as MMLU (Massive Multitask Language Understanding) and BIG-bench are not convincing enough?

Jonathan: These methods undoubtedly do two types of tasks. One is a multiple-choice task, which contains one correct answer, which allows the model to generate options such as A, B, C, or D, and then chooses the answer that the model is most likely to generate by calculating the perplexity of each possible answer . But instead of asking the model to do multiple-choice questions, we do a second kind of open-ended generative task, such as summarization. Comparing using metrics like BLEU and ROUGE is not accurate enough, there are many excellent paper abstracts and open generation methods. In contrast, manual is a more reliable evaluation standard, but manual evaluation is very time-consuming and laborious, and it cannot be compared with the model in real time, which may be possible in the future.

Abhinav: We have a great evaluation team that is helping us build new metrics.

Jonathan: But LLMs are hard to evaluate, and I don't think any of these metrics really reflect what we would expect from a model in practice.

Cost reduction and efficiency increase of model training

Swyx: Now it takes people three to ten days to train a model, how long do you want to shorten that time?

Abhinav: This year is probably one of the most exciting years in terms of raw model training efficiency improvements. This year, both hardware and software have been upgraded accordingly. The first is Nvidia’s new-generation hardware H100s, which alone can improve performance by at least two times. Secondly, there is a new floating-point number format FP8, which can achieve the same performance improvement when used alone.

A few years ago, we started using 32-bit precision, and then Nvidia introduced 16-bit precision. After several years of development, we have gradually mastered the 16-bit training skills due to the continuous improvement of requirements.

With FP8 this year, we can double the throughput, which means we can triple the cost. At the same time, we have started profiling LLM training using FP8 on the H100, and progress has been rapid. So, just by improving the hardware, we can reduce the cost a lot.

In addition, there are many studies on architecture applications. We are exploring ways to introduce some sparsity, but not completely random sparsity. Is there a gating mechanism or MoE-style architectural way to achieve this?

Our original goal was to reduce the cost of training the GPT-J model from $500,000 to $100,000, and if we can achieve that by the end of the year, that would be a great achievement.

Jonathan: This idea is not a castle in the air. Although that stage has not been reached yet, this goal is likely to be reached by 2023.

Statistics on training and inference costs are scarce. Google's David Patterson published a blog post discussing Google's energy usage for machine learning. After a detailed analysis, over the past three years, Google spent three-fifths of its resources on inference and two-fifths on training. The above is Google's data, they provide models for billions of users.

Google is probably the place with the largest inference load in the world. And that's just resource allocation for training, with inference accounting for three-fifths and training accounting for two-fifths. The hardware may be more expensive, and the network structure of the hardware may be more complex, so the training and reasoning may be divided in half. The above is Google's allocation ratio, but for other companies, training may account for a higher weight.

The importance of openness for AI research

Alessio: The previous training cost was very expensive, which prevented us from conducting enough experiments, so there were many problems in selecting data sets and so on.

Jonathan: In grad school, I used to be jealous of my friends because they had GPUs and I didn't have one on my laptop, so I couldn't train any models. I fantasized about winning the lottery so I could own a K80 GPU.

Deep down, I'm still that eager student of science. I strongly believe that if we want to do scientific research and really understand these systems, how to make them work well, understand the elements of their behavior, safety and reliability, we have to reduce the cost of training so that we can actually do scientific research. Take biological experiments, for example, where we need to do multiple cell cultures and experiments to make sure a drug works, a lot of scientific research is necessary before we really understand something.

**Abhinav:**MosaicML has many customers who are trying to train models, so the company has an incentive to devote a lot of resources and time to scientific research. Only by truly understanding how models should be trained can we help more people. So for us, this aggregation process is very important.

I remember there was a paper from Google before that investigated batch size or something. This paper probably cost millions of dollars, and it has huge benefits to the community as a whole. Now, we can all learn from it and save money without breaking the bank. Therefore, for Mosaic, through experimental research, we have gained deep insights into data, pre-training architecture, etc., which is why customers choose us.

Jonathan: Openness is very important to the AI community. In a sense, we have no reason to be closed. We earn income by helping customers train models. There is no loss for us to share the results with the community. After all, we have to earn income through customized models and excellent infrastructure. And bringing these aspects together is why we named our company MosaicML.

We have always maintained an open attitude and will not hide the results we have achieved. But now, I find that we have become one of the largest open source labs in the industry, which is a sad fact, because MosaicML is not that big in terms of the industry as a whole, we only have about 15 researchers, many others The laboratories have become closed and no longer publish much content publicly. However, MosaicML will continue to communicate and share with the community, and try its best to become a pioneer of open research. While our scale and volume of research cannot match that of a large lab, we will continue to share what we learn in an effort to create resources for the community.

When I discuss the AI ecosystem with policymakers, a common concern always comes up: that a lack of openness will hinder the pace of innovation. I've been emphasizing this issue for years, but it's finally a reality. I advocate open source, but I don't think everyone will share their work. We once took open source for granted, but that's no longer the case.

I think it's going to slow down our development. In many cases, there is a monolithic culture in each laboratory, and communication is an important driving force for scientific progress. Therefore, open source is not only indispensable in the open source community and academia, but also critical to the advancement of technology. We need a vibrant open source research community.

Future trends

Swyx: You mentioned that a lot of things don't last long and are easily replaced, but Transformer is here to stay.

Jonathan: Transformers will always exist. Convolutional Neural Networks (CNNs) are still in use today, and Visual Transformers have not taken their place. Look at the recurrent neural network (RNN), which has existed for decades, but is still active in many fields. As a result, implementing major infrastructure improvements is difficult.

Abhinav: I think that your bet depends a lot on what is defined as attention. If an operation such as QK matrix multiplication is replaced by a similar method, what effect will this have on the result?

Jonathan: In the final analysis, this is just a fully connected feedforward network, Transformer with a simple attention mechanism. So things may change, but we continue to use Transformer as Ashish Vaswani (Transformer author) envisioned six years ago, and perhaps will continue to do so in the future.

Abhinav: I think it will become similar to MLP (Multilayer Perceptron), which is the only option we have at the moment, because now the architecture has been simplified a lot, leaving only some linear layers, residual connections, Attention, dot multiplication operation.

Jonathan: Your assumption is that the architecture will become simpler, but the reality may be the opposite, and the architecture may become more complex.

Swyx: What are your thoughts on the recent debate about "emergent phenomena"?

Abhinav: I've seen similar papers, and these are probably just by-products of evaluation techniques like log scaling, evaluation metrics, and what we're doing now is meshing accuracy, which is a Strictly binary judgments, i.e. classifying outcomes as true or false, without taking into account finer-grained sequential differences.

But, similar to Jonathan's point about evaluation, we also have a problem with the diversity of evaluation metrics: when we release these models, even the chat model, the command model, people often use it for a variety of different tasks. We can hardly measure and evaluate each dimension precisely beforehand, and even at a scale of 7 billion, these models still perform poorly on some very difficult MMLU tasks. Sometimes they score barely above random chance, especially when dealing with very difficult tasks.

Therefore, some of these problems may be more useful to us as we pursue higher quality models. However, we developed the MPT-7B a bit blindly because we didn't fully understand how the model would ultimately behave. It can only be developed against a small set of common perceptual inference tasks, and the performance is evaluated by comparing these metrics with other open source models.

Alessio: I think fast inference and training is one of the goals, so there is a trade-off between solving the most difficult tasks and being fast on other tasks.

Abhinav: Yes. Even at 7 billion data scale, people will try to run it on the CPU at home, or try to port it to their mobile phone, mainly because small-scale applications will drive people to adopt this technology, and this is an important trend at the moment.

Alessio: What are some things in AI that are moving much faster than expected?

Jonathan: I remember when GPT-2 was released, I was not very excited, but at that time it already had 1.5 billion parameters. As models scale in size, their performance cannot continue to improve. Then GPT-3 came out, and I just thought it was a little bit better at generating text, but I was wrong again and again. Scaling up the model can yield very useful models by predicting the next token.

To be fair, we're pretty much all wrong about this, so we can't quite blame ourselves either. Otherwise, Google, Facebook, and Microsoft Research would have released killer language megamodels long before I had a chance to act. I made a very strange bet that turned out to be right: Diffusion models, while somewhat dumb, produced stunningly beautiful images.

Abhinav: Regarding chatbots at scale, I think it will be a long time before hundreds of millions of people are having massive conversations with AI models. With so many startups and businesses now using not just ChatGPT, but other projects like character creation, it’s amazing how many people are actually creating emotional connections with these AI models. I don't think I would have predicted that in September or October of last year. The inflection point that has occurred in the past six months has been truly unexpected.

Swyx: What do you think they'll be used for, like emotional support?

Abhinav: Some of them are for emotional support, or just as friends. Loneliness and mental health issues are a hot topic. If you go to the subreddits of those communities, people are talking and thinking about their AI friends and these characters, it's like something out of science fiction, and I never expected that to happen.

Swyx: What is the most interesting unsolved problem in AI?

Abhinav: I'm interested in how far we can go in terms of accuracy and something like BF16/FP16.

I wonder if these problems become more tractable as the size of the model increases. Related papers show that quantization and pruning may become easier as scale increases. So, as a natural consequence of scaling up over the next few years, we might move towards using four-bit or two-bit or even binary weights.

Jonathan: I wanted to see another way how small a model we could achieve, and how efficiently we could develop a model with equivalent performance. This was the question I worked on throughout my Ph.D., and in a sense, at Mosaic as well. OpenAI has shown us one route to this incredible capability, namely scaling. But I hope this is not the only way. I hope there are many other ways to achieve this as well, through better modeling methods, better algorithms, etc.

While I'm not a fan of neuroscience tropes, in a sense our existence and our brains prove that there is at least another way to achieve this incredible ability without trillions of parameters or even astronomical ones. Capital investment. So I'm really curious how small a model we can achieve? Is there another path to these capabilities that doesn't have to follow the current path? Hope to find the answer in Mosaic, if it exists.

Swyx: Exactly, one of the things I'm most interested in is the fact that the human brain consumes only 30 watts of power, and the model is orders of magnitude away from that.

Abhinav: I don't think there is a way to achieve this with a single GPU or other tools alone.

Alessio: There's a lot of information going on right now, like how should people think about artificial intelligence? What should they focus on?

Jonathan: Keep calm. Some people take the hype too seriously; others are very pessimistic, reacting strongly to it, or denying it to some extent. Keep your peace and know that we've built a very useful tool.

But we haven't built general intelligence yet, and personally, we're nowhere near that goal. So it's important to be peaceful and follow the science, and that's what Mosaic AI strives for. We try to focus on things that are useful to humans, hopefully creating a better world. We will do our best, but most importantly, we will follow the science, be guided by the data, and achieve this goal through real results, not rhetoric.

Abhinav: I think there is nothing like doing research in an open community. In the community, not only a large number of people pay attention to your model, but even give their opinions on the problems of the model and how to improve it. This kind of open research will be the way forward, both to keep our models safe, and to delve into the real-world impact and consequences of these AI models.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

0/400

No comments

Topic
#Token of Love: Cheer on Square & Win Tickets
908 Popularity
#Crypto Market Rebound
194k Popularity
#FOMC July Minutes
19k Popularity
#Show My Alpha Points
177k Popularity
#Crypto-Related xStocks Rally
3k Popularity

Sitemap