The amount of text data used for Google PaLM2 training is nearly five times that of the first generation

Industry dynamics 2023-05-17 13:33:30 Source: NetEase Technology Report Beijing

On May 17th, Google launched its latest large-scale language model, PaLM2, at the 2023 I/O Developer Conference last week. According to internal company documents, the amount of text data used to train the new model since 2022 is almost five times that of the previous generation

On May 17th, Google launched its latest large-scale language model, PaLM2, at the 2023 I/O Developer Conference last week. According to internal company documents, the amount of text data used to train the new model since 2022 is almost five times that of the previous generation.

It is reported that Google's newly released PaLM2 can perform more advanced programming, computing, and creative writing tasks. According to internal documents, there are 3.6 trillion tokens used to train PaLM2.

The so-called token is a string, and people will segment the sentences and paragraphs in the text used for training the model, with each string commonly referred to as a token. This is an important component of training large-scale language models, which can teach the model to predict which words will appear next in the sequence.

The previous generation large language model PaLM released by Google in 2022 used 780 billion tokens in training.

Although Google has always been eager to showcase its strength in the field of artificial intelligence technology, explaining how to embed artificial intelligence into search engines, email, word processing, and spreadsheets, it has been reluctant to disclose the scale or other details of training data. OpenAI, supported by Microsoft, also keeps the details of the newly released GPT-4 large language model confidential.

Both companies stated that the reason for not disclosing this information is due to fierce competition in the artificial intelligence industry. Google and OpenAI both want to attract users who want to use chat robots instead of traditional search engines to search for information.

But with the intense competition in the field of artificial intelligence, the research community is demanding greater transparency.

Since the launch of PaLM2, Google has consistently stated that the new model is smaller than previous large-scale language models, which means that the company's technology can become more efficient in completing more complex tasks. Parameters are commonly used to illustrate the complexity of language models. According to internal documents, PaLM2 received training on 340 billion parameters, while the first generation PaLM received training on 540 billion parameters.

Google did not immediately comment.

Google stated in a blog article about PaLM2 that the new model uses a "new technology" called "compute optimal scaling", which can make PaLM2 "more efficient, with better overall performance, such as faster inference, fewer service parameters, and lower service costs

When releasing PaLM2, Google revealed that the new model has been trained in 100 languages and is capable of various tasks. The 25 features and products, including Google's experimental chat robot Bard, all use PaLM2. PaLM2 has four different versions based on parameter size, from small to large: Gecko (gecko), Otter (otter), Bison (bison), and Unicorn (unicorn).

According to publicly disclosed information from Google, PaLM2 is more powerful than any existing model. Facebook announced in February this year the launch of a large language model called LLaMA, which used 1.4 trillion tokens in training. OpenAI disclosed the relevant training scale when it released GPT-3. At that time, the company said that this model had received 300 billion token training. In March of this year, OpenAI released a new model GPT-4 and stated that it has demonstrated "human level" in many professional tests.

According to the latest document, the language model launched by Google two years ago received training on 1.5 trillion tokens.

As new generative artificial intelligence applications rapidly become mainstream in the technology industry, the controversy surrounding underlying technologies is becoming increasingly fierce.

In February this year, El Mahdi El Mahamdi, a senior scientist in Google's research department, resigned because the company lacked transparency. On Tuesday, OpenAI CEO Sam Altman testified at a hearing on privacy and technology before the US Senate Judiciary Subcommittee and agreed to use a new system to address artificial intelligence.

For a very new technology, we need a new framework, "Altman said." Of course, companies like ours have a lot of responsibility for the tools they launch

Tag: of The amount text data used for Google PaLM2

Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.

Previous: JD Pharmaceutical Express Announces a Significant Reduction in Commission Deduction Points

Previous: Baidu's first new mobile phone product: Xiaodu Qinghe Learning Phone officially released on May 22nd