AI in 2023: A Review

2024-02-19 来源：搜狐时尚原文链接评论0条

JUNLING HU

2024年1月11日

We are starting the first issue of AI Frontiers newsletter this week. This is a perfect time to look back at 2023, and see how far we have come. (See resources and announcements at the end of this article.)

The year 2023 is a watershed year for AI. For the first time, AI has entered the public realm, touching every aspect of our lives. It is starting to replace search engines, becoming our go-to place to ask questions. AI is poised to disrupt many industries: from education to marketing, IT support, and medicine. Here, I want to summarize the progress of AI in seven major areas. They are definitely not exhaustive. I've chosen these areas for their importance and the potential to disrupt the future. Feel free to drop a comment here if you are observing other interesting developments.

1. The breakthrough in AI capabilities

Today's AI systems can easily pass the Turing test, and we no longer debate whether AI is feasible. If AI was perceived as a toddler before 2023, it has matured into a teenager in 2023, though it is not yet an adult. An adult AI system should be capable of thinking and reasoning like a human adult, which means passing college exams or completing similarly difficult tasks. In 2023, both GPT-4 and Google's Gemini have made significant progress toward that goal.

Both GPT-4 and Gemini are very large. This is due to the fact that an LLM becomes more intelligent as its size gets larger. GPT-4 is estimated1 to have approximately 1.8 trillion parameters, with around 120 layers and using Mixture of Experts within the model. Google has not released the size of Gemini, but it is significantly larger than PaLM 2, which had 340 billion parameters2. According to the Gemini report3, training Gemini requires significantly more resources than PaLM 2, likely three times as much. This places it in the range of 1 trillion parameters. The architecture of Gemini is likely to be similar to that of GPT-4: a decoder-only transformer model with a mixture of experts.

Today's large language models (LLMs) demonstrate remarkable intelligence, as evidenced by their performance on a range of challenging datasets4.

In commonsense reasoning (HelloSwag), GPT-4 has achieved a 95% accuracy rate, equivalent to human performance. In grade school mathematics (GSM8K), both LLMs achieved around 95% accuracy. In college exams covering 57 subjects (MMLU), both LLMs achieved over 90% accuracy, surpassing human performance (89%). For the problems that used to cause LLMs to stumble (Big-bench-hard), GPT-4 achieved an 89% accuracy rate, while Gemini reached 83%. It appears that LLMs are overcoming their shortcomings. In coding problems (HumanEval), GPT-4 reached an 88% success rate. For reading comprehension with numerical reasoning (DROP), both LLMs achieved around 83% accuracy. The only area in which these LLMs performed poorly was mathematical competition questions (MATH). In summary, our large foundation models outperformed humans in 3 out of 7 tasks, approached near-human performance in 3 other tasks, and performed poorly in only one out of 7 tasks. AI is approaching to have human adult intelligence.

What we can anticipate for 2024 is a continuous improvement in the performance of large foundation models. By the end of 2024, I expect that the best LLMs will surpass humans in almost all datasets. By then, we may declare that AI has reached adulthood, with the capability of reasoning and understanding equivalent to an adult human.

Open-source foundation models

The large foundation models are all closed-source and owned by a couple of companies. Many companies are concerned about their dependence on these models because there is no visibility into their inner workings. This concern has led to the emergence of many open-source models.

Meta released Llama in February, and LIMA was released in May. However, most of them did not deliver satisfactory performance compared to the state of the art OpenAI model (GPT 3.5 at that time).

Meta's Llama 2 and Mistral's Mixtral 8x7B model are among the best-performing ones. They have generated excitement because they approached the GPT 3.5 level. But they are still far behind GPT-4. Here is the newest performance chart :

On average, the open-source model is 20% below the best GPT-4 model. This raises questions about deploying open-source models. This is because commercial products demand high accuracy. Therefore, most companies would stick with OpenAI or Google for foundation models, mainly because of the highly accurate results. For this reason, we will see a continuing rise of OpenAI this year, with more companies using OpenAI API for their GPT-4 products. Google will also be an active player in this game, with its existing GCP and high-performing Gemini, Google could become an AI provider to enterprises.

We have not solved hallucination problems. In fact, hallucination may be an inherent property of large language models, as some research showed. Remedies for hallucination include limiting answers to existing documents and using external search to check the validity of the answer. Another way is requiring Chain of Thoughts reasoning in the response. Researchers found this significantly reduced wrong answers. Since hallucination is a big problem in many practical applications, we will see more research on solving this problem in 2024.

2. Multi-modal AI

Another significant advancement is the maturation of multi-modal LLMs. Bard allowed image uploading in July, enabling users to ask questions based on images. OpenAI released GPT-4V in September 2023, which is capable of understanding text, images, and speech. Google released Gemini in December 2023, which can process text, images, audio, and video simultaneously. We now have fully multi-modal LLMs, which is also called LMMs (Large Multi-modal Models).

The emerging trend of 2023 is the integration of all these modalities into a single model. Such a model uses a transformer as its core architecture and transforms every type of input into tokens that can be processed by the transformer. Not only can we process different modalities, but we can also generate different modalities from such a model.

The achievement of multimodal capabilities is the result of the widespread adoption of transformers in all AI fields, allowing for a unified architecture to handle text, images, audio, and video. Vision transformers and video transformers have proven to be superior to CNN models, and speech transformer models outperformed CNN-based speech recognition models. Today, we only need a single transformer model to process these input formats, with the only extra work being the generation of image tokens or speech tokens.

The newest VideoPoet5 is demonstration of such multimodal process and multimodal generation model. VideoPoet uses a decoder-only transformer that processes multimodal inputs -- including images, videos, text, and audio.

VideoPoet achieved state-of-the-art zero-shot video generation, and can generate high-fidelity video.

3. The explosion of Generative AI

Using AI to generate images, music, and videos became the biggest advancement in 2023. Text-to-image generation achieved remarkable fidelity in terms of image quality and realism. Here is a summary of the major generative models in 2023.

For image generation, Meta released the Segment Anything Model (SAM) in April, capable of zero-shot segmentation on any picture. In October, OpenAI released Dall-E 3. It has the best image generation quality with deep language understanding.

In text-to-video generation, Meta released Emu Video in November 6. This model simplified video generation into two steps, allowing it to generate a 4-second video from text and an image. Emu Video outperformed all previous models, including MAV, Google's Imagen, AYL, PYOCO, R&D, Cog, Gen2, and Pika. Emu outperformed all these other models in human evaluation, being preferred over each of the other models over 90% of the time.

The most exciting achievement of 2023 occurred at the end of the year. AudioBox 7 was released in December, enabling AI to generate any sound based on text. This followed after Lyria 8, which can generate any music in the style of artist based a text prompt.

VideoPoet was also released in December, ushering in a new paradigm of video generation without the diffusion model and integrating it into LLM.

Alphacode 2 was announced on the same day as Gemini. It uses Gemini as a foundation model and achieved a performance level of 85% compared to human participants in coding competitions. Magicoder was also released, and it is the best open-source code generator.

The year 2023 marked the triumph of the diffusion model, as many image generations were based on this model, including Emu Video. However, alternatives to the diffusion model have emerged. OpenAI's Dall-E 3 employs a consistency model9 that does not rely on the diffusion model. Google's VideoPoet uses transformers directly, also avoiding the use of the diffusion model in their image generation. In other words, the two largest AI companies are shifting away from the use of the diffusion model for image generation. My prediction is that the diffusion model will decline in 2024. The drive to move away from the diffusion model is the pursuit of using a single transformer model for all tasks. We expect to see more research results in transformer-generated images in 2024.

4. The rise of AI agents

In 2023, we started to see the "agent," an AI system that can take action on our behalf. Such actions can include sending an email, calling a restaurant, retrieving information from a database, or generating a chart. Once actions are introduced, the AI assistant can become more powerful. This action model is seamlessly integrated into the LLM; therefore, it is learnable and tunable.

One application of agents is in data analytics. In the future, analyzing data will no longer be a human job but will be delegated to AI. If an executive is interested in customer trends, they can simply ask a question in natural language, and the answer and chart will generate automatically. There is no need for data scientists to write elaborate SQL code to retrieve data. This suggests that text-to-SQL and chart generation will be significant applications in 2024. There are also other applications for accessing a database to serve customer needs.

OpenAI is supporting the AI agent paradigm by offering the Assistants API. It links your code to external tools, making it potentially powerful. However, the fact that the assistants require a lot of context and each time append the context to the total tokens makes it very expensive. Additionally, it is not easy to integrate the Assistant with other tools. In 2024, the AI assistant remains an open field for competition. A flexible assistant API and a low-cost solution can be attractive. Langchain has gained a lot of traction, but it's not the perfect one. AutoGen seems much easier to use. AutoGPT was a good try but falls short in many key functions. We may see new companies that deploy good agent solutions. This is where startup innovation can happen.

Even though OpenAI and Google lead in foundation models, good prompt engineering and agent actions could generate many interesting applications. We expect to see some specialized agents, such as a travel assistant, research assistant, price negotiation agent, and so on. Each of these assistants can leverage specialized tools and deliver value to the customers.

5. Better ways to finetune LLMs

The success of ChatGPT brought a lot of attention to the method called RLHF (Reinforcement Learning with Human Feedback). This method gave a significant boost to the original GPT-3 model and led to the successful deployment of GPT-3.5, which powered ChatGPT. RLHF is also used to enhance the performance of GPT-4, Google’s PaLM 2, and Meta’s Llama 2 model. Thus, it is the most widely used fine-tuning method for LLMs today.

Since RLHF has been so successful and is used with all foundation models, people are attempting to find ways to improve it. This is achieved by simplifying the RLHF steps. RLHF involves three steps: 1. Supervised fine-tuning: Use human-created data to train the current model. 2. Training a reward model. In this step, user preferences for AI-generated outputs are collected, and each output is given a score. Then, a scoring model or reward model is trained. 3. Applying reinforcement learning and the reward model to train the large language model.

(1) DPO

One improvement on RLHF is replacing the reinforcement learning step. Researchers from Stanford University proposed a method called DPO (Direct Preference Optimization) 10. Instead of training a reward model and then learning a reinforcement learning model, DPO simply uses the preference data directly to train the LLM. Therefore, DPO reduces two steps (reward function learning and RL) to one single step.

The authors show that DPO outperforms the reinforcement learning approach. Today, DPO has gained traction among practitioners for fine-tuning their models. This trend will continue in 2024.

(2) RLAIF

Another way to improve RLHF is by removing the bottleneck of data gathering. One of the key steps in RLHF is gathering human feedback data, which is expensive to obtain by hiring people to provide answers. The human data gathering process is also time-consuming. Instead of relying on humans, we can use an LLM such as GPT-4 to provide us with feedback. RLAIF (Reinforcement Learning with AI Feedback) 11 employs GPT-4 to generate preference data, and they demonstrate that RLAIF has a similar effect as RLHF in boosting a model. By utilizing AI for feedback, we eliminate the bottleneck associated with collecting data from humans.

It appears that we are moving toward the use of AI for generating evaluation data, not only for preference data but also for other supervised training tasks.

(3) Week-to-Strong Alignment

A third important development is investigating whether RLHF will continue to be useful in the future. There is an implicit assumption that RLHF will always improve a model's performance because humans know better. However, this assumption may not be true anymore. In the coming year (this or next year), we will observe AI growing into superhuman intelligence. This means it will beat humans in almost all tasks, from writing a good email to solving a math problem. When we force an LLM to conform to a human's way of writing or speaking, we could degrade the LLM’s performance in doing other tasks. In other words, training with RLHF could make an LLM less capable. This is very different from classical supervised training, where humans are always smarter. This situation is shown in the center picture of the following figure, where a person is trying to teach a superhuman AI.

Researchers from OpenAI have investigated this problem and have made the first attempt to simulate this issue 12. They used a weak LLM (GPT-2) to teach a strong LLM (GPT-4) and confirmed that the performance of GPT-4 indeed degraded. This suggests that RLHF may not work well in the future. OpenAI researchers have proposed a remedy by adding an auxiliary confidence loss. This allows finetuned GPT-4's performance to increase to the GPT 3.5 level but still remains below the original GPT-4 level. This paper represents the first attempt to understand the effect of applying a weak model to train a strong model. They dubbed this method weak-to-strong generalization, and we expect to see more results on this from OpenAI in 2024 .

6.The exciting development of robotics

As LLMs continue to mature and become more powerful, the frontier of AI has shifted from building digital models to physical ones. The next stage of AI development will in the field of robotics.

The progress of robotics in 2023 is exciting, though not as rapid as that of LLMs. This is primarily due to the inherent challenges in building and testing physical components. An exciting achievement in this domain is Tesla Optimus 2, capable of delicately picking up and placing an egg without breaking it. Such precise handling marks a significant breakthrough for robots entering households.

Another noteworthy breakthrough is the transformer-based robotic architecture RT-2 13 . It introduced a vision-language-action model, that encode robot actions as tokens to be processed by a transformer. The transform can generate such action tokens for the robot to take action accordingly. The architecture looks like this:

The transformer model can accept text and image as inputs and then generate corresponding actions. This architecture will enable today’s robots to use LLM as its core model. Such a robot can have all the listening, seeing and speaking capabilities in addition to moving and grasping.

In October, Google researchers released the Open X-Embodiment dataset14. Collected from 22 different robots through a collaboration between 21 institutions, it contains 527 skills. This dataset can help robots jump-start their learning and leverage the "pre-training" in other skills to boost their performance. As a result, it will accelerate robotics development.

7. Detecting brain activities

When we measure a person's brain signals, can we actually detect what the person is hearing or seeing? Another astounding achievement in 2023 involves real-time image reconstruction based on brain signals recorded by MEG.15 The level of accuracy it achieves is truly astonishing.

It appears that we can recover not only the correct shape and color but also very specific details from the brain signals. This work, conducted by Meta researchers, builds upon the earlier work of detecting speech from brain signals and image reconstruction from fMRI recordings.

In the near future, we may be able to apply these techniques to a person while they are sleeping and monitor their dreams. Could it be possible that one day we can project a person's dream onto a big screen like a movie? Research in image recovery is expected to continue in 2024, likely yielding much better performance."

文章转自：AI in 2023: A Review - by Junling Hu - AI Frontiers (substack.com)

关键词： model AI This GPT The models

转载声明：本文为转载发布，仅代表原作者或原平台态度，不代表我方观点。今日新西兰仅提供信息发布平台，文章或有适当删改。对转载有异议和删稿要求的原著方，可联络[email protected]。