ChatGPT breaks the Turing test, it's time to find a new way to evaluate AI technology

**Source:**AI Frontline

** author | Celeste Biever**

Translator|Nucle-Cola

Planning|Dongmei

Image source: Generated by Unbounded AI tool, general model (paper-cut)

Large language models have excellent human language simulation capabilities, but scientists are still divided on their inference performance.

On July 25, "Nature" stated in an article that ChatGPT has broken the Turing test, and it is time to enable other new methods to evaluate artificial intelligence technology.

The world's strongest artificial intelligence (AI) system can pass rigorous exams, write convincing papers, and participate in chats smoothly. Many people can't even tell the difference between AI and humans in terms of language expression. Is there anything they can't do? Of course there are, and they are very simple questions.

A series of brightly colored graphics are arranged on the screen, and most people can quickly figure out the answer to this type of visual logic test. But as the light of technology behind the chat robot ChatGPT and the search engine Bing, and the highest masterpiece of current AI, GPT-4 is obviously unable to do what it wants. A study in May of this year showed that GPT-4 was correct only a third of the time on one type of pattern test, and a measly 3% on the other.

The research team behind the logic puzzle hopes that the test will provide a better benchmark for AI systems and help address the inherent shortcomings of large language models such as GPT-4. To sum up: in the language test, the large language model easily completed the intelligence feat that was once regarded as a milestone; but in the visual logic test, their performance is quite weak, there are obvious blind spots, and they cannot be based on abstraction. Concepts make inferences.

"Practitioners in the AI field are grappling with the difficult problem of evaluating large language model systems," says Melanie Mitchell, a computer scientist at the Santa Fe Research Institute in New Mexico. To that end, her team has put together this set of logical problems.

In the past two or three years, the large language model has completely crushed the previous AI system in terms of cross-multitasking capabilities. Their working principle is uncomplicated: Based on the billions of online sentences they have been exposed to during training, they summarize the statistical correlation between each word, and then generate a reasonable next word for a given input text. For chatbots built on top of large language models, an additional element is added: Human trainers provide extensive feedback, thus fine-tuning how the bot responds.

It is worth noting that algorithms trained on such massive human language corpora with properties similar to autocompletion have successfully demonstrated a wide range of problem-solving capabilities. While legacy AI systems may be able to beat large language models on a specific task, the former must be trained in problem-specific quantities, and this ability cannot be quickly transferred from one task to another.

Broadly speaking, researchers in these two camps hold diametrically opposed views on how large language models work under the hood, says Tomer Ullman, a cognitive scientist at Harvard University. Some attribute the algorithm's achievements to genuine reasoning or comprehension, but others (including Ullman himself and researchers like Mitchell above) are more cautious.

According to Ullamn, “both sides of this debate are brilliant and high-level.” The root cause of the disagreement is the lack of hard evidence to support their respective views. "After all, there is no stable and reliable intelligent detector like a Geiger counter, which can clearly give the answer of intelligence or non-intelligence."

Researchers on both sides of the discussion say that relying on tests such as logic questions to reveal differences in capabilities between humans and AI systems should be an important step in the right direction. Brenden Lake, a cognitive computing scientist at New York University, says such benchmarks can also help reveal capabilities that are missing from today's machine learning systems, and clarify what exactly human intelligence is made of.

In addition, this test of large language models and benchmark ability research has other practical significance. Mitchell pointed out that if you want to apply large language models to real-world scenarios such as medicine and law, you must first clarify where the boundaries of its capabilities lie. "We have to figure out what it can and can't do before we can judge how to use it safely."

Is the Turing test obsolete?

In the field of testing machine intelligence, the most famous scheme has always been the Turing test. The test was proposed by British mathematician and computer pioneer Alan Turing in 1950, when computers were in their infancy. Turing proposed an evaluation method of the so-called "imitation game". In this scenario, the human referee has a short text dialogue with the computer and the human hidden behind the screen to see whether it can accurately identify the machine and the human. . Turing believed that this should answer the question "Do machines have the ability to think?"

Mitchell pointed out that Turing did not specify a great deal of detail about the scenario, so there were no exact rules to follow. According to François Chollet, a software engineer at Google, "the Turing test is not a concrete test that can actually be run on a machine—it's more of a thought experiment."

But this view of using language to test whether a machine has the ability to think has been deeply ingrained in the field of technology. For decades, businessman and philanthropist Hugh Loebner has long funded the annual Turing Test event, known as the Loebner Prize. But computer scientist Rob Wortham said the campaign stopped after 2019 because funding for the campaign ran out following Loebner's own death. Wortham is co-director of the UK Society for Artificial Intelligence and Behavioral Simulation Research, which has hosted the competition on Loebner's behalf since 2014. He explained that the large language model now basically has the ability to deceive humans, so the Loebner Award was forced to stop on the eve of the full take-off of the large language model, which is quite a black humor.

Other researchers also believe that large language models such as GPT-4 already have the ability to pass the Turing test. At least in short conversations, it's probably hard for most people to tell who's a human and who's a big model. In May, researchers at the AI21 lab in Tel Aviv, Israel, reported that more than 1.5 million people had played an online game based on the Turing test. Users will engage in a two-minute chat with either another user or a large language model masquerading as a real person based on prompts from the researchers. The probability of the player correctly identifying the robot is only 60%, which is almost the same as completely random guessing3.

However, researchers who are more familiar with large language models can still distinguish chatbots from various details. Chollet noted that he found that it was easy to detect who was a large language model simply by exploiting the known weaknesses of the system. "If I were to put myself to the test to see if I was talking to a big language model, I would definitely get the right answer."

The key is to let the big language model get out of its comfort zone. His trick is to propose differentiating scenarios to the big language model than the common training scenarios. In most cases, the large language model is outputting the most likely word based on the training data, rather than really giving the correct answer according to the new scene.

Moreover, Chollet et al. are skeptical of this method of testing based on deceptive performance. "This obviously exists to deceive human referees." Such tests will only encourage developers to instill more camouflage skills into AI, and will not inspire more useful or interesting functions.

Benchmarks are unreliable

Researchers often evaluate AI systems with benchmarks that assess specific abilities, such as language, commonsense reasoning, and math, and technology teams are increasingly adopting academic and professional exams designed for humans.

When GPT-4 was first released in March, the San Francisco, California-based company OpenAI evaluated the new model's performance on a series of benchmarks designed for machines, including reading comprehension, math and coding. As reported by OpenAI, GPT-4 performed well on most tests4. They also set about 30 exams for GPT-4, including: a variety of exams for American high school students, known as Advanced Placement; an exam to assess the clinical knowledge of American doctors; and the criteria used in the selection process for American graduate students test (GRE). GPT-4 managed to score in the top 10% on the Uniform Bar Examination (which is included in the bar exam in several US states).

AI System Performance - Excerpt from Results

Source: OpenAI/ Reference 4

The ranking percentile here is the position of the human candidates who have achieved this score among all the subjects.

Mitchell acknowledges that "quite a few language models do well on these benchmarks. But in most cases, it's not that they outperform humans in general ability, but rather that the benchmarks themselves have limitations." The researchers make a strong case Doubt that because the model was trained on a large amount of text material, it is likely that similar problems have been seen in the training data. Benchmarking conclusions drawn in this situation are called "pollution" and are obviously not credible.

OpenAI says they checked this by looking for similar strings in the problem and training data. Testing large language models before and after removing similar strings shows little change in performance. This suggested that the extremely high scores had nothing to do with pollution, but some researchers questioned whether the test was rigorous enough.

Sam Bowman is a language technology scientist at New York University who also works at Anthropic, an AI company in San Francisco. He warned against simply taking the GPT-4 test scores as the result of "seeing similar problems" and denying the ability of GPT-4. In his view, "the pollution talk does complicate the situation a bit, but I don't think it really affects the bigger picture."

The researchers also pointed out that the ability of large language models to score high exams is also relatively fragile, and it may not be able to be transformed into the ability to make correct judgments in the real world. According to Mitchell, just a little tweaking of the exam questions could render large models unacceptable. For example, she took a question from an MBA exam that ChatGPT passed and slightly changed it. Humans could easily adjust the answer according to the change, but ChatGPT failed miserably.

There is another, deeper problem when it comes to deciphering the implications of benchmarking. For humans, high scores in these tests generally represent a strong level of intelligence-in fact, intelligence level itself is also a vague concept, mainly reflected in the ability to adapt to different environments shown in a series of tasks . In other words, a high score on a test demonstrates that the person has good cognitive abilities and a good command of certain abstract concepts. But this is not the case for large language models. Mitchell emphasized that the judgment method of large models is very different from that of humans. "In most cases, AI systems are not doing reasoning in a way that humans are familiar with."

This may be because large language models can only learn from language experience; due to the lack of channels to connect with the real world, they cannot experience the connection between language and objects, attributes and emotions like humans. “It’s clear that they don’t understand words the way humans do,” says Lake. In his view, current evidence suggests that large language models “can use language very fluently without actually understanding what they’re saying. "

On the other hand, large language models have also shown some abilities that humans do not have, such as understanding the connection between almost every word that humans write down. Mitchell said this may mean that the model is relying on certain characteristics of language or other indicators to solve the problem, without needing to grasp the broader reasoning ability.

Nick Ryder, a researcher at OpenAI, agrees with this judgment, saying that the performance of AI on a single test is not enough to prove its general ability like human subjects. "I don't think people should directly compare human scores with the scores of large language models." The scores released by OpenAI "do not describe the human-like ability or human-like reasoning level of large language models, but simply show that these models perform well. performance on these tasks."

In addition to traditional machine benchmarks and human professional exams, the researchers also explored large language models more broadly. In March of this year, Sébastien Bubeck of Microsoft Research and his colleagues released pre-published version 5 entitled "Spark of General Artificial Intelligence: GPT-4 Early Experiments", which caused heated discussions in the industry. Using an early version of GPT-4, they documented a surprising set of features, many of which were not directly or explicitly linked to language. One noteworthy feature is that it passes the tests used to evaluate psychological theories. Psychological theory is a core human ability to predict and reason about the mental states of others. "Given the breadth and depth of capabilities of GPT-4, we have reason to believe that it already represents an early (but not yet perfect) version of an artificial general intelligence (AGI) system," they wrote in the paper.

But Bubeck himself later clarified, emphasizing that "GPT-4 certainly doesn't think like a human, and it has its own unique and different way of implementing any function it exhibits."

Mitchell believes that although the report is quite radical, it does not systematically explore the capabilities of large language models. "This is more like an anthropological study." Ullman also said that in order to prove that machines can master psychological theories, at least they must give evidence of the corresponding underlying cognitive process, instead of simply relying on the machine to output the same answer as humans. Rough assertion.

AI researchers believe that broader and more rigorous scrutiny is needed to understand the strengths and weaknesses of large language models. The color logic problem may be an important part of it.

Fresh Puzzles

In 2019, just before the explosion of large language models, Chollet released a new set of logical test sets specially compiled for AI systems on the Internet, called Abstract and Reasoning Corpus (ARC). The solver is presented with a visual demonstration in which several square grids transform into another pattern, which instructs the next grid how to change to demonstrate that they have understood the rules of change. "It's a test of our ability to adapt to things we haven't seen before," says Chollet, who believes that this ability to find patterns is the essence of intelligence.

According to Lake, ARC captures "the hallmark of human intelligence": abstracting from everyday knowledge and applying it to never-before-seen problems.

Chollet organized an ARC robotics competition in 2020, before large language models gained widespread traction. The winning AI system was specifically trained to be good at tasks like ARC. But unlike the big language model, it does not have a general function, and it only answered 21% of the questions correctly. In comparison, humans correctly solve ARC problems 80% of the time7. Multiple research teams are currently using ARC to test the capabilities of large language models, and none have come close to human performance.

Mitchell and her colleagues developed a new set of puzzles (called ConceptARC) inspired by ARC, with two main differences. ConceptARC is even easier: Mitchell's team wanted benchmarks to reflect advances in machine capabilities, even if only a little bit. Second, the team selected specific concepts to test and then created a series of theme-related puzzle variations around each concept.

For example, to test the concept of identity, one problem requires the solver to hold objects of the same shape in place, and another problem requires the solver to align objects of the same shape along an axis. The idea is to reduce the chances of an AI system passing a test without grasping the concept.

What does poor performance mean?

The researchers released the ConceptARC task to GPT-4 and recruited 400 subjects. Humans scored an average of 91% across all concept groups (97% for the highest-scoring group); 33% for the top-scoring GPT-4 group, and no more than 30% for the remaining concept groups.

"We showed that the machine still falls short of human-level intelligence," Mitchell said. "But surprisingly, it was able to solve some of these problems despite never being trained on them."

The team also tested robots that won the Chollet competition, which are not general-capability systems like large language models, but were trained specifically for vision problems such as ARC. Overall, they performed better than GPT-4, but still inferior to humans, scoring 77% in the best concept group but below 60% in most concept groups1.

However, Bowman believes that GPT-4's failure to pass ConceptARC's training does not prove that it lacks potential abstract reasoning capabilities. In his view, there is a bias between ConceptARC and GPT-4, which is a visual test after all. "Even if these models are really good at this kind of conceptual reasoning, it's unlikely that they will score well on such tests the first time around."

The limitation of the test method may also be the influencing factor of GPT-4's poor performance. The public version of the Big Language Model can only accept text input, so the researchers submitted arrays of numbers describing the images. (For example, a blank pixel might be represented by a 0, and a colored square might be represented by a corresponding number.) In contrast, human subjects were able to see the image directly. Mitchell also admits, "We're comparing a pure language system to a human, and humans have a highly developed visual system, so I'm afraid the comparison isn't entirely fair."

OpenAI has built a "multimodal" version of GPT-4 that can accept image input directly. Mitchell's team is waiting for the technology to be formally disclosed so that it can do another round of ConceptARC. But she doesn't think the multimodal GPT-4 is much better. "I don't think these systems still have the level of abstraction and reasoning that is comparable to humans."

Sam Acquaviva, a computational cognitive scientist at the Massachusetts Institute of Technology, agrees. And the pattern is limited to single row instead of grid8. This should remove some of the unfairness issues, but Acquaviva sees that while GPT-4's performance has improved, it's also not enough to demonstrate reliable rule understanding and reasoning for large language models.

reasoning argument

Bowman also mentioned some other experiments. According to the comprehensive results, the large language model has at least mastered the basic ability to reason about abstract concepts. In one case, Harvard computer scientist Kenneth Li and his colleagues used a digital version of Reversi, in which players place black and white pieces on an 8 x 8 grid. They hope to assess whether large language models rely on memorized linguistic statistical relationships to generate text, or whether they can really build internal representations of phenomena like humans.

After submitting a training set of human players' actions to the large language model, the AI quickly mastered the ability to choose the correct strategy for the next move. The researchers believe that this shows that the large language model can even understand the situation on the chessboard and give suggestions for chess moves based on the current features, which obviously breaks through the constraints of the text form9.

Bowman admits that the reasoning ability of large language models can be described as "various" in general, and it does not reach the height of human reasoning. But he thinks that reasoning ability does exist, and it seems to improve with model size. In other words, future large language models will perform better and better. "These systems aren't as reliable or general as we'd like them to be, and they're completely confused about certain kinds of abstract reasoning. But I think their fundamental reasoning abilities do exist objectively."

Researchers such as Bowman and Mitchell also agree that how to better test large language models for abstract reasoning and other indicators of intelligence remains an open question. Michael Frank, a cognitive scientist at Stanford University, believes that there is no single all-encompassing test that can completely replace the Turing test. Instead, he argues that researchers need to devise extensive tests to quantify the strengths and weaknesses of various systems. "These agents are great, they're just flawed in so many ways, so the most important thing is to explore this systematically."

Wortham advises those new to AI systems to stay away from the obsession with anthropomorphism. "We always try to understand anything that shows intelligence as a human, which is really unnecessary."

"It's even cursed, meaning that we can't imagine any form of intelligence that exhibits a clear goal orientation other than our own. We're always wishful thinking that it does so in the same deep way of thinking as we do."

references:

Moskvichev, A., Odouard, V. V. & Mitchell, M. Preprint at (2023).

Turing, A. M. Mind LIX, 433–460 (1950).

Article Google Scholar

Jannai , D. , Meron , A. , Lenz , B. , Levine , Y. & Shoham , Y. Preprint at (2023).

OpenAI. Preprint at (2023).

Bubeck, S. et al. Preprint at (2023).

Chollet, F. Preprint at (2019).

Johnson, A., Vong, W. K., Lake, B. M. & Gureckis, T. M. Preprint at (2021).

Xu , Y. , Li , W. , Vaezipoor , P. , Sanner . S. & Khalil, EB Preprint at (2023).

Li, K. et al. Proc. Eleventh Int. Conf. Learn. Represent. (2023).

Original Link:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)