CVApr 25, 2022
A very preliminary analysis of DALL-E 2Gary Marcus, Ernest Davis, Scott Aaronson
The DALL-E 2 system generates original synthetic images corresponding to an input text as caption. We report here on the outcome of fourteen tests of this system designed to assess its common sense, reasoning and ability to understand complex texts. All of our prompts were intentionally much more challenging than the typical ones that have been showcased in recent weeks. Nevertheless, for 5 out of the 14 prompts, at least one of the ten images fully satisfied our requests. On the other hand, on no prompt did all of the ten images satisfy our requests.
AIAug 10, 2023
Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problemsErnest Davis, Scott Aaronson
This report describes a test of the large language model GPT-4 with the Wolfram Alpha and the Code Interpreter plug-ins on 105 original problems in science and math, at the high school and college levels, carried out in June-August 2023. Our tests suggest that the plug-ins significantly enhance GPT's ability to solve these problems. Having said that, there are still often "interface" failures; that is, GPT often has trouble formulating problems in a way that elicits useful answers from the plug-ins. Fixing these interface failures seems like a central challenge in making GPT a reliable tool for college-level calculation problems.
AIFeb 9, 2023
Benchmarks for Automated Commonsense Reasoning: A SurveyErnest Davis
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.
AIJan 23, 2023
Mathematics, word problems, common sense, and artificial intelligenceErnest Davis
The paper discusses the capacities and limitations of current artificial intelligence (AI) technology to solve word problems that combine elementary knowledge with commonsense reasoning. No existing AI systems can solve these reliably. We review three approaches that have been developed, using AI natural language technology: outputting the answer directly, outputting a computer program that solves the problem, and outputting a formalized representation that can be input to an automated theorem verifier. We review some benchmarks that have been developed to evaluate these systems and some experimental studies. We discuss the limitations of the existing technology at solving these kinds of problems. We argue that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will be important in applications of mathematics, and may well be important in developing programs capable of reading and understanding mathematical content written by humans.
AIAug 14, 2022
Limits of an AI program for solving college math problemsErnest Davis
Drori et al. (2022) report that "A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level ... [It] automatically answers 81\% of university-level mathematics problems." The system they describe is indeed impressive; however, the above description is very much overstated. The work of solving the problems is done, not by a neural network, but by the symbolic algebra package Sympy. Problems of various formats are excluded from consideration. The so-called "explanations" are just rewordings of lines of code. Answers are marked as correct that are not in the form specified in the problem. Most seriously, it seems that in many cases the system uses the correct answer given in the test corpus to guide its path to solving the problem.
CLApr 3, 2022
Pragmatic constraints and pronoun reference disambiguation: the possible and the impossibleErnest Davis
Pronoun disambiguation in understanding text and discourse often requires the application of both general pragmatic knowledge and context-specific information. In AI and linguistics research, this has mostly been studied in cases where the referent is explicitly stated in the preceding text nearby. However, pronouns in natural text often refer to entities, collections, or events that are only implicitly mentioned previously; in those cases the need to use pragmatic knowledge to disambiguate becomes much more acute and the characterization of the knowledge becomes much more difficult. Extended literary texts at times employ both extremely complex patterns of reference and extremely rich and subtle forms of knowledge. Indeed, it is occasionally possible to have a pronoun that is far separated from its referent in a text. In the opposite direction, pronoun use is affected by considerations of focus of attention and by formal constraints such as a preference for parallel syntactic structures; these can be so strong that no pragmatic knowledge suffices to overrule them.
CYOct 11, 2024
Testing GPT-4-o1-preview on math and science problems: A follow-up studyErnest Davis
In August 2023, Scott Aaronson and I reported the results of testing GPT4 with the Wolfram Alpha and Code Interpreter plug-ins over a collection of 105 original high-school level and college-level science and math problems (Davis and Aaronson, 2023). In September 2024, I tested the recently released model GPT-4o1-preview on the same collection. Overall I found that performance had significantly improved, but was still considerably short of perfect. In particular, problems that involve spatial reasoning are often stumbling blocks.
AIJan 22, 2022
Physical Reasoning in an Open WorldZhuoran Zeng, Ernest Davis
Most work on physical reasoning, both in artificial intelligence and in cognitive science, has focused on closed-world reasoning, in which it is assumed that the problem specification specifies all relevant objects and substance, all their relations in an initial situation, and all exogenous events. However, in many situations, it is important to do open-world reasoning; that is, making valid conclusions from very incomplete information. We have implemented in Prolog an open-world reasoner for a toy microworld of containers that can be loaded, unloaded, sealed, unsealed, carried, and dumped.
CLJan 7, 2022
The Defeat of the Winograd Schema ChallengeVid Kocijan, Ernest Davis, Thomas Lukasiewicz et al.
The Winograd Schema Challenge - a set of twin sentences involving pronoun reference disambiguation that seem to require the use of commonsense knowledge - was proposed by Hector Levesque in 2011. By 2019, a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy. In this paper, we review the history of the Winograd Schema Challenge and discuss the lasting contributions of the flurry of research that has taken place on the WSC in the last decade. We discuss the significance of various datasets developed for WSC, and the research community's deeper understanding of the role of surrogate tasks in assessing the intelligence of an AI system.
LGDec 8, 2021
Deep Learning and Mathematical Intuition: A Review of (Davies et al. 2021)Ernest Davis
A recent paper by Davies et al (2021) describes how deep learning (DL) technology was used to find plausible hypotheses that have led to two original mathematical results: one in knot theory, one in representation theory. I argue here that the significance and novelty of this application of DL technology to mathematics is significantly overstated in the paper under review and has been wildly overstated in some of the accounts in the popular science press. In the knot theory result, the role of DL was small, and a conventional statistical analysis would probably have sufficed. In the representation theory result, the role of DL is much larger; however, it is not very different in kind from what has been done in experimental mathematics for decades. Moreover, it is not clear whether the distinctive features of DL that make it useful here will apply across a wide range of mathematical problems. Finally, I argue that the DL here "guides human intuition" is unhelpful and misleading; what the DL does primarily does is to mark many possible conjectures as false and a few others as possibly worthy of study. Certainly the representation theory result represents an original and interesting application of DL to mathematical research, but its larger significance is uncertain.
AIMay 24, 2021
A Flawed Dataset for Symbolic Equation VerificationErnest Davis
Arabshahi, Singh, and Anandkumar (2018) propose a method for creating a dataset of symbolic mathematical equations for the tasks of symbolic equation verification and equation completion. Unfortunately, a dataset constructed using the method they propose will suffer from two serious flaws. First, the class of true equations that the procedure can generate will be very limited. Second, because true and false equations are generated in completely different ways, there are likely to be artifactual features that allow easy discrimination. Moreover, over the class of equations they consider, there is an extremely simple probabilistic procedure that solves the problem of equation verification with extremely high reliability. The usefulness of this problem in general as a testbed for AI systems is therefore doubtful.
CVJan 25, 2021
Unanswerable Questions about Images and TextsErnest Davis
Questions about a text or an image that cannot be answered raise distinctive issues for an AI. This note discusses the problem of unanswerable questions in VQA (visual question answering), in QA (visual question answering), and in AI generally.
CLAug 1, 2020
The test set for the TransCoder systemErnest Davis
The TransCoder system translates source code between Java, C++, and Python 3. The test set that was used to evaluate its quality is missing important features of Java, including the ability to define and use classes and the ability to call user-defined functions other than recursively. Therefore, the accuracy of TransCoder over programs with those features remains unknown.
CLApr 23, 2020
A Review of Winograd Schema Challenge Datasets and ApproachesVid Kocijan, Thomas Lukasiewicz, Ernest Davis et al.
The Winograd Schema Challenge is both a commonsense reasoning and natural language understanding challenge, introduced as an alternative to the Turing test. A Winograd schema is a pair of sentences differing in one or two words with a highly ambiguous pronoun, resolved differently in the two sentences, that appears to require commonsense knowledge to be resolved correctly. The examples were designed to be easily solvable by humans but difficult for machines, in principle requiring a deep understanding of the content of the text and the situation it describes. This paper reviews existing Winograd Schema Challenge benchmark datasets and approaches that have been published since its introduction.
LGDec 12, 2019
The Use of Deep Learning for Symbolic Integration: A Review of (Lample and Charton, 2019)Ernest Davis
Lample and Charton (2019) describe a system that uses deep learning technology to compute symbolic, indefinite integrals, and to find symbolic solutions to first- and second-order ordinary differential equations, when the solutions are elementary functions. They found that, over a particular test set, the system could find solutions more successfully than sophisticated packages for symbolic mathematics such as Mathematica run with a long time-out. This is an impressive accomplishment, as far as it goes. However, the system can handle only a quite limited subset of the problems that Mathematica deals with, and the test set has significant built-in biases. Therefore the claim that this outperforms Mathematica on symbolic integration needs to be very much qualified.
AIAug 5, 2016
Winograd Schemas and Machine TranslationErnest Davis
A Winograd schema is a pair of sentences that differ in a single word and that contain an ambiguous pronoun whose referent is different in the two sentences and requires the use of commonsense knowledge or world knowledge to disambiguate. This paper discusses how Winograd schemas and other sentence pairs could be used as challenges for machine translation using distinctions between pronouns, such as gender, that appear in the target language but not in the source.
AIJun 16, 2015
The Scope and Limits of Simulation in Cognitive ModelsErnest Davis, Gary Marcus
It has been proposed that human physical reasoning consists largely of running "physics engines in the head" in which the future trajectory of the physical system under consideration is computed precisely using accurate scientific theories. In such models, uncertainty and incomplete knowledge is dealt with by sampling probabilistically over the space of possible trajectories ("Monte Carlo simulation"). We argue that such simulation-based models are too weak, in that there are many important aspects of human physical reasoning that cannot be carried out this way, or can only be carried out very inefficiently; and too strong, in that humans make large systematic errors that the models cannot account for. We conclude that simulation-based reasoning makes up at most a small part of a larger system that encompasses a wide range of additional cognitive processes.
AINov 6, 2014
The Limitations of Standardized Science Tests as Benchmarks for Artificial Intelligence Research: Position PaperErnest Davis
In this position paper, I argue that standardized tests for elementary science such as SAT or Regents tests are not very good benchmarks for measuring the progress of artificial intelligence systems in understanding basic science. The primary problem is that these tests are designed to test aspects of knowledge and ability that are challenging for people; the aspects that are challenging for AI systems are very different. In particular, standardized tests do not test knowledge that is obvious for people; none of this knowledge can be assumed in AI systems. Individual standardized tests also have specific features that are not necessarily appropriate for an AI benchmark. I analyze the Physics subject SAT in some detail and the New York State Regents Science test more briefly. I also argue that the apparent advantages offered by using standardized tests are mostly either minor or illusory. The one major real advantage is that the significance is easily explained to the public; but I argue that even this is a somewhat mixed blessing. I conclude by arguing that, first, more appropriate collections of exam style problems could be assembled, and second, that there are better kinds of benchmarks than exam-style problems. In an appendix I present a collection of sample exam-style problems that test kinds of knowledge missing from the standardized tests.
AIOct 4, 2013
The Relevance of Proofs of the Rationality of Probability Theory to Automated Reasoning and Cognitive ModelsErnest Davis
A number of well-known theorems, such as Cox's theorem and de Finetti's theorem. prove that any model of reasoning with uncertain information that satisfies specified conditions of "rationality" must satisfy the axioms of probability theory. I argue here that these theorems do not in themselves demonstrate that probabilistic models are in fact suitable for any specific task in automated reasoning or plausible for cognitive models. First, the theorems only establish that there exists some probabilistic model; they do not establish that there exists a useful probabilistic model, i.e. one with a tractably small number of numerical parameters and a large number of independence assumptions. Second, there are in general many different probabilistic models for a given situation, many of which may be far more irrational, in the usual sense of the term, than a model that violates the axioms of probability theory. I illustrate this second point with an extended examples of two tasks of induction, of a similar structure, where the reasonable probabilistic models are very different.