This wasn’t quite the Winograd Schema Challenge that I had set out on. Originally this language comprehension contest for A.I. was announced in July 2014, to be run in October 2015, but was postponed to February 2016, and then again to July 2016. I was just about to ship my program overseas, three weeks before the last-accepted arrival date of postal entries, when the contest announced changes to the rules and technical format.
Some universities had been training with ambiguous pronouns like this:
I had been practising on the official Winograd schemas like this:
Whereas the final test featured this:
The programs were now faced with any number of consecutively ambiguous pronouns in passages from 1940’s children’s novels, which made quite a difference. It turns out the organisers had already decided on this last year, as appears from their sensible enough explanation in a members-only AI magazine (Winograd schemas are too hard to compose). Unfortunately they somehow did not see fit to share these changes on the contest website until too late. While the benchmark of 65% had previously been feasible, it now quickly became unlikely that anyone would win anything this year. A number of would-been participants backed out.
The contest finally took place at the IJCAI conference in New York with four contestants: the Open University of Cyprus, the University of Science and Technology of China, the independent Denis Robert from France, and myself from the Netherlands. Curiously absent were a number of American universities who had previously reported successes of over 70% for solving Winograd schemas. The absence of Google, IBM, and other commercial powerhouses was less strange, if you consider that the winner was obligated to publish their methods so that others could reproduce them, and that anything below human level would be portrayed as a failure in the media.
The glass is half full
The A.I. programs were asked to figure out 60 multiple choice pronouns, with such ambiguity that they were to be solved through an understanding of the context. Given two to five potential answers per pronoun, the baseline score for guesswork was 45%. $1000 would be awarded for a 65% score, $25000 for a 90% score, human level.
(Note: these are the scores after recount. There was some confusion as my program had omitted two answers)
|Contestant||Correct answers out of 60||Method|
|Quan Liu||35 / 35 / 29 (58% – 48%)||deep neural network & ConceptNet|
|Nikos Isaak||29 (48%)||probabilistic engine & knowledge extraction|
|Patrick Dhondt||29 (48%)||logical axioms|
|Denis Robert||19 (32%)||logical inferences|
Quan Liu’s group entered three programs, which is a little unorthodox for contests. But if you see this as a scientific test then it makes sense to test which configuration of a neural network works best. Their machine learning approach gathered pairs of events (mainly verbs) that are commonly associated, e.g. “rob -> be arrested”, and then applied their probability of co-occurring. Two of their versions scored the highest, 58%, which is consistent with the track record of similar approaches.
The unusual score of Denis Robert’s system, below the 45% guesswork baseline, can largely be explained by the fact that his system was not designed for cases with more than two possible answers, as this was only changed on short notice. However, he also indicated that his algorithm didn’t apply to most of the cases.
There were nevertheless no winners that reached the 65% threshold. On the one hand one could say that technology is literally halfway human ability, on the other hand the programs did only a little better than one might by chance. Any conclusion drawn from just the scores would be premature. If this test is to be a meaningful measure of progress, we should look at which areas the programs were better or worse in. For this I can at least answer about my own approach.
Winograd schemas vs prose
The ambiguity in the new prose form was actually not so bad compared to previously published Winograd schemas. But the phrasing was often excessively long-threaded with all sorts of interjected tangents. Although I built my program for reading articles and dialogue alike, I had not covered the grammar of interrupting phrases that break up the main thread of a sentence. Such sentence structures are abundant in story novels but do not occur in Winograd schemas, and I wasn’t planning on having my A.I. read novels any time soon. The inclusion of some 1940’s vocabulary also complicated matters: “cook-shanty”, “red-letter days”, “a pallid young dandy”? Maybe it’s because I’m Dutch, but I can only guess what these are.
Compared to the wide variety of common sense axioms that I had programmed (see How to teach a computer common sense*), many solutions to the pronouns were ordinary cases of continuity. E.g. a pronoun with an active role typically refers to the last noun with an active role (You won’t find this rule in a grammar book, because ambiguous pronouns are grammatically “incorrect” to begin with).
This makes sense when you’re testing on novels: No storyteller wants to write in such a counter-intuitive way that the reader has to stop and think about it, contrary to Winograd schemas which are designed for exactly that purpose.
Where no particular common sense axiom applied, rules of continuity and grammar chose 21 of my 29 correct answers. Thus two thirds of my success seemed not due to the application of common sense, but due to conventional writing. Curious, I ran the test again with all axioms disabled except continuity. The result was an equal amount of correct answers, but much more randomly distributed and obviously chosen for the wrong reasons. The common sense axioms were clearly contributing by fencing off the exceptions to continuity, so the cause of the mistakes lay elsewhere.
A closer look at the results
The table below show which of the 60 pronouns my program resolved correctly (highlighted green), which axioms were applicable, and/or which problems hindered their conclusion. When a problem occurred or no axiom applied, the program defaulted to the grammatically correct choice: The noun closest to the pronoun. Only 1/3rd of all pronouns actually conformed with this grammar rule, which explains why whenever a problem occurred, the answer was typically wrong.
The dotted lines in the table mean that the same sentence was given, but a different pronoun was asked about.
I will highlight the most prominent mistakes:
Logic could expect Dad to return the favour, were it not that “always” and “now” suggest a continuity, which the program did not pick up on. Consequently, the answers to both “he” and “him” were switched around. This also illustrates why this test was more difficult than chance: The more ambiguous pronouns a passage contained, the more likely a mistake in one would carry over to the others.
For this the program compared the similarities of bulbs, hamburgers and onions, but of course knowledge of onions was lacking in the database, so the inference fell flat. Retrieving such knowledge from the internet would slow things down, and though speed is no issue in a contest, in daily practice I want my program to read one page per second, not one sentence per second.
People aren’t known to have wings, otherwise the bodypart location paradox would have excluded Larry from being taken under his own wing. Alternatively one would have to know figurative meanings of English idioms, an added layer of difficulty.
The program considered “to…” to indicate Maude’s reason for leaving “in order to” do something. The pronoun wasn’t the only ambiguous word in this case.
“Backward” = “back”, “Southward” = “south”, therefore “Edward” = “Ed”. Although the pronoun was interpreted correctly, “Ed” was of course not found among the multiple choice answers.
As I mentioned in my previous post*, the “what goes around comes around” karma axiom was the least reliable, causing five misinterpretations in this test. Sometimes it triggered on trivial events, other times the events did not make sense this way (scolding to get someone to do something positive). It had better be limited to events that are direct cause and result, as they had been in most Winograd schemas.
Consecutive mental activities are typically by the same person, but of course not when it’s a comparison. Though the context system does distinguish comparisons, the axioms did not.
While the pronoun was interpreted correctly, there was a technical hitch with selecting “freemans” from the multiple choice answers, due to the name having a plural -s.
“enough” was internally translated to “enough beans” but lost its plural status in the translation, after which the beans were no longer considered a candidate for plural “them”.
Most of these problems are easily fixed and are not inherent to the common sense axioms, apart from the “karma” axiom. The majority of problems were instead linguistic: Small flaws in the grammar rules, difficulty with long-threaded phrasing, limited range of the context system, and problems with the contest’s XML-format interface. It just goes to show how perfect every part of the system has to be before it pays off, and how little one can tell about a program’s abilities from the surface.
Patterns in the test
You may have noticed some things in the table of results. First, many more linguistic problems appear in the first third of the test than after. This is partly because sentences 22 to 33 were more brief and thus easier to process. Though I can not well account for the rest, it suggests the order of the sentences was not random, but that perhaps standards were lowered after listing their best shots.
Second, 32 of 60 times the correct answer was “A”: The referent furthest from the pronoun. It seems the most ambiguous sentences were thought to be the ones where the answer was the furthest out of sight. This makes that the test is not aligned with conventional writing practices, and that it is susceptible to reverse psychology.
Let me pose a very stupid scenario:
Suppose one makes a program that answers the least likely choice “A” in all cases, except when the same sentence is given repeatedly (see the dotted lines in the table), then it increments to B and C as one asks about each next pronoun in the sentence. The result of this zero-effort approach would be 57%, just about the highest score.
I am not suggesting that this actually happened, I read the winner’s paper and their method definitely has merit. I am however suggesting that machine learning AI would pick up on exactly this sort of statistical pattern born from human psychological tendencies. For that reason, test scores should never be taken at face value.
The language barrier
As a test of common sense I found this setup less suitable than the original plan with Winograd schemas, which were more concise and profound in which areas of common sense they tested (e.g. spatial relations, physics, social interactions). Had I known from the start that the qualifying round would mainly feature novel prose, I would probably not have embarked on this challenge, knowing that my grammar parser wasn’t up for it. Now the prose passages contained too many variables to tell whether results were due to language or common sense, and it never got to the Winograd schema round. This puts us back at the Turing Test where it’s either everything or nothing, and that is not a useful measure of progress. Swapping the rounds would be a good idea for next time.
It was nice to see serious competitors with a wide variety of technology tackling the problem, and although the overall results are unimpressive, I am pleased that my partial solution did as well as some academic efforts, with a minimum of resources at that. I am not disappointed in my common sense axioms as many of them were well applicable in this test, including for pronouns that weren’t graded. I will broaden their application to ambiguous locations and indirect object relations, where I have greater need for them.
However, my main interest is the development of intelligent processes and I do not intend to linger on this aspect of language processing more than necessary. It is worth remembering that much can be said without ambiguity. Though common sense has widespread application, it ultimately serves to filter and limit possibilities, while the possibilities in areas like problem solving and planning have yet to expand. For that reason I do not expect human levels of common sense to be reached within ten years either, but we can certainly make strides towards.