This wasn’t quite the Winograd Schema Challenge that I had set out on. Originally this language comprehension contest for A.I. was announced in July 2014, to be run in October 2015, but was postponed to February 2016, and then again to July 2016. I was just about to ship my program overseas, three weeks before the last-accepted arrival date of postal entries, when the contest announced changes to the rules and technical format.
Some universities had been training with ambiguous pronouns like this:
I had been practising on the official Winograd schemas like this:
Whereas the final test featured this:
The programs were now faced with any number of consecutively ambiguous pronouns in passages from 1940’s children’s novels, which made quite a difference. It turns out the organisers had already decided on this last year, as appears from their sensible enough explanation in a members-only AI magazine (Winograd schemas are too hard to compose). Unfortunately they somehow did not see fit to share these changes on the contest website until too late. While the benchmark of 65% had previously been feasible, it now quickly became unlikely that anyone would win anything this year. A number of would-been participants backed out.
The contest finally took place at the IJCAI conference in New York with four contestants: the Open University of Cyprus, the University of Science and Technology of China, the independent Denis Robert from France, and myself from the Netherlands. Curiously absent were a number of American universities who had previously reported successes of over 70% for solving Winograd schemas. The absence of Google, IBM, and other commercial powerhouses was less strange, if you consider that the winner was obligated to publish their methods so that others could reproduce them. And that anything below human level would be considered a failure in the media.
The glass is half full
The programs were asked to figure out 60 multiple choice pronouns, with such ambiguity that they were to be solved through an understanding of the context. With two to five potential answers per pronoun, the baseline score for guesswork was 45%. $1000 would be awarded for a 65% score, and $25000 for a 90% score, human level.
(Note: these are the scores after recount. There was some confusion as my program had omitted two answers)
|Contestant||Correct answers out of 60||Method|
|Quan Liu||35 / 35 / 29 (58% – 48%)||deep neural network & ConceptNet|
|Nikos Isaak||29 (48%)||probabilistic engine & knowledge extraction|
|Patrick Dhondt||29 (48%)||logical axioms|
|Denis Robert||19 (32%)||logical inferences|
Quan Liu’s group entered three programs, which is a little unorthodox for contests. But if you see this as a scientific test then it makes sense to test which configuration of a neural network works best. Their machine learning approach gathered pairs of events (mainly verbs) that are commonly associated, e.g. “rob -> be arrested”, and then applied their probability of co-occurring. Two of their versions scored the highest, 58%, which is consistent with the track record of similar approaches.
There were nevertheless no winners that reached the 65% threshold. On the one hand one could say that technology is literally halfway human ability, on the other hand the programs did only a little better than one might by chance. Any conclusion drawn from just the scores is premature. If this test is to be a meaningful measure of progress, we should look at which areas the programs were better or worse in. To this I can at least answer about my own program.
Winograd schemas vs prose
The ambiguity in the new prose form was actually not so bad compared to previously published Winograd schemas. But the phrasing was often excessively long-threaded with all sorts of interjected tangents. Although I built my program for reading articles and dialogue alike, I had not covered the grammar of interrupting phrases that break up the main thread of a sentence. Such sentence structures are abundant in story novels but do not occur in Winograd schemas, and I wasn’t planning on having my A.I. read novels any time soon. The inclusion of some 1940’s vocabulary also complicated matters: “cook-shanty”, “red-letter days”, “a pallid young dandy”? Maybe it’s because I’m Dutch, but I can only guess what these are.
Compared to the wide variety of common sense axioms that I had programmed (see How to teach a computer common sense*), many solutions to the pronouns were ordinary cases of continuity. E.g. a pronoun with an active role typically refers to the last active-role noun (You won’t find this rule in a grammar book, because ambiguous pronouns are grammatically “incorrect” to begin with).
This makes sense when you’re testing on novels: No storyteller wants to write in such a counter-intuitive way that the reader has to stop and think about it, contrary to Winograd schemas which are designed for exactly that purpose.
Where no particular common sense axiom applied, rules of continuity and grammar chose 21 of my 29 correct answers. Thus the majority of my success seemed not due to the application of common sense, but due to conventional writing. Curious, I ran the test again with all axioms disabled except continuity. The result was an equal amount of correct answers, but much more randomly distributed and obviously chosen for the wrong reasons. The common sense axioms clearly contributed by fencing off the exceptions to continuity, so the cause of the mistakes lay elsewhere.
A closer look at the results
The results below show which of the 60 pronouns my program got correct, which axioms were applicable, and/or which problems hindered their conclusion. Where no axiom applied or a problem occurred, the program defaulted to the grammatically correct choice: The candidate closest to the pronoun. Only 1/3rd of all pronouns actually conformed with this grammar rule, which explains why whenever a problem occurred, the answer was typically wrong.
I will highlight the most prominent mistakes:
Logic could expect Dad to return the favour, were it not that “always” and “now” suggest a continuity, which the program did not pick up on. Consequently, the answers to both “he” and “him” were switched around. This also highlights why this test was more difficult than chance: The more ambiguous pronouns a passage contained, the more likely a mistake in one would carry over to the others.
For this the program compared the similarities of bulbs, hamburgers and onions, but of course knowledge of onions was lacking in the database, so the inference fell flat. Retrieving knowledge from the internet would slow things down, and though speed is no issue in a contest, in daily practice I want my program to read one page per second, not one sentence per second.
People aren’t known to have wings, otherwise the bodypart location paradox would have excluded Larry. Alternatively one would have to know figurative meanings of English idioms, an added layer of difficulty.
The program considers “to…” to indicate Maude’s reason for leaving “in order to” do something.
“Backward” = “back”, “Southward” = “south”, therefore “Edward” = “Ed”. Although the pronoun was interpreted correctly, “Ed” was of course not found among the multiple choice answers.As I mentioned in my previous post, the “what goes around comes around” axiom was the least reliable, causing five misinterpretations in this test. Sometimes it triggered on trivial events, other times the events did not make sense this way (scolding to get someone to do good). It had better be limited to events that are direct cause and result, as they had been in most Winograd schemas.
Consecutive mental activity is typically by the same person, but of course not when it’s a comparison. Though the context system does distinguish comparisons, the axioms did not.
While the pronoun was interpreted correctly, there was a technical hitch with selecting “freemans” from the multiple choice answers.
“enough” was translated to “enough beans” but lost its plural status in the translation, after which the beans were no longer considered a candidate for plural “them”.
Most of these problems are easily fixed and are not inherent to the common sense axioms, apart from #40 and its like. The majority of problems were instead linguistic: Small flaws in the grammar rules, difficulty with long-threaded phrasing, limited coverage of the context system, and problems with the contest’s XML-format interface. It just goes to show how perfect every part of the system has to be before it pays off, and how little you can tell about a program’s abilities from the surface.
The language barrier
As a test of common sense I found this setup less suitable than the original plan with Winograd schemas, who were more concise and profound in which areas of common sense they tested (e.g. spatial relations, physics, social interactions). Had I known from the start that the qualifying round would mainly feature novel prose, I would probably not have embarked on this challenge. Now the prose passages contained too many variables to tell whether results were due to language or common sense, and it never got to the Winograd schema round. This puts us back at the Turing Test where it’s either everything or nothing, and that is not a useful measure of progress. Swapping the rounds would be a good idea for next time.
It was nice to see serious competitors with a wide variety of technology tackling the problem, and although the overall results are unimpressive, I am pleased that my partial solution did as well as some academic efforts, with a minimum of resources at that. I am not disappointed in my common sense axioms as many of them were well applicable in this test, including all the pronouns that weren’t graded. I will broaden their application to ambiguous locations and indirect object relations, where I have greater need for them.
However, my main interest is the development of intelligent processes and I do not intend to linger on this aspect of language processing more than necessary. It is worth remembering that much can be said without ambiguity, and software like Stanford’s Coreference Resolver already achieve 90% precision on average texts. Though common sense has widespread application, it ultimately serves to filter and limit possibilities, while the possibilities in areas like problem solving and planning have yet to expand. For that reason I do not expect human levels of common sense to be reached within ten years either, but we can certainly make strides towards.