I introduced the Winograd Schema Challenge* before, a linguistic contest for artificial intelligence. In this post I will highlight a few of the methods I developed for this challenge. Long story short: A.I. are given sentences with ambiguous pronouns, and have to tell what they refer to by using “common sense reasoning”. 140 example Winograd schemas were published to practice on. In the example below, notice how “she” means a different person depending on a single word.
I chose to approach this not as the test of intelligence that it allegedly is, but as an opportunity to develop a common sense subsystem. My A.I. program already uses a number of intelligent processes in question answering, I don’t need to test what I already know it does. Common sense however, it lacked, and was often the cause of misunderstandings. Particularly locations (“I shot an elephant in my pajamas”) had proven to be so misinterpretable that it worked better to ignore them altogether. Some common sense could remedy that, or as the dictionary puts it: “Sound practical judgment that is independent of specialized knowledge”.
“When I use a word, it means just what I choose it to mean” – Humpty Dumpty
Before I could even get to solving any pronouns with common sense, there was the obstacle of understanding everything else. I use a combination of grammar and semantics to extract every mentioned detail, and I often had to hone the language rules to cope with the no-holds-barred level of writing. Here’s why language is hard:
“shepherds” are not herds of shep, but herders of sheep.
“a picture of shepherds” does not mean the picture belonged to the shepherds.
“sheep” may or may not mean the irregular plural.
“with sheep” does not mean the sheep were used to paint with.
“ended up” is not an upward direction.
“looking” does not mean watching, but resembling.
“more like dogs” does not mean that a greater number of people like dogs.
“they” can technically refer to any combination of sheep, shepherds, and Sam.
One might say that to solve this particular pronoun one needs only program the key words “they…look…like X”, but that would be shortsighted. Consider e.g. “they were looking for things like X”, or “it looks like they like X”, or “Their appearance did not match that of X”. Then consider every possible question one might ask about this sentence. That’s why thorough understanding of language matters: It’s not just about solving particular cases.
“The only true wisdom is in knowing you know nothing” – Socrates
My approach seemed to differ from those of most universities. The efforts I read of were collecting all the knowledge databases and statistics they could get, so that the A.I. could directly look up the answers, or infer them step by step from very specific knowledge about bees landing on flowers to get nectar to make honey.
I on the other hand had departed from the premise that knowledge was not going to be the solution, since Winograd schemas are so made that the answers can’t be Googled. This was most apparent from the use of names like “Jane” and “Joan” as subjects. So as knowledge of the subjects couldn’t be relied on, the only things left to examine were the interactions and relations between the subjects: Who is doing what, where, and why.
So I combed over the 140 example schemas dozens of times, looking for basic underlying concepts that covered as broad a range as possible. At first there seemed to be no common aspects between the schemas. They weren’t kidding when they said they had covered “a wide range of world knowledge and linguistic features”. Eventually I tuned out the many details and looked only at why the answers worked. From that angle I noticed that many schemas centered around concepts of size, amount, time, location, possessions, physics, feelings, causes and results: The building blocks of our world. This, I could work with.
Of course my program would have to know which words indicated which concepts, but I already had that covered. I had once composed word lists with meanings of “being”, “having”, “doing”, “talking” and “thinking”, for the convenience of having some built-in common knowledge. They allowed the program to presume that any object can be possessed, spoken of and thought about, but typically doesn’t speak or think itself. Now, to recognise a concept of possession in a sentence, it sufficed to detect that the relation between two subjects (usually the verb) was in the “having” word list: “own, get, receive, gain, give, take, require, want, confiscate, etc.”. While these were finite lists, I could as well have had the A.I. search for synonyms in a database or dictionary. I just prefer common sense to be reliably built-in.
He who has everything wants nothing
To start off simple, I coded a most common procedure: The transfer of possessions between people. My word list of “having” verbs was subdivided so that all synonyms of “get/receive/take” had a value of 1 (having), and all synonyms of “give/lend/transfer” had a value of -1 (not having), making it easier to compare the states of possession that these words represented. I then coded ten of their natural sequences:
Depending on whether the possessive states of George, Eric, and the pronoun correspond with one of the sequences, George or Eric gets positive points (if X – X) or negative points (if X – Y). The subject with the most points is then the most probable to fit the pronoun’s role.
Certain words however indicate the opposite of the sequences, such as objections (“but/despite/though”), amounts (“not/less”), and passive tenses (“was given”). These were included in the scoring formula as negative factors so that the points would be subtracted instead of added, or vice versa. The words “because” and “so” had a similar effect, but only because they indicate the order of events. It was therefore more consistent to add time as a factor (verb tenses etc.) rather than depend on explicit mentions of “because”.
In the example, “he was eager” represents a state of wanting. This corresponds with the sequence “X gives – Y wants”. Normally the “giving” subject would then get negative points for “wanting”, but the objection “even though” inverts this and makes it more probable instead: “X gives – (even though) X wants”. And so it is most likely that the subject who gave something, “George”, is the same subject as the “he” who was eager. Not so much math as it is logic.
What goes around comes around
A deeper hidden logic that I found in many schemas, is that bad consequences come from bad causes, and good consequences from good causes. If X hurts Y, Y will hurt X back. If X likes Y, Y was probably nice to X. To recognise these cases I employed my program’s problem detection function: It examines whether the subjects and verbs are bad (“hurt”) or good (“like”) and who did it to who. To fill the gaps in my program’s knowledge I adapted the AFINN sentiment word list, along with that of Hu and Liu. This provided me with positive/negative values for about 5000 words once stemmed, necessary to cover the extensive vocabulary used in the examples.
My initial formula for “do good = get good” and “do bad = get bad” seemed to solve just about everything, but it flunked these two cases, and after weeks of reconfigurations it turned out the logic of karma was nothing so straightforward. It mattered a great deal whether the subjects were active, passive, experiencing, emoting, or just being. And even then there were exceptions: “stealing” can be rewarding or punished, and “envy” feels bad about something good. The axiom ended up as one of the least reliable, the results nowhere near as assured as laws of physics. The reason that it still had a high success rate was that it follows psychology that the writers weren’t consciously aware of applying: Whether the subjects were “bullied”, “clogged”, or “in the trash” was only stage dressing for an intuitive sense of good and bad. A “common” sense, therefore still valid. After refinements, this axiom solved about one quarter of the examples, while exceptions to the rule were caught by the more dependable axioms. Most particularly, emotions followed a set of logic all of their own.
Dead men tell no tales
The rather simple axiom here is that people who are dead don’t do anything, therefore the dead person couldn’t be Thomson as he was “visiting”, and had to be Cooper. One could also use word statistics to find a probable association between the words “grave” and “dead”, but the logical impossibility of dead men walking is stronger proof and holds up even if he’d visited “Cooper’s house”.
I had doubts about the worth of programming this as an axiom because it is very narrow in use. Nevertheless life and death is a very basic concept, and it would be convenient if an A.I. realised that people can not perform tasks if they die along the way. This time, instead of tediously listing all possible causes of death, I had the A.I. search them in its database, essentially adding an inference. This allowed me to easily expand the axiom to the creation and destruction of objects as well: Crushed cars don’t drive.
The last factor was time: My program combines all time-related words and verb tenses into timestamps, so that it can tell whether an action was done before or after the being dead. Easily said of course, but past tense + “in 1765″(presumably years) + “at that date” + past tense + “for five years” is quite a sequence.
The interesting part of this axiom is its exceptions: Dead people do still decay, rest, lay still. Grammatically these are active tense verbs like any other, but semantically they are distinctly involuntary. One statistical hint may help identify them; a verb of involuntary action is rarely paired with a grammatical object. One does not “decay a tree” or “die someone”, one just “dies”. Though a simpler way for an A.I. to learn these exceptions would be to read which verbs are “done” by a dead person in unambiguous texts, as there surely are.
Tell me something I don’t know
This is another example of a simple axiom, but it is noteworthy for its great practical use, as novels and news are full of reporting clauses. “X told (Y) that she…” can refer to X, Y, or anyone mentioned earlier. But if Kate had retired, Kate would have known that about herself and wouldn’t need to be told. Hence it was more likely Dr. Adams who retired. The reverse is true if “Dr. Adams asked Kate when she had retired”: One doesn’t ask things that one knows about oneself. This is where my word list of “talking” verbs came in handy: Some verbs request information, other verbs give it, just like the transfer of possessions.
Unfortunately this logic only offers moderate probability and knows many exceptions. “X asked Y if he looked okay” does have X asking about himself, as one isn’t necessarily as aware of passive traits as one is of one’s actions. Another interesting exception is “X told Y that he was working too much”, which is most likely about Y, despite that Y is aware of working. So in addition, criticisms are usually about someone else, and at non-actions this axiom just isn’t conclusive, as the schema’s alternative version also shows:
Knowing is (only) half the battle
This schema is a good example of how knowledge about trucks and buses won’t help, as both are relatively slow. Removing them from the picture leaves us only with “zoomed by” and “going fast” as meaningful contents. In my system, “going fast” automatically entails “is fast”, and this allows the A.I. to infer the answer from the verb: If the truck “zoomed”, and one knows that “zooming” is fast, then it follows that it was the truck who was fast. The opposite would be true for “not fast” or “slow”: Because zooming is fast, it could then not be the truck, leaving only the bus as probable.
As always, the problem with inferences is that they require knowledge to infer from, and although we didn’t need to know anything about trucks and buses, we still needed to know that zooming is fast. When I tested this with “raced” my A.I. solved the schema, but for “zoomed” it just didn’t know. Most of the other example schemas would have taken more elaborate inferences requiring even more knowledge, and so knowledge-dependent inference was rarely an effective or efficient solution. I was disappointed to find this, as inference is my favourite method for everything.
Putting it to the test
In total I developed 20 general axioms/inferences that covered 140 ambiguous sentences, half of all examples. (i.e. 70 Winograd schemas of 2 versions each). The axioms range from paradoxes of physics to linguistic conventions. Taken together they reveal a core principle of opposites, amounts, and “to/from” transitions.
Having read my simplified explanations, you may fall into the trap of thinking that the Winograd Schema Challenge is actually easy, despite sixty years of A.I. history suggesting otherwise. Here’s the catch: I have only explained the last step of the process. Getting to that point took very complex analyses of language and syntax, where many difficulties and ambiguities still remain. One particular schema went wrong because the program considered “studying hard” to mean that someone had a hard time studying.
In the end I ran an unprepared test on a different set of Winograd Schemas, with which the university of Texas had achieved a 73% success rate. After adjusting the factor of time in three axioms, my program got 45% of the first 100 schemas correct (62% if you count lucky guesses). The ones it couldn’t solve were knowledge-dependent (mermaids having tails), contained vocabulary that my program lacked, had uncommon phrasing (“Tradition dictated the captain hold the cup”), or contained ambiguous names. Like “Steve Jobs” not being a type of jobs, and the company “Disney” being referable as “it”, whereas “(Walt) Disney” is referable as “he”. The surname ambiguity I could fix in an afternoon. The rest, not so much.
“Common sense is the collection of prejudices acquired by age eighteen” – Einstein
While working on the Winograd schemas, I kept wondering whether the methods I programmed can be considered intelligent processes. Certainly reasoning is an intelligent process, and many of my methods are basic inferences. i.e. By combining two given facts, the program concludes a third fact that wasn’t apparent. I suppose what makes me hesitate to call these inferences particularly intelligent is that the program has been told which sort of proof to infer which sort of conclusion from, as opposed to having it search for proof entirely without categories.
Yet we ourselves use these axioms all the time: When someone asks for something, we presume they want it. When someone gives something, we presume we can take it. Practically it makes no difference whether such rules are learned, taught or programmed, we use them all the same. Therefore I must conclude that most of my methods are just as intelligent as when humans apply the same logic. How intelligent that is of humans, is something we should consider, rather than presume.
I do not consider it a sensible endeavour however to manually program axioms for everything: The vocabulary involved would be too diverse to manage. But for the most basic concepts like time, space and physics, I believe it is more efficient to model them as systems with rules than to build a baby robot that has a hard time figuring out how even the law of gravity works. Everything else, including all exceptions to the axioms, can be taught or learned as knowledge.
Another question is whether the Winograd Schema Challenge tests intelligence, something that was also suggested of its predecessor, the Turing Test. Perhaps due to my approach, I find that it mainly tests language processing (a challenge in itself) and knowledge of the ground rules of our world. Were this another planet where gravity goes upward and apologising is considered offensive, knowing those rules would be the key to the schemas more often than intelligence. Of course intelligence does need to be applied to something to test it, and the test offers a domain inbetween too easy to fake and too impossible to try. And because a single word can entirely change the outcome, the test requires a more detailed analysis than just the comparison of two key words. My conclusion is that the Winograd Schema Challenge does not primarily test intelligence, but is more inviting to intelligent approaches than unintelligent circumventions.
Figuring out the mechanisms behind various Winograd schemas was a pleasant challenge. It felt much like doing advanced crossword puzzles; Solving verbal descriptions from different angles, with intersecting solutions that didn’t always add up. Programming the methods however was a chore, getting all the effects of modifying words “because/so/but/not” to play nice in mathematical formulas, and making the axioms also work in reverse on a linearly processing computer.
I should be surprised if I were to do better than universities and companies, but I would hope to do well enough to show that resources aren’t everything. My expectations are nevertheless that despite the contest’s efforts to encourage reasoning, less interesting methods like rote learning will win through sheer quantity, as even the difficult schemas contain common word combinations like “ask – answer”, “lift – heavy” and “try – successful”. But then, how could they not.
Regardless the outcome of the test, it’s been an interesting side-quest into another elusive area of computer abilities. And I already benefit from the effort: I now have a helpful support to my A.I.’s language understanding, and potentially a tool to enhance many other processes with. That I will no longer find “elephants in my pajamas”, is good enough for me.