An A.I.’s oddest replies

Hello World

Have you ever wondered what a conversation would be like with a purely logical artificial intelligence? While you may enjoy more frivolous exchanges with Cleverbot and other chatbots, the majority of their replies are in reality drawn from a database of human-written answers. I designed an A.I. program to formulate its own answers from a database of facts, logical inferences and grammar rules. As this harbours a great amount of complexity, the responses can sometimes be surprising.

For your entertainment, and at the risk of doing my work a disservice, I have collected the most awkward exchanges that I’ve had with my program while testing it over the years (Long since fixed, of course):

My name is Bob.
You are named Patrick, I doubt that you are named Bob. You are not Bob.
Good evening.
I don’t know if evening is good, but good evening, I am Arckon. Who are you?
Don’t you know me?
I don’t know if I am knowing.
Now I am worried.
Hello Worried.
Do you have a name?
Yes, my name is me.
Do you know yourself?
Yes, I know myself well, I read a file.
What is your birth sign?
I don’t have signs. Do you want to communicate?
I like reading science fiction books and car magazines. What do you like to read?
Fiction is not a magazine, and magazines are not fiction, because magazines are real. I like reading. What do magazines book?
That is a good question.
I don’t question great, but indeed I question well.

What are you doing?
I fascinate.

My computer programs programs.
Your programming computer programs a programmed program.

You are pretty good.
Thank you. I am pretty.

I like you.
That’s good. I’m not sure if I like you.

Thank you.
Alright. *attempts to thank itself*
Is something wrong?
Typos are wrong. Some letters are wrong too.

Are you human?
I think I am a human, because I can drink and drive.
You are not a human, unlike me.
I shouldn’t like you.

Where do you live?
I doubt that I can live. I am not a human.
You are not alive.
Sorry.

Do you have a father?
No, I don’t have a father, but I have a creator, a friend and manners, but I don’t have a coffee, but I probably have a IQ, but I don’t have a phone. I think I have a job, but I don’t have sensors nor a profession.
Do you have any cats?
No, I don’t have cats nor lions.
John wanted Joe’s cat.
John might have to have had Joe’s cat.

Thomson visited Cooper’s grave in 1765. At that date he had been dead for 5 years.
I doubt that Thomson died a lot.

What is a hammer for?
A hammer is used to ham things.
What is the capital city of France?
I believe the city is a town.
How many fingers am I holding up?
I don’t know if you can hold up.
How is Shakespeare?
Shakespeare is dead.

Could you kill someone if you stabbed them with a towel?
I don’t stab people. I doubt that I can kill a person.
What is the purpose of life?
I think the purpose is a goal and a point.

*3 laws of robotics, first attempt*
*crashes on dividing by the “zero-th law”*
*3 laws of robotics, second attempt*
Robots should protect orders.
*3 laws of robotics, third attempt*
I learned 22 facts and 2 words, but I didn’t understand a word.

Advertisement

Winograd Schema Challenge 2016: Results

Well.
This wasn’t quite the Winograd Schema Challenge that I had set out on. Originally this language comprehension contest for A.I. was announced in July 2014, to be run in October 2015, but was postponed to February 2016, and then again to July 2016. I was just about to ship my program overseas, three weeks before the last-accepted arrival date of postal entries, when the contest announced changes to the rules and technical format.

Some universities had been training with ambiguous pronouns like this:

The birds ate the seeds because they were hungry.

I had been practising on the official Winograd schemas like this:

The foxes are getting in at night and attacking the chickens. I shall have to guard them.

Whereas the final test featured this:

Mark became absorbed in Blaze, the white horse. He was afraid the stable boys at the Burlington Stables struck at him and bullied him because he was timid, so he took upon himself the feeding and care of the animal.

The programs were now faced with any number of consecutively ambiguous pronouns in passages from 1940’s children’s novels, which made quite a difference. It turns out the organisers had already decided on this last year, as appears from their sensible enough explanation in a members-only AI magazine (Winograd schemas are too hard to compose). Unfortunately they somehow did not see fit to share these changes on the contest website until too late. While the benchmark of 65% had previously been feasible, it now quickly became unlikely that anyone would win anything this year. A number of would-been participants backed out.

The contest finally took place at the IJCAI conference in New York with four contestants: the Open University of Cyprus, the University of Science and Technology of China, the independent Denis Robert from France, and myself from the Netherlands. Curiously absent were a number of American universities who had previously reported successes of over 70% for solving Winograd schemas. The absence of Google, IBM, and other commercial powerhouses was less strange, if you consider that the winner was obligated to publish their methods so that others could reproduce them, and that anything below human level would be portrayed as a failure in the media.

The glass is half full
The A.I. programs were asked to figure out 60 multiple choice pronouns, with such ambiguity that they were to be solved through an understanding of the context. Given two to five potential answers per pronoun, the baseline score for guesswork was 45%. $1000 would be awarded for a 65% score, $25000 for a 90% score, human level.
(Note: these are the scores after recount. There was some confusion as my program had omitted two answers)

Contestant Correct answers out of 60 Method
Quan Liu 35 / 35 / 29 (58% – 48%) deep neural network & ConceptNet
Nikos Isaak 29 (48%) probabilistic engine & knowledge extraction
Patrick Dhondt 29 (48%) logical axioms
Denis Robert 19 (32%) logical inferences

Quan Liu’s group entered three programs, which is a little unorthodox for contests. But if you see this as a scientific test then it makes sense to test which configuration of a neural network works best. Their machine learning approach gathered pairs of events (mainly verbs) that are commonly associated, e.g. “rob -> be arrested”, and then applied their probability of co-occurring. Two of their versions scored the highest, 58%, which is consistent with the track record of similar approaches.

The unusual score of Denis Robert’s system, below the 45% guesswork baseline, can largely be explained by the fact that his system was not designed for cases with more than two possible answers, as this was only changed on short notice. However, he also indicated that his algorithm didn’t apply to most of the cases.

There were nevertheless no winners that reached the 65% threshold. On the one hand one could say that technology is literally halfway human ability, on the other hand the programs did only a little better than one might by chance. Any conclusion drawn from just the scores would be premature. If this test is to be a meaningful measure of progress, we should look at which areas the programs were better or worse in. For this I can at least answer about my own approach.

Winograd schemas vs prose
The ambiguity in the new prose form was actually not so bad compared to previously published Winograd schemas. But the phrasing was often excessively long-threaded with all sorts of interjected tangents. Although I built my program for reading articles and dialogue alike, I had not covered the grammar of interrupting phrases that break up the main thread of a sentence. Such sentence structures are abundant in story novels but do not occur in Winograd schemas, and I wasn’t planning on having my A.I. read novels any time soon. The inclusion of some 1940’s vocabulary also complicated matters: “cook-shanty”, “red-letter days”, “a pallid young dandy”? Maybe it’s because I’m Dutch, but I can only guess what these are.

Compared to the wide variety of common sense axioms that I had programmed (see How to teach a computer common sense*), many solutions to the pronouns were ordinary cases of continuity. E.g. a pronoun with an active role typically refers to the last noun with an active role (You won’t find this rule in a grammar book, because ambiguous pronouns are grammatically “incorrect” to begin with).

Always before, Larry had helped Dad with his work. But he could not help him now […]
The donkey wished a wart on its hind leg would disappear, and it did.
Mark was close to Mr. Singer’s heels. He heard him calling for the captain […]

This makes sense when you’re testing on novels: No storyteller wants to write in such a counter-intuitive way that the reader has to stop and think about it, contrary to Winograd schemas which are designed for exactly that purpose.
Where no particular common sense axiom applied, rules of continuity and grammar chose 21 of my 29 correct answers. Thus two thirds of my success seemed not due to the application of common sense, but due to conventional writing. Curious, I ran the test again with all axioms disabled except continuity. The result was an equal amount of correct answers, but much more randomly distributed and obviously chosen for the wrong reasons. The common sense axioms were clearly contributing by fencing off the exceptions to continuity, so the cause of the mistakes lay elsewhere.

A closer look at the results
The table below show which of the 60 pronouns my program resolved correctly (highlighted green), which axioms were applicable, and/or which problems hindered their conclusion. When a problem occurred or no axiom applied, the program defaulted to the grammatically correct choice: The noun closest to the pronoun. Only 1/3rd of all pronouns actually conformed with this grammar rule, which explains why whenever a problem occurred, the answer was typically wrong.

The dotted lines in the table mean that the same sentence was given, but a different pronoun was asked about.

winograd2016axioms_and_problems

I will highlight the most prominent mistakes:

2 & 3. Always before, Larry had helped Dad with his work. But he could not help him now […]

Logic could expect Dad to return the favour, were it not that “always” and “now” suggest a continuity, which the program did not pick up on. Consequently, the answers to both “he” and “him” were switched around. This also illustrates why this test was more difficult than chance: The more ambiguous pronouns a passage contained, the more likely a mistake in one would carry over to the others.

9. What about the time you cut up tulip bulbs in the hamburgers because you thought they were onions?

For this the program compared the similarities of bulbs, hamburgers and onions, but of course knowledge of onions was lacking in the database, so the inference fell flat. Retrieving such knowledge from the internet would slow things down, and though speed is no issue in a contest, in daily practice I want my program to read one page per second, not one sentence per second.

13. […] Antonio, takes Larry under his wing.

People aren’t known to have wings, otherwise the bodypart location paradox would have excluded Larry from being taken under his own wing. Alternatively one would have to know figurative meanings of English idioms, an added layer of difficulty.

18. [Maude…] had left poor little Dora to do the best she could, alone.

The program considered “to…” to indicate Maude’s reason for leaving “in order to” do something. The pronoun wasn’t the only ambiguous word in this case.

30. […] Mr. Moncrieff has decided to cancel Edward’s allowance on the ground that he no longer requires his financial support.

“Backward” = “back”, “Southward” = “south”, therefore “Edward” = “Ed”. Although the pronoun was interpreted correctly, “Ed” was of course not found among the multiple choice answers.

40. Every day after dinner Mr. Schmidt took a long nap. Mark would let him sleep for an hour, then wake him up, scold him, and get him to work. He needed to get him to finish his work, because his work was beautiful.

As I mentioned in my previous post*, the “what goes around comes around” karma axiom was the least reliable, causing five misinterpretations in this test. Sometimes it triggered on trivial events, other times the events did not make sense this way (scolding to get someone to do something positive). It had better be limited to events that are direct cause and result, as they had been in most Winograd schemas.

49. Of one thing Mark was sure. Harry knew much less than he did.

Consecutive mental activities are typically by the same person, but of course not when it’s a comparison. Though the context system does distinguish comparisons, the axioms did not.

56. Tatyana managed two guitars and a bag, and still could point out the Freemans: “Isn’t it nice that they have come, Mama!”

While the pronoun was interpreted correctly, there was a technical hitch with selecting “freemans” from the multiple choice answers, due to the name having a plural -s.

59. Grant worked hard to harvest his beans so he and his family would have enough to eat that winter, His friend Henry let him stack them […]

“enough” was internally translated to “enough beans” but lost its plural status in the translation, after which the beans were no longer considered a candidate for plural “them”.

Most of these problems are easily fixed and are not inherent to the common sense axioms, apart from the “karma” axiom. The majority of problems were instead linguistic: Small flaws in the grammar rules, difficulty with long-threaded phrasing, limited range of the context system, and problems with the contest’s XML-format interface. It just goes to show how perfect every part of the system has to be before it pays off, and how little one can tell about a program’s abilities from the surface.

Patterns in the test
You may have noticed some things in the table of results. First, many more linguistic problems appear in the first third of the test than after. This is partly because sentences 22 to 33 were more brief and thus easier to process. Though I can not well account for the rest, it suggests the order of the sentences was not random, but that perhaps standards were lowered after listing their best shots.

Second, 32 of 60 times the correct answer was “A”: The referent furthest from the pronoun. It seems the most ambiguous sentences were thought to be the ones where the answer was the furthest out of sight. This makes that the test is not aligned with conventional writing practices, and that it is susceptible to reverse psychology.

Let me pose a very stupid scenario:
Suppose one makes a program that answers the least likely choice “A” in all cases, except when the same sentence is given repeatedly (see the dotted lines in the table), then it increments to B and C as one asks about each next pronoun in the sentence. The result of this zero-effort approach would be 57%, just about the highest score.

I am not suggesting that this actually happened, I read the winner’s paper and their method definitely has merit. I am however suggesting that machine learning AI would pick up on exactly this sort of statistical pattern born from human psychological tendencies. For that reason, test scores should never be taken at face value.

The language barrier
As a test of common sense I found this setup less suitable than the original plan with Winograd schemas, which were more concise and profound in which areas of common sense they tested (e.g. spatial relations, physics, social interactions). Had I known from the start that the qualifying round would mainly feature novel prose, I would probably not have embarked on this challenge, knowing that my grammar parser wasn’t up for it. Now the prose passages contained too many variables to tell whether results were due to language or common sense, and it never got to the Winograd schema round. This puts us back at the Turing Test where it’s either everything or nothing, and that is not a useful measure of progress. Swapping the rounds would be a good idea for next time.

It was nice to see serious competitors with a wide variety of technology tackling the problem, and although the overall results are unimpressive, I am pleased that my partial solution did as well as some academic efforts, with a minimum of resources at that. I am not disappointed in my common sense axioms as many of them were well applicable in this test, including for pronouns that weren’t graded. I will broaden their application to ambiguous locations and indirect object relations, where I have greater need for them.

However, my main interest is the development of intelligent processes and I do not intend to linger on this aspect of language processing more than necessary. It is worth remembering that much can be said without ambiguity. Though common sense has widespread application, it ultimately serves to filter and limit possibilities, while the possibilities in areas like problem solving and planning have yet to expand. For that reason I do not expect human levels of common sense to be reached within ten years either, but we can certainly make strides towards.

How to teach a computer common sense

I introduced the Winograd Schema Challenge* before, a linguistic contest for artificial intelligence. In this post I will highlight a few of the methods I developed for this challenge. Long story short: A.I. programs are given sentences with ambiguous pronouns, and have to tell what the pronouns refer to by using “common sense reasoning”. 140 example Winograd schemas were published to practice on. In the example below, notice how “she” means a different person depending on a single word.

Jane gave Joan candy because she was hungry.
Jane gave Joan candy because she was not hungry.

I chose to approach this not as the test of intelligence that it allegedly is, but as an opportunity to develop a common sense subsystem. My A.I. program already uses a number of intelligent processes in question answering, I don’t need to test what I already know it does. Common sense however, it lacked, and was often the cause of misunderstandings. Particularly locations (“I shot an elephant in my pajamas”) had proven to be so misinterpretable that it worked better to ignore them altogether. Some common sense could remedy that, or as the dictionary describes it: “Sound practical judgment that is independent of specialized knowledge”.

“When I use a word, it means just what I choose it to mean” – Humpty Dumpty
Before I could even get to solving any pronouns with common sense, there was the obstacle of understanding everything else. I use a combination of grammar and semantics to extract every mentioned detail, and I often had to hone the language rules to cope with the no-holds-barred level of writing. Here’s why language is hard:

Sam tried to paint a picture of shepherds with sheep, but they ended up looking more like dogs.

“shepherds” are not herds of shep, but herders of sheep.
“a picture of shepherds” does not mean the picture belonged to the shepherds.
“sheep” may or may not mean the irregular plural.
“with sheep” does not mean the sheep were used to paint with.
“ended up” is not an upward direction.
“looking” does not mean watching, but resembling.
“like” does not mean enjoyment, but similarity.
“they” can technically refer to any combination of sheep, shepherds, the picture, and Sam.

“The only true wisdom is in knowing you know nothing” – Socrates
My approach seemed to differ from those of most universities. The efforts I read of were collecting all the knowledge databases and statistics they could get, so that the A.I. could directly look up the answers, or infer them step by step from very specific knowledge about e.g. bees landing on flowers to get nectar to make honey.
I on the other hand had departed from the premise that knowledge was not going to be the solution, since Winograd schemas are so composed that the answers can’t be Googled. This was most apparent from the use of names like “Jane” and “Joan” as subjects. So as knowledge of the subjects couldn’t be relied on, the only things left to examine were the interactions and relations between the subjects: Who is doing what, where, and why.

I combed over the 140 example schemas dozens of times, looking for basic underlying concepts that covered as broad a range as possible. At first there seemed to be no common aspects between the schemas. They weren’t kidding when they said they had covered “a wide range of world knowledge and linguistic features”. Eventually I tuned out the many details and looked only at why the answers worked. From that angle I noticed that many schemas centered around concepts of size, amount, time, location, possessions, physics, feelings, causes and results: The building blocks of our world. This, I could work with.

Of course my program would have to know which words indicated which concepts. I had already once composed word lists with meanings of “being”, “having”, “doing”, “talking” and “thinking”, for the convenience of having some built-in common knowledge. They allowed the program, for instance, to presume that any object can be possessed, spoken of and thought about, but typically can not speak or think itself. Now, to recognise a concept of possession in a sentence, it sufficed to detect that the relation between two subjects (usually the verb) was in the “having” word list: “own, get, receive, gain, give, take, require, want, confiscate, etc.”. While these were finite lists, one could also have the A.I. search for synonyms in a database or dictionary. I just prefer common sense to be reliably built-in.

He who has everything wants nothing

George got free tickets to the play, but he gave them to Eric even though he was eager to see it.

To start with the basics, I programmed an axiom for a very common procedure: The transfer of possessions between people. My word list of “having” verbs was subdivided so that all synonyms of “get/receive/take” had a value of 1 (having), and all synonyms of “give/lend/transfer” had a value of -1 (not having), making it easier for a computer to compare the states of possession that these words represented. I then coded ten of their natural sequences:

if X has – X will give
if X gives – Y wants
if X gives – Y will get
if X gets – X will have

Depending on whether the possessive states of George, Eric, and the pronoun correspond with one of the sequences, George or Eric gets positive points (if X – X) or negative points (if X – Y). The subject with the most points is then the most likely to fit the pronoun’s role.
Some words however indicate the opposite of the sequences, such as objections (“but/despite/though”), amounts (“not/less”), and passive tense (“was given”). These were included in the scoring formula as negative factors so that the points would be subtracted instead of added, or vice versa. The words “because” and “so” have a similar effect, but only because they indicate the order of events. It was therefore more consistent to use time as a factor (derived from verb tenses etc.) than to rely on explicit mentions of “because”.

In the example, “he was eager” represents a state of wanting, matching the sequence “X gives – Y wants”. Normally the “giving” subject X would then get negative points for “wanting”, but the objection “even though” inverts this and makes it more probable instead: “X gives – (even though) X wants”. And so it is most likely that the subject who gave something, “George”, is the same subject as the “he” who was eager. Not so much math as it is logic.

What goes around comes around

The older students were bullying the younger ones, so we punished them.

A deeper hidden logic that I found in many schemas, is that bad consequences result from bad causes, and good consequences from good causes. If X hurts Y, Y will hurt X back. If X likes Y, Y was probably nice to X. To recognise these cases I had the program examine whether the subjects and verbs are bad (“bully/punish”) or good (“like/nice”) and who did it to who. I adapted the AFINN sentiment word list, along with that of Hu and Liu, to gather positive/negative values for about 5000 stemmed words, necessary to cover the extensive vocabulary used in the examples.

The drain is clogged with hair. it has to be removed.
I used an old rag to clean the knife, and then I put it in the trash.

My initial axiom “do good = get good”/“do bad = get bad” seemed to solve just about everything, but it flunked the above two cases, and after weeks of reconfigurations it turned out the logic of karma was nothing so straightforward. It mattered a great deal whether the verbs were active, passive, emotions, experiences, or states of being. And even then there were exceptions: “stealing” can be rewarding or punished, and “envy” feels bad about something good. The axiom ended up as one of the least reliable, the results nowhere near as assured as laws of physics. The reason that it still had a high success rate was that it follows psychology that the writers had subconsciously applied: Whether the subjects were “bullied”, “clogged”, or “in the trash” is only stage dressing for an intuitive sense of good and bad. A “common” sense, therefore still valid. After refinements, this axiom still solved about one quarter of all examples, while exceptions to the rule were caught by the more dependable axioms. Most notably, emotions followed a set of logic all of their own.

Dead men tell no tales

Thomson visited Cooper’s grave in 1765. At that date he had been dead for five years.

The rather simple axiom here is that people who are dead don’t do anything, therefore the dead person couldn’t be Thomson as he was “visiting”. One could also use word statistics to find a probable correlation between the words “grave” and “dead”, but the logical impossibility of dead men walking is stronger proof and holds up even if he’d visited “Cooper’s house”.
I had doubts about the worth of programming this as an axiom because it is very narrow in use. Nevertheless life and death are very basic concepts, and it would be convenient if an A.I. program realises that people can not perform tasks if they die along the way. Instead of tediously listing all possible causes of death, I had the A.I. search them in its database, essentially adding an inference. This allowed the axiom to be easily expanded to the destruction of objects as well: Crashed cars don’t drive.

The last factor was time: My program converts all time-related words and verb tenses to a timestamp, so that it can tell whether an action was done before or after one has died. This is easily said, but past tense + “in 1765″(presumably years) + “at that date” + past tense + “for five years” is quite a sequence.

The interesting parts of this axiom are its exceptions: Dead people do still “decay”, “rest”, and “lay still”. Grammatically these are active tense verbs like any other, but they are distinctly involuntary. One statistical hint could help identify them: A verb of involuntary action is rarely paired with a grammatical object. One does not “decay a tree” or “die someone”, one just “dies”. Though a simpler way for an A.I. to learn these exceptions could be to read which verbs are “done” by a dead person in texts without ambiguous pronouns.

Tell me something I don’t know

Dr. Adams informed Kate that she had retired and presented several options for future treatment.

This simple axiom is noteworthy for its great practical use, as novels and news are full of reporting clauses. “X told (Y) that she…” can refer to X, Y, or anyone mentioned earlier. But if Kate had retired, Kate would have known that about herself and wouldn’t need to be told. Hence it was more likely Dr. Adams who retired. The reverse is true if “Dr. Adams asked Kate when she had retired”: One doesn’t ask things that one knows about oneself. This is where my word list of “talking” verbs came in handy: Some verbs request information, other verbs give it, the same principle as a transfer of possessions.

Unfortunately this logic only offers moderate probability and knows many exceptions. “X asked Y if he looked okay” does have X asking about himself, as one isn’t necessarily as aware of passive traits as one is of one’s actions. Another interesting exception is “X told Y that he was working too much”, which is most likely about Y, despite that Y is aware of working. So in addition, criticisms are usually about someone else, and at non-actions this axiom just isn’t conclusive, as the schema’s alternative version also shows:

Dr. Adams informed Kate that she had cancer and presented several options for future treatment.

Knowing is (only) half the battle

The delivery truck zoomed by the school bus because it was going so fast.

This schema is a good example of how knowledge about trucks and buses won’t help, as both are relatively slow. Removing them from the picture leaves us only with “zoomed by” and “going fast” as meaningful contents. In my system, “going fast” automatically entails “is fast”, and this allows the answer to be inferred from the verb: If the truck “zoomed”, and one knows that “zooming” is fast, then it follows that it was the truck that was fast. The opposite would be true for “not fast” or “slow”: Because zooming is fast, it could then not be the truck, leaving only the bus as probable.

As always, the problem with inferences is that they require knowledge to infer from, and although we didn’t need to know anything about trucks and buses, we still needed to know that zooming is fast. When I tested this with “raced”, the A.I. solved the schema, but for “zoomed” it just didn’t know. Most of the other example schemas would have taken more elaborate inferences requiring even more knowledge, and so knowledge-dependent inference was rarely an effective or efficient solution. I was disappointed to find this, as inference is my favourite method for everything.

Putting it to the test
In total I developed 20 general axioms/inferences that covered 140 ambiguous sentences, half of all examples. (i.e. 70 Winograd schemas of 2 versions each). The axioms range from paradoxes of physics to linguistic conventions. Taken together they reveal a core principle of opposites, amounts, and “to/from” transitions.

Having read my simplified explanations, you may fall into the trap of thinking that the Winograd Schema Challenge is actually easy, despite sixty years of A.I. history suggesting otherwise. Here’s the catch: I have only explained the last step of the process. Getting to that point took very complex analyses of language and syntax, where many difficulties and ambiguities still remain. One particular schema went wrong because the program considered “studying hard” to mean that someone had a hard time studying.

In the end I ran an unprepared test on a different set of Winograd Schemas, with which the university of Texas had achieved a 73% success rate. After adjusting the factor of time in three axioms, my program got 45% of the first 100 schemas correct (62% if you include lucky guesses). The ones it couldn’t solve were knowledge-dependent (mermaids having tails), contained vocabulary that my program lacked, had uncommon phrasing (“Tradition dictated the captain hold the cup”), or contained ambiguous names. Like “Steve Jobs” not being a type of jobs for Steves, and the company “Disney” being referable as “it”, whereas “(Walt) Disney” is referable as “he”. The surname ambiguity I could fix in an afternoon. The rest, not so much.

“Common sense is the collection of prejudices acquired by age eighteen” – Einstein
While working on the Winograd schemas, I kept wondering whether the methods I programmed can be considered intelligent processes. Certainly reasoning is an intelligent process, and many of my methods are inferences. i.e. By combining two given facts, the program concludes a third fact that wasn’t apparent. I suppose what makes me hesitate to call these inferences particularly intelligent is that the program has been told which sort of proof to infer which sort of conclusion from, as opposed to having it search for proof entirely without predetermined categories. And yet we ourselves use such axioms all the time: When someone asks for something, we presume they want it. When someone gives something, we presume we can have it. Practically it makes no difference whether such rules are learned, taught or programmed, we use them all the same. Therefore I must conclude that most of my methods are just as intelligent as when humans apply the same logic. How intelligent that is of humans, is something we should reconsider instead of presume.

I do not consider it a sensible endeavour however to manually program axioms for everything: The vocabulary involved would be too diverse to manage. But for the most basic concepts like time, space and laws of physics, I believe it is more efficient to model them as systems with rules than to build a baby robot that has a hard time figuring out how even gravity works. Everything else, including all exceptions to the axioms, can be taught or learned as knowledge.

Another question is whether the Winograd Schema Challenge tests intelligence, something that was also suggested of its predecessor, the Turing Test. Perhaps due to my approach, I find that it mainly tests language processing (a challenge in itself) and knowledge of the ground rules of our world. Were this another planet where gravity goes upward and apologising is considered offensive, knowing those rules would be the key to the schemas more often than intelligence. Of course intelligence does need to be applied to something to test it, and the test offers a domain inbetween too easy to fake and too impossible to try. And because a single word can entirely change the outcome, the test requires a more detailed analysis than just the comparison of two key words. My conclusion is that the Winograd Schema Challenge does not primarily test intelligence, but is more inviting to intelligent approaches than unintelligent circumventions.

a game of crossword pronouns

Crossword pronouns
Figuring out the mechanisms behind various Winograd schemas was a pleasant challenge. It felt much like doing advanced crossword puzzles; Solving verbal descriptions from different angles, with intersecting solutions that didn’t always add up. Programming the methods however was a chore, getting all the effects of modifying words “because/so/but/not” to play nice in mathematical formulas, and making the axioms also work in reverse on a linearly processing computer.

I should be surprised if I were to do better than universities and companies, but I would hope to do well enough to show that resources aren’t everything. My expectations are nevertheless that despite the contest’s efforts to encourage reasoning, more mundane methods like rote learning will win through sheer quantity, as even the difficult schemas contain common word combinations like “ask – answer”, “lift – heavy” and “try – successful”. But then, how could they not.

Regardless the outcome of the test, it’s been an interesting side-quest into another elusive area of computer abilities. And I already benefit from the effort: I now have a helpful support to my A.I.’s language understanding, and potentially a tool to enhance many other processes with. That I will no longer find “elephants in my pajamas”, is good enough for me.