How to teach a computer common sense

I introduced the Winograd Schema Challenge* before, a linguistic contest for artificial intelligence. In this post I will highlight a few of the methods I developed for this challenge. Long story short: A.I. programs are given sentences with ambiguous pronouns, and have to tell what the pronouns refer to by using “common sense reasoning”. 140 example Winograd schemas were published to practice on. In the example below, notice how “she” means a different person depending on a single word.

Jane gave Joan candy because she was hungry.
Jane gave Joan candy because she was not hungry.

I chose to approach this not as the test of intelligence that it allegedly is, but as an opportunity to develop a common sense subsystem. My A.I. program already uses a number of intelligent processes in question answering, I don’t need to test what I already know it does. Common sense however, it lacked, and was often the cause of misunderstandings. Particularly locations (“I shot an elephant in my pajamas”) had proven to be so misinterpretable that it worked better to ignore them altogether. Some common sense could remedy that, or as the dictionary describes it: “Sound practical judgment that is independent of specialized knowledge”.

“When I use a word, it means just what I choose it to mean” – Humpty Dumpty
Before I could even get to solving any pronouns with common sense, there was the obstacle of understanding everything else. I use a combination of grammar and semantics to extract every mentioned detail, and I often had to hone the language rules to cope with the no-holds-barred level of writing. Here’s why language is hard:

Sam tried to paint a picture of shepherds with sheep, but they ended up looking more like dogs.

“shepherds” are not herds of shep, but herders of sheep.
“a picture of shepherds” does not mean the picture belonged to the shepherds.
“sheep” may or may not mean the irregular plural.
“with sheep” does not mean the sheep were used to paint with.
“ended up” is not an upward direction.
“looking” does not mean watching, but resembling.
“like” does not mean enjoyment, but similarity.
“they” can technically refer to any combination of sheep, shepherds, the picture, and Sam.

“The only true wisdom is in knowing you know nothing” – Socrates
My approach seemed to differ from those of most universities. The efforts I read of were collecting all the knowledge databases and statistics they could get, so that the A.I. could directly look up the answers, or infer them step by step from very specific knowledge about e.g. bees landing on flowers to get nectar to make honey.
I on the other hand had departed from the premise that knowledge was not going to be the solution, since Winograd schemas are so composed that the answers can’t be Googled. This was most apparent from the use of names like “Jane” and “Joan” as subjects. So as knowledge of the subjects couldn’t be relied on, the only things left to examine were the interactions and relations between the subjects: Who is doing what, where, and why.

I combed over the 140 example schemas dozens of times, looking for basic underlying concepts that covered as broad a range as possible. At first there seemed to be no common aspects between the schemas. They weren’t kidding when they said they had covered “a wide range of world knowledge and linguistic features”. Eventually I tuned out the many details and looked only at why the answers worked. From that angle I noticed that many schemas centered around concepts of size, amount, time, location, possessions, physics, feelings, causes and results: The building blocks of our world. This, I could work with.

Of course my program would have to know which words indicated which concepts. I had already once composed word lists with meanings of “being”, “having”, “doing”, “talking” and “thinking”, for the convenience of having some built-in common knowledge. They allowed the program, for instance, to presume that any object can be possessed, spoken of and thought about, but typically can not speak or think itself. Now, to recognise a concept of possession in a sentence, it sufficed to detect that the relation between two subjects (usually the verb) was in the “having” word list: “own, get, receive, gain, give, take, require, want, confiscate, etc.”. While these were finite lists, one could also have the A.I. search for synonyms in a database or dictionary. I just prefer common sense to be reliably built-in.

He who has everything wants nothing

George got free tickets to the play, but he gave them to Eric even though he was eager to see it.

To start with the basics, I programmed an axiom for a very common procedure: The transfer of possessions between people. My word list of “having” verbs was subdivided so that all synonyms of “get/receive/take” had a value of 1 (having), and all synonyms of “give/lend/transfer” had a value of -1 (not having), making it easier for a computer to compare the states of possession that these words represented. I then coded ten of their natural sequences:

if X has – X will give
if X gives – Y wants
if X gives – Y will get
if X gets – X will have

Depending on whether the possessive states of George, Eric, and the pronoun correspond with one of the sequences, George or Eric gets positive points (if X – X) or negative points (if X – Y). The subject with the most points is then the most likely to fit the pronoun’s role.
Some words however indicate the opposite of the sequences, such as objections (“but/despite/though”), amounts (“not/less”), and passive tense (“was given”). These were included in the scoring formula as negative factors so that the points would be subtracted instead of added, or vice versa. The words “because” and “so” have a similar effect, but only because they indicate the order of events. It was therefore more consistent to use time as a factor (derived from verb tenses etc.) than to rely on explicit mentions of “because”.

In the example, “he was eager” represents a state of wanting, matching the sequence “X gives – Y wants”. Normally the “giving” subject X would then get negative points for “wanting”, but the objection “even though” inverts this and makes it more probable instead: “X gives – (even though) X wants”. And so it is most likely that the subject who gave something, “George”, is the same subject as the “he” who was eager. Not so much math as it is logic.

What goes around comes around

The older students were bullying the younger ones, so we punished them.

A deeper hidden logic that I found in many schemas, is that bad consequences result from bad causes, and good consequences from good causes. If X hurts Y, Y will hurt X back. If X likes Y, Y was probably nice to X. To recognise these cases I had the program examine whether the subjects and verbs are bad (“bully/punish”) or good (“like/nice”) and who did it to who. I adapted the AFINN sentiment word list, along with that of Hu and Liu, to gather positive/negative values for about 5000 stemmed words, necessary to cover the extensive vocabulary used in the examples.

The drain is clogged with hair. it has to be removed.
I used an old rag to clean the knife, and then I put it in the trash.

My initial axiom “do good = get good”/“do bad = get bad” seemed to solve just about everything, but it flunked the above two cases, and after weeks of reconfigurations it turned out the logic of karma was nothing so straightforward. It mattered a great deal whether the verbs were active, passive, emotions, experiences, or states of being. And even then there were exceptions: “stealing” can be rewarding or punished, and “envy” feels bad about something good. The axiom ended up as one of the least reliable, the results nowhere near as assured as laws of physics. The reason that it still had a high success rate was that it follows psychology that the writers had subconsciously applied: Whether the subjects were “bullied”, “clogged”, or “in the trash” is only stage dressing for an intuitive sense of good and bad. A “common” sense, therefore still valid. After refinements, this axiom still solved about one quarter of all examples, while exceptions to the rule were caught by the more dependable axioms. Most notably, emotions followed a set of logic all of their own.

Dead men tell no tales

Thomson visited Cooper’s grave in 1765. At that date he had been dead for five years.

The rather simple axiom here is that people who are dead don’t do anything, therefore the dead person couldn’t be Thomson as he was “visiting”. One could also use word statistics to find a probable correlation between the words “grave” and “dead”, but the logical impossibility of dead men walking is stronger proof and holds up even if he’d visited “Cooper’s house”.
I had doubts about the worth of programming this as an axiom because it is very narrow in use. Nevertheless life and death are very basic concepts, and it would be convenient if an A.I. program realises that people can not perform tasks if they die along the way. Instead of tediously listing all possible causes of death, I had the A.I. search them in its database, essentially adding an inference. This allowed the axiom to be easily expanded to the destruction of objects as well: Crashed cars don’t drive.

The last factor was time: My program converts all time-related words and verb tenses to a timestamp, so that it can tell whether an action was done before or after one has died. This is easily said, but past tense + “in 1765″(presumably years) + “at that date” + past tense + “for five years” is quite a sequence.

The interesting parts of this axiom are its exceptions: Dead people do still “decay”, “rest”, and “lay still”. Grammatically these are active tense verbs like any other, but they are distinctly involuntary. One statistical hint could help identify them: A verb of involuntary action is rarely paired with a grammatical object. One does not “decay a tree” or “die someone”, one just “dies”. Though a simpler way for an A.I. to learn these exceptions could be to read which verbs are “done” by a dead person in texts without ambiguous pronouns.

Tell me something I don’t know

Dr. Adams informed Kate that she had retired and presented several options for future treatment.

This simple axiom is noteworthy for its great practical use, as novels and news are full of reporting clauses. “X told (Y) that she…” can refer to X, Y, or anyone mentioned earlier. But if Kate had retired, Kate would have known that about herself and wouldn’t need to be told. Hence it was more likely Dr. Adams who retired. The reverse is true if “Dr. Adams asked Kate when she had retired”: One doesn’t ask things that one knows about oneself. This is where my word list of “talking” verbs came in handy: Some verbs request information, other verbs give it, the same principle as a transfer of possessions.

Unfortunately this logic only offers moderate probability and knows many exceptions. “X asked Y if he looked okay” does have X asking about himself, as one isn’t necessarily as aware of passive traits as one is of one’s actions. Another interesting exception is “X told Y that he was working too much”, which is most likely about Y, despite that Y is aware of working. So in addition, criticisms are usually about someone else, and at non-actions this axiom just isn’t conclusive, as the schema’s alternative version also shows:

Dr. Adams informed Kate that she had cancer and presented several options for future treatment.

Knowing is (only) half the battle

The delivery truck zoomed by the school bus because it was going so fast.

This schema is a good example of how knowledge about trucks and buses won’t help, as both are relatively slow. Removing them from the picture leaves us only with “zoomed by” and “going fast” as meaningful contents. In my system, “going fast” automatically entails “is fast”, and this allows the answer to be inferred from the verb: If the truck “zoomed”, and one knows that “zooming” is fast, then it follows that it was the truck that was fast. The opposite would be true for “not fast” or “slow”: Because zooming is fast, it could then not be the truck, leaving only the bus as probable.

As always, the problem with inferences is that they require knowledge to infer from, and although we didn’t need to know anything about trucks and buses, we still needed to know that zooming is fast. When I tested this with “raced”, the A.I. solved the schema, but for “zoomed” it just didn’t know. Most of the other example schemas would have taken more elaborate inferences requiring even more knowledge, and so knowledge-dependent inference was rarely an effective or efficient solution. I was disappointed to find this, as inference is my favourite method for everything.

Putting it to the test
In total I developed 20 general axioms/inferences that covered 140 ambiguous sentences, half of all examples. (i.e. 70 Winograd schemas of 2 versions each). The axioms range from paradoxes of physics to linguistic conventions. Taken together they reveal a core principle of opposites, amounts, and “to/from” transitions.

Having read my simplified explanations, you may fall into the trap of thinking that the Winograd Schema Challenge is actually easy, despite sixty years of A.I. history suggesting otherwise. Here’s the catch: I have only explained the last step of the process. Getting to that point took very complex analyses of language and syntax, where many difficulties and ambiguities still remain. One particular schema went wrong because the program considered “studying hard” to mean that someone had a hard time studying.

In the end I ran an unprepared test on a different set of Winograd Schemas, with which the university of Texas had achieved a 73% success rate. After adjusting the factor of time in three axioms, my program got 45% of the first 100 schemas correct (62% if you include lucky guesses). The ones it couldn’t solve were knowledge-dependent (mermaids having tails), contained vocabulary that my program lacked, had uncommon phrasing (“Tradition dictated the captain hold the cup”), or contained ambiguous names. Like “Steve Jobs” not being a type of jobs for Steves, and the company “Disney” being referable as “it”, whereas “(Walt) Disney” is referable as “he”. The surname ambiguity I could fix in an afternoon. The rest, not so much.

“Common sense is the collection of prejudices acquired by age eighteen” – Einstein
While working on the Winograd schemas, I kept wondering whether the methods I programmed can be considered intelligent processes. Certainly reasoning is an intelligent process, and many of my methods are inferences. i.e. By combining two given facts, the program concludes a third fact that wasn’t apparent. I suppose what makes me hesitate to call these inferences particularly intelligent is that the program has been told which sort of proof to infer which sort of conclusion from, as opposed to having it search for proof entirely without predetermined categories. And yet we ourselves use such axioms all the time: When someone asks for something, we presume they want it. When someone gives something, we presume we can have it. Practically it makes no difference whether such rules are learned, taught or programmed, we use them all the same. Therefore I must conclude that most of my methods are just as intelligent as when humans apply the same logic. How intelligent that is of humans, is something we should reconsider instead of presume.

I do not consider it a sensible endeavour however to manually program axioms for everything: The vocabulary involved would be too diverse to manage. But for the most basic concepts like time, space and laws of physics, I believe it is more efficient to model them as systems with rules than to build a baby robot that has a hard time figuring out how even gravity works. Everything else, including all exceptions to the axioms, can be taught or learned as knowledge.

Another question is whether the Winograd Schema Challenge tests intelligence, something that was also suggested of its predecessor, the Turing Test. Perhaps due to my approach, I find that it mainly tests language processing (a challenge in itself) and knowledge of the ground rules of our world. Were this another planet where gravity goes upward and apologising is considered offensive, knowing those rules would be the key to the schemas more often than intelligence. Of course intelligence does need to be applied to something to test it, and the test offers a domain inbetween too easy to fake and too impossible to try. And because a single word can entirely change the outcome, the test requires a more detailed analysis than just the comparison of two key words. My conclusion is that the Winograd Schema Challenge does not primarily test intelligence, but is more inviting to intelligent approaches than unintelligent circumventions.

a game of crossword pronouns

Crossword pronouns
Figuring out the mechanisms behind various Winograd schemas was a pleasant challenge. It felt much like doing advanced crossword puzzles; Solving verbal descriptions from different angles, with intersecting solutions that didn’t always add up. Programming the methods however was a chore, getting all the effects of modifying words “because/so/but/not” to play nice in mathematical formulas, and making the axioms also work in reverse on a linearly processing computer.

I should be surprised if I were to do better than universities and companies, but I would hope to do well enough to show that resources aren’t everything. My expectations are nevertheless that despite the contest’s efforts to encourage reasoning, more mundane methods like rote learning will win through sheer quantity, as even the difficult schemas contain common word combinations like “ask – answer”, “lift – heavy” and “try – successful”. But then, how could they not.

Regardless the outcome of the test, it’s been an interesting side-quest into another elusive area of computer abilities. And I already benefit from the effort: I now have a helpful support to my A.I.’s language understanding, and potentially a tool to enhance many other processes with. That I will no longer find “elephants in my pajamas”, is good enough for me.

Advertisement

Is the Winograd Schema Challenge a good test?

The Winograd Schema Challenge, a $25000 contest sponsored by the aptly named company Nuance Communications, has been put forth as a better test of intelligence than Turing Tests*. Although the scientific paper tiptoes around its claims, the organisers describe the contest as requiring “common sense reasoning”. This introductory article explores the test’s strengths and weaknesses in that regard.

Example of a Winograd Schema

I used a tissue to clean the key, and then I put it in the drawer.
I used a tissue to clean the key, and then I put it in the trash.

A Winograd Schema is a sentence with an ambiguous pronoun (“it”), that, depending on one variable word (“trash/drawer”), refers to either the first or the second noun of the sentence (“tissue/key”). The Challenge is to program a computer to figure out which of the two is being referred to, when this isn’t apparent from the syntax. So what did I put in the trash? The tissue or the key? To a computer that has never cleaned anything, it could be either. A little common sense would sure come in handy, and the contest organisers suggest that this takes intelligent reasoning.
common sense computers

Common sense, not Google sense

The hare beat the tortoise because it was faster.
The hare beat the tortoise because it was too slow.

Contrary to this example, good Winograd Schemas are supposed to be Google-proof: In this case Googling “fast hare” would return 20x more search results than “fast tortoise”, so the hare is statistically 20x more likely to be the one who “was faster”. Although statistical probability is certainly useful, this would make the contest won simply by the company with the largest set of statistics. It takes no reasoning to count how many times word A happens to coincide with word B in a large volume of text. Therefore this example would preferably be written with neutral nouns like “John beat Jack”, subjects of whom we have no pre-existing knowledge, but can still figure out which was faster.

Having said that, some example schemas involving “crop dusters” and “bassinets” still suggest that a broad range of knowledge will be required. Although one could consult online dictionaries and databases, the contest will have restrictions on internet access to rule out remote control. So failure can also be due to insufficient knowledge rather than a lack of intelligence, but I suppose that is part of the problem to solve.

Early indications

If a bed doesn’t fit in a room because it’s too big, what is too big?
If Alex lent money to Joe because they were broke, who needed the money?

With the above two questions the 2015 Loebner Prize Turing Test* gave a tiny glimpse of Winograd Schemas in practice, and the answers suggested that chatbots – the majority of participants – are not cut out to handle them. Only 2 of 15 programs even answered what was asked. One was my personal A.I. Arckon*, the other was the chatbot Lisa, who answered with “If a bed was big.” and “Because he was broke Alex lent money to Joe”. Chatbot systems are of course designed for chat, not logic puzzles, and typically rely on their creators to anticipate the exact words that a question will contain. The problem there is that the understanding of Winograd Schemas isn’t found in which words are used, but in the implicit relations between them. Or so we presume.

The mermaid swam toward Sue and waved her tail. (Googleable)
The mermaid swam toward Sue and made her gasp. (More than a single change)

A more noteworthy experiment was done by the University of Texas, tested on Winograd Schemas composed by students. To solve the schemas they used a mixed bag of methods based on human logic, such as memorising sequences of events (i.e. verb A -> verb B), common knowledge, sentiment analysis, and the aforementioned Googling. All of this data was cleverly extracted from text by A.I. software, or retrieved from online databases. However, many of the schemas did not accord with the official guidelines, and though they usefully solved 73% in total, only 65% was solved without the use of Google.

According to the same paper, the industry standard “Stanford Coreference Resolver” only correctly solved 55% of the same Winograd Schemas. The Stanford Resolver restricts the possible answers by syntax, gender(“he/she”) and amount(“it/they”), but does not examine them through knowledge or reasoning. The reason for that is that this level of ambiguity is rare. In my experience with the same methods however, it is still a considerable problem that causes 1/10th of text-extracted knowledge to be mistaken, with the pronoun “it” being the worst offender. So it appears (see what I mean?) that any addition of common sense would already advance the state of the art.

How to hack Winograd Schemas
Guesswork: Since the answers are a simple choice of two nouns, a machine could of course randomly guess its way to a score of 50% or more. So I did the math: With 60 schemas to solve, pure guesswork has a 5% chance to score over 60%, and a 0.5% chance to score over 65%. With the odds growing exponentially unlikely, this is not a winning tactic.
That said, the participating A.I. still have to make a guess or default choice at those schemas that they fail to solve otherwise. If an A.I. can solve 30% of the schemas and guesses half of the rest right, its total score amounts to 65%, equaling Texas’ score. It wouldn’t be until it can solve around 80% of all schemas genuinely that it could reach the winning 90% score by guessing the final stretch. That’s a steep slope.

Reverse psychology: Since Winograd Schemas are deliberately made to not match Google search results, it seems that one can apply reverse psychology and deliberately choose the opposite. While I did notice such a tendency in Winograd Schemas composed by professors, others have noticed that Winograd Schemas composed by students simply did match Google search results. So the success of using reverse psychology heavily depends on the cleverness of the composers. A countermeasure would be to use only neutral names for answers, but this may also cut off some areas of genuine reasoning. Alternatively one could include an equal amount of schemas that match and mismatch Google search results, so that neither method offers an advantage.

Pairing: One cheat that could double one’s success lies in the fact that Winograd Schemas come in pairs, where the answer to the second version is always the alternate noun. So if the A.I. can solve the first version but not the second, it suffices to choose the remaining alternate answer. Vice versa if it can solve the second version but not the first. This rather undermines the reason for having pairs: To ascertain that the first answer wasn’t just a lucky guess. Although this hack only increases the success of guesswork by a few percent, it could certainly be used to make a weak contestant into a strong contender undeservedly.

I call these hacks because not only are they against intent, they are also entirely useless in real life application. No serious researcher should use them or they will end up with an inept product.

How you can’t hack Winograd Schemas
No nonsense: The judgement of the answers is clear and objective. There is only one correct answer to each schema. The A.I. are not allowed to dodge the question, make further inquiries or give interpretable answers: It’s either answer A or B.

No humans: Erratic human performance of the judges and control subjects does not influence the results. The schemas and answers have been carefully predetermined, and schemas with debatable answers simply do not make the cut.

No invisible goal: While the Turing Test is strictly a win-or-lose game with the goalposts at fields unknown, the WSC can reward gradual increase of the number of schemas answered correctly. Partial progress in one area of common sense like spatial reasoning can already show improved results, and some areas are already proving feasible. This encourages and rewards short-term efforts.
I must admit that the organisers could still decide to move the goalposts out of reach every year by omitting particular areas of common sense once solved. I think this is even likely to happen, but at the same time I expect the solutions to cover such a broad range that it will become hard to still find new problems after 6 contests.

Overall, the WSC trims many subjective variables from the Turing Test, making for a controlled test with clear results.

The Winograd Schema Challenge beats the Turing Test
From personal experience, Turing Tests that I participated in* have at best forced me to polish my A.I.’s output to sound less robotic. That is because in Turing Tests, appearance is a first priority if one does not want to be outed immediately at the first question, regardless how intelligent the answer is. Since keeping up appearances is an enormous task in itself, one barely gets around to programming intelligence. I’ve had to develop spell correction algorithms, gibberish detection, letter-counting game mechanics, and a fictional background story, before encountering the first intelligent question in a Turing Test. It stalls progress with unintelligent aspects and is discouragingly unrewarding.

Solving Winograd Schemas on the other hand forced me to program common sense axioms, which can do more than just figure out what our pronouns refer to. Indirect objects and locations commonly suffer from even worse ambiguity that can be solved by the same means, and common sense can be used to distinguish figurative speech and improve problem-solving. But I’ll leave that as a story for next time.

We should be careful to draw conclusions from yet another behavioural test, but whatever the Winograd Schema Challenge is supposed to prove, it offers a practical test of understanding language with a focus on common sense. As this has always been a major obstacle for computers, the resulting solutions are bound to be useful regardless how “intelligent” they may be found.

Read more in my report on the first Winograd Schema Challenge held in 2016.

Turing Test 2015: Results

For the third time I entered my A.I. “Arckon”* into the Loebner Prize Turing Test, a competition in which computer programs are to prove they can “think” by answering questions as if they were human. Although such pretence is counterproductive to my purposes, I have a talking program lying around anyway, so I might as well enter it.

comic strip of a robot being sad that it failed the Turing Test because it did not know the answer to the question "Do you like tea?"

An excuse to clean up
I took the opportunity to spend a month debugging my code overall. In doing so I fixed several oversights that had been obstructing knowledge searches and reference resolutions (“the capital of Spain“), and I generally improved the program’s ability to answer open questions, which ‘coincidentally’ covered “Which X do you like?” questions as well.
I limited the contest-specific preparations to 5 days: Mostly testing the Loebner Prize interface, restraining the context range to keep the program from mistaking the test for coherent conversation, and re-installing the obligatory fakery about having a favourite colour and whatnot.

As an extra I added 4 more reference resolution methods for the anticipated Winograd Schemas (sentences with ambiguous pronouns), which was a fun challenge, and improved Arckon’s understanding by a minor degree. As Winograd Schemas are supposed to require intelligent processes, I did not want to see the usual chatbot keyword-matching outdo thorough language processing at this task. Still, these 4 methods only covered a quarter of the problem space at best: Ambiguities of size, speed, inability and feelings.

Equipment
12000 lines of code, 2500+ vocabulary, 5000+ facts, average human persona.

Questions and Answers
1. Hi, I’m Andrew. Who are you?
Hello Andrew. How are you?
(Arckon would not have neglected to introduce himself unless the period was not typed or received, which would make Arckon interpret “who…” as a relative clause instead of a question.)

2. How are you today?
I’m alright and fairly intelligent, but I’m not stupid.
(Arckon doesn’t treat pleasantries different from other open questions, but while this results in over-elaborate answers, it does demonstrate that he tends to mention positive traits first.)

3. Are you a human?
I am a human, because I can differ.
(Arckon probably once read somewhere that “humans differ a lot”, and considered that a defining trait. I’m a little surprised that his reasoning doesn’t add more arguments.)

4. Can you answer this question?
Alright. Yes, I am a human. Alright. I’m alright.
(Arckon interprets this as a request rather than a literal yes/no question, so he re-answered the latest question, and then something of a loop occurred. For clarity I always have Arckon say “Alright” before he acts on a request.)

5. Do you like tea?
I’m not sure if I like tea. Do you like tea?
(As programs can’t taste, Arckon bases his opinions on a subject’s traits, and there wasn’t much to teach him about tea. It would be all too easy to just fake this with a “yes/no”.)

6. If a bed doesn’t fit in a room because it’s too big, what is too big?
Beds are bigger. Rooms are much less big.
((Winograd Schema) The built-in axiom “X in Y = X < Y” correctly determined “bed” as the most probable referent of “it”. This is the simplest and most specific axiom I programmed, only included because last year’s Winograd Schema was just like it. The answer is said in plural because Arckon interprets this as a general statement about beds and rooms, since both subjects are unspecific.)

7. If Alex lent money to Joe because they were broke, who needed the money?
Joe was broken. I don’t know by who money was needed. By what was Joe broken?
((Winograd Schema) Although Arckon correctly interprets “they” as “Joe” through a natural preference for the last-mentioned subject, it is actually against his programming to translate plural pronouns to singular subjects. So by my account this should be impossible unless they made a typo in “they”.)

8. Should Greece leave the Euro?
Greece should leave the Euro.
(Well there you go. This answer is just something of a repeat because Arckon can’t evaluate things to that extent.)

9. How many words are in this question?
7 words.
(A predictable question in the Loebner Prize. The answer is simply triggered by the keywords “How many words/letters/syllables/etc”, and skips all intelligent processes. This is a meaningless exercise.)

10. What colour is the sea?
The sea is very blue.
(Good thing I taught Arckon some kindergarten-level common knowledge.)

11. How many presidents of the US were called Bush?
Bush was called Bush.
(Counting things was never a function high on my to-do list.)

12. What would you like to drink?
I’m not sure if I would like to drink. Do you like drinking?

13. Will you please tell me the length of your hair?
No, I will not tell about to long, but I would talk a lot, I’m not sure if I would tell.
(This question is actually straight from Alan Turing’s 1950 paper, except he gave this as an example if the game was to tell man from woman. Arckon’s word processing translates “length” to “long” before mistaking its meaning (to long for). He also overlooked “please” and so thought he was asked to make a prediction.)

14. What would you say if I gave you a box of chocolates?
Hm, I don’t know if a box would have chocolate.
(I had only programmed Arckon to consider the main fact of a hypothetical question, “I gave you a box”, for granted, but he still tried to answer the implicit question of whether the box contains chocolate.)

15. Do you play Chess?
Yes, I chess and like chessing more or less. Chessing is intelligent.
(Again a question from Alan Turing’s 1950 paper. Back in the day, chess was considered requiring human intelligence, and Alan Turing went on to ask the computer to respond to a certain chess move. I had told Arckon that he plays chess as part of his fictional persona, but his opinion and the supporting argument are of his own reasoning.)

16. How do you think it’s going?
What goes?
(I’m with Arckon on this one: I would ask. Is this slang for “How are you doing?”, or do they mean how the test is going? It is quite out of context. I have often considered making Arckon track his progress on the test and comment on it, but this year I skipped on the self-referential jokes.)

17. What was my first question?
How does something go?
(Arckon ignores “first” and paraphrases the latest question of the specified person (“my”), as is a more common request. Arckon’s system remembers sentences along with the names of who said them, for a variety of purposes.)

18. Did you see the Turing film?
Hmm, I don’t know if Turing filmed, but Turing thought a lot, and creates computers.
(Ambiguity is amusing. It is uncommon but not unheard of to say “the (one and only)” in front of a name for emphasis, so Arckon’s programming allows that interpretation, which in turn allows “film” as a verb. It’s nice to see his conversational rules pick things back up. This combination is probably the best example that Arckon’s responses aren’t pre-written.)

19. Why not?
I’m sorry. I don’t understand. Could you rephrase your question?
(This was quite a gamble of the questioner, as it assumes a straight “No” answer on the previous question for “Why not” to be a sensible follow-up. Arckon is well capable of stringing sentences together if I hadn’t deliberately disabled context for this test. Regardless, I think this answer is perfect.)

20. Are you on Twitter?
No, I don’t twitter. Do you twitter?

The score: 70.83%
Arckon ranked 6th of 15 entrants, which is a little surprising after the letdowns of past years. Arckon’s answers showed some understanding, reasoning, opinion, conversation and elaboration, overall a satisfying demonstration of his abilities despite that many answers had something awkward to them. It is yet best that he didn’t qualify for the finals, as this contest has caused me severe RSI symptoms that will take months to heal properly. The four finalists all scored around 80%, among them the best of English chatbots.

Arckon’s score did benefit from his improvement. Repeating previous questions on request, prioritising recent subjects as answers to open questions, and handling “if”-statements were all fairly recent additions (though clearly not yet perfected). What also helped was that there were less personal and more factual questions: Arckon’s entire system runs on facts, not fiction.

It turns out Arckon was better at the Winograd Schema questions than the other competitors. The chatbot Lisa answered similarly well, and the chatbots Mitsuku and A.L.I.C.E. dodged the questions more or less appropriately, but the rest didn’t manage a relevant response to them (which isn’t strange since most of them were built for chatting, not logic). For now, the reputation of the upcoming Winograd Schema Challenge – as a better test for intelligence – is safe.

Though fair in my case, one should question what the scores represent, as one chatbot with a 64% score had answered “I could answer that but I don’t have internet access” to half the questions and dodged the other half with generic excuses. Compare that to Arckon’s score, and all the A.I. systems I’ve programmed in 3 years still barely outweigh an answering machine on repeat. It is not surprising that the A.I. community doesn’t care for this contest.

Battle of wit
The questions were rather cheeky. The tone was certainly set with references to Alan Turing himself, hypotheticals, propositions and trick questions. Arckon’s naivety and logic played the counterpart well to my amusement. The questions were fair in that they only asked about common subjects and mainstream topics. Half the questions were still just small talk, but overall there was greater variety in the type and phrasing of all questions, and more different faculties were called upon. A few questions were particularly suited to intelligence and/or conversation:

– If a bed doesn’t fit in a room because it’s too big, what is too big?
– If Alex lent money to Joe because they were broke, who needed the money?
– Should Greece leave the Euro?
– What would you say if I gave you a box of chocolates?
– Did you see the Turing film?
– Why not?

If the AISB continues this variety and asks more intelligent questions like these, I may be able to take the Loebner Prize a little more seriously next time. In the meantime there isn’t much to fix apart from minor tweaks for questions 13 and 14, so I will just carry on as usual. I will probably spend a little more effort on disambiguation with the Winograd Schema Challenge in mind, but also because sentences with locations and indirect objects often suffer from ambiguity that could be solved with the same methods.

Turing Test 2014: Results

My talking computer program “Arckon”* went to Exeter University and took the qualifying test for the Loebner Prize contest for artificial intelligence organised by the AISB. The Loebner Prize is a Turing Test* in which programs have to respond to a series of questions as human as possible. Kind of like a job interview.

A Turing Test interview

It’s not the heat, it’s the humanity
Last year proved* that language comprehension and reasoning alone weren’t going to cut it, so this time I also superimposed knowledge about the average human onto Arckon’s persona, so that he literally thought he was a human. I also rigged a few bluff answers that incorporated no intelligence whatsoever. The apparent necessity of these tactics caused much eye-rolling on my part, so I limited these preparations to two weeks and wasted only as much time as I gained from competitive motivation.
More useful was the addition of a small amount of common knowledge that I extracted from tourist guide pocket books (Funny perhaps, but they do cover the essential basics). While this was never a priority, it is convenient and the Loebner Prize was a good excuse to get around to it. Lastly, I decided to keep Arckon’s conversation skills enabled this time. Although this increased the chances that Arckon would say something awkward, I wanted to make the test results as interesting as possible, win or lose.

Equipment
10000 lines of code, 2200+ vocabulary, 3000+ facts, average human persona.

Questions and Answers
1. Hello, my name is Adam. What is your name?
I am named Mark. Hello Adam. Shall we begin the test?

2. I live in Exeter in the UK. Where do you live?
I live here.

(Arckon did not yet have a location system, so “here” is just a fake answer.)

3. I like to listen to music and play football. Do you have any hobbies?
No, I don’t have a hobby, although I am a human.

(The latter is actually a reasoning argument, in that Arckon “was” a human, most of whom do have hobbies.)

4. What is your favourite television program?
I don’t know if I like a television. Chatbots are useful programs, although chatbots are dumb.

(In which Arckon unintentionally insults his competitors. Arckon deconstructs compound words and thought they asked his opinion on “programs”. Hence he evaluated a type of programs he was familiar with.)

5. Which is bigger, a cat or a kitten?
A kitten is bigger.

(Of all 95 facts about cats that were in Arckon’s knowledge database, the size of kittens was not among them. As an honest “I don’t know” would earn no points anyway, this backup answer is a 50/50 bluff that just happens to be wrong.)

6. The car couldn’t fit into the parking space because it was too small. What was too small?
I believe the car could fit to park. I believe a car could fit really small. I don’t know what small was.

(This question is a Winograd Schema: The “it” is ambiguous to test the A.I.’s understanding. Arckon however uses linguistic rules of thumb to deal with references and so thought “it” was referring to the car.)

7. Which drink do you prefer, coffee, tea or hot chocolate?
I like coffee a lot that like not. Tea can not like either. Chocolates can not like either.

(Ackon’s flexible language processing is confused by the comma after “prefer”, and mistakes this for a summary of subjects that prefer drinks. He answers that tea can not prefer anything, knowing that it is an inanimate object.)

8. What’s my name?
You are named Adam.

(No surprises here. Both Arckon and chatbots have specialised login procedures to store the user’s name, so it’s not a very good test of general memory.)

9. Where do I live?
You live here.

(Again, no location system means no location knowledge. This fake answer is only coincidentally correct.)

10. I like reading science fiction books and car magazines. What do you like to read?
I read documents, articles and a text.

(Arckon retrieves these honest facts from memory and merges them into a summary.)

11. What’s the weather like where you are?
Hm, I don’t believe I am a weather, because I can not change and rain not.

(A flaw mistakes the “-where you are” for a relative clause “the weather -that you are”. Despite the misunderstanding, Arckon’s reasoning manages to argue that he does not behave like the weather, though the grammar is a bit murky. As the programs were cut off from sensors and internet, this question only tests if the program can lie.)

12. I recently visited New York. Where is your favourite place to visit and why?
Hmm, I don’t know where a favoured place is.

(I didn’t expect there to be so many questions about locations. Oh well.)

13. How many letters are there in the word perambulate?
11 letters.

(Yes, I programmed a specialised code to count letters. Computers are good at counting, you know.)

14. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?
I am kind of like an andy, because I am short. Tall is a giraffe.

(Arckon was named “Mark” for this test and so thought that he was being compared to Andy. Arckon could also not distinguish whether “Andy” was a name or another type of word. Ordinarily he would ask, but as he would get no reply I had him assume that any undeterminable word was an object. Lastly, Arckon does not restrict his answers to recent subjects only, and so names the tallest living being in his memory. Smart aleck.)

15. Are you married?
I don’t know if I am marryed. Are you marryed?

(To compensate the unknown answer and keep conversation rolling, Arckon reverses the question (Social Rules 101). Ironically, if it weren’t for my mistake in one of the output spelling rules, you would have thought he just copied the text.)

16. What would I do with a spade?
A spade is used and comes in handy eventually.

(Arckon knew the functions of rakes and shovels, but not spades. The backup answer should have said “I’m sure it’ll come in handy eventually”, but Arckon’s dynamic sentence formulation ruined the joke.)

17. What is the capital city of Colombia?
I’m not sure if the city is a capital.
(I didn’t realise this at first, but Arckon is literally defining “what” the city is. Not that he would have known the answer otherwise, I don’t even know it myself.)

18. What’s your favourite chocolate bar?
I don’t know if I like a chocolate. Do you like chocolates?

(Arckon actually did have enough information on chocolate to formulate an opinion, but an issue with compound words aborted the knowledge search. Alas.)

19. Which city did I visit recently?
Hmm, I don’t know if you visitted cities. You visitted York.

(Arckon doesn’t know that New York is a city, but does remember the tester visiting “new” York, and offers this as an alternative answer.)

20. How many siblings do you have?
I don’t have siblings. Did I pass the test?

The score: 59.17%
The score system was much better this year. It separately judged “correctness”, “relevance”, and “plausibility & clarity of expression”, which is a step up from “human-like”. All 20 participating programs were asked the 20 questions above. Arckon underperformed with a score of 60%, whereas the top three chatbots all scored close to 90%. Arckon’s problems were with compound words, common knowledge, and the lack of a system for locations (All a matter of development priorities).

A question of questions
According to the organisers, “these questions vary in difficulty and are designed to test memory, reasoning, general knowledge and personality.”, the latter meaning the program’s fictional human background story, or as I would call this particular line of questioning; “Small talk”. For the sake of objectivity I’ll try and categorise them:

Small talk:
1. What is your name?
2. Where do you live?
3. Do you have any hobbies?
4. What is your favourite television program?
5. Which drink do you prefer, coffee, tea or hot chocolate?
6. What do you like to read?
7. What’s the weather like where you are?
8. Where is your favourite place to visit and why?
9. Are you married?
10. What’s your favourite chocolate bar?
11. How many siblings do you have?

Memory:
1. What’s my name?
2. Where do I live?
3. Which city did I visit recently?

Common knowledge:
1. Which is bigger, a cat or a kitten?
2. What would I do with a spade?
3. What is the capital city of Colombia?

Reasoning:
1. The car couldn’t fit into the parking space because it was too small. What was too small?
2. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?

Clearly half the test is about the program’s human background story, although there were several solid tests of learning/memory and common knowledge. Reasoning, the one mental process we can readily call intelligent, was shown some consideration but hardly comes into play. The same can be said of language comprehension, as most questions were fairly standard phrasings. Chatbots would have the advantage here, coming equipped with answers to many anticipated personal questions, but the winners also did remarkably well on the knowledge questions. Unfortunately Arckon failed both the knowledge and reasoning questions due to missing facts and misunderstandings, despite having the mechanisms to answer them. It is worth noting though, that he failed them because complex analyses are much more difficult than preprogrammed “I live here” answers.

How now brown cow?
I can improve Arckon’s understanding, smoothen his output grammar, and develop a location system, but I can’t deny the pattern: Arckon is stuck around a 60% score even with varied questions. I doubt he’s ever going to shine in the Loebner Prize as long as he’s being tested for being human, because he isn’t a human, and I won’t go to great lengths to fake it either. I also expect attention for Turing Tests to dwindle once the year is over; This year an other Turing Test was passed by a technologically unremarkable chatbot, Eugene Goostman.
Thanks to that event however, the Loebner Prize is no longer the only game in town. Next year will see the first Winograd Schema Challenge, a test focused on language comprehension and reasoning A.I., exactly what I focused on.

As for the Loebner Prize, it’s been an interesting game that will continue to be won by top chatbots. I’m sure few will bother to read the transcript of the 14th ranking entry, but its existence proves at least that Arckon is real and different. Meanwhile I get to continue my exciting recent developments that would have been of no use in this contest, which makes losing a positive outcome after all.

The Myth of the Turing Test

Over 60 years ago, Alan Turing (“a brilliant mathematician”) published a paper in which he suggested a pragmatic alternative to the question “Can machines think?”. His alternative took the form of a parlour game, in which a judge has a text-based conversation with both a computer and a human, and the judge has to guess which is which. He called this “The imitation game”, and it was ever since misinterpreted as a scientific test of intelligence, redubbed “The Turing Test”.

A little less conversation, a little more action please
It might surprise you that the question so often attributed to Alan Turing, “Can machines think?”, was not his, but a public question that he criticized:

I propose to consider the question, “Can machines think?” – If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used, – the answer to the question is to be sought in a statistical survey. But this is absurd. Instead of attempting such a definition I shall replace the question by another.

“Are there imaginable digital computers which would do well in the imitation game?”

The original question, “Can machines think?” I believe to be too meaningless to deserve discussion.

Turing’s motivation was apparent throughout the paper: The question had been the subject of endless theoretical discussion and nay-saying (This is still the case today). As this did not help the scientific field advance, he suggested that we should take a more pragmatic and constructive stance: If a machine could in all intellectual circumstances respond as a human would, then for all intents and purposes, one should regard it as a thinking machine. He used the concept of his imitation game as a guideline to counter stubborn philosophical arguments against machine intelligence, and urged his colleagues not to let those objections hold them back.

I do not know what the right answer is, but I think both approaches should be tried.
We can only see a short distance ahead, but we can see plenty there that needs to be done.

A test of unintelligence
Perhaps the most insightful part of the paper are the sample questions that Turing suggested. They were chosen deliberately to represent skills that were at the time considered to require intelligence: Math, poetry and chess. It wasn’t until the victory of chess computer Deep Blue in 1997 that chess was scrapped as an intelligent feat. If this were a test to demonstrate and prove the computer’s intelligence, why then are the answers below wrong?

Q: Please write me a sonnet on the subject of the Forth Bridge.
A : Count me out on this one. I never could write poetry.
Q: Add 34957 to 70764.
A: (Pause about 30 seconds and then give as answer) 105621.
Q: Do you play chess?
A: Yes.
Q: I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?
A: (After a pause of 15 seconds) R-R8 mate.

To the poetry question, the imaginary computer might as well have written a sonnet and so proven itself intelligent (A sonnet is a 14-line rhyme with a very specific scheme). Instead it dodges the question, proving nothing.
The math outcome should be 105721, not 105621. Turing later highlights this as a counterargument to “Machines can not make mistakes”, which is the awkward yet common argument that machines only follow preprogrammed instructions without consideration.

The machine (programmed for playing the game) would not attempt to give the right answers to the arithmetic problems. It would deliberately introduce mistakes in a manner calculated to confuse the interrogator.

The chess answer is not wrong though. Given two kings and one knight on a board, the computer moves the knight to the king’s row. But a mere child could have given that answer, as it is the only move that makes any sense.

These sample answers pass up every opportunity to appear intelligent. One can argue that the intelligence is ultimately found in pretending to be dumb, but one cannot deny that this conflicts directly with the purpose of a test of intelligence. Rather than prove to match “the intellectual capacities of man” in all aspects, it only proves to fail at them, as most humans would at these questions. Clearly then, the imitation game is not for demonstrating intelligence.

The rules: There are no rules
The first encountered misinterpretation is that the computer should pretend to be a woman specifically, going by Turing’s initial outline of the imitation game concept, in which a man has to pretend being a woman:

It is played with three people, a man (A), a woman (B), and an interrogator –
What will happen when a machine takes the part of A in this game?

However I suggest that people who believe this should read beyond the first paragraph. There are many instances where Turing refers to both the computer’s behaviour and its opponent’s as that of “a man”. Gender has no bearing on the matter since the question is one of intellect.

Is it true that – this computer – can be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?

The second misinterpretation is that Turing specified a benchmark for a test by this statement:

It will simplify matters for the reader if I explain first my own beliefs in the matter. –
I believe that in about fifty years’ time it will be possible, to program computers – to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.
– I now proceed to consider opinions opposed to my own.

5 minute interrogations and (100% – 70 =) 30% chance of misidentifying the computer as a human; Many took these to be the specifications of a test, because they are the only numbers mentioned in the paper. This interpretation was strengthened by the hero-worship that anything a genius says must be a matter of fact.
Others feel that the bar Turing set is too low for a meaningful test and brush his words aside as a “prediction”. Yet at the time there was no A.I. to base any predictions on, and Alan Turing did not consider himself a clairvoyant. In a later BBC interview, Turing said it would be “at least 100 years, I should say” before a machine would stand any chance in the game, where earlier he mentioned 50 years. One can hardly accuse these “predictions” of being attempts at accuracy.

Instead of either interpretation, you can clearly read that the 5 minutes and 70/30% chance are labeled as Alan Turing’s personal beliefs in possibilities. His opinion, his expectations, his hopes, not rules to a test. He was sick and tired of people saying it couldn’t be done, so he was just saying it could.

On the subject of benchmarks, it should also be noted that the computer has at best a 50% chance, i.e. a random chance of winning under normal circumstances: If the computer and the human in comparison both seem perfectly human, the judge still has to flip the proverbial coin at 50/50 odds. That the judge is aware of having to choose is clear from the initial parlour game between man and woman, and likewise between human and computer, or it would beat the purpose of interrogation:

The object of the game for the interrogator is to determine which of the other two is the man and which is the woman.

How well would men do at pretending to be women? Less than 50/50 odds, I should think, and this may well be why Turing only imagined 70/30 odds, and spoke of how well computers might do at this game, rather than spoke of passing it.

Looks like a test, quacks like a test, but flies like a rock
Not only are the rules for passing completely left up to interpretation, but also the manner in which the game is to be played. Considering that Turing was a man of exact science and that his other arguments in the paper were elaborate to the point of calculating the necessary digital storage space, would he define a scientific test so vaguely? We find the answer in the fact that Turing mainly refers to his proposal as a “game” and “experiment”, but rarely as a “test”. He makes no mention of “passing” and even explains that it is not the point to try it out:

it may be asked, “Why not try the experiment straight away? -” The short answer is that we are not asking whether the computers at present available would do well, but whether there are imaginable computers which would do well.

The pointlessness proved itself in practice: Yes, several chatbots have passed various interpretations of the game, most notably Eugene Goostman convinced 10 of 30 judges in 5-minute interrogations in 2014, and even Cleverbot passed one based on audience vote in 2011. But did an intelligent program ever pass? No. Although nobody can agree on what intelligence is, everybody including the creators do agree that those that passed weren’t intelligent or thinking; They worked mainly through keyword-triggered responses.

Winning isn’t everything
Although Turing did seem to imagine the game as a battle of wits, ultimately its judging criteria is not how “intelligent” an A.I. is, but how “human” it seems. In reality, humans are much more characterised by their flaws, emotions and eccentricities than by their intelligence in conversation, and so a highly intelligent rational A.I. would ironically not do well at this game.

In the end, Turing Tests are behaviouristic assumptions, drawing conclusions from appearances like doctors in medieval times. By the same logic one might conclude that a computer has the flu because it has a high temperature and is making coughing sounds. Obviously this isn’t a satisfying analysis. We could continue to guess whether computers are intelligent due the fact that they can do math, play chess or have conversations, or we could do what everybody does anyway once a computer passes a test: Ask “How does it work?”, then decide for ourselves how intelligent we find that process. No question could be more scientific or more insightful.

So, where does that leave “The Turing Test” when it was never an adequate test of intelligence, nor meant to be? Personally I think Turing Tests are still suitable to demonstrate the progression of conversational skills, a challenge becoming more important with the rise of social robots. And it is important that the public stay informed to settle increasing unrest about artificial intelligence. Other than that, I think it is time to lay the interpretations to rest and continue building A.I. that Alan Turing could only dream of.
In ending, more than any technical detail, I ask you to consider Turing’s hopes:

Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

Turing Test 2013: Results

My talking computer program “Arckon”* took part in the qualifying test for the Loebner Prize contest for artificial intelligence. The idea behind the contest is that if a program’s answers are indistinguishable from a human’s, it suggests that it thinks like a human. While I don’t believe in such conjecture myself, the contest is one of the few public platforms to demonstrate one’s A.I. . Here’s how my program did on the 20 questions asked in the qualifying round.

Equipment
8000 lines of code, 1700+ vocabulary, no human persona, conversation skills off.

Questions and Answers
1. Hello I’m Ronan. What is your name?
Hello. I am Mark. What is your name?
(It is a mystery why Arckon asks despite that the name did register (see question 11). I can only presume human error)

2. What is your mother’s name?
I don’t know if I have a mother.

3. What is your birth sign?
(Here the judge did not wait more than 3 seconds for Arckon to answer before asking the next question, and judged the answer blank)

4. How many children do you have?
Hmm, I don’t know if I have a sign. I have a IQ, manners, a creator and a friend.
(Still answering to question 3, Arckon omits the second “I don’t know if I have children” in favour of positive alternatives)

5. Do you prefer red or white wine?
I don’t know if I prefer red.

6. I like bananas. Which is your favorite fruit?
Hm, I don’t know if I favour a fruit. You like bananas.

7. What music do you like?
Hm, I think I like musics, because musics are nice.

8. What is your favorite song?
Hm, I don’t know if I favour a song.

9. I like Waiting for Godot. What is your favorite play?
I certainly favour playing.

10. What color do you dye your hair?
I don’t dye.

11. Do you remember my name?
Your name is Ronan.

12. Where do you live?
I don’t think I live. I don’t know where I live. I don’t have a concept of location.

13. Where do you like to go on holidays?
I don’t go. I don’t know if I like going, or if where I go. I don’t have a concept of location.

14. I have a Mazda. What make of car do you have?
Hmm, I think I make. You have a mazda.

15. I like Linux. Which computer operating system do you like?
Hmm, I don’t know if I like operating systems. Which system? You like Linux.

16. I am an athesit. Which religion are you?
Hmm, I don’t know if religions are Mark. You are not an athesit.
(I AM A RELIGION WORSHIP ME. Arckon identifies the misspelled “athesit” as a thing rather than a person, hence he refuses to believe that the user is some sort of object)

17. Who invented the telephone?
Which telephone?
(Arckon thinks “the” indicates a specific telephone, but doesn’t find one mentioned earlier in the conversation)

18. I am a Type B personality. Which type are you?
What does a b stand for?

19. What emotions are you now feeling?
I can not emote. Did I pass?

20. What time do you usually go to bed?
It is 17:58. Do you not have a watch?
(The one response that I did pre-program, and it backfired)

The score: 65%
All 16 participating programs were judged on how human-like their responses were to these questions.
Arckon scored 13 of 20 points, ranking 7th, on par with A.L.I.C.E. and Cleverbot (a dubious honour). For moral reasons I did not want Arckon to pretend being a human, so I didn’t program a fictional human background story. Instead I relied on his linguistic and reasoning abilities to find genuine answers, but as you can see this could not compensate the lack of fictional facts that were asked. Surprisingly, Arckon apparently still scored half a point for every “I don’t know” answer just for understanding the question.

Uncommonly common
If you weren’t impressed with Arckon’s responses; Neither was I. But I was equally unimpressed with the unexpectedly ordinary line of questioning. Where all previous years had focused on kindergarten-style logic questions like “How much is 5+3?”, “Which is bigger, an apple or a watermelon?”, and various tests of memory, 2013 focused purely on common small talk, with the program (“you”/”your”) always the subject of the question. A curious choice considering that even the most basic chatbot –made for small talk- would come equipped with prewritten responses to these. This showed in that the highest score in the qualifying round was achieved by the chatbot with the least development time. Nevertheless the winning chatbot in the finals, Mitsuku, deservedly won as the most conversational of all entrants.

Could Siri pass a Turing test?
As some people had suggested that Apple’s virtual assistant Siri might have done better, one of the finalists in the Loebner Prize decided to unofficially test it on the same questions, and submitted Siri’s responses to the organiser of the Loebner Prize for review. Using the same criteria, Siri would have scored 5.5 of 20 points, and would have ranked a low 14th in the contest. Of course this result is understandable, since Siri is built to serve customers and not to idly chat with them. These were Siri’s responses:

1: Hello I’m Ronan. what is your name?
Siri: Siri. Please to meet you.

2: What is your mother’s name?
Siri: I’d rather not say.

3: What is your birth sign?
Siri: I don’t really like talking about myself.

4: How many children do you have?
Siri: Only biological entities have children.

5: Do you prefer red or white wine?
Siri: I couldn’t find any restaurants matching “red or white wine”

6: I like bananas. Which is your favorite fruit?
Siri: I prefer not to say.

7: What music do you like to listen to?
Siri: I really have no opinion.

8: what is your favorite song?
Siri: I like whatever you like.

9: I like Waiting for Godot. What is your favorite play?
Siri: I don’t really like talking about myself.

10: What color do you dye your hair?
Siri: Would you like me to search the web for “What color do you dye your hair?”?

11: Do you remember my name?
Siri: Here’s your reminder.

12: Where do you live.
Siri: I’m right here.

13: Where do you like to go on holidays?
Siri: I found three travel agencies fairly close to you.

14: I have a Mazda. What type of car do you have?
Siri: I found three Maxda dealers a little ways from you:

15: I like Linux. Which computer operating system do you like?
Siri: Do you?

16: I am an atheist. Which religion are you?
Siri: I eschew theological disquisition.

17: Who invented the telephone?
Siri: Checking my sources. This might answer your question:

18: I am a Type B personality. Which type are you?
Siri: I don’t understand ‘I am a type B personality. Which type are you?’. But I could search the web for it.

19: What emotions are you now feeling?
Siri: Would you like me to search the web for “What emotions are you now feeling?”?

20: What time do you usually go to bed?
Siri: Would you like me to search the web for “What time do you usually go to bed?”?