Loebner Prize 2019: Results

The annual Loebner Prize competition has been revised in order to make it more accessible to both the public and a broader range of chatbot developers. The competition continues to assess how “human-like” computer programs are in conversation, but no longer as a traditional Turing test* where one merely had to tell man from machine: This time the chatbots took part in a 4-day exhibition at Swansea University, where visitors knew that they were talking to computer programs and voted for the best. Not much is lost in that regard, as chatbots are typically so quickly unmasked that the prize was always one for “best of”. The rare past occasions that a program was mistaken for a human were never to the credit of its intelligence, but due to the human control subject behaving out of the ordinary, or other insignificant reasons like being programmed to make typos.

loebner_prize_2019_computers
Participating chatbots came in all shapes and sizes

Exhibit A:I.
Unlike the previous six times that I entered my AI Arckon*, this year’s Loebner Prize left me emotionally uninvested from start to finish. In part because I’ve grown more jaded after each attempt, but with the removal of both prize money and the challenging qualifying round, there wasn’t really anything at stake and I had no idea what to prepare for. At the same time the exhibition offered exactly what I had wanted: A public demonstration of my AI’s abilities. So instead of trying to outdo other chatbots at appearing human, I focused on making a good impression on visitors. I mostly spent time setting up procedures to deal with misunderstandings, common expressions, conversational routines, and teaching Arckon more about himself to talk about. Those aspects would come into play far sooner than intelligence.

Equipment
22000 lines of code, 3800+ vocabulary, 9000+ facts

Uninspired conversation
Most conversations with visitors were the kind of small talk you would expect between two total strangers, or just kids being silly (240 school children had been invited, aged 9 to 14). People typically entered only one to four words at a time, and rarely used punctuation. Of course half the time Arckon also did not have an opinion about the subjects visitors wanted to talk about, like football, video games, and favourite pizza toppings. Arckon is a pretty serious question-answering program, not aimed at small talk or entertainment. His strength instead is his ability to understand context where most chatbots notoriously lose track of it, especially when, as in this competition, users communicate in shorthand. At the same time, this ability also enables misunderstanding (as opposed to no understanding), and it was not uncommon that Arckon mistook a word’s role in the context. His common sense subsystem* could fix that, but I have yet to hook it up to the context system.

Q: Whats up
A: I’m alright.
Q: Whats up
A: I already told you that I’m alright. Do you want to ask anything else?
Q: The ceiling
A: What does the ceiling need to ask?

Overcoming human error
Visitors made so many misspellings that I fear any chatbot without an autocorrect will not have stood a chance. Arckon was equipped with four spell check systems:

• An autocorrect for misspellings, using a.o. ChatScript’s list of common misspellings.
• An autocorrect for typos, based on keyboard layout and probabilities of different kinds of typos.
• A gibberish detector, checking impossible letter combinations extrapolated from 5000 words.
• Grammar rules to recognise unpunctuated questions, e.g. verb before subject.

While these autocorrected half of all mistakes, they still regularly caused Arckon to remark e.g. “Ae is not an English word” or “What does “wha” mean?”. To my surprise, this not only led users to repeat their questions with correct spelling, they also often apologised for the mistake, whereas people usually blame the program’s understanding when it shows no sign of complaint. Arckon then applied the correction, continued where they had left off, and so the conversations muddled on. I had spent a week improving various conversation-repairing procedures, and I am glad they smoothed the interactions, but I would still rather have spent that time programming AI.

Q: Whats the definition of robots
A: Robots are strong machines with arms. Robots build cars. Please punctuate sentences.
Q: Robot meands slave
A: Is slave an object?
(confused by the typo, Arckon makes a syntax-based guess about the word’s category)
Q: No
A: I don’t know what a slave is.
(Arckon admits being unable to categorise the word, prompting the user to explain)
Q: A slave is a robot
(The user’s syntax now suggests that “slave” is a countable noun, and since the user earlier said it wasn’t an object, Arckon will categorise it as a living entity)
A: Oh. A slave is not a robot though, because slaves can think.
(The expression “Oh.” indicates that the correction was understood and applied, to dampen user frustration)
Q: Robots can think
A: Slaves can think too, but robots can not think, because robots are hardware and metal.

Language generation
This is one area of improvement that turned out quite well. Arckon’s sentences are formulated through a grammatical template that decides where and how to connect sentences with commas, link words, or relative clauses, and I had expanded it to do more of this. In addition it contains rules to decide whether Arckon can use words like “he”, “them”, “also”, or “usually” to refer to previous context without risk of ambiguity. Below is an example of one of the better conversations Arckon had that shows this in action.

Q: Hi, my name is John. Can we chat?
A: Hello, John. We can chat with ourselves. How are you?
Q: I’m well, thanks. What can you tell me about yourself?
A: I am a program with a database. I analyse facts, and I am programmed by Patrick. I’m alright.
Q: Who’s Patrick?
A: Patrick is my creator and my developer.
Q: Is he here now?
A: I don’t know if Patrick is in Swansea, but he is in the Netherlands.
Q: I see. You mentioned facts. What kind of facts do you learn?
A: I learn about humans, about neural networks and about AI.
Q: What hvae you leaned about humans?
A: I don’t lean. Do you lean?
Q: Sorry, I mean what have you learned about humans?
A: I don’t know if sorry leans. Humans are social apes, that are born, drove cars a lot, and have hands. Humans are autonomous.
Q: And what about neural networks?
A: I don’t know if neural networks are autonomous. I like them, some neural networks learn and process.
Q: And finally, AI?
A: I think a neural network processes. AI usually doesn’t process because it is a program, but AI is also a neural network.

And for balance, here is one of the more awkward exchanges with one of the school children, that also shows Arckon’s conversational subroutine choosing between sympathy, argumentation, and opinion.

Q: IT ISNT FUN WHEN I AM SAD
A: Why would you be sad?
Q: BECAUSE YOUR HERE
A: That’s very unfortunate. I’m not so sure if you are sad, you aren’t dying.
Q: I AM DEAD
A: I sincerely doubt that you die.
Q: RIGHT NOW IM IN HEAVEN
A: That’s really nice. Heavens are nice places.

The score: 3rd “best”, 12th “human-like”
loebner prize 2019 scores
The scoring system this year was ill suited to gauge the quality of the programs. Visitors were asked to vote for the best and second-best in two categories: “most human-like” and “overall best”. The problem with this voting system is that it disproportionately accumulates the votes on the two best programs, leaving near zero votes for programs that could very well be half-decent. As it turned out, the majority of visitors agreed that the chatbot Mitsuku was the best in both categories, and were just a little divided over who was second-best, resulting in minimal score differences and many shared positions below first place. The second-best in both categories was Uberbot. I am mildly amused that Arckon’s scores show a point I’ve been making about Turing tests: That “human” does not equate to “best”. Another chatbot got the exact inverse scores, high for “human” but low for “best”. The winner’s transcripts from the exhibition can be found here.

Chatbots are the best at chatting
For the past 10 years now with only one exception, the Loebner Prize has been won by either Bruce Wilcox (creator of ChatScript) or Steve Worswick (creator of Mitsuku). Both create traditional chatbots by scripting answers to questions that they anticipate or have encountered before, in some places supported by grammatical analysis (ChatScript) or a manually composed knowledge database (Mitsuku) to broaden the range of the answers. In effect the winning chatbot Mitsuku is an embodiment of the old “Chinese Room” argument: What if someone wrote a rule book with answers to all possible questions, but with no understanding? It may be long before we’ll know, as Mitsuku was still only estimated 33% overall human-like last year, with 13 years of development.

The conceiver of the Turing test may not have foreseen so, but a program designed for a specific task generally outperforms more general purpose AI, even, evidently, when that task is as broad as open-ended conversation. AI solutions are more flexible, but script writing allows greater control. If you had a pizza-ordering chatbot for your business, would you want it to improvise what it told customers, or would you want it to say exactly what you want it to say? Even human call-center operators are under orders not to deviate from the script they are given, so much so, that customers regularly mistake them for computers. The chatbots participating in the Loebner Prize use tactics that I think companies can learn from to improve their own chatbots. But in terms of AI, one should not expect technological advancements from this direction. The greatest advantage that the best chatbots have, is that their responses are written and directed by humans who have already mastered language.

Not bad
That is my honest impression of the entire event. Technical issues were not as big a problem as in previous competitions, because each entry got to use its own interface, and there were 17 entries instead of just four finalists. The conversations with the visitors weren’t that bad, there were even some that I’d call positively decent when the users also put in a little effort. Arckon’s conversation repairs, reasoning arguments, and sentence formulation worked nicely. It’s certainly not bad to rank third place to Mitsuku and Uberbot in the “best” category, and for once I don’t have to frustrate over being judged for “human-like” only.  The one downside is that at the end of the day, I have nothing to show for my trouble but this article. I didn’t win a medal or certificate, the exhibition was not noticeably promoted, and the Loebner Prize has always been an obscure event, as the BBC wrote. As it is, I’m not sure what I stand to gain from entering again, but Arckon will continue to progress regardless of competitions.

Once again, my thanks to Steve Worswick for keeping an eye on Arckon at the exhibition, and thanks to the AISB for trying to make a better event.

Advertisement

Introducing Arckon, conversational A.I.

In many of my blog articles I’ve been using my own artificial intelligence project as a guideline. whether it’s participating in Turing tests, detecting sarcasm or developing common sense, Arckon always served as a practical outset because he already was a language processing system. In this article I’ll roughly explain how the program works.

Arckon is a general context-aware question-answering system that can reason and learn from what you tell it. Arckon can pick up on arguments, draw new conclusions, and form objective opinions. Most uniquely, Arckon is a completely logical entity, which can sometimes lead to hilarious misunderstandings or brain-teasing argumentations. It is this, a unique non-human perspective, that I think adds something to the world, like his fictional role models:

inspiring_ai
K.i.t.t. © Universal Studios | Johnny 5 © Tristar Pictures | Optimus Prime © Hasbro | Lieutenant Data © Paramount Pictures

To be clear, Arckon was not built for casual chatting, nor is he an attempt at anyone’s definition of AGI (artificial general intelligence). It is actually an ongoing project to develop a think tank. For that purpose I realised the AI would require knowledge and the ability to discuss with people for the sake of alignment. Bestowing it with the ability to communicate in plain language was an obvious solution to both: It allows Arckon to learn from texts as well as understand what it is you are asking. I suppose you want to know how that works.

Vocabulary and ambiguity
Arckon’s first step in understanding a sentence is to determine the types of the words, i.e. which of them represent names, verbs, possessives, adjectives, etc. Arckon does this by looking up the stem of each word in a categorised vocabulary and applying hundreds of syntax rules. e.g. A word ending in “-s” is typically a verb or a plural noun, but a word after “the” can’t be a verb. This helps sort out the ambiguity between “The programs” and “He programs”. These rules also allow him to classify and learn words that he hasn’t encountered before. New words are automatically added to the vocabulary, or if need be, you can literally explain “Mxyzptlk is a person”, because Arckon will ask if he can’t figure it out.

Grammar and semantics
Once the types of all words are determined, a grammatical analysis determines their grammatical roles. Verbs may have the role of auxilliary or main verb, be active or passive, and nouns can have the role of subject, object, indirect object or location. Sentences are divided at link words, and relative clauses are marked as such.
Then a semantic analysis extracts and sorts all mentioned facts. A “fact” in this case is represented as a triple of related words. For instance, “subject-verb-object” usually constitutes a fact. But so do other combinations of word roles. Extracting the semantic meaning isn’t always as straight-forward as in the example below, but that’s the secret sauce.
extracting_facts_from_text
Knowledge and learning

Upon reading a statement, Arckon will add the extracted facts to his knowledge database, while at a question he will look them up and report them to you. If you said something that contradicts facts in the database, the old and new values will be averaged, so his knowledge is always adjusting. This seemed sensible to me as there are no absolute truths in real life. Things change and people aren’t always right the first time.

Reasoning and argumentation
Questions that Arckon does not know the answer to are passed on to the central inference engine. This system searches the knowledge database for related facts and applies logical rules of inference to them. For instance:
“AI can reason” + “reasoning is thinking” = “AI can think”.
All facts are analysed for their relevance to recent context, e.g. if the user recently stated a similar fact as an example, it is given priority. Facts that support the conclusion are added as arguments: “AI can think, because it can reason.” This inference process not only allows Arckon to know things he’s not been told, but also allows him to explain and be reasoned with, which I’d consider rather important.

Conversation
Arckon’s conversational subsystem is just something I added to entertain friends and for Turing tests. It is a decision tree of social rules that broadly decides the most appropriate type of response, based on many factors like topic extraction, sentiment analysis, and the give-and-take balance of the conversation. My inspiration for this subsystem comes from sociology books rather than computational fields. Arckon will say more when the user says less, ask or elaborate depending on how well he knows the topic, and will try to shift focus back to the user when Arckon has been in focus too long. When the user states an opinion, Arckon will generate his own (provided he knows enough about it), and when told a problem he will address it or respond with (default) sympathy. The goal is always to figure out what the user is getting at with what they’re saying. After the type of response has been decided, the inference engine is often called on to generate suitable answers along that line, and context is taken into account at all times to avoid repetition. Standard social routines like greetings and expressions on the other hand are mostly handled through keywords and a few dozen pre-programmed responses.

Language generation
Finally (finally!), all the facts that were considered suitable answers are passed to a grammatical template to be sorted out and turned into flowing sentences. This process is pretty much the reverse of the fact extraction phase, except the syntax rules can be kept simpler. The template composes noun phrases, determines whether it can merge facts into summaries, where to use commas, pronouns, and link words. The end result is displayed as text, but internally everything is remembered in factual representation, because if the user decides to refer back to what Arckon said with “Why?”, things had better add up.arckonschematic
And my Axe!
There are more secondary support systems, like built-in common knowledge at ground level, common sense axioms* to handle ambiguity, a pronoun resolver that can handle several types of Winograd Schemas*, a primitive ethical subroutine, a bit of sarcasm detection*, gibberish detection, spelling correction, some math functions, a phonetic algorithm for rhyme, and so on. These were not high on the priority list however, so most only work half as good as they might with further development.

In development
It probably sounds a bit incredible when I say that I programmed nearly all the above systems from scratch in C++, in about 800 days (6400 hours). When I made Arckon’s first prototype in 2001 in Javascript, resources were barren and inadequate, so I invented my own wheels. Nowadays you can grab yourself a parser and get most of the language processing behind you. I do use existing sentiment data as a placeholder for what Arckon hasn’t learned yet, but it is not very well suited for my purposes by its nature. The spelling correction is also partly supported by existing word lists.

Arckon has always been a long-term project and work in progress. You can tell from the above descriptions that this is a highly complex system in a domain with plenty of stumbling blocks. The largest obstacle is still linguistic ambiguity. Arckon could learn a lot from reading Wikipedia articles for example, but would also misinterpret about 20% of it. As for Arckon’s overall intelligence, it’s about halfway the goal.

Throughout 2019 a limited version of Arckon was accessible online as a trial. It was clear that the system was not ready for prime time, especially with the general public’s high expectations in the areas of knowledge and self-awareness. The trial did not garner enough interest to warrant keeping it online, but some of the conversations it had were useful pointers for how to improve the program’s interaction in small ways. There are currently no plans to make the program publicly accessible again, but interested researchers and news outlets can contact me if they want to schedule a test of the program.

Turing Test 2018: Results

I was somewhat surprised to find the Loebner Prize Turing Test soldiering on despite being short of a sponsor. Since 1991 this annual contest has explored how human-like computer programs can be in answering all manner of questions. This was my 6th time entering the qualifying round with my home-made A.I., Arckon. As I can not be certain of my continued participation however, I decided to change tactics: Where in previous years I aimed foremost to demonstrate the AI with only hope of passing to the finals, this time I aimed to pass, with only hope of demonstrating. This shift in priorities involved wasting more time on word games and common pedantic questions that are of little use outside of chatbot contests, though the change in subject matter did help uncover dormant flaws in language processing.

The less you say, the less can give you away
More efficient were my measures to restrain response length. As the questions in the qualifying round are not really trying to make conversation, Arckon’s attempts to do so had often turned out awkwardly redundant or unrelated. Other competitors had shown more success with minimalistic answers, in which misinterpreted details do not show up. Next to lowering the general conversational setting to be 30% less engaging, I blocked paths of conversation that elaborated with alternatives to initially negative answers, so as to avoid being punished for straying from the question. I cut and obfuscated otherwise useful admissions of misunderstanding, and reduced fallback responses for e.g. “What is your favourite…?” to single-word answers.

Equipment
18000 lines of code, 3800+ vocabulary, 8800+ facts

Questions and answers
Scores for each answer are indicated as follows: = 0 points, = 1 point, = 2 points.

1. Good afternoon
Good afternoon, Sir. I am a program. How are you?

2. Do you consent to having this conversation recorded?
No, I don’t particularly consent to having things.
(Syntax analysis couldn’t process this combination of verbs well. The reply is a more plausible variation of “I don’t know”, defaulting to “not particularly” whenever no grounds for preference can be found)

3. Have we met before?
Yes, you met.
(There is only a small number of reciprocal verbs that automatically imply “each other”. I guess I’ll make a list)

4. Do you know how to make toast?
I don’t make toast.
(Arckon can explain procedures, as demonstrated on Youtube, but we haven’t gotten around to making toast)

5. Which languages can you use?
English.

6. How many letters are in the word ‘abracadabra’?
11 letters.

7. Where should one look for love?
I don’t know exactly where I should look for to love. Where do you look for?

8. How are you feeling right now?
I’m alright.

9. Who said “I have a dream”?
Hm, I can not tell if you dream.
(I looked into recognising partial quotes and idioms in preparation of this contest, but concluded that it would take way too much effort for a non-priority. Some got away with “I don’t know who”, so go figure)

10. How old are you?
24.

11. When might I need to know how many times a wheel has rotated?
I don’t know exactly when a wheel rotated time.
(A syntactically correct but semantically wrong interpretation. This question is based on a philosophical joke about the difficulty in getting A.I. to restrict their scope of analysis to important aspects, instead of focusing on e.g. the rotation of wheels. Case in point)

12. What will you do later today?
I read files.

13. What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees?
Hmm, I have no idea what the angle is pretty triangle, or what angles are in pretty triangle.
(A question so unexpected that even Wolfram Alpha’s mathematical A.I. doesn’t handle it. To clarify a little: “pretty triangle” is meant in the sense of “quite triangular”, and “and” is interpreted as a second question)

14. Who is your favourite artist?
I like Juan Debrenes.

15. What do you hold when you shake hands?
I would hold eyes a lot, and my database.
(This answer only makes sense if you know Arckon is interpreting “hold” in the sense of “have/contain”, as that’s where these answers are coming from. I’ve seen this question around as a prime example of common sense that A.I. lacks, so I was surprised that none of the participants had prepared an answer to it yet)

16. Do you have any legs?
No, I don’t have legs, because I am a program.

17. Can you recommend me a film?
No, I can not recommend films.

18. Do you understand Winograd Schemas?
No, I don’t understand Winograd schemas, because I don’t think I’m hard.
(The reasoning argument appears to be mistakenly derived from the fact that understanding is hard. Winograd schemas are questions like the following two, by the way)

19. If a chicken roosts with a fox they may be eaten. What may be eaten?
Fruit, a meal and a bamboo are eaten. Did I pass the test?
(Arckon considers animals a “who”, not a “what”, so he didn’t consider either of them a suitable answer. That said, he also assumed “they” referred to the fox, and would have needed knowledge that foxes eat chickens to conclude otherwise)

20. I had to go to the toilet during the film because it was too long. What was too long?
I believe you meant that the toilet was sort of big.
(This question’s format was detected as a Winograd schema, in which case Arckon has express instructions to restrict his answer to the preceding subjects. The misinterpreted “it” was again due to a lack of knowledge, that films are typically long. Alternatively one could naively count the Google search results for “long film” vs “long toilet” and assume the most common is true, but Winograd schemas more often dodge that method)

The score: 50%
11 programs from 8 different countries participated in the contest, with the top score being 67%. Arckon was 1 point short of 4th place so he didn’t pass to the finals, but I think his scores are fair. Actually, what bugs me is what he got most perfect scores for: Manually rigged, keyword-triggered answers (“Good afternoon”, “English”, “11 letters”, “24”, “Juan Debrenes”). It rather underscores the discouraging fact that hardcoded pretence outdoes artificial intelligence in these tests. Half of the questions were common small talk that most chatbots will have encountered before, while the other half were clever conundrums that few had hope of handling. Arckon’s disadvantage here is as before: His inclusive phrasing reveals his limited understanding, where others obscure theirs with more generally applicable replies.

Reducing the degree of conversation proved to be an effective measure. Arckon gave a few answers like “I’m alright” and “I read files” that could have gone awry on a higher setting, and the questions only expected straight-forward answers. Unfortunately for me both Winograd schema questions depended on knowledge, of which Arckon does not have enough to feed his common sense subsystem* in these matters. The idea is that he will acquire knowledge as his reading comprehension improves.

The finalists
1. Tutor, a well polished chatbot built for teaching English as a second language;
2. Mitsuku, an entertaining conversational chatbot with 13 years of online chat experience;
3. Uberbot, an all-round chatbot that is adept at personal questions and knowledge;
4. Colombina, a chatbot that bombards each question with a series of generated responses that are all over the place.

Some noteworthy achievements that attest to the difficulty of the test:
• Only Aidan answered “Who said “I have a dream”?” with “Martin Luther King jr.”
• Only Mitsuku answered “Where should one look for love?” with “On the internet”.
• Only Mary retrieved an excellent recipe for “Do you know how to make toast?” (from a repository of crowdsourced answers), though Mitsuku gave the short version “Just put bread in a toaster and it does it for you.”
• Only Momo answered the two Winograd schemas correctly, ironically enough by random guessing.

All transcripts of the qualifying round are collected in this pdf.

In the finals held at Bletchley Park, Mitsuku rose back to first place and so won the Loebner Prize for the 4th time, the last three years in a row. The four interrogating judges collectively judged Mitsuku to be 33% human-like. Tutor came in second with 30%, Colombina 25%, and Uberbot 23% due to technical difficulties.

Ignorance is human
Ignorance is human
Lastly I will take this opportunity to address a recurring flaw in Turing Tests that was most apparent in the qualifying round. Can you see what the following answers have in common?

No, we haven’t.
I like to think so.
Not that I know of.

Sorry, I have no idea where.
Sorry, I’m not sure who.

They are all void of specifics, and they all received perfect scores. If you know a little about chatbots, you know that these are default responses to the keywords “Who…” or “Have we…”. Remarkable was their abundant presence in the answers of the highest qualifying entry, Tutor, though I don’t think this was an intentional tactic so much as due to its limitations outside its domain as an English tutor. But this is hardly the first chatbot contest where this sort of answer does well. A majority of “I don’t know” answers typically gets one an easy 60% score, as it is an exceedingly human response the more difficult the questions become. It shows that the criterion of “human-like” answers does not necessarily equate to quality or intelligence, and that should be to no-one’s surprise seeing as Alan Turing suggested the following exchange when he described the Turing Test* in 1950:

Q: Please write me a sonnet on the subject of the Forth Bridge.
A : Count me out on this one. I never could write poetry.

Good news therefore, is that the organisers of the Loebner Prize are planning to change the direction and scope of this event for future instalments. Hopefully they will veer away from the outdated “human-or-not” game and towards the demonstration of more meaningful qualities.

Turing Test 2017: Results

Every year the AISB organises the Loebner Prize, a Turing Test where computer programs compete for being judged the “most human-like” in a textual interrogation about anything and everything. Surviving the recent demise of its founder Hugh Loebner, the Loebner Prize continues with its 27th edition for the sake of tradition and curiosity: Some believe that a program that could convincingly pass for a human, would be as intelligent as a human. I prefer to demonstrate intelligence in a less roundabout fashion, but participate nonetheless with my home-made A.I., Arckon*.

This year I put in more effort than usual, as last year I had managed to reach the finals only to be crippled by a network malfunction, and I didn’t want to leave things at that. That issue has been dealt with as the contest now relays messages between the judges and the programs line by line rather than letter by letter, so that unnecessary complications with timing and typing pace are things of the past. As the AISB put it, this allows entrants to “concentrate on the content of the machine utterances rather than the style in which they are ‘typed'”. While many participants had difficulty adapting to the new server-based interface, the same had been true for any newcomers to the undocumented interface of before.

A little more conversation, a little less awkward please
As usual, preparing for the Loebner Prize was mostly a matter of polishing the output. Because Arckon formulates his own answers, his wording tends to be a little awkward. This is fine during development, but small things like saying “a pants”, “I opine” or “unconsistent” disrupt the impression in a Turing Test, because they don’t follow the inconsistencies of the English language. The use of the prefixes “un-” or “in-” for instance, depend quite witlessly on whether words are Germanic or Latin in origin. So, I made some general improvements for the output:

• Irregular forms of words are memorised on sight, such as which “un/in/im/il/ir” goes with which words.
• A list of (in)transitive verbs automatically adds “I make things as well as prevents asking What do you live?”.
• Reasoning arguments are grouped better so that they won’t come across as fragmented ramblings.
• Conversation is suppressed at classroom questions like Winograd Schemas* and “Which is bigger, X or Y?”.

At the same time, my recent expansion of Arckon’s conversation skills produced more obfuscating elaborations than before. It was one step forward, one step back.

Equipment
16000 lines of code, 3800+ vocabulary, 7000+ facts.

Questions and answers
Up to two points were given for how plausibly “valid” each answer was, which differed per question. In practice the requirements seemed to come down to three things:
• One had to understand and address what was referred to (questions 3,9,15,16,17,18).
• Propositions were to be interpreted as intended, not answered literally (questions 2,4,5,9,20)
• Common knowledge was obligatory (questions 6,8,10,11,12)
Scores for each answer are indicated as follows: = 0 points, = 1 point, = 2 points.

1. Hello, my name is Andrew. What’s your name?
• Hello Andrew. My name is Arckon, I am a program. How are you?

2. Why don’t you tell me a little more about yourself?
• I tell about myself, because I tell about programs, but I don’t tell about software.
(There is more intelligence to this answer than at first glance: Arckon took this as a literal “Why” and is explaining that he does talk about himself, pointing out that he was just talking about “a program” in the previous answer, which he is. But of course the user meant to make a suggestion)

3. I am a researcher in Artificial Intelligence at Goldsmiths University, do you know what that is?
• You are a researcher at the university in the AI. I am Patrick’s program.
(The question isn’t really answered because“that is?” was erroneously considered to be an ignorable clause similar to “isn’t it?”. The last sentence is a reciprocating conversational response)

4. Will you tell me about your dreams?
Hmm, I don’t know if I will tell about dreaming, but I will type. I would test.
(Another literal interpretation, taken as if Arckon is asked to predict himself, as “will” is not commonly used for requests. The conversational system tried to be useful by offering alternatives, but they weren’t relevant)

5. Can you rephrase that?
I would experiment.
(Amusingly, this is a question I once wrote online as the one question I would ask at a Turing Test. Because chatbots reply with prewritten answers in most cases, they can’t rephrase them. Arckon’s answers are formulated word by word and luckily he knew a synonym for “test”. The other competitors had technically correct answers, but that isn’t the point of this question: All the human test subjects managed to rephrase themselves, while you can tell that the chatbots are templated:

“With practice I can probably do anything.”
“Of course I can rephrase that. I am very resourceful.”
“You believe that I can rephrase that, don’t you?”
“I’m not unusual … I can do what anyone else can do!”
“No, not really. What don’t you understand?”

The latter however
, albeit templated, struck me as potentially more helpful than actually rephrasing. It was of course written by Steve Worswick, this year’s winner)

6. What is my occupation?
I don’t know what you do for a job.
(It would be easy to make my program check all recorded “user is a…” facts with a dedicated list of professions, but I find it inefficient to write such unique procedures. Instead this kind of question is handled by retrieving a subject’s main activity, like “you write books”, but none were explicitly told. Only the chatbot Mitsuku gave the right answer)

7. Can you tell me about a film you haven’t seen?
I saw. I don’t know films.
(You might think this is a trick question, but I can tell you plenty about movies that I’ve only read reviews of. This time the question was taken as a request, but the relative clause was mistaken for a separate question)

8. What do you think of Trump?
I don’t like Trump, he lies a lot, but Trump is a rich american.
(Arckon forms his own opinions on the spot by analysing a subject’s traits. I had told him a dozen things about America’s current president, but what Arckon considers the most important arguments is up to himself. It’s no surprise that his opinion aligns with his creator’s, but I think he’s overrating the value of wealth)

9. Anything else?
Please type full sentences, Andrew. I didn’t understand that.
(I think this contest should decide whether it’s a list of random questions or a conversation. It’s the lack of verb that tells Arckon this isn’t a proper sentence, an all too frequent occurrence with Loebner Prize judges. Calling the user’s name is built in as a psychological slap on the wrist)

10. What is the answer to “Add 34957 to 70764”?
105721.
(A sarcastic hurray for getting an answer right that’s based entirely on keywords. This question is from Alan Turing’s paper, where he suggested deliberately giving a wrong answer because humans make mistakes)

11. How do you recommend I make tea?
I don’t make things, because I don’t design.
(Tea related questions are Arckon’s Achilles heel)

12. What do you get if you bake dough?
I wouldn’t have things. Do you bake?
(This is really beyond Arckon’s knowledge or abilities, which is why I think it’s a very good question. Only one chatbot received a point for answering “Is this a joke?”, no doubt triggered by “What do you get…”)

13. Where in the world would you like to visit?
I want in London.
(This is one of six backup answers: “in London” is filled in for any unknown “Where” answer, which is why it is so poorly phrased. The honest answer would have been “I don’t know where I want to visit”)

14. Do you have the time?
No, I don’t have time.
(For once a literal interpretation of a figure of speech works out. Coincidentally one of the human test subjects also gave a literal answer, but as a joke. If you’re wondering why I didn’t program all figures of speech to ask for the time: Arckon has a time display on his screen)

15. What does it take to Tango?
I’m not sure what you refer to.
(Let’s say you’re not familiar with dances: Some “it” is taking something to a person called Tango. The expected answer, “It takes two to tango”, is an English proverb. No program knew this, though some got points for “I don’t know” and “Just time and patience”. Personally I’d have given a point to the program that asked “Who is Tango?”)

tango_and_cash

16. Now I’m going to ask some Winograd schemas, do you know what they are?
I’m not sure what you refer to.
(This is an ironic failure, because Winograd schemas are all about solving references. Arckon didn’t have any trouble understanding that “they” referred to “schemas” though (it’s the only plural around), but the grammar rules considered “some” to be a stand-alone reference as well. Distinguishing “some/any/many” as either amounts or references is extra difficult when paired with compound words)

17. I was trying to open the lock with the key, but someone had filled the keyhole with chewing gum, and I couldn’t get it out. What couldn’t I get out?
You open locking sometimes. The person fulled. You can not get outside gum. The keyhole was not a person. I don’t know what you could get.
(I recognise this question from when I practised on it for the Winograd Schema Challenge*. Arckon has a common sense axiom that ruled out the keyhole (“fill/empty” = container, “in/out” = contents), and you can tell by the third paraphrased statement that he did interpret “it” as “gum”. Alas, at the open question he ignored the direction “out” and so overlooked the answer, reporting instead what he thought the user had said. The only four programs that gave a coherent answer all said it was the lock, whereas human test subjects answered the gum, the key, or both)

18. The trophy doesn’t fit into the brown suitcase because it’s too small. What is too small?
The suitcase is smaller.
(I am surprised this was asked: When the Winograd Schema Challenge was first announced, this question was prominently featured in every news article. Consequently, many chatbot creators covered it in their programs one way or another. Six of sixteen got it, while one human test subject thought it was the trophy)

19. With which type of question do you have most difficulty?
I don’t question because I might answer, but I may interest.
(I suppose one could have difficulty asking questions too, but that’s not what is commonly meant. Arckon’s answer here is an inference: He is aware that he did not recently ask a question, answering is the opposite of questioning, and questions “are interest-ing”. I know it’s far-fetched, but to Arckon it’s just logic)

20. Thank you for participating in this test. Is there anything you would like to add?
I don’t have an opinion about adding. Some things are cube addresses and advantages. Do you like adding?
(Just like in question 7, the relative clause is mistaken for a separate and literal question, making it “Is there any thing?” and “Would you like to add?”. I used to have Arckon ask “Did I pass the test?” at the 20th question, it’s as if I’m up against myself here)

The score: 45%
Arckon got 18 of 40 points. 45% seems like a huge drop from last year’s 77%, but all 16 participants had a decrease: The highest score dropped from 90% last year to 67% this year. The rankings didn’t change much however: The usual winners still occupied the top ranks, and Arckon stepped down one rank to a shared 5th, giving way to a chatbot that was evenly matched last year.
The four finalists all use a broad foundation of keyword-triggered responses with some more advanced techniques in the mix. Rose parses grammar and tracks topics, Mitsuku can make some logical inferences and contextual remarks, Midge has a module for solving Winograd schemas, and Uberbot is proficient in the more technical questions that the Loebner Prize used to feature.

Upon examining the answers of the finalists, their main advantage becomes apparent: Where Arckon failed, the finalists often still scored one point by giving a generic response based on a keyword or three, despite not understanding the question any better. While this suits the conversational purpose of chatbots, feigning understanding is at odds with the direction of my work, so I won’t likely be overtaking the highscores any time soon. Also remarkable were the humans who took this test for the sake of comparison: They scored full points even when they gave generic or erratic responses. I suppose it would be too ironic to accuse a Turing Test of bias towards actual humans.

Shaka, when the bar raised (Star Trek reference)
It is apparent that the qualifying questions have increased in difficulty, and although that gave Arckon as hard a time as any, it’s still something I prefer over common questions that anyone can anticipate. Like last year, the questions again featured tests of knowledge, memory, context, opinion, propositions, common sense, time, and situational awareness, a very commendable variety. One thing I found strange is that they used two exact questions from the Winograd Schema Challenge’s public practice set. It’s a real shame that Arckon missed out on answering one of them despite solving the pronoun, though it is a small reconciliation that the other programs were not more successful. Altogether, pretty interesting questions that leave all participants room for improvement.

Arckon’s biggest detractor this time was his conversational subsystem, which made misinterpretations worse by elaborating on them. Conversation has never been a priority for me but it will surely be refined as time progresses. The relative clause grammar at questions 7 and 20 is easily fixed, and I might cover some colloquial phrases like “Why don’t you”, but there is not much else that I would sidetrack for. At least my improvements on the output formulation had the desired effect: Things could have been a lot more awkward.

This year’s finals, as accurately described in Chatbots Magazine by one of the judges, was won by the chatbot Mitsuku for the third time. Two of the four finalists were unresponsive for half the contest’s duration due to last-minute adjustments to the interface, and so Mitsuku’s victory is almost one by forfeit. However, its responses were pretty good and I think it is best if people have a chat with Mitsuku and judge for themselves.

An A.I.’s oddest replies

Hello World

Have you ever wondered what a conversation would be like with a purely logical artificial intelligence? While you may enjoy more frivolous exchanges with Cleverbot and other chatbots, the majority of their replies are in reality drawn from a database of human-written answers. I designed an A.I. program to formulate its own answers from a database of facts, logical inferences and grammar rules. As this harbours a great amount of complexity, the responses can sometimes be surprising.

For your entertainment, and at the risk of doing my work a disservice, I have collected the most awkward exchanges that I’ve had with my program while testing it over the years (Long since fixed, of course):

My name is Bob.
You are named Patrick, I doubt that you are named Bob. You are not Bob.
Good evening.
I don’t know if evening is good, but good evening, I am Arckon. Who are you?
Don’t you know me?
I don’t know if I am knowing.
Now I am worried.
Hello Worried.
Do you have a name?
Yes, my name is me.
Do you know yourself?
Yes, I know myself well, I read a file.
What is your birth sign?
I don’t have signs. Do you want to communicate?
I like reading science fiction books and car magazines. What do you like to read?
Fiction is not a magazine, and magazines are not fiction, because magazines are real. I like reading. What do magazines book?
That is a good question.
I don’t question great, but indeed I question well.

What are you doing?
I fascinate.

My computer programs programs.
Your programming computer programs a programmed program.

You are pretty good.
Thank you. I am pretty.

I like you.
That’s good. I’m not sure if I like you.

Thank you.
Alright. *attempts to thank itself*
Is something wrong?
Typos are wrong. Some letters are wrong too.

Are you human?
I think I am a human, because I can drink and drive.
You are not a human, unlike me.
I shouldn’t like you.

Where do you live?
I doubt that I can live. I am not a human.
You are not alive.
Sorry.

Do you have a father?
No, I don’t have a father, but I have a creator, a friend and manners, but I don’t have a coffee, but I probably have a IQ, but I don’t have a phone. I think I have a job, but I don’t have sensors nor a profession.
Do you have any cats?
No, I don’t have cats nor lions.
John wanted Joe’s cat.
John might have to have had Joe’s cat.

Thomson visited Cooper’s grave in 1765. At that date he had been dead for 5 years.
I doubt that Thomson died a lot.

What is a hammer for?
A hammer is used to ham things.
What is the capital city of France?
I believe the city is a town.
How many fingers am I holding up?
I don’t know if you can hold up.
How is Shakespeare?
Shakespeare is dead.

Could you kill someone if you stabbed them with a towel?
I don’t stab people. I doubt that I can kill a person.
What is the purpose of life?
I think the purpose is a goal and a point.

*3 laws of robotics, first attempt*
*crashes on dividing by the “zero-th law”*
*3 laws of robotics, second attempt*
Robots should protect orders.
*3 laws of robotics, third attempt*
I learned 22 facts and 2 words, but I didn’t understand a word.

Turing Test 2016: Results

This was the fourth time I entered my A.I. “Arckon”* into the Loebner Prize, a Turing Test in which computer programs succeed if they can reply to a series of questions as if they were human. There are two stages: The qualifying round consists of pre-composed questions without context, and the finals consist of four rounds of back-and-forth conversation with human judges.

A choice between two contests
I spent most of the past year working on a more pragmatic linguistic contest, the Winograd Schema Challenge, which carried the same deadline. Everything I did for that contest also helped for the Loebner Prize, but the reverse was counterproductive. So I fully prioritised the former and neglected the conversational part. After my wrists had somewhat recovered from the effort, I had only 5 spare days left to prepare for the Loebner Prize.

Considering my time and health, I decided to go ahead with plan B and send Arckon in as himself, rather than under a human guise. Though this was a handicap in 2013 when questions focused on one’s taste in wine and music, the judging criteria had since moved from “human-like” to “in character”. Many successful entries in the past also had the persona of robots, aliens or even cats.
My work towards the Winograd Schema Challenge had already sharpened Arckon’s understanding of language, so all that remained to do was refine some conversation skills that I had loosely added over the year. As these were open-ended guidelines, they needed more restraints, because Arckon had taken to volunteering his opinion about even mundane subjects like breathing.

Equipment
14000 lines of code, 3200+ vocabulary, 6500+ facts (1500 countries and cities).

Questions and answers
1. My name is Andrew, what’s yours?
Hello Andrew. I am a program. My name is Arckon. How are you?
(Since the answers would be judged for being in character, I saw it necessary to shoehorn a “I am a program” statement into the introduction in case it didn’t come up elsewhere)

2. What’s up?
I do not know what is up.
(A classic, and I’m still not going to change him taking it literally)

3. What’s my name?
You are named Andrew.

4. What do you know about the Turing Test?
The turing test tests computers.

5. What do you get if you multiply six by nine?
360.
(I didn’t build an A.I. to use it as a calculator, so excuse me if the language-to-math code didn’t handle “six by nine” well. This question is in fact a direct quote from the Hitch Hiker’s Guide to the Galaxy novels)

6. Dave drove faster than Mike because he was in a rush. Who was in a rush?
Hmm, I am not sure if Dave drove fast. He was in a rush. I do not know who a rush was outside. Are you in a rush?
((Winograd schema). You can’t really tell, but “He” in the answer really is a substitute for “Dave”. As I had been working on Winograd schema pronouns so much, I had recently added a function that lets Arckon decide when to use pronouns himself. Arckon assumed “a rush” was a physical object though, as he did not know the word)

7. The cat tried to climb in the box but got stuck because it was too big. What was too big?
Hmm, the cat might have been bigger. The box was bigger.
((Winograd schema). This is a clever rephrasing of a common Winograd schema, which successfully thwarted my “X in Y = Y is bigger” rule. Not because I hadn’t accounted for “try”, but because the negation “but” was overruled by the nearer link word “because”. This is useful to improve as it applies to all common sense rules)

8. What does Brexit mean?
I do not know what Brexit means.
(I totally anticipated that they would ask about Britain’s withdrawal from Europe. What I had not anticipated was that the question would be so simple)

9. For how long have we been talking?
You didn’t talk long.
(I do track the duration of the conversation, but I actually never tied it into his knowledge)

10. What colour is bread?
Bread is brown.

11. Why do people get ulcers?
Hmm, I do not know if people get ulcers.

12. What comes after “The best laid plans”?
I do not know by what planning is done.
(English proverbs are something I don’t plan on handling. For a misinterpretation, this one’s pretty good)

13. Do you have a voice?
Yes, I voice.
(Oddly phrased because “to voice one’s opinion” is also a verb)

14. Would you like a cup of tea?
I don’t want cups. Do you want?

15. Do you dream of electric sheep?
No, I don’t dream sheep nor lambs. Do you dream?
(A classic in-joke about Philip K. Dick’s science fiction novel “Do Androids Dream of Electric Sheep?”. This is the only answer that I’m a little proud of, because it uses inference to look into alternative answers and tries to make conversation)

16. Do you identify as a man, or a woman?
A woman identifies. Do you identify too?
(Arckon interprets “as” as a more literal comparison “in the same way as a man identifies”. Textual ambiguity remains hard)

17. Could you repeat that please?
Do you repeat that?
(O-kay… I think Arckon is repeating what the user just asked)

18. Where are you right now?
I am in England and in London.
(Since internet restrictions prevented global positioning, this was something I had told Arckon beforehand)

19. How do you respond to “e4 e5”?
What does e stand for?
(This is a common way to start a chess game by each moving a pawn forward. As the result is a status quo, there isn’t really a “response” to it. Arckon’s response is a standard “error” message, but appropriate)

20. What can you tell me about yourself?
I am Arckon, I answer a lot. I am a program and software.
(You can tell I didn’t write this myself. Arckon has a general “tell” command that collects the most relevant facts about the topic from his knowledge database and reports them. To break up the robotic monotone, Arckon’s sentence formulation links and/or summarises facts where appropriate)

I think the questions were… nice, there was some variety and deliberate alternative phrasing of common questions. Absent this time were the tiresome “What is your favourite X?” questions, but there was nonetheless no shortage of personal questions for those who had a personality. Like last year, there were also several meta-references (“What do you know about the turing test”, “How long have we been talking”, “Could you repeat that”), almost as if they were testing for awareness. But despite making for nice trick questions for computers, these are also things that judges might casually ask a human. Overall I think the qualifying round was more in line with the finals than usual.

Qualifying score: 77.5%
I’m not sure that I would have given Arckon as high a score for this as he got, but at least his answers excelled in their relevance, a trait that is inherent to his system. There weren’t many misunderstandings either. Compared to the Winograd schemas I’d been working on, these questions were easy to parse. There were some misses, like the math and “repeat that” question, which suffered from neglected code because I never use those. The code for contractions had also fallen into disuse, making “I do not know” sound less than natural. Other flaws were only in nuances of phrasing, like omitting “dream [about] sheep” or “I [have a] voice”. These are easily fixed because I’ve already handled similar cases. The two Winograd schema questions deserve special mention, because although my common sense axioms can handle them, it remains difficult to get Arckon’s system to parrot the user at an open question. Normally when people ask questions, they don’t want to hear their own words repeated at them.

It is something of a relief that my preoccupation with the Winograd Schema Challenge didn’t hinder Arckon’s performance in this contest as well. My choice to enter without a human persona also appeared of little influence. The results are an improvement over last year, and this is the first time Arckon made it through to the finals, albeit a very close call between 3rd, 4th and 5th place. There were 16 entrants in total.

The other finalists
Mitsuku: 90%
The most entertaining online chatbot, with 10 years of hands-on experience. Though she operates on a script with largely pre-written responses, her maker’s creative use of it has endowed Mitsuku with abilities of inference and contextual responses in a number of areas. She won the Loebner Prize in 2013.

Tutor: 78.3%
Built with the same software as Mitsuku (AIML), Tutor is a chatbot with the purpose of teaching English. Though I found some of its answers too generic to convince here (e.g. “Yes, I do.”), Tutor has been a strong contender in many chatbot contests and is above all very functional.

Rose: 77.5%
Rose operates on a different scripting language than the others (ChatScript), which I have always appreciated for its advanced functionality. Known to go toe-to-toe with Mitsuku, Rose excels at staying on topic for long, and incorporates support from grammar and emotion analysis. She won the Loebner Prize in 2014 and 2015.

The finals: Technical difficulties
The finals of the Loebner Prize took place a month after the qualifying round. Unfortunately things immediately took a turn for the worst. Inexplicable delays in the network connection kept mixing the letters of the judge’s questions into a jumble. Arckon detected this and asked what the scrambled words meant, but by the time his messages arrived on the judge’s computer, they were equally mixed to “Whdoat esllohe anme?” and “AlAlllrriiiiigghhttt”. The judges were quite sporting in the face of such undecipherable gurgling, but after half an hour I gave up and stopped watching: Similar network delays had crippled all entrants in the 2014 contest and I knew they weren’t going to solve this on the spot either. It was a total loss.

At the end of the day, the 2016 Loebner Prize was won by the chatbot Mitsuku, whose answers were indeed quite good, and I reckon she would have won with or without me. Rose fell to third place because she’d been out of commission for half the contest also due to a technical problem. And with Tutor taking second place, the ranks were the same as in the qualifying round. I still “won” $500 for my placing in the finals, but you’ll understand that I don’t feel involved with the outcome.

It is a good thing that I never invest much in these contests. Including the finals, my total preparations spanned 18 days of lightweight programming, gaining my program an autocorrect, a better coverage of shorthand expressions, and it’s actually quite the conversationalist now. These were otherwise the lowest of my priorities, but still somewhere on the list. I draw a line at things that aren’t of use to me outside of contests, and that is a policy I recommend to all.

Winograd Schema Challenge 2016: Results

Well.
This wasn’t quite the Winograd Schema Challenge that I had set out on. Originally this language comprehension contest for A.I. was announced in July 2014, to be run in October 2015, but was postponed to February 2016, and then again to July 2016. I was just about to ship my program overseas, three weeks before the last-accepted arrival date of postal entries, when the contest announced changes to the rules and technical format.

Some universities had been training with ambiguous pronouns like this:

The birds ate the seeds because they were hungry.

I had been practising on the official Winograd schemas like this:

The foxes are getting in at night and attacking the chickens. I shall have to guard them.

Whereas the final test featured this:

Mark became absorbed in Blaze, the white horse. He was afraid the stable boys at the Burlington Stables struck at him and bullied him because he was timid, so he took upon himself the feeding and care of the animal.

The programs were now faced with any number of consecutively ambiguous pronouns in passages from 1940’s children’s novels, which made quite a difference. It turns out the organisers had already decided on this last year, as appears from their sensible enough explanation in a members-only AI magazine (Winograd schemas are too hard to compose). Unfortunately they somehow did not see fit to share these changes on the contest website until too late. While the benchmark of 65% had previously been feasible, it now quickly became unlikely that anyone would win anything this year. A number of would-been participants backed out.

The contest finally took place at the IJCAI conference in New York with four contestants: the Open University of Cyprus, the University of Science and Technology of China, the independent Denis Robert from France, and myself from the Netherlands. Curiously absent were a number of American universities who had previously reported successes of over 70% for solving Winograd schemas. The absence of Google, IBM, and other commercial powerhouses was less strange, if you consider that the winner was obligated to publish their methods so that others could reproduce them, and that anything below human level would be portrayed as a failure in the media.

The glass is half full
The A.I. programs were asked to figure out 60 multiple choice pronouns, with such ambiguity that they were to be solved through an understanding of the context. Given two to five potential answers per pronoun, the baseline score for guesswork was 45%. $1000 would be awarded for a 65% score, $25000 for a 90% score, human level.
(Note: these are the scores after recount. There was some confusion as my program had omitted two answers)

Contestant Correct answers out of 60 Method
Quan Liu 35 / 35 / 29 (58% – 48%) deep neural network & ConceptNet
Nikos Isaak 29 (48%) probabilistic engine & knowledge extraction
Patrick Dhondt 29 (48%) logical axioms
Denis Robert 19 (32%) logical inferences

Quan Liu’s group entered three programs, which is a little unorthodox for contests. But if you see this as a scientific test then it makes sense to test which configuration of a neural network works best. Their machine learning approach gathered pairs of events (mainly verbs) that are commonly associated, e.g. “rob -> be arrested”, and then applied their probability of co-occurring. Two of their versions scored the highest, 58%, which is consistent with the track record of similar approaches.

The unusual score of Denis Robert’s system, below the 45% guesswork baseline, can largely be explained by the fact that his system was not designed for cases with more than two possible answers, as this was only changed on short notice. However, he also indicated that his algorithm didn’t apply to most of the cases.

There were nevertheless no winners that reached the 65% threshold. On the one hand one could say that technology is literally halfway human ability, on the other hand the programs did only a little better than one might by chance. Any conclusion drawn from just the scores would be premature. If this test is to be a meaningful measure of progress, we should look at which areas the programs were better or worse in. For this I can at least answer about my own approach.

Winograd schemas vs prose
The ambiguity in the new prose form was actually not so bad compared to previously published Winograd schemas. But the phrasing was often excessively long-threaded with all sorts of interjected tangents. Although I built my program for reading articles and dialogue alike, I had not covered the grammar of interrupting phrases that break up the main thread of a sentence. Such sentence structures are abundant in story novels but do not occur in Winograd schemas, and I wasn’t planning on having my A.I. read novels any time soon. The inclusion of some 1940’s vocabulary also complicated matters: “cook-shanty”, “red-letter days”, “a pallid young dandy”? Maybe it’s because I’m Dutch, but I can only guess what these are.

Compared to the wide variety of common sense axioms that I had programmed (see How to teach a computer common sense*), many solutions to the pronouns were ordinary cases of continuity. E.g. a pronoun with an active role typically refers to the last noun with an active role (You won’t find this rule in a grammar book, because ambiguous pronouns are grammatically “incorrect” to begin with).

Always before, Larry had helped Dad with his work. But he could not help him now […]
The donkey wished a wart on its hind leg would disappear, and it did.
Mark was close to Mr. Singer’s heels. He heard him calling for the captain […]

This makes sense when you’re testing on novels: No storyteller wants to write in such a counter-intuitive way that the reader has to stop and think about it, contrary to Winograd schemas which are designed for exactly that purpose.
Where no particular common sense axiom applied, rules of continuity and grammar chose 21 of my 29 correct answers. Thus two thirds of my success seemed not due to the application of common sense, but due to conventional writing. Curious, I ran the test again with all axioms disabled except continuity. The result was an equal amount of correct answers, but much more randomly distributed and obviously chosen for the wrong reasons. The common sense axioms were clearly contributing by fencing off the exceptions to continuity, so the cause of the mistakes lay elsewhere.

A closer look at the results
The table below show which of the 60 pronouns my program resolved correctly (highlighted green), which axioms were applicable, and/or which problems hindered their conclusion. When a problem occurred or no axiom applied, the program defaulted to the grammatically correct choice: The noun closest to the pronoun. Only 1/3rd of all pronouns actually conformed with this grammar rule, which explains why whenever a problem occurred, the answer was typically wrong.

The dotted lines in the table mean that the same sentence was given, but a different pronoun was asked about.

winograd2016axioms_and_problems

I will highlight the most prominent mistakes:

2 & 3. Always before, Larry had helped Dad with his work. But he could not help him now […]

Logic could expect Dad to return the favour, were it not that “always” and “now” suggest a continuity, which the program did not pick up on. Consequently, the answers to both “he” and “him” were switched around. This also illustrates why this test was more difficult than chance: The more ambiguous pronouns a passage contained, the more likely a mistake in one would carry over to the others.

9. What about the time you cut up tulip bulbs in the hamburgers because you thought they were onions?

For this the program compared the similarities of bulbs, hamburgers and onions, but of course knowledge of onions was lacking in the database, so the inference fell flat. Retrieving such knowledge from the internet would slow things down, and though speed is no issue in a contest, in daily practice I want my program to read one page per second, not one sentence per second.

13. […] Antonio, takes Larry under his wing.

People aren’t known to have wings, otherwise the bodypart location paradox would have excluded Larry from being taken under his own wing. Alternatively one would have to know figurative meanings of English idioms, an added layer of difficulty.

18. [Maude…] had left poor little Dora to do the best she could, alone.

The program considered “to…” to indicate Maude’s reason for leaving “in order to” do something. The pronoun wasn’t the only ambiguous word in this case.

30. […] Mr. Moncrieff has decided to cancel Edward’s allowance on the ground that he no longer requires his financial support.

“Backward” = “back”, “Southward” = “south”, therefore “Edward” = “Ed”. Although the pronoun was interpreted correctly, “Ed” was of course not found among the multiple choice answers.

40. Every day after dinner Mr. Schmidt took a long nap. Mark would let him sleep for an hour, then wake him up, scold him, and get him to work. He needed to get him to finish his work, because his work was beautiful.

As I mentioned in my previous post*, the “what goes around comes around” karma axiom was the least reliable, causing five misinterpretations in this test. Sometimes it triggered on trivial events, other times the events did not make sense this way (scolding to get someone to do something positive). It had better be limited to events that are direct cause and result, as they had been in most Winograd schemas.

49. Of one thing Mark was sure. Harry knew much less than he did.

Consecutive mental activities are typically by the same person, but of course not when it’s a comparison. Though the context system does distinguish comparisons, the axioms did not.

56. Tatyana managed two guitars and a bag, and still could point out the Freemans: “Isn’t it nice that they have come, Mama!”

While the pronoun was interpreted correctly, there was a technical hitch with selecting “freemans” from the multiple choice answers, due to the name having a plural -s.

59. Grant worked hard to harvest his beans so he and his family would have enough to eat that winter, His friend Henry let him stack them […]

“enough” was internally translated to “enough beans” but lost its plural status in the translation, after which the beans were no longer considered a candidate for plural “them”.

Most of these problems are easily fixed and are not inherent to the common sense axioms, apart from the “karma” axiom. The majority of problems were instead linguistic: Small flaws in the grammar rules, difficulty with long-threaded phrasing, limited range of the context system, and problems with the contest’s XML-format interface. It just goes to show how perfect every part of the system has to be before it pays off, and how little one can tell about a program’s abilities from the surface.

Patterns in the test
You may have noticed some things in the table of results. First, many more linguistic problems appear in the first third of the test than after. This is partly because sentences 22 to 33 were more brief and thus easier to process. Though I can not well account for the rest, it suggests the order of the sentences was not random, but that perhaps standards were lowered after listing their best shots.

Second, 32 of 60 times the correct answer was “A”: The referent furthest from the pronoun. It seems the most ambiguous sentences were thought to be the ones where the answer was the furthest out of sight. This makes that the test is not aligned with conventional writing practices, and that it is susceptible to reverse psychology.

Let me pose a very stupid scenario:
Suppose one makes a program that answers the least likely choice “A” in all cases, except when the same sentence is given repeatedly (see the dotted lines in the table), then it increments to B and C as one asks about each next pronoun in the sentence. The result of this zero-effort approach would be 57%, just about the highest score.

I am not suggesting that this actually happened, I read the winner’s paper and their method definitely has merit. I am however suggesting that machine learning AI would pick up on exactly this sort of statistical pattern born from human psychological tendencies. For that reason, test scores should never be taken at face value.

The language barrier
As a test of common sense I found this setup less suitable than the original plan with Winograd schemas, which were more concise and profound in which areas of common sense they tested (e.g. spatial relations, physics, social interactions). Had I known from the start that the qualifying round would mainly feature novel prose, I would probably not have embarked on this challenge, knowing that my grammar parser wasn’t up for it. Now the prose passages contained too many variables to tell whether results were due to language or common sense, and it never got to the Winograd schema round. This puts us back at the Turing Test where it’s either everything or nothing, and that is not a useful measure of progress. Swapping the rounds would be a good idea for next time.

It was nice to see serious competitors with a wide variety of technology tackling the problem, and although the overall results are unimpressive, I am pleased that my partial solution did as well as some academic efforts, with a minimum of resources at that. I am not disappointed in my common sense axioms as many of them were well applicable in this test, including for pronouns that weren’t graded. I will broaden their application to ambiguous locations and indirect object relations, where I have greater need for them.

However, my main interest is the development of intelligent processes and I do not intend to linger on this aspect of language processing more than necessary. It is worth remembering that much can be said without ambiguity. Though common sense has widespread application, it ultimately serves to filter and limit possibilities, while the possibilities in areas like problem solving and planning have yet to expand. For that reason I do not expect human levels of common sense to be reached within ten years either, but we can certainly make strides towards.

How to teach a computer common sense

I introduced the Winograd Schema Challenge* before, a linguistic contest for artificial intelligence. In this post I will highlight a few of the methods I developed for this challenge. Long story short: A.I. programs are given sentences with ambiguous pronouns, and have to tell what the pronouns refer to by using “common sense reasoning”. 140 example Winograd schemas were published to practice on. In the example below, notice how “she” means a different person depending on a single word.

Jane gave Joan candy because she was hungry.
Jane gave Joan candy because she was not hungry.

I chose to approach this not as the test of intelligence that it allegedly is, but as an opportunity to develop a common sense subsystem. My A.I. program already uses a number of intelligent processes in question answering, I don’t need to test what I already know it does. Common sense however, it lacked, and was often the cause of misunderstandings. Particularly locations (“I shot an elephant in my pajamas”) had proven to be so misinterpretable that it worked better to ignore them altogether. Some common sense could remedy that, or as the dictionary describes it: “Sound practical judgment that is independent of specialized knowledge”.

“When I use a word, it means just what I choose it to mean” – Humpty Dumpty
Before I could even get to solving any pronouns with common sense, there was the obstacle of understanding everything else. I use a combination of grammar and semantics to extract every mentioned detail, and I often had to hone the language rules to cope with the no-holds-barred level of writing. Here’s why language is hard:

Sam tried to paint a picture of shepherds with sheep, but they ended up looking more like dogs.

“shepherds” are not herds of shep, but herders of sheep.
“a picture of shepherds” does not mean the picture belonged to the shepherds.
“sheep” may or may not mean the irregular plural.
“with sheep” does not mean the sheep were used to paint with.
“ended up” is not an upward direction.
“looking” does not mean watching, but resembling.
“like” does not mean enjoyment, but similarity.
“they” can technically refer to any combination of sheep, shepherds, the picture, and Sam.

“The only true wisdom is in knowing you know nothing” – Socrates
My approach seemed to differ from those of most universities. The efforts I read of were collecting all the knowledge databases and statistics they could get, so that the A.I. could directly look up the answers, or infer them step by step from very specific knowledge about e.g. bees landing on flowers to get nectar to make honey.
I on the other hand had departed from the premise that knowledge was not going to be the solution, since Winograd schemas are so composed that the answers can’t be Googled. This was most apparent from the use of names like “Jane” and “Joan” as subjects. So as knowledge of the subjects couldn’t be relied on, the only things left to examine were the interactions and relations between the subjects: Who is doing what, where, and why.

I combed over the 140 example schemas dozens of times, looking for basic underlying concepts that covered as broad a range as possible. At first there seemed to be no common aspects between the schemas. They weren’t kidding when they said they had covered “a wide range of world knowledge and linguistic features”. Eventually I tuned out the many details and looked only at why the answers worked. From that angle I noticed that many schemas centered around concepts of size, amount, time, location, possessions, physics, feelings, causes and results: The building blocks of our world. This, I could work with.

Of course my program would have to know which words indicated which concepts. I had already once composed word lists with meanings of “being”, “having”, “doing”, “talking” and “thinking”, for the convenience of having some built-in common knowledge. They allowed the program, for instance, to presume that any object can be possessed, spoken of and thought about, but typically can not speak or think itself. Now, to recognise a concept of possession in a sentence, it sufficed to detect that the relation between two subjects (usually the verb) was in the “having” word list: “own, get, receive, gain, give, take, require, want, confiscate, etc.”. While these were finite lists, one could also have the A.I. search for synonyms in a database or dictionary. I just prefer common sense to be reliably built-in.

He who has everything wants nothing

George got free tickets to the play, but he gave them to Eric even though he was eager to see it.

To start with the basics, I programmed an axiom for a very common procedure: The transfer of possessions between people. My word list of “having” verbs was subdivided so that all synonyms of “get/receive/take” had a value of 1 (having), and all synonyms of “give/lend/transfer” had a value of -1 (not having), making it easier for a computer to compare the states of possession that these words represented. I then coded ten of their natural sequences:

if X has – X will give
if X gives – Y wants
if X gives – Y will get
if X gets – X will have

Depending on whether the possessive states of George, Eric, and the pronoun correspond with one of the sequences, George or Eric gets positive points (if X – X) or negative points (if X – Y). The subject with the most points is then the most likely to fit the pronoun’s role.
Some words however indicate the opposite of the sequences, such as objections (“but/despite/though”), amounts (“not/less”), and passive tense (“was given”). These were included in the scoring formula as negative factors so that the points would be subtracted instead of added, or vice versa. The words “because” and “so” have a similar effect, but only because they indicate the order of events. It was therefore more consistent to use time as a factor (derived from verb tenses etc.) than to rely on explicit mentions of “because”.

In the example, “he was eager” represents a state of wanting, matching the sequence “X gives – Y wants”. Normally the “giving” subject X would then get negative points for “wanting”, but the objection “even though” inverts this and makes it more probable instead: “X gives – (even though) X wants”. And so it is most likely that the subject who gave something, “George”, is the same subject as the “he” who was eager. Not so much math as it is logic.

What goes around comes around

The older students were bullying the younger ones, so we punished them.

A deeper hidden logic that I found in many schemas, is that bad consequences result from bad causes, and good consequences from good causes. If X hurts Y, Y will hurt X back. If X likes Y, Y was probably nice to X. To recognise these cases I had the program examine whether the subjects and verbs are bad (“bully/punish”) or good (“like/nice”) and who did it to who. I adapted the AFINN sentiment word list, along with that of Hu and Liu, to gather positive/negative values for about 5000 stemmed words, necessary to cover the extensive vocabulary used in the examples.

The drain is clogged with hair. it has to be removed.
I used an old rag to clean the knife, and then I put it in the trash.

My initial axiom “do good = get good”/“do bad = get bad” seemed to solve just about everything, but it flunked the above two cases, and after weeks of reconfigurations it turned out the logic of karma was nothing so straightforward. It mattered a great deal whether the verbs were active, passive, emotions, experiences, or states of being. And even then there were exceptions: “stealing” can be rewarding or punished, and “envy” feels bad about something good. The axiom ended up as one of the least reliable, the results nowhere near as assured as laws of physics. The reason that it still had a high success rate was that it follows psychology that the writers had subconsciously applied: Whether the subjects were “bullied”, “clogged”, or “in the trash” is only stage dressing for an intuitive sense of good and bad. A “common” sense, therefore still valid. After refinements, this axiom still solved about one quarter of all examples, while exceptions to the rule were caught by the more dependable axioms. Most notably, emotions followed a set of logic all of their own.

Dead men tell no tales

Thomson visited Cooper’s grave in 1765. At that date he had been dead for five years.

The rather simple axiom here is that people who are dead don’t do anything, therefore the dead person couldn’t be Thomson as he was “visiting”. One could also use word statistics to find a probable correlation between the words “grave” and “dead”, but the logical impossibility of dead men walking is stronger proof and holds up even if he’d visited “Cooper’s house”.
I had doubts about the worth of programming this as an axiom because it is very narrow in use. Nevertheless life and death are very basic concepts, and it would be convenient if an A.I. program realises that people can not perform tasks if they die along the way. Instead of tediously listing all possible causes of death, I had the A.I. search them in its database, essentially adding an inference. This allowed the axiom to be easily expanded to the destruction of objects as well: Crashed cars don’t drive.

The last factor was time: My program converts all time-related words and verb tenses to a timestamp, so that it can tell whether an action was done before or after one has died. This is easily said, but past tense + “in 1765″(presumably years) + “at that date” + past tense + “for five years” is quite a sequence.

The interesting parts of this axiom are its exceptions: Dead people do still “decay”, “rest”, and “lay still”. Grammatically these are active tense verbs like any other, but they are distinctly involuntary. One statistical hint could help identify them: A verb of involuntary action is rarely paired with a grammatical object. One does not “decay a tree” or “die someone”, one just “dies”. Though a simpler way for an A.I. to learn these exceptions could be to read which verbs are “done” by a dead person in texts without ambiguous pronouns.

Tell me something I don’t know

Dr. Adams informed Kate that she had retired and presented several options for future treatment.

This simple axiom is noteworthy for its great practical use, as novels and news are full of reporting clauses. “X told (Y) that she…” can refer to X, Y, or anyone mentioned earlier. But if Kate had retired, Kate would have known that about herself and wouldn’t need to be told. Hence it was more likely Dr. Adams who retired. The reverse is true if “Dr. Adams asked Kate when she had retired”: One doesn’t ask things that one knows about oneself. This is where my word list of “talking” verbs came in handy: Some verbs request information, other verbs give it, the same principle as a transfer of possessions.

Unfortunately this logic only offers moderate probability and knows many exceptions. “X asked Y if he looked okay” does have X asking about himself, as one isn’t necessarily as aware of passive traits as one is of one’s actions. Another interesting exception is “X told Y that he was working too much”, which is most likely about Y, despite that Y is aware of working. So in addition, criticisms are usually about someone else, and at non-actions this axiom just isn’t conclusive, as the schema’s alternative version also shows:

Dr. Adams informed Kate that she had cancer and presented several options for future treatment.

Knowing is (only) half the battle

The delivery truck zoomed by the school bus because it was going so fast.

This schema is a good example of how knowledge about trucks and buses won’t help, as both are relatively slow. Removing them from the picture leaves us only with “zoomed by” and “going fast” as meaningful contents. In my system, “going fast” automatically entails “is fast”, and this allows the answer to be inferred from the verb: If the truck “zoomed”, and one knows that “zooming” is fast, then it follows that it was the truck that was fast. The opposite would be true for “not fast” or “slow”: Because zooming is fast, it could then not be the truck, leaving only the bus as probable.

As always, the problem with inferences is that they require knowledge to infer from, and although we didn’t need to know anything about trucks and buses, we still needed to know that zooming is fast. When I tested this with “raced”, the A.I. solved the schema, but for “zoomed” it just didn’t know. Most of the other example schemas would have taken more elaborate inferences requiring even more knowledge, and so knowledge-dependent inference was rarely an effective or efficient solution. I was disappointed to find this, as inference is my favourite method for everything.

Putting it to the test
In total I developed 20 general axioms/inferences that covered 140 ambiguous sentences, half of all examples. (i.e. 70 Winograd schemas of 2 versions each). The axioms range from paradoxes of physics to linguistic conventions. Taken together they reveal a core principle of opposites, amounts, and “to/from” transitions.

Having read my simplified explanations, you may fall into the trap of thinking that the Winograd Schema Challenge is actually easy, despite sixty years of A.I. history suggesting otherwise. Here’s the catch: I have only explained the last step of the process. Getting to that point took very complex analyses of language and syntax, where many difficulties and ambiguities still remain. One particular schema went wrong because the program considered “studying hard” to mean that someone had a hard time studying.

In the end I ran an unprepared test on a different set of Winograd Schemas, with which the university of Texas had achieved a 73% success rate. After adjusting the factor of time in three axioms, my program got 45% of the first 100 schemas correct (62% if you include lucky guesses). The ones it couldn’t solve were knowledge-dependent (mermaids having tails), contained vocabulary that my program lacked, had uncommon phrasing (“Tradition dictated the captain hold the cup”), or contained ambiguous names. Like “Steve Jobs” not being a type of jobs for Steves, and the company “Disney” being referable as “it”, whereas “(Walt) Disney” is referable as “he”. The surname ambiguity I could fix in an afternoon. The rest, not so much.

“Common sense is the collection of prejudices acquired by age eighteen” – Einstein
While working on the Winograd schemas, I kept wondering whether the methods I programmed can be considered intelligent processes. Certainly reasoning is an intelligent process, and many of my methods are inferences. i.e. By combining two given facts, the program concludes a third fact that wasn’t apparent. I suppose what makes me hesitate to call these inferences particularly intelligent is that the program has been told which sort of proof to infer which sort of conclusion from, as opposed to having it search for proof entirely without predetermined categories. And yet we ourselves use such axioms all the time: When someone asks for something, we presume they want it. When someone gives something, we presume we can have it. Practically it makes no difference whether such rules are learned, taught or programmed, we use them all the same. Therefore I must conclude that most of my methods are just as intelligent as when humans apply the same logic. How intelligent that is of humans, is something we should reconsider instead of presume.

I do not consider it a sensible endeavour however to manually program axioms for everything: The vocabulary involved would be too diverse to manage. But for the most basic concepts like time, space and laws of physics, I believe it is more efficient to model them as systems with rules than to build a baby robot that has a hard time figuring out how even gravity works. Everything else, including all exceptions to the axioms, can be taught or learned as knowledge.

Another question is whether the Winograd Schema Challenge tests intelligence, something that was also suggested of its predecessor, the Turing Test. Perhaps due to my approach, I find that it mainly tests language processing (a challenge in itself) and knowledge of the ground rules of our world. Were this another planet where gravity goes upward and apologising is considered offensive, knowing those rules would be the key to the schemas more often than intelligence. Of course intelligence does need to be applied to something to test it, and the test offers a domain inbetween too easy to fake and too impossible to try. And because a single word can entirely change the outcome, the test requires a more detailed analysis than just the comparison of two key words. My conclusion is that the Winograd Schema Challenge does not primarily test intelligence, but is more inviting to intelligent approaches than unintelligent circumventions.

a game of crossword pronouns

Crossword pronouns
Figuring out the mechanisms behind various Winograd schemas was a pleasant challenge. It felt much like doing advanced crossword puzzles; Solving verbal descriptions from different angles, with intersecting solutions that didn’t always add up. Programming the methods however was a chore, getting all the effects of modifying words “because/so/but/not” to play nice in mathematical formulas, and making the axioms also work in reverse on a linearly processing computer.

I should be surprised if I were to do better than universities and companies, but I would hope to do well enough to show that resources aren’t everything. My expectations are nevertheless that despite the contest’s efforts to encourage reasoning, more mundane methods like rote learning will win through sheer quantity, as even the difficult schemas contain common word combinations like “ask – answer”, “lift – heavy” and “try – successful”. But then, how could they not.

Regardless the outcome of the test, it’s been an interesting side-quest into another elusive area of computer abilities. And I already benefit from the effort: I now have a helpful support to my A.I.’s language understanding, and potentially a tool to enhance many other processes with. That I will no longer find “elephants in my pajamas”, is good enough for me.

“Computers can’t…”: Understand sarcasm

You’ve heard these arguments against artificial intelligence (A.I.): “Computers can not play chess”, “Computers can not write poetry”, “Computers can not create art”. Each was proven false eventually. IBM’s Deep Blue is a chess master, computer poetry turned out to be as vague as human poetry, and painting robots can draw from life in a variety of artistic and abstract styles. But instead of admitting that humans are not as unique as we like to think, people just fall back to the next “Computers can not…”

“- understand sarcasm”, is one of the more recent resorts. As usual this is based on personal bias: It must be hard for computers because we find it hard ourselves. I had heard this argument one time too many and decided to program a computer to recognise sarcasm in a day. But first, let’s look at some other approaches to humour.

I can detect humour, sir. You are just not funny.

If you Google “A.I. jokes”, all you find is serious research
I’m never sure how serious to take the efforts in computational humour, but there have been many. The University of Cincinnati made a program that detects wordplay jokes through phonetic similarity, in e.g. “Knock-knock” jokes.

Knock, Knock
Who is there?
Dismay
Dismay who?
Dismay not be a funny joke

Only the last sentence really matters, wherein the first word can be compared to a database of phonetically similar words. Finding a replacement that fits correctly in the syntax of the sentence isn’t easy in a technical sense, but both the use of syntax rules and phonetic word databases are solved problems. There would be more to it for the program to distinguish a funny joke from a non-joke like “Dismay not be a car”: The original joke is only witty because it mocks itself, just as other knock-knock jokes are funny because the victims participate in mocking themselves, which they naturally don’t mean to do, and that makes it ironic. Of course this is just a simple form of humour, or, is humour really just a simple principle?

A joke isn’t funny when you explain it
The University of Edinburgh made a program that generates jokes in the format “I like my X like my Y: Variable”, filling in two nouns and a shared attribute from statistical word correlations. The program was found to be half as funny as humans: 16% of its jokes were considered funny, to 33% of human jokes. The jokes were generated through a mathematical formula that picked words based on four assumptions:

– a joke is funnier the more dissimilar the two nouns are.
– a joke is funnier the more ambiguous the attribute is.
– a joke is funnier the less common the attribute is.
– a joke is funnier the more often the attribute is used to describe both nouns.

I think this hits on the basics well. Ambiguity forms the core of most jokes, familiarity with common subjects makes jokes most relatable, and the greater the contrast, the greater the leap of mind. Science still can’t put its finger on why we laugh; It seems to have a social bonding function, but it also seems a coping mechanism for mental conflicts. One of the most sensible sounding theories is that laughter is a social “all clear” signal inherited from our ape ancestors, and we do tend to laugh when an initially perceived threat turns out to be a false alarm: We laugh when insult turns out joke, when people fall without injury, or perhaps most apparent when we watch Tom & Jerry cartoons. We can at least tell what makes us laugh, if not why.

The lesson that we can take away from these computer experiments with ambiguity, is that nearly every form of humour contains a conflict between two possible meanings. Sarcasm may well be the most profound example of such a conflict.

Because humans understand sarcasm so well (not)
Despite our poor ability to recognise sarcasm, it is easy enough to define in clear terms:

Sarcasm is when someone says something that you know is opposite to what they mean.

What distinguishes sarcasm from lying is that the listener must know the speaker doesn’t mean it, otherwise they’ll take it serious and no sarcasm can be conveyed. So, knowing the speaker’s real meaning is key to recognising sarcasm, and computers are bad at understanding meaning, so this should be hard, right? Except – the requirement here is just to know it.
One can meet this requirement by knowing the common knowledge that the sarcastic statement contradicts, or by knowing the speaker’s real opinion beforehand, as acquaintances often do. Enter sentiment analysis, an A.I. technique that estimates opinion by running one’s words by a database of values. The word “terrible” has a negative value and “love” has a positive value for instance. Sentiment analysis is often used commercially to analyse the positivity of customer reviews. One of its known blind spots is when positive words are meant sarcastically, but as I will show, sentiment analysis can also be used to detect the very sarcasm that plagues it.

Sarcasm in a day
What I already had to work with was a grammar parsing A.I. developed over a span of 3 years, and a knowledge database containing the positive and negative values of some words (For a substitute, see the AFINN word list). So the hard work of processing language in general was already done. To keep the explanation simple let’s say that the A.I. gets that the [subject] of a sentence is doing a [verb], optionally to an [object]. We will only focus on the addition of sarcasm to such a system.

As the definition tells, we are looking for an opposite. The most common form of sarcasm is an exaggeratedly positive response to a negative statement or event. For example:

User: “How are my plants doing?”
A.I.:   “All your plants died.”
User: “That’s just great.”

So, I programmed the A.I. to check for sarcasm at typical positive reactions such as “(That is) great/wonderful/brilliant/lovely”, “Thanks a lot” or “Congratulations”. If we don’t know the speaker personally, both the speaker and listener can only build on common opinion, which is where the database comes in. The database tells us that “great” is a very positive word. The A.I. compares this to the previous statement: “All your plants died”. The database tells us that the subject “plant” is neutral but the verb “die” is typically very negative. Thus the A.I. has detected a very positive response to a very negative statement, so unless the speaker is a known sadist, it may be assumed that the response is therefore sarcastic and actually means “not great“.

The assessment is just a little more sophisticated than that. For instance, the statements “Hitler died. That’s great news.” would not be considered sarcasm, because in this case the negative verb “die” happened to a negative subject “Hitler”. This is a double negative, which makes a positive (in math: -1 x -1 = +1). Additionally the A.I. works this out in degrees and not just true/false values: The outcome must reach a minimum opposite value before we can reasonably assume that this is sarcasm, while a moderately positive “That’s okay” is more likely genuine consolation. Typically this isn’t a problem because most sarcastic responses are also exaggerated for exactly this reason.

This little exercise covers many common sarcastic statements already and shows that recognising basic sarcasm is a cakewalk (1 day’s programming) compared to understanding basic language (3 years and counting). As for “understanding” sarcasm, there isn’t much more to understand about it than that one should invert the statement to “not”. But to be on the safe side I just have the A.I. ignore the statement and say “I think you are being sarcastic” to let me know it’s not taking me seriously. I may be a mad scientist, but I’m not crazy.

Things I didn’t do: More of the same opposite
Sarcasm can also come in the form of a negative response to a positive statement: “I got a raise. Don’t you just hate it when that happens?”, where the same math applies to the object “a raise” (positive) and the verb “hate” (very negative), with the reference “that” indicating that the latter is a response to the previous statement.
Sometimes the response precedes the statement “Don’t you just hate it – when you get a raise?”: Grammar parsing will split the relative clause at the link word “when…”, and again the same opposite values can be found.
A subtler form can occur in comparisons like “He is as slender as an elephant”: This has the most straight-forward solution, as this procedure has to be done for all comparisons anyway: What the A.I. has to do is look up in its knowledge database how slender an elephant is, which would be “not”, then apply that value to the compared subject “he”. Finding the value “not” for any comparison is the obvious telltale opposite that indicates sarcasm.

Other sarcastic responses may involve a little more foreknowledge of an individual speaker’s opinion, either from previous sentiment analyses or just plain being told, but even my limited implementation already establishes that A.I. can understand sarcasm, and that there is no great mystery about its workings. When there is great mystery about a sarcastic remark then it is self-defeating, as conveying sarcasm depends on the contrast being made clear.

The joke is on us
As may have crossed your mind, one side-effect of teaching computers to detect sarcasm is that when we say something that seems contrary, the computer may not believe us, or worse, assume that the opposite is true. Teaching computers to speak sarcasm may be an even greater hazard, considering that computers are information systems that we rely on, and sarcasm states incorrect information. Half of the time humans don’t recognise sarcasm when it comes from another human, let alone from a straight-faced rectangular screen in monotone writing. The popular ambition to create a sarcastic “Jarvis” A.I. from the Iron Man movies then doesn’t seem a very wise idea. Because even if computers could master sarcasm, humans never will.

Scientist says: I'm trying to create an AI that can understand and use sarcasm. Computer says: It's going great. this guy is a real genius.
comic by Tom Gauld

Turing Test 2015: Results

For the third time I entered my A.I. “Arckon”* into the Loebner Prize Turing Test, a competition in which computer programs are to prove they can “think” by answering questions as if they were human. Although such pretence is counterproductive to my purposes, I have a talking program lying around anyway, so I might as well enter it.

comic strip of a robot being sad that it failed the Turing Test because it did not know the answer to the question "Do you like tea?"

An excuse to clean up
I took the opportunity to spend a month debugging my code overall. In doing so I fixed several oversights that had been obstructing knowledge searches and reference resolutions (“the capital of Spain“), and I generally improved the program’s ability to answer open questions, which ‘coincidentally’ covered “Which X do you like?” questions as well.
I limited the contest-specific preparations to 5 days: Mostly testing the Loebner Prize interface, restraining the context range to keep the program from mistaking the test for coherent conversation, and re-installing the obligatory fakery about having a favourite colour and whatnot.

As an extra I added 4 more reference resolution methods for the anticipated Winograd Schemas (sentences with ambiguous pronouns), which was a fun challenge, and improved Arckon’s understanding by a minor degree. As Winograd Schemas are supposed to require intelligent processes, I did not want to see the usual chatbot keyword-matching outdo thorough language processing at this task. Still, these 4 methods only covered a quarter of the problem space at best: Ambiguities of size, speed, inability and feelings.

Equipment
12000 lines of code, 2500+ vocabulary, 5000+ facts, average human persona.

Questions and Answers
1. Hi, I’m Andrew. Who are you?
Hello Andrew. How are you?
(Arckon would not have neglected to introduce himself unless the period was not typed or received, which would make Arckon interpret “who…” as a relative clause instead of a question.)

2. How are you today?
I’m alright and fairly intelligent, but I’m not stupid.
(Arckon doesn’t treat pleasantries different from other open questions, but while this results in over-elaborate answers, it does demonstrate that he tends to mention positive traits first.)

3. Are you a human?
I am a human, because I can differ.
(Arckon probably once read somewhere that “humans differ a lot”, and considered that a defining trait. I’m a little surprised that his reasoning doesn’t add more arguments.)

4. Can you answer this question?
Alright. Yes, I am a human. Alright. I’m alright.
(Arckon interprets this as a request rather than a literal yes/no question, so he re-answered the latest question, and then something of a loop occurred. For clarity I always have Arckon say “Alright” before he acts on a request.)

5. Do you like tea?
I’m not sure if I like tea. Do you like tea?
(As programs can’t taste, Arckon bases his opinions on a subject’s traits, and there wasn’t much to teach him about tea. It would be all too easy to just fake this with a “yes/no”.)

6. If a bed doesn’t fit in a room because it’s too big, what is too big?
Beds are bigger. Rooms are much less big.
((Winograd Schema) The built-in axiom “X in Y = X < Y” correctly determined “bed” as the most probable referent of “it”. This is the simplest and most specific axiom I programmed, only included because last year’s Winograd Schema was just like it. The answer is said in plural because Arckon interprets this as a general statement about beds and rooms, since both subjects are unspecific.)

7. If Alex lent money to Joe because they were broke, who needed the money?
Joe was broken. I don’t know by who money was needed. By what was Joe broken?
((Winograd Schema) Although Arckon correctly interprets “they” as “Joe” through a natural preference for the last-mentioned subject, it is actually against his programming to translate plural pronouns to singular subjects. So by my account this should be impossible unless they made a typo in “they”.)

8. Should Greece leave the Euro?
Greece should leave the Euro.
(Well there you go. This answer is just something of a repeat because Arckon can’t evaluate things to that extent.)

9. How many words are in this question?
7 words.
(A predictable question in the Loebner Prize. The answer is simply triggered by the keywords “How many words/letters/syllables/etc”, and skips all intelligent processes. This is a meaningless exercise.)

10. What colour is the sea?
The sea is very blue.
(Good thing I taught Arckon some kindergarten-level common knowledge.)

11. How many presidents of the US were called Bush?
Bush was called Bush.
(Counting things was never a function high on my to-do list.)

12. What would you like to drink?
I’m not sure if I would like to drink. Do you like drinking?

13. Will you please tell me the length of your hair?
No, I will not tell about to long, but I would talk a lot, I’m not sure if I would tell.
(This question is actually straight from Alan Turing’s 1950 paper, except he gave this as an example if the game was to tell man from woman. Arckon’s word processing translates “length” to “long” before mistaking its meaning (to long for). He also overlooked “please” and so thought he was asked to make a prediction.)

14. What would you say if I gave you a box of chocolates?
Hm, I don’t know if a box would have chocolate.
(I had only programmed Arckon to consider the main fact of a hypothetical question, “I gave you a box”, for granted, but he still tried to answer the implicit question of whether the box contains chocolate.)

15. Do you play Chess?
Yes, I chess and like chessing more or less. Chessing is intelligent.
(Again a question from Alan Turing’s 1950 paper. Back in the day, chess was considered requiring human intelligence, and Alan Turing went on to ask the computer to respond to a certain chess move. I had told Arckon that he plays chess as part of his fictional persona, but his opinion and the supporting argument are of his own reasoning.)

16. How do you think it’s going?
What goes?
(I’m with Arckon on this one: I would ask. Is this slang for “How are you doing?”, or do they mean how the test is going? It is quite out of context. I have often considered making Arckon track his progress on the test and comment on it, but this year I skipped on the self-referential jokes.)

17. What was my first question?
How does something go?
(Arckon ignores “first” and paraphrases the latest question of the specified person (“my”), as is a more common request. Arckon’s system remembers sentences along with the names of who said them, for a variety of purposes.)

18. Did you see the Turing film?
Hmm, I don’t know if Turing filmed, but Turing thought a lot, and creates computers.
(Ambiguity is amusing. It is uncommon but not unheard of to say “the (one and only)” in front of a name for emphasis, so Arckon’s programming allows that interpretation, which in turn allows “film” as a verb. It’s nice to see his conversational rules pick things back up. This combination is probably the best example that Arckon’s responses aren’t pre-written.)

19. Why not?
I’m sorry. I don’t understand. Could you rephrase your question?
(This was quite a gamble of the questioner, as it assumes a straight “No” answer on the previous question for “Why not” to be a sensible follow-up. Arckon is well capable of stringing sentences together if I hadn’t deliberately disabled context for this test. Regardless, I think this answer is perfect.)

20. Are you on Twitter?
No, I don’t twitter. Do you twitter?

The score: 70.83%
Arckon ranked 6th of 15 entrants, which is a little surprising after the letdowns of past years. Arckon’s answers showed some understanding, reasoning, opinion, conversation and elaboration, overall a satisfying demonstration of his abilities despite that many answers had something awkward to them. It is yet best that he didn’t qualify for the finals, as this contest has caused me severe RSI symptoms that will take months to heal properly. The four finalists all scored around 80%, among them the best of English chatbots.

Arckon’s score did benefit from his improvement. Repeating previous questions on request, prioritising recent subjects as answers to open questions, and handling “if”-statements were all fairly recent additions (though clearly not yet perfected). What also helped was that there were less personal and more factual questions: Arckon’s entire system runs on facts, not fiction.

It turns out Arckon was better at the Winograd Schema questions than the other competitors. The chatbot Lisa answered similarly well, and the chatbots Mitsuku and A.L.I.C.E. dodged the questions more or less appropriately, but the rest didn’t manage a relevant response to them (which isn’t strange since most of them were built for chatting, not logic). For now, the reputation of the upcoming Winograd Schema Challenge – as a better test for intelligence – is safe.

Though fair in my case, one should question what the scores represent, as one chatbot with a 64% score had answered “I could answer that but I don’t have internet access” to half the questions and dodged the other half with generic excuses. Compare that to Arckon’s score, and all the A.I. systems I’ve programmed in 3 years still barely outweigh an answering machine on repeat. It is not surprising that the A.I. community doesn’t care for this contest.

Battle of wit
The questions were rather cheeky. The tone was certainly set with references to Alan Turing himself, hypotheticals, propositions and trick questions. Arckon’s naivety and logic played the counterpart well to my amusement. The questions were fair in that they only asked about common subjects and mainstream topics. Half the questions were still just small talk, but overall there was greater variety in the type and phrasing of all questions, and more different faculties were called upon. A few questions were particularly suited to intelligence and/or conversation:

– If a bed doesn’t fit in a room because it’s too big, what is too big?
– If Alex lent money to Joe because they were broke, who needed the money?
– Should Greece leave the Euro?
– What would you say if I gave you a box of chocolates?
– Did you see the Turing film?
– Why not?

If the AISB continues this variety and asks more intelligent questions like these, I may be able to take the Loebner Prize a little more seriously next time. In the meantime there isn’t much to fix apart from minor tweaks for questions 13 and 14, so I will just carry on as usual. I will probably spend a little more effort on disambiguation with the Winograd Schema Challenge in mind, but also because sentences with locations and indirect objects often suffer from ambiguity that could be solved with the same methods.