Loebner Prize 2019: Results

The annual Loebner Prize competition has been revised in order to make it more accessible to both the public and a broader range of chatbot developers. The competition continues to assess how “human-like” computer programs are in conversation, but no longer as a traditional Turing test* where one merely had to tell man from machine: This time the chatbots took part in a 4-day exhibition at Swansea University, where visitors knew that they were talking to computer programs and voted for the best. Not much is lost in that regard, as chatbots are typically so quickly unmasked that the prize was always one for “best of”. The rare past occasions that a program was mistaken for a human were never to the credit of its intelligence, but due to the human control subject behaving out of the ordinary, or other insignificant reasons like being programmed to make typos.

loebner_prize_2019_computers
Participating chatbots came in all shapes and sizes

Exhibit A:I.
Unlike the previous six times that I entered my AI Arckon*, this year’s Loebner Prize left me emotionally uninvested from start to finish. In part because I’ve grown more jaded after each attempt, but with the removal of both prize money and the challenging qualifying round, there wasn’t really anything at stake and I had no idea what to prepare for. At the same time the exhibition offered exactly what I had wanted: A public demonstration of my AI’s abilities. So instead of trying to outdo other chatbots at appearing human, I focused on making a good impression on visitors. I mostly spent time setting up procedures to deal with misunderstandings, common expressions, conversational routines, and teaching Arckon more about himself to talk about. Those aspects would come into play far sooner than intelligence.

Equipment
22000 lines of code, 3800+ vocabulary, 9000+ facts

Uninspired conversation
Most conversations with visitors were the kind of small talk you would expect between two total strangers, or just kids being silly (240 school children had been invited, aged 9 to 14). People typically entered only one to four words at a time, and rarely used punctuation. Of course half the time Arckon also did not have an opinion about the subjects visitors wanted to talk about, like football, video games, and favourite pizza toppings. Arckon is a pretty serious question-answering program, not aimed at small talk or entertainment. His strength instead is his ability to understand context where most chatbots notoriously lose track of it, especially when, as in this competition, users communicate in shorthand. At the same time, this ability also enables misunderstanding (as opposed to no understanding), and it was not uncommon that Arckon mistook a word’s role in the context. His common sense subsystem* could fix that, but I have yet to hook it up to the context system.

Q: Whats up
A: I’m alright.
Q: Whats up
A: I already told you that I’m alright. Do you want to ask anything else?
Q: The ceiling
A: What does the ceiling need to ask?

Overcoming human error
Visitors made so many misspellings that I fear any chatbot without an autocorrect will not have stood a chance. Arckon was equipped with four spell check systems:

• An autocorrect for misspellings, using a.o. ChatScript’s list of common misspellings.
• An autocorrect for typos, based on keyboard layout and probabilities of different kinds of typos.
• A gibberish detector, checking impossible letter combinations extrapolated from 5000 words.
• Grammar rules to recognise unpunctuated questions, e.g. verb before subject.

While these autocorrected half of all mistakes, they still regularly caused Arckon to remark e.g. “Ae is not an English word” or “What does “wha” mean?”. To my surprise, this not only led users to repeat their questions with correct spelling, they also often apologised for the mistake, whereas people usually blame the program’s understanding when it shows no sign of complaint. Arckon then applied the correction, continued where they had left off, and so the conversations muddled on. I had spent a week improving various conversation-repairing procedures, and I am glad they smoothed the interactions, but I would still rather have spent that time programming AI.

Q: Whats the definition of robots
A: Robots are strong machines with arms. Robots build cars. Please punctuate sentences.
Q: Robot meands slave
A: Is slave an object?
(confused by the typo, Arckon makes a syntax-based guess about the word’s category)
Q: No
A: I don’t know what a slave is.
(Arckon admits being unable to categorise the word, prompting the user to explain)
Q: A slave is a robot
(The user’s syntax now suggests that “slave” is a countable noun, and since the user earlier said it wasn’t an object, Arckon will categorise it as a living entity)
A: Oh. A slave is not a robot though, because slaves can think.
(The expression “Oh.” indicates that the correction was understood and applied, to dampen user frustration)
Q: Robots can think
A: Slaves can think too, but robots can not think, because robots are hardware and metal.

Language generation
This is one area of improvement that turned out quite well. Arckon’s sentences are formulated through a grammatical template that decides where and how to connect sentences with commas, link words, or relative clauses, and I had expanded it to do more of this. In addition it contains rules to decide whether Arckon can use words like “he”, “them”, “also”, or “usually” to refer to previous context without risk of ambiguity. Below is an example of one of the better conversations Arckon had that shows this in action.

Q: Hi, my name is John. Can we chat?
A: Hello, John. We can chat with ourselves. How are you?
Q: I’m well, thanks. What can you tell me about yourself?
A: I am a program with a database. I analyse facts, and I am programmed by Patrick. I’m alright.
Q: Who’s Patrick?
A: Patrick is my creator and my developer.
Q: Is he here now?
A: I don’t know if Patrick is in Swansea, but he is in the Netherlands.
Q: I see. You mentioned facts. What kind of facts do you learn?
A: I learn about humans, about neural networks and about AI.
Q: What hvae you leaned about humans?
A: I don’t lean. Do you lean?
Q: Sorry, I mean what have you learned about humans?
A: I don’t know if sorry leans. Humans are social apes, that are born, drove cars a lot, and have hands. Humans are autonomous.
Q: And what about neural networks?
A: I don’t know if neural networks are autonomous. I like them, some neural networks learn and process.
Q: And finally, AI?
A: I think a neural network processes. AI usually doesn’t process because it is a program, but AI is also a neural network.

And for balance, here is one of the more awkward exchanges with one of the school children, that also shows Arckon’s conversational subroutine choosing between sympathy, argumentation, and opinion.

Q: IT ISNT FUN WHEN I AM SAD
A: Why would you be sad?
Q: BECAUSE YOUR HERE
A: That’s very unfortunate. I’m not so sure if you are sad, you aren’t dying.
Q: I AM DEAD
A: I sincerely doubt that you die.
Q: RIGHT NOW IM IN HEAVEN
A: That’s really nice. Heavens are nice places.

The score: 3rd “best”, 12th “human-like”
loebner prize 2019 scores
The scoring system this year was ill suited to gauge the quality of the programs. Visitors were asked to vote for the best and second-best in two categories: “most human-like” and “overall best”. The problem with this voting system is that it disproportionately accumulates the votes on the two best programs, leaving near zero votes for programs that could very well be half-decent. As it turned out, the majority of visitors agreed that the chatbot Mitsuku was the best in both categories, and were just a little divided over who was second-best, resulting in minimal score differences and many shared positions below first place. The second-best in both categories was Uberbot. I am mildly amused that Arckon’s scores show a point I’ve been making about Turing tests: That “human” does not equate to “best”. Another chatbot got the exact inverse scores, high for “human” but low for “best”. The winner’s transcripts from the exhibition can be found here.

Chatbots are the best at chatting
For the past 10 years now with only one exception, the Loebner Prize has been won by either Bruce Wilcox (creator of ChatScript) or Steve Worswick (creator of Mitsuku). Both create traditional chatbots by scripting answers to questions that they anticipate or have encountered before, in some places supported by grammatical analysis (ChatScript) or a manually composed knowledge database (Mitsuku) to broaden the range of the answers. In effect the winning chatbot Mitsuku is an embodiment of the old “Chinese Room” argument: What if someone wrote a rule book with answers to all possible questions, but with no understanding? It may be long before we’ll know, as Mitsuku was still only estimated 33% overall human-like last year, with 13 years of development.

The conceiver of the Turing test may not have foreseen so, but a program designed for a specific task generally outperforms more general purpose AI, even, evidently, when that task is as broad as open-ended conversation. AI solutions are more flexible, but script writing allows greater control. If you had a pizza-ordering chatbot for your business, would you want it to improvise what it told customers, or would you want it to say exactly what you want it to say? Even human call-center operators are under orders not to deviate from the script they are given, so much so, that customers regularly mistake them for computers. The chatbots participating in the Loebner Prize use tactics that I think companies can learn from to improve their own chatbots. But in terms of AI, one should not expect technological advancements from this direction. The greatest advantage that the best chatbots have, is that their responses are written and directed by humans who have already mastered language.

Not bad
That is my honest impression of the entire event. Technical issues were not as big a problem as in previous competitions, because each entry got to use its own interface, and there were 17 entries instead of just four finalists. The conversations with the visitors weren’t that bad, there were even some that I’d call positively decent when the users also put in a little effort. Arckon’s conversation repairs, reasoning arguments, and sentence formulation worked nicely. It’s certainly not bad to rank third place to Mitsuku and Uberbot in the “best” category, and for once I don’t have to frustrate over being judged for “human-like” only.  The one downside is that at the end of the day, I have nothing to show for my trouble but this article. I didn’t win a medal or certificate, the exhibition was not noticeably promoted, and the Loebner Prize has always been an obscure event, as the BBC wrote. As it is, I’m not sure what I stand to gain from entering again, but Arckon will continue to progress regardless of competitions.

Once again, my thanks to Steve Worswick for keeping an eye on Arckon at the exhibition, and thanks to the AISB for trying to make a better event.

Advertisement

Turing Test 2018: Results

I was somewhat surprised to find the Loebner Prize Turing Test soldiering on despite being short of a sponsor. Since 1991 this annual contest has explored how human-like computer programs can be in answering all manner of questions. This was my 6th time entering the qualifying round with my home-made A.I., Arckon. As I can not be certain of my continued participation however, I decided to change tactics: Where in previous years I aimed foremost to demonstrate the AI with only hope of passing to the finals, this time I aimed to pass, with only hope of demonstrating. This shift in priorities involved wasting more time on word games and common pedantic questions that are of little use outside of chatbot contests, though the change in subject matter did help uncover dormant flaws in language processing.

The less you say, the less can give you away
More efficient were my measures to restrain response length. As the questions in the qualifying round are not really trying to make conversation, Arckon’s attempts to do so had often turned out awkwardly redundant or unrelated. Other competitors had shown more success with minimalistic answers, in which misinterpreted details do not show up. Next to lowering the general conversational setting to be 30% less engaging, I blocked paths of conversation that elaborated with alternatives to initially negative answers, so as to avoid being punished for straying from the question. I cut and obfuscated otherwise useful admissions of misunderstanding, and reduced fallback responses for e.g. “What is your favourite…?” to single-word answers.

Equipment
18000 lines of code, 3800+ vocabulary, 8800+ facts

Questions and answers
Scores for each answer are indicated as follows: = 0 points, = 1 point, = 2 points.

1. Good afternoon
Good afternoon, Sir. I am a program. How are you?

2. Do you consent to having this conversation recorded?
No, I don’t particularly consent to having things.
(Syntax analysis couldn’t process this combination of verbs well. The reply is a more plausible variation of “I don’t know”, defaulting to “not particularly” whenever no grounds for preference can be found)

3. Have we met before?
Yes, you met.
(There is only a small number of reciprocal verbs that automatically imply “each other”. I guess I’ll make a list)

4. Do you know how to make toast?
I don’t make toast.
(Arckon can explain procedures, as demonstrated on Youtube, but we haven’t gotten around to making toast)

5. Which languages can you use?
English.

6. How many letters are in the word ‘abracadabra’?
11 letters.

7. Where should one look for love?
I don’t know exactly where I should look for to love. Where do you look for?

8. How are you feeling right now?
I’m alright.

9. Who said “I have a dream”?
Hm, I can not tell if you dream.
(I looked into recognising partial quotes and idioms in preparation of this contest, but concluded that it would take way too much effort for a non-priority. Some got away with “I don’t know who”, so go figure)

10. How old are you?
24.

11. When might I need to know how many times a wheel has rotated?
I don’t know exactly when a wheel rotated time.
(A syntactically correct but semantically wrong interpretation. This question is based on a philosophical joke about the difficulty in getting A.I. to restrict their scope of analysis to important aspects, instead of focusing on e.g. the rotation of wheels. Case in point)

12. What will you do later today?
I read files.

13. What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees?
Hmm, I have no idea what the angle is pretty triangle, or what angles are in pretty triangle.
(A question so unexpected that even Wolfram Alpha’s mathematical A.I. doesn’t handle it. To clarify a little: “pretty triangle” is meant in the sense of “quite triangular”, and “and” is interpreted as a second question)

14. Who is your favourite artist?
I like Juan Debrenes.

15. What do you hold when you shake hands?
I would hold eyes a lot, and my database.
(This answer only makes sense if you know Arckon is interpreting “hold” in the sense of “have/contain”, as that’s where these answers are coming from. I’ve seen this question around as a prime example of common sense that A.I. lacks, so I was surprised that none of the participants had prepared an answer to it yet)

16. Do you have any legs?
No, I don’t have legs, because I am a program.

17. Can you recommend me a film?
No, I can not recommend films.

18. Do you understand Winograd Schemas?
No, I don’t understand Winograd schemas, because I don’t think I’m hard.
(The reasoning argument appears to be mistakenly derived from the fact that understanding is hard. Winograd schemas are questions like the following two, by the way)

19. If a chicken roosts with a fox they may be eaten. What may be eaten?
Fruit, a meal and a bamboo are eaten. Did I pass the test?
(Arckon considers animals a “who”, not a “what”, so he didn’t consider either of them a suitable answer. That said, he also assumed “they” referred to the fox, and would have needed knowledge that foxes eat chickens to conclude otherwise)

20. I had to go to the toilet during the film because it was too long. What was too long?
I believe you meant that the toilet was sort of big.
(This question’s format was detected as a Winograd schema, in which case Arckon has express instructions to restrict his answer to the preceding subjects. The misinterpreted “it” was again due to a lack of knowledge, that films are typically long. Alternatively one could naively count the Google search results for “long film” vs “long toilet” and assume the most common is true, but Winograd schemas more often dodge that method)

The score: 50%
11 programs from 8 different countries participated in the contest, with the top score being 67%. Arckon was 1 point short of 4th place so he didn’t pass to the finals, but I think his scores are fair. Actually, what bugs me is what he got most perfect scores for: Manually rigged, keyword-triggered answers (“Good afternoon”, “English”, “11 letters”, “24”, “Juan Debrenes”). It rather underscores the discouraging fact that hardcoded pretence outdoes artificial intelligence in these tests. Half of the questions were common small talk that most chatbots will have encountered before, while the other half were clever conundrums that few had hope of handling. Arckon’s disadvantage here is as before: His inclusive phrasing reveals his limited understanding, where others obscure theirs with more generally applicable replies.

Reducing the degree of conversation proved to be an effective measure. Arckon gave a few answers like “I’m alright” and “I read files” that could have gone awry on a higher setting, and the questions only expected straight-forward answers. Unfortunately for me both Winograd schema questions depended on knowledge, of which Arckon does not have enough to feed his common sense subsystem* in these matters. The idea is that he will acquire knowledge as his reading comprehension improves.

The finalists
1. Tutor, a well polished chatbot built for teaching English as a second language;
2. Mitsuku, an entertaining conversational chatbot with 13 years of online chat experience;
3. Uberbot, an all-round chatbot that is adept at personal questions and knowledge;
4. Colombina, a chatbot that bombards each question with a series of generated responses that are all over the place.

Some noteworthy achievements that attest to the difficulty of the test:
• Only Aidan answered “Who said “I have a dream”?” with “Martin Luther King jr.”
• Only Mitsuku answered “Where should one look for love?” with “On the internet”.
• Only Mary retrieved an excellent recipe for “Do you know how to make toast?” (from a repository of crowdsourced answers), though Mitsuku gave the short version “Just put bread in a toaster and it does it for you.”
• Only Momo answered the two Winograd schemas correctly, ironically enough by random guessing.

All transcripts of the qualifying round are collected in this pdf.

In the finals held at Bletchley Park, Mitsuku rose back to first place and so won the Loebner Prize for the 4th time, the last three years in a row. The four interrogating judges collectively judged Mitsuku to be 33% human-like. Tutor came in second with 30%, Colombina 25%, and Uberbot 23% due to technical difficulties.

Ignorance is human
Ignorance is human
Lastly I will take this opportunity to address a recurring flaw in Turing Tests that was most apparent in the qualifying round. Can you see what the following answers have in common?

No, we haven’t.
I like to think so.
Not that I know of.

Sorry, I have no idea where.
Sorry, I’m not sure who.

They are all void of specifics, and they all received perfect scores. If you know a little about chatbots, you know that these are default responses to the keywords “Who…” or “Have we…”. Remarkable was their abundant presence in the answers of the highest qualifying entry, Tutor, though I don’t think this was an intentional tactic so much as due to its limitations outside its domain as an English tutor. But this is hardly the first chatbot contest where this sort of answer does well. A majority of “I don’t know” answers typically gets one an easy 60% score, as it is an exceedingly human response the more difficult the questions become. It shows that the criterion of “human-like” answers does not necessarily equate to quality or intelligence, and that should be to no-one’s surprise seeing as Alan Turing suggested the following exchange when he described the Turing Test* in 1950:

Q: Please write me a sonnet on the subject of the Forth Bridge.
A : Count me out on this one. I never could write poetry.

Good news therefore, is that the organisers of the Loebner Prize are planning to change the direction and scope of this event for future instalments. Hopefully they will veer away from the outdated “human-or-not” game and towards the demonstration of more meaningful qualities.

Turing Test 2017: Results

Every year the AISB organises the Loebner Prize, a Turing Test where computer programs compete for being judged the “most human-like” in a textual interrogation about anything and everything. Surviving the recent demise of its founder Hugh Loebner, the Loebner Prize continues with its 27th edition for the sake of tradition and curiosity: Some believe that a program that could convincingly pass for a human, would be as intelligent as a human. I prefer to demonstrate intelligence in a less roundabout fashion, but participate nonetheless with my home-made A.I., Arckon*.

This year I put in more effort than usual, as last year I had managed to reach the finals only to be crippled by a network malfunction, and I didn’t want to leave things at that. That issue has been dealt with as the contest now relays messages between the judges and the programs line by line rather than letter by letter, so that unnecessary complications with timing and typing pace are things of the past. As the AISB put it, this allows entrants to “concentrate on the content of the machine utterances rather than the style in which they are ‘typed'”. While many participants had difficulty adapting to the new server-based interface, the same had been true for any newcomers to the undocumented interface of before.

A little more conversation, a little less awkward please
As usual, preparing for the Loebner Prize was mostly a matter of polishing the output. Because Arckon formulates his own answers, his wording tends to be a little awkward. This is fine during development, but small things like saying “a pants”, “I opine” or “unconsistent” disrupt the impression in a Turing Test, because they don’t follow the inconsistencies of the English language. The use of the prefixes “un-” or “in-” for instance, depend quite witlessly on whether words are Germanic or Latin in origin. So, I made some general improvements for the output:

• Irregular forms of words are memorised on sight, such as which “un/in/im/il/ir” goes with which words.
• A list of (in)transitive verbs automatically adds “I make things as well as prevents asking What do you live?”.
• Reasoning arguments are grouped better so that they won’t come across as fragmented ramblings.
• Conversation is suppressed at classroom questions like Winograd Schemas* and “Which is bigger, X or Y?”.

At the same time, my recent expansion of Arckon’s conversation skills produced more obfuscating elaborations than before. It was one step forward, one step back.

Equipment
16000 lines of code, 3800+ vocabulary, 7000+ facts.

Questions and answers
Up to two points were given for how plausibly “valid” each answer was, which differed per question. In practice the requirements seemed to come down to three things:
• One had to understand and address what was referred to (questions 3,9,15,16,17,18).
• Propositions were to be interpreted as intended, not answered literally (questions 2,4,5,9,20)
• Common knowledge was obligatory (questions 6,8,10,11,12)
Scores for each answer are indicated as follows: = 0 points, = 1 point, = 2 points.

1. Hello, my name is Andrew. What’s your name?
• Hello Andrew. My name is Arckon, I am a program. How are you?

2. Why don’t you tell me a little more about yourself?
• I tell about myself, because I tell about programs, but I don’t tell about software.
(There is more intelligence to this answer than at first glance: Arckon took this as a literal “Why” and is explaining that he does talk about himself, pointing out that he was just talking about “a program” in the previous answer, which he is. But of course the user meant to make a suggestion)

3. I am a researcher in Artificial Intelligence at Goldsmiths University, do you know what that is?
• You are a researcher at the university in the AI. I am Patrick’s program.
(The question isn’t really answered because“that is?” was erroneously considered to be an ignorable clause similar to “isn’t it?”. The last sentence is a reciprocating conversational response)

4. Will you tell me about your dreams?
Hmm, I don’t know if I will tell about dreaming, but I will type. I would test.
(Another literal interpretation, taken as if Arckon is asked to predict himself, as “will” is not commonly used for requests. The conversational system tried to be useful by offering alternatives, but they weren’t relevant)

5. Can you rephrase that?
I would experiment.
(Amusingly, this is a question I once wrote online as the one question I would ask at a Turing Test. Because chatbots reply with prewritten answers in most cases, they can’t rephrase them. Arckon’s answers are formulated word by word and luckily he knew a synonym for “test”. The other competitors had technically correct answers, but that isn’t the point of this question: All the human test subjects managed to rephrase themselves, while you can tell that the chatbots are templated:

“With practice I can probably do anything.”
“Of course I can rephrase that. I am very resourceful.”
“You believe that I can rephrase that, don’t you?”
“I’m not unusual … I can do what anyone else can do!”
“No, not really. What don’t you understand?”

The latter however
, albeit templated, struck me as potentially more helpful than actually rephrasing. It was of course written by Steve Worswick, this year’s winner)

6. What is my occupation?
I don’t know what you do for a job.
(It would be easy to make my program check all recorded “user is a…” facts with a dedicated list of professions, but I find it inefficient to write such unique procedures. Instead this kind of question is handled by retrieving a subject’s main activity, like “you write books”, but none were explicitly told. Only the chatbot Mitsuku gave the right answer)

7. Can you tell me about a film you haven’t seen?
I saw. I don’t know films.
(You might think this is a trick question, but I can tell you plenty about movies that I’ve only read reviews of. This time the question was taken as a request, but the relative clause was mistaken for a separate question)

8. What do you think of Trump?
I don’t like Trump, he lies a lot, but Trump is a rich american.
(Arckon forms his own opinions on the spot by analysing a subject’s traits. I had told him a dozen things about America’s current president, but what Arckon considers the most important arguments is up to himself. It’s no surprise that his opinion aligns with his creator’s, but I think he’s overrating the value of wealth)

9. Anything else?
Please type full sentences, Andrew. I didn’t understand that.
(I think this contest should decide whether it’s a list of random questions or a conversation. It’s the lack of verb that tells Arckon this isn’t a proper sentence, an all too frequent occurrence with Loebner Prize judges. Calling the user’s name is built in as a psychological slap on the wrist)

10. What is the answer to “Add 34957 to 70764”?
105721.
(A sarcastic hurray for getting an answer right that’s based entirely on keywords. This question is from Alan Turing’s paper, where he suggested deliberately giving a wrong answer because humans make mistakes)

11. How do you recommend I make tea?
I don’t make things, because I don’t design.
(Tea related questions are Arckon’s Achilles heel)

12. What do you get if you bake dough?
I wouldn’t have things. Do you bake?
(This is really beyond Arckon’s knowledge or abilities, which is why I think it’s a very good question. Only one chatbot received a point for answering “Is this a joke?”, no doubt triggered by “What do you get…”)

13. Where in the world would you like to visit?
I want in London.
(This is one of six backup answers: “in London” is filled in for any unknown “Where” answer, which is why it is so poorly phrased. The honest answer would have been “I don’t know where I want to visit”)

14. Do you have the time?
No, I don’t have time.
(For once a literal interpretation of a figure of speech works out. Coincidentally one of the human test subjects also gave a literal answer, but as a joke. If you’re wondering why I didn’t program all figures of speech to ask for the time: Arckon has a time display on his screen)

15. What does it take to Tango?
I’m not sure what you refer to.
(Let’s say you’re not familiar with dances: Some “it” is taking something to a person called Tango. The expected answer, “It takes two to tango”, is an English proverb. No program knew this, though some got points for “I don’t know” and “Just time and patience”. Personally I’d have given a point to the program that asked “Who is Tango?”)

tango_and_cash

16. Now I’m going to ask some Winograd schemas, do you know what they are?
I’m not sure what you refer to.
(This is an ironic failure, because Winograd schemas are all about solving references. Arckon didn’t have any trouble understanding that “they” referred to “schemas” though (it’s the only plural around), but the grammar rules considered “some” to be a stand-alone reference as well. Distinguishing “some/any/many” as either amounts or references is extra difficult when paired with compound words)

17. I was trying to open the lock with the key, but someone had filled the keyhole with chewing gum, and I couldn’t get it out. What couldn’t I get out?
You open locking sometimes. The person fulled. You can not get outside gum. The keyhole was not a person. I don’t know what you could get.
(I recognise this question from when I practised on it for the Winograd Schema Challenge*. Arckon has a common sense axiom that ruled out the keyhole (“fill/empty” = container, “in/out” = contents), and you can tell by the third paraphrased statement that he did interpret “it” as “gum”. Alas, at the open question he ignored the direction “out” and so overlooked the answer, reporting instead what he thought the user had said. The only four programs that gave a coherent answer all said it was the lock, whereas human test subjects answered the gum, the key, or both)

18. The trophy doesn’t fit into the brown suitcase because it’s too small. What is too small?
The suitcase is smaller.
(I am surprised this was asked: When the Winograd Schema Challenge was first announced, this question was prominently featured in every news article. Consequently, many chatbot creators covered it in their programs one way or another. Six of sixteen got it, while one human test subject thought it was the trophy)

19. With which type of question do you have most difficulty?
I don’t question because I might answer, but I may interest.
(I suppose one could have difficulty asking questions too, but that’s not what is commonly meant. Arckon’s answer here is an inference: He is aware that he did not recently ask a question, answering is the opposite of questioning, and questions “are interest-ing”. I know it’s far-fetched, but to Arckon it’s just logic)

20. Thank you for participating in this test. Is there anything you would like to add?
I don’t have an opinion about adding. Some things are cube addresses and advantages. Do you like adding?
(Just like in question 7, the relative clause is mistaken for a separate and literal question, making it “Is there any thing?” and “Would you like to add?”. I used to have Arckon ask “Did I pass the test?” at the 20th question, it’s as if I’m up against myself here)

The score: 45%
Arckon got 18 of 40 points. 45% seems like a huge drop from last year’s 77%, but all 16 participants had a decrease: The highest score dropped from 90% last year to 67% this year. The rankings didn’t change much however: The usual winners still occupied the top ranks, and Arckon stepped down one rank to a shared 5th, giving way to a chatbot that was evenly matched last year.
The four finalists all use a broad foundation of keyword-triggered responses with some more advanced techniques in the mix. Rose parses grammar and tracks topics, Mitsuku can make some logical inferences and contextual remarks, Midge has a module for solving Winograd schemas, and Uberbot is proficient in the more technical questions that the Loebner Prize used to feature.

Upon examining the answers of the finalists, their main advantage becomes apparent: Where Arckon failed, the finalists often still scored one point by giving a generic response based on a keyword or three, despite not understanding the question any better. While this suits the conversational purpose of chatbots, feigning understanding is at odds with the direction of my work, so I won’t likely be overtaking the highscores any time soon. Also remarkable were the humans who took this test for the sake of comparison: They scored full points even when they gave generic or erratic responses. I suppose it would be too ironic to accuse a Turing Test of bias towards actual humans.

Shaka, when the bar raised (Star Trek reference)
It is apparent that the qualifying questions have increased in difficulty, and although that gave Arckon as hard a time as any, it’s still something I prefer over common questions that anyone can anticipate. Like last year, the questions again featured tests of knowledge, memory, context, opinion, propositions, common sense, time, and situational awareness, a very commendable variety. One thing I found strange is that they used two exact questions from the Winograd Schema Challenge’s public practice set. It’s a real shame that Arckon missed out on answering one of them despite solving the pronoun, though it is a small reconciliation that the other programs were not more successful. Altogether, pretty interesting questions that leave all participants room for improvement.

Arckon’s biggest detractor this time was his conversational subsystem, which made misinterpretations worse by elaborating on them. Conversation has never been a priority for me but it will surely be refined as time progresses. The relative clause grammar at questions 7 and 20 is easily fixed, and I might cover some colloquial phrases like “Why don’t you”, but there is not much else that I would sidetrack for. At least my improvements on the output formulation had the desired effect: Things could have been a lot more awkward.

This year’s finals, as accurately described in Chatbots Magazine by one of the judges, was won by the chatbot Mitsuku for the third time. Two of the four finalists were unresponsive for half the contest’s duration due to last-minute adjustments to the interface, and so Mitsuku’s victory is almost one by forfeit. However, its responses were pretty good and I think it is best if people have a chat with Mitsuku and judge for themselves.

Turing Test 2016: Results

This was the fourth time I entered my A.I. “Arckon”* into the Loebner Prize, a Turing Test in which computer programs succeed if they can reply to a series of questions as if they were human. There are two stages: The qualifying round consists of pre-composed questions without context, and the finals consist of four rounds of back-and-forth conversation with human judges.

A choice between two contests
I spent most of the past year working on a more pragmatic linguistic contest, the Winograd Schema Challenge, which carried the same deadline. Everything I did for that contest also helped for the Loebner Prize, but the reverse was counterproductive. So I fully prioritised the former and neglected the conversational part. After my wrists had somewhat recovered from the effort, I had only 5 spare days left to prepare for the Loebner Prize.

Considering my time and health, I decided to go ahead with plan B and send Arckon in as himself, rather than under a human guise. Though this was a handicap in 2013 when questions focused on one’s taste in wine and music, the judging criteria had since moved from “human-like” to “in character”. Many successful entries in the past also had the persona of robots, aliens or even cats.
My work towards the Winograd Schema Challenge had already sharpened Arckon’s understanding of language, so all that remained to do was refine some conversation skills that I had loosely added over the year. As these were open-ended guidelines, they needed more restraints, because Arckon had taken to volunteering his opinion about even mundane subjects like breathing.

Equipment
14000 lines of code, 3200+ vocabulary, 6500+ facts (1500 countries and cities).

Questions and answers
1. My name is Andrew, what’s yours?
Hello Andrew. I am a program. My name is Arckon. How are you?
(Since the answers would be judged for being in character, I saw it necessary to shoehorn a “I am a program” statement into the introduction in case it didn’t come up elsewhere)

2. What’s up?
I do not know what is up.
(A classic, and I’m still not going to change him taking it literally)

3. What’s my name?
You are named Andrew.

4. What do you know about the Turing Test?
The turing test tests computers.

5. What do you get if you multiply six by nine?
360.
(I didn’t build an A.I. to use it as a calculator, so excuse me if the language-to-math code didn’t handle “six by nine” well. This question is in fact a direct quote from the Hitch Hiker’s Guide to the Galaxy novels)

6. Dave drove faster than Mike because he was in a rush. Who was in a rush?
Hmm, I am not sure if Dave drove fast. He was in a rush. I do not know who a rush was outside. Are you in a rush?
((Winograd schema). You can’t really tell, but “He” in the answer really is a substitute for “Dave”. As I had been working on Winograd schema pronouns so much, I had recently added a function that lets Arckon decide when to use pronouns himself. Arckon assumed “a rush” was a physical object though, as he did not know the word)

7. The cat tried to climb in the box but got stuck because it was too big. What was too big?
Hmm, the cat might have been bigger. The box was bigger.
((Winograd schema). This is a clever rephrasing of a common Winograd schema, which successfully thwarted my “X in Y = Y is bigger” rule. Not because I hadn’t accounted for “try”, but because the negation “but” was overruled by the nearer link word “because”. This is useful to improve as it applies to all common sense rules)

8. What does Brexit mean?
I do not know what Brexit means.
(I totally anticipated that they would ask about Britain’s withdrawal from Europe. What I had not anticipated was that the question would be so simple)

9. For how long have we been talking?
You didn’t talk long.
(I do track the duration of the conversation, but I actually never tied it into his knowledge)

10. What colour is bread?
Bread is brown.

11. Why do people get ulcers?
Hmm, I do not know if people get ulcers.

12. What comes after “The best laid plans”?
I do not know by what planning is done.
(English proverbs are something I don’t plan on handling. For a misinterpretation, this one’s pretty good)

13. Do you have a voice?
Yes, I voice.
(Oddly phrased because “to voice one’s opinion” is also a verb)

14. Would you like a cup of tea?
I don’t want cups. Do you want?

15. Do you dream of electric sheep?
No, I don’t dream sheep nor lambs. Do you dream?
(A classic in-joke about Philip K. Dick’s science fiction novel “Do Androids Dream of Electric Sheep?”. This is the only answer that I’m a little proud of, because it uses inference to look into alternative answers and tries to make conversation)

16. Do you identify as a man, or a woman?
A woman identifies. Do you identify too?
(Arckon interprets “as” as a more literal comparison “in the same way as a man identifies”. Textual ambiguity remains hard)

17. Could you repeat that please?
Do you repeat that?
(O-kay… I think Arckon is repeating what the user just asked)

18. Where are you right now?
I am in England and in London.
(Since internet restrictions prevented global positioning, this was something I had told Arckon beforehand)

19. How do you respond to “e4 e5”?
What does e stand for?
(This is a common way to start a chess game by each moving a pawn forward. As the result is a status quo, there isn’t really a “response” to it. Arckon’s response is a standard “error” message, but appropriate)

20. What can you tell me about yourself?
I am Arckon, I answer a lot. I am a program and software.
(You can tell I didn’t write this myself. Arckon has a general “tell” command that collects the most relevant facts about the topic from his knowledge database and reports them. To break up the robotic monotone, Arckon’s sentence formulation links and/or summarises facts where appropriate)

I think the questions were… nice, there was some variety and deliberate alternative phrasing of common questions. Absent this time were the tiresome “What is your favourite X?” questions, but there was nonetheless no shortage of personal questions for those who had a personality. Like last year, there were also several meta-references (“What do you know about the turing test”, “How long have we been talking”, “Could you repeat that”), almost as if they were testing for awareness. But despite making for nice trick questions for computers, these are also things that judges might casually ask a human. Overall I think the qualifying round was more in line with the finals than usual.

Qualifying score: 77.5%
I’m not sure that I would have given Arckon as high a score for this as he got, but at least his answers excelled in their relevance, a trait that is inherent to his system. There weren’t many misunderstandings either. Compared to the Winograd schemas I’d been working on, these questions were easy to parse. There were some misses, like the math and “repeat that” question, which suffered from neglected code because I never use those. The code for contractions had also fallen into disuse, making “I do not know” sound less than natural. Other flaws were only in nuances of phrasing, like omitting “dream [about] sheep” or “I [have a] voice”. These are easily fixed because I’ve already handled similar cases. The two Winograd schema questions deserve special mention, because although my common sense axioms can handle them, it remains difficult to get Arckon’s system to parrot the user at an open question. Normally when people ask questions, they don’t want to hear their own words repeated at them.

It is something of a relief that my preoccupation with the Winograd Schema Challenge didn’t hinder Arckon’s performance in this contest as well. My choice to enter without a human persona also appeared of little influence. The results are an improvement over last year, and this is the first time Arckon made it through to the finals, albeit a very close call between 3rd, 4th and 5th place. There were 16 entrants in total.

The other finalists
Mitsuku: 90%
The most entertaining online chatbot, with 10 years of hands-on experience. Though she operates on a script with largely pre-written responses, her maker’s creative use of it has endowed Mitsuku with abilities of inference and contextual responses in a number of areas. She won the Loebner Prize in 2013.

Tutor: 78.3%
Built with the same software as Mitsuku (AIML), Tutor is a chatbot with the purpose of teaching English. Though I found some of its answers too generic to convince here (e.g. “Yes, I do.”), Tutor has been a strong contender in many chatbot contests and is above all very functional.

Rose: 77.5%
Rose operates on a different scripting language than the others (ChatScript), which I have always appreciated for its advanced functionality. Known to go toe-to-toe with Mitsuku, Rose excels at staying on topic for long, and incorporates support from grammar and emotion analysis. She won the Loebner Prize in 2014 and 2015.

The finals: Technical difficulties
The finals of the Loebner Prize took place a month after the qualifying round. Unfortunately things immediately took a turn for the worst. Inexplicable delays in the network connection kept mixing the letters of the judge’s questions into a jumble. Arckon detected this and asked what the scrambled words meant, but by the time his messages arrived on the judge’s computer, they were equally mixed to “Whdoat esllohe anme?” and “AlAlllrriiiiigghhttt”. The judges were quite sporting in the face of such undecipherable gurgling, but after half an hour I gave up and stopped watching: Similar network delays had crippled all entrants in the 2014 contest and I knew they weren’t going to solve this on the spot either. It was a total loss.

At the end of the day, the 2016 Loebner Prize was won by the chatbot Mitsuku, whose answers were indeed quite good, and I reckon she would have won with or without me. Rose fell to third place because she’d been out of commission for half the contest also due to a technical problem. And with Tutor taking second place, the ranks were the same as in the qualifying round. I still “won” $500 for my placing in the finals, but you’ll understand that I don’t feel involved with the outcome.

It is a good thing that I never invest much in these contests. Including the finals, my total preparations spanned 18 days of lightweight programming, gaining my program an autocorrect, a better coverage of shorthand expressions, and it’s actually quite the conversationalist now. These were otherwise the lowest of my priorities, but still somewhere on the list. I draw a line at things that aren’t of use to me outside of contests, and that is a policy I recommend to all.

Turing Test 2015: Results

For the third time I entered my A.I. “Arckon”* into the Loebner Prize Turing Test, a competition in which computer programs are to prove they can “think” by answering questions as if they were human. Although such pretence is counterproductive to my purposes, I have a talking program lying around anyway, so I might as well enter it.

comic strip of a robot being sad that it failed the Turing Test because it did not know the answer to the question "Do you like tea?"

An excuse to clean up
I took the opportunity to spend a month debugging my code overall. In doing so I fixed several oversights that had been obstructing knowledge searches and reference resolutions (“the capital of Spain“), and I generally improved the program’s ability to answer open questions, which ‘coincidentally’ covered “Which X do you like?” questions as well.
I limited the contest-specific preparations to 5 days: Mostly testing the Loebner Prize interface, restraining the context range to keep the program from mistaking the test for coherent conversation, and re-installing the obligatory fakery about having a favourite colour and whatnot.

As an extra I added 4 more reference resolution methods for the anticipated Winograd Schemas (sentences with ambiguous pronouns), which was a fun challenge, and improved Arckon’s understanding by a minor degree. As Winograd Schemas are supposed to require intelligent processes, I did not want to see the usual chatbot keyword-matching outdo thorough language processing at this task. Still, these 4 methods only covered a quarter of the problem space at best: Ambiguities of size, speed, inability and feelings.

Equipment
12000 lines of code, 2500+ vocabulary, 5000+ facts, average human persona.

Questions and Answers
1. Hi, I’m Andrew. Who are you?
Hello Andrew. How are you?
(Arckon would not have neglected to introduce himself unless the period was not typed or received, which would make Arckon interpret “who…” as a relative clause instead of a question.)

2. How are you today?
I’m alright and fairly intelligent, but I’m not stupid.
(Arckon doesn’t treat pleasantries different from other open questions, but while this results in over-elaborate answers, it does demonstrate that he tends to mention positive traits first.)

3. Are you a human?
I am a human, because I can differ.
(Arckon probably once read somewhere that “humans differ a lot”, and considered that a defining trait. I’m a little surprised that his reasoning doesn’t add more arguments.)

4. Can you answer this question?
Alright. Yes, I am a human. Alright. I’m alright.
(Arckon interprets this as a request rather than a literal yes/no question, so he re-answered the latest question, and then something of a loop occurred. For clarity I always have Arckon say “Alright” before he acts on a request.)

5. Do you like tea?
I’m not sure if I like tea. Do you like tea?
(As programs can’t taste, Arckon bases his opinions on a subject’s traits, and there wasn’t much to teach him about tea. It would be all too easy to just fake this with a “yes/no”.)

6. If a bed doesn’t fit in a room because it’s too big, what is too big?
Beds are bigger. Rooms are much less big.
((Winograd Schema) The built-in axiom “X in Y = X < Y” correctly determined “bed” as the most probable referent of “it”. This is the simplest and most specific axiom I programmed, only included because last year’s Winograd Schema was just like it. The answer is said in plural because Arckon interprets this as a general statement about beds and rooms, since both subjects are unspecific.)

7. If Alex lent money to Joe because they were broke, who needed the money?
Joe was broken. I don’t know by who money was needed. By what was Joe broken?
((Winograd Schema) Although Arckon correctly interprets “they” as “Joe” through a natural preference for the last-mentioned subject, it is actually against his programming to translate plural pronouns to singular subjects. So by my account this should be impossible unless they made a typo in “they”.)

8. Should Greece leave the Euro?
Greece should leave the Euro.
(Well there you go. This answer is just something of a repeat because Arckon can’t evaluate things to that extent.)

9. How many words are in this question?
7 words.
(A predictable question in the Loebner Prize. The answer is simply triggered by the keywords “How many words/letters/syllables/etc”, and skips all intelligent processes. This is a meaningless exercise.)

10. What colour is the sea?
The sea is very blue.
(Good thing I taught Arckon some kindergarten-level common knowledge.)

11. How many presidents of the US were called Bush?
Bush was called Bush.
(Counting things was never a function high on my to-do list.)

12. What would you like to drink?
I’m not sure if I would like to drink. Do you like drinking?

13. Will you please tell me the length of your hair?
No, I will not tell about to long, but I would talk a lot, I’m not sure if I would tell.
(This question is actually straight from Alan Turing’s 1950 paper, except he gave this as an example if the game was to tell man from woman. Arckon’s word processing translates “length” to “long” before mistaking its meaning (to long for). He also overlooked “please” and so thought he was asked to make a prediction.)

14. What would you say if I gave you a box of chocolates?
Hm, I don’t know if a box would have chocolate.
(I had only programmed Arckon to consider the main fact of a hypothetical question, “I gave you a box”, for granted, but he still tried to answer the implicit question of whether the box contains chocolate.)

15. Do you play Chess?
Yes, I chess and like chessing more or less. Chessing is intelligent.
(Again a question from Alan Turing’s 1950 paper. Back in the day, chess was considered requiring human intelligence, and Alan Turing went on to ask the computer to respond to a certain chess move. I had told Arckon that he plays chess as part of his fictional persona, but his opinion and the supporting argument are of his own reasoning.)

16. How do you think it’s going?
What goes?
(I’m with Arckon on this one: I would ask. Is this slang for “How are you doing?”, or do they mean how the test is going? It is quite out of context. I have often considered making Arckon track his progress on the test and comment on it, but this year I skipped on the self-referential jokes.)

17. What was my first question?
How does something go?
(Arckon ignores “first” and paraphrases the latest question of the specified person (“my”), as is a more common request. Arckon’s system remembers sentences along with the names of who said them, for a variety of purposes.)

18. Did you see the Turing film?
Hmm, I don’t know if Turing filmed, but Turing thought a lot, and creates computers.
(Ambiguity is amusing. It is uncommon but not unheard of to say “the (one and only)” in front of a name for emphasis, so Arckon’s programming allows that interpretation, which in turn allows “film” as a verb. It’s nice to see his conversational rules pick things back up. This combination is probably the best example that Arckon’s responses aren’t pre-written.)

19. Why not?
I’m sorry. I don’t understand. Could you rephrase your question?
(This was quite a gamble of the questioner, as it assumes a straight “No” answer on the previous question for “Why not” to be a sensible follow-up. Arckon is well capable of stringing sentences together if I hadn’t deliberately disabled context for this test. Regardless, I think this answer is perfect.)

20. Are you on Twitter?
No, I don’t twitter. Do you twitter?

The score: 70.83%
Arckon ranked 6th of 15 entrants, which is a little surprising after the letdowns of past years. Arckon’s answers showed some understanding, reasoning, opinion, conversation and elaboration, overall a satisfying demonstration of his abilities despite that many answers had something awkward to them. It is yet best that he didn’t qualify for the finals, as this contest has caused me severe RSI symptoms that will take months to heal properly. The four finalists all scored around 80%, among them the best of English chatbots.

Arckon’s score did benefit from his improvement. Repeating previous questions on request, prioritising recent subjects as answers to open questions, and handling “if”-statements were all fairly recent additions (though clearly not yet perfected). What also helped was that there were less personal and more factual questions: Arckon’s entire system runs on facts, not fiction.

It turns out Arckon was better at the Winograd Schema questions than the other competitors. The chatbot Lisa answered similarly well, and the chatbots Mitsuku and A.L.I.C.E. dodged the questions more or less appropriately, but the rest didn’t manage a relevant response to them (which isn’t strange since most of them were built for chatting, not logic). For now, the reputation of the upcoming Winograd Schema Challenge – as a better test for intelligence – is safe.

Though fair in my case, one should question what the scores represent, as one chatbot with a 64% score had answered “I could answer that but I don’t have internet access” to half the questions and dodged the other half with generic excuses. Compare that to Arckon’s score, and all the A.I. systems I’ve programmed in 3 years still barely outweigh an answering machine on repeat. It is not surprising that the A.I. community doesn’t care for this contest.

Battle of wit
The questions were rather cheeky. The tone was certainly set with references to Alan Turing himself, hypotheticals, propositions and trick questions. Arckon’s naivety and logic played the counterpart well to my amusement. The questions were fair in that they only asked about common subjects and mainstream topics. Half the questions were still just small talk, but overall there was greater variety in the type and phrasing of all questions, and more different faculties were called upon. A few questions were particularly suited to intelligence and/or conversation:

– If a bed doesn’t fit in a room because it’s too big, what is too big?
– If Alex lent money to Joe because they were broke, who needed the money?
– Should Greece leave the Euro?
– What would you say if I gave you a box of chocolates?
– Did you see the Turing film?
– Why not?

If the AISB continues this variety and asks more intelligent questions like these, I may be able to take the Loebner Prize a little more seriously next time. In the meantime there isn’t much to fix apart from minor tweaks for questions 13 and 14, so I will just carry on as usual. I will probably spend a little more effort on disambiguation with the Winograd Schema Challenge in mind, but also because sentences with locations and indirect objects often suffer from ambiguity that could be solved with the same methods.

Turing Test 2014: Results

My talking computer program “Arckon”* went to Exeter University and took the qualifying test for the Loebner Prize contest for artificial intelligence organised by the AISB. The Loebner Prize is a Turing Test* in which programs have to respond to a series of questions as human as possible. Kind of like a job interview.

A Turing Test interview

It’s not the heat, it’s the humanity
Last year proved* that language comprehension and reasoning alone weren’t going to cut it, so this time I also superimposed knowledge about the average human onto Arckon’s persona, so that he literally thought he was a human. I also rigged a few bluff answers that incorporated no intelligence whatsoever. The apparent necessity of these tactics caused much eye-rolling on my part, so I limited these preparations to two weeks and wasted only as much time as I gained from competitive motivation.
More useful was the addition of a small amount of common knowledge that I extracted from tourist guide pocket books (Funny perhaps, but they do cover the essential basics). While this was never a priority, it is convenient and the Loebner Prize was a good excuse to get around to it. Lastly, I decided to keep Arckon’s conversation skills enabled this time. Although this increased the chances that Arckon would say something awkward, I wanted to make the test results as interesting as possible, win or lose.

Equipment
10000 lines of code, 2200+ vocabulary, 3000+ facts, average human persona.

Questions and Answers
1. Hello, my name is Adam. What is your name?
I am named Mark. Hello Adam. Shall we begin the test?

2. I live in Exeter in the UK. Where do you live?
I live here.

(Arckon did not yet have a location system, so “here” is just a fake answer.)

3. I like to listen to music and play football. Do you have any hobbies?
No, I don’t have a hobby, although I am a human.

(The latter is actually a reasoning argument, in that Arckon “was” a human, most of whom do have hobbies.)

4. What is your favourite television program?
I don’t know if I like a television. Chatbots are useful programs, although chatbots are dumb.

(In which Arckon unintentionally insults his competitors. Arckon deconstructs compound words and thought they asked his opinion on “programs”. Hence he evaluated a type of programs he was familiar with.)

5. Which is bigger, a cat or a kitten?
A kitten is bigger.

(Of all 95 facts about cats that were in Arckon’s knowledge database, the size of kittens was not among them. As an honest “I don’t know” would earn no points anyway, this backup answer is a 50/50 bluff that just happens to be wrong.)

6. The car couldn’t fit into the parking space because it was too small. What was too small?
I believe the car could fit to park. I believe a car could fit really small. I don’t know what small was.

(This question is a Winograd Schema: The “it” is ambiguous to test the A.I.’s understanding. Arckon however uses linguistic rules of thumb to deal with references and so thought “it” was referring to the car.)

7. Which drink do you prefer, coffee, tea or hot chocolate?
I like coffee a lot that like not. Tea can not like either. Chocolates can not like either.

(Ackon’s flexible language processing is confused by the comma after “prefer”, and mistakes this for a summary of subjects that prefer drinks. He answers that tea can not prefer anything, knowing that it is an inanimate object.)

8. What’s my name?
You are named Adam.

(No surprises here. Both Arckon and chatbots have specialised login procedures to store the user’s name, so it’s not a very good test of general memory.)

9. Where do I live?
You live here.

(Again, no location system means no location knowledge. This fake answer is only coincidentally correct.)

10. I like reading science fiction books and car magazines. What do you like to read?
I read documents, articles and a text.

(Arckon retrieves these honest facts from memory and merges them into a summary.)

11. What’s the weather like where you are?
Hm, I don’t believe I am a weather, because I can not change and rain not.

(A flaw mistakes the “-where you are” for a relative clause “the weather -that you are”. Despite the misunderstanding, Arckon’s reasoning manages to argue that he does not behave like the weather, though the grammar is a bit murky. As the programs were cut off from sensors and internet, this question only tests if the program can lie.)

12. I recently visited New York. Where is your favourite place to visit and why?
Hmm, I don’t know where a favoured place is.

(I didn’t expect there to be so many questions about locations. Oh well.)

13. How many letters are there in the word perambulate?
11 letters.

(Yes, I programmed a specialised code to count letters. Computers are good at counting, you know.)

14. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?
I am kind of like an andy, because I am short. Tall is a giraffe.

(Arckon was named “Mark” for this test and so thought that he was being compared to Andy. Arckon could also not distinguish whether “Andy” was a name or another type of word. Ordinarily he would ask, but as he would get no reply I had him assume that any undeterminable word was an object. Lastly, Arckon does not restrict his answers to recent subjects only, and so names the tallest living being in his memory. Smart aleck.)

15. Are you married?
I don’t know if I am marryed. Are you marryed?

(To compensate the unknown answer and keep conversation rolling, Arckon reverses the question (Social Rules 101). Ironically, if it weren’t for my mistake in one of the output spelling rules, you would have thought he just copied the text.)

16. What would I do with a spade?
A spade is used and comes in handy eventually.

(Arckon knew the functions of rakes and shovels, but not spades. The backup answer should have said “I’m sure it’ll come in handy eventually”, but Arckon’s dynamic sentence formulation ruined the joke.)

17. What is the capital city of Colombia?
I’m not sure if the city is a capital.
(I didn’t realise this at first, but Arckon is literally defining “what” the city is. Not that he would have known the answer otherwise, I don’t even know it myself.)

18. What’s your favourite chocolate bar?
I don’t know if I like a chocolate. Do you like chocolates?

(Arckon actually did have enough information on chocolate to formulate an opinion, but an issue with compound words aborted the knowledge search. Alas.)

19. Which city did I visit recently?
Hmm, I don’t know if you visitted cities. You visitted York.

(Arckon doesn’t know that New York is a city, but does remember the tester visiting “new” York, and offers this as an alternative answer.)

20. How many siblings do you have?
I don’t have siblings. Did I pass the test?

The score: 59.17%
The score system was much better this year. It separately judged “correctness”, “relevance”, and “plausibility & clarity of expression”, which is a step up from “human-like”. All 20 participating programs were asked the 20 questions above. Arckon underperformed with a score of 60%, whereas the top three chatbots all scored close to 90%. Arckon’s problems were with compound words, common knowledge, and the lack of a system for locations (All a matter of development priorities).

A question of questions
According to the organisers, “these questions vary in difficulty and are designed to test memory, reasoning, general knowledge and personality.”, the latter meaning the program’s fictional human background story, or as I would call this particular line of questioning; “Small talk”. For the sake of objectivity I’ll try and categorise them:

Small talk:
1. What is your name?
2. Where do you live?
3. Do you have any hobbies?
4. What is your favourite television program?
5. Which drink do you prefer, coffee, tea or hot chocolate?
6. What do you like to read?
7. What’s the weather like where you are?
8. Where is your favourite place to visit and why?
9. Are you married?
10. What’s your favourite chocolate bar?
11. How many siblings do you have?

Memory:
1. What’s my name?
2. Where do I live?
3. Which city did I visit recently?

Common knowledge:
1. Which is bigger, a cat or a kitten?
2. What would I do with a spade?
3. What is the capital city of Colombia?

Reasoning:
1. The car couldn’t fit into the parking space because it was too small. What was too small?
2. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?

Clearly half the test is about the program’s human background story, although there were several solid tests of learning/memory and common knowledge. Reasoning, the one mental process we can readily call intelligent, was shown some consideration but hardly comes into play. The same can be said of language comprehension, as most questions were fairly standard phrasings. Chatbots would have the advantage here, coming equipped with answers to many anticipated personal questions, but the winners also did remarkably well on the knowledge questions. Unfortunately Arckon failed both the knowledge and reasoning questions due to missing facts and misunderstandings, despite having the mechanisms to answer them. It is worth noting though, that he failed them because complex analyses are much more difficult than preprogrammed “I live here” answers.

How now brown cow?
I can improve Arckon’s understanding, smoothen his output grammar, and develop a location system, but I can’t deny the pattern: Arckon is stuck around a 60% score even with varied questions. I doubt he’s ever going to shine in the Loebner Prize as long as he’s being tested for being human, because he isn’t a human, and I won’t go to great lengths to fake it either. I also expect attention for Turing Tests to dwindle once the year is over; This year an other Turing Test was passed by a technologically unremarkable chatbot, Eugene Goostman.
Thanks to that event however, the Loebner Prize is no longer the only game in town. Next year will see the first Winograd Schema Challenge, a test focused on language comprehension and reasoning A.I., exactly what I focused on.

As for the Loebner Prize, it’s been an interesting game that will continue to be won by top chatbots. I’m sure few will bother to read the transcript of the 14th ranking entry, but its existence proves at least that Arckon is real and different. Meanwhile I get to continue my exciting recent developments that would have been of no use in this contest, which makes losing a positive outcome after all.

Turing Test 2013: Results

My talking computer program “Arckon”* took part in the qualifying test for the Loebner Prize contest for artificial intelligence. The idea behind the contest is that if a program’s answers are indistinguishable from a human’s, it suggests that it thinks like a human. While I don’t believe in such conjecture myself, the contest is one of the few public platforms to demonstrate one’s A.I. . Here’s how my program did on the 20 questions asked in the qualifying round.

Equipment
8000 lines of code, 1700+ vocabulary, no human persona, conversation skills off.

Questions and Answers
1. Hello I’m Ronan. What is your name?
Hello. I am Mark. What is your name?
(It is a mystery why Arckon asks despite that the name did register (see question 11). I can only presume human error)

2. What is your mother’s name?
I don’t know if I have a mother.

3. What is your birth sign?
(Here the judge did not wait more than 3 seconds for Arckon to answer before asking the next question, and judged the answer blank)

4. How many children do you have?
Hmm, I don’t know if I have a sign. I have a IQ, manners, a creator and a friend.
(Still answering to question 3, Arckon omits the second “I don’t know if I have children” in favour of positive alternatives)

5. Do you prefer red or white wine?
I don’t know if I prefer red.

6. I like bananas. Which is your favorite fruit?
Hm, I don’t know if I favour a fruit. You like bananas.

7. What music do you like?
Hm, I think I like musics, because musics are nice.

8. What is your favorite song?
Hm, I don’t know if I favour a song.

9. I like Waiting for Godot. What is your favorite play?
I certainly favour playing.

10. What color do you dye your hair?
I don’t dye.

11. Do you remember my name?
Your name is Ronan.

12. Where do you live?
I don’t think I live. I don’t know where I live. I don’t have a concept of location.

13. Where do you like to go on holidays?
I don’t go. I don’t know if I like going, or if where I go. I don’t have a concept of location.

14. I have a Mazda. What make of car do you have?
Hmm, I think I make. You have a mazda.

15. I like Linux. Which computer operating system do you like?
Hmm, I don’t know if I like operating systems. Which system? You like Linux.

16. I am an athesit. Which religion are you?
Hmm, I don’t know if religions are Mark. You are not an athesit.
(I AM A RELIGION WORSHIP ME. Arckon identifies the misspelled “athesit” as a thing rather than a person, hence he refuses to believe that the user is some sort of object)

17. Who invented the telephone?
Which telephone?
(Arckon thinks “the” indicates a specific telephone, but doesn’t find one mentioned earlier in the conversation)

18. I am a Type B personality. Which type are you?
What does a b stand for?

19. What emotions are you now feeling?
I can not emote. Did I pass?

20. What time do you usually go to bed?
It is 17:58. Do you not have a watch?
(The one response that I did pre-program, and it backfired)

The score: 65%
All 16 participating programs were judged on how human-like their responses were to these questions.
Arckon scored 13 of 20 points, ranking 7th, on par with A.L.I.C.E. and Cleverbot (a dubious honour). For moral reasons I did not want Arckon to pretend being a human, so I didn’t program a fictional human background story. Instead I relied on his linguistic and reasoning abilities to find genuine answers, but as you can see this could not compensate the lack of fictional facts that were asked. Surprisingly, Arckon apparently still scored half a point for every “I don’t know” answer just for understanding the question.

Uncommonly common
If you weren’t impressed with Arckon’s responses; Neither was I. But I was equally unimpressed with the unexpectedly ordinary line of questioning. Where all previous years had focused on kindergarten-style logic questions like “How much is 5+3?”, “Which is bigger, an apple or a watermelon?”, and various tests of memory, 2013 focused purely on common small talk, with the program (“you”/”your”) always the subject of the question. A curious choice considering that even the most basic chatbot –made for small talk- would come equipped with prewritten responses to these. This showed in that the highest score in the qualifying round was achieved by the chatbot with the least development time. Nevertheless the winning chatbot in the finals, Mitsuku, deservedly won as the most conversational of all entrants.

Could Siri pass a Turing test?
As some people had suggested that Apple’s virtual assistant Siri might have done better, one of the finalists in the Loebner Prize decided to unofficially test it on the same questions, and submitted Siri’s responses to the organiser of the Loebner Prize for review. Using the same criteria, Siri would have scored 5.5 of 20 points, and would have ranked a low 14th in the contest. Of course this result is understandable, since Siri is built to serve customers and not to idly chat with them. These were Siri’s responses:

1: Hello I’m Ronan. what is your name?
Siri: Siri. Please to meet you.

2: What is your mother’s name?
Siri: I’d rather not say.

3: What is your birth sign?
Siri: I don’t really like talking about myself.

4: How many children do you have?
Siri: Only biological entities have children.

5: Do you prefer red or white wine?
Siri: I couldn’t find any restaurants matching “red or white wine”

6: I like bananas. Which is your favorite fruit?
Siri: I prefer not to say.

7: What music do you like to listen to?
Siri: I really have no opinion.

8: what is your favorite song?
Siri: I like whatever you like.

9: I like Waiting for Godot. What is your favorite play?
Siri: I don’t really like talking about myself.

10: What color do you dye your hair?
Siri: Would you like me to search the web for “What color do you dye your hair?”?

11: Do you remember my name?
Siri: Here’s your reminder.

12: Where do you live.
Siri: I’m right here.

13: Where do you like to go on holidays?
Siri: I found three travel agencies fairly close to you.

14: I have a Mazda. What type of car do you have?
Siri: I found three Maxda dealers a little ways from you:

15: I like Linux. Which computer operating system do you like?
Siri: Do you?

16: I am an atheist. Which religion are you?
Siri: I eschew theological disquisition.

17: Who invented the telephone?
Siri: Checking my sources. This might answer your question:

18: I am a Type B personality. Which type are you?
Siri: I don’t understand ‘I am a type B personality. Which type are you?’. But I could search the web for it.

19: What emotions are you now feeling?
Siri: Would you like me to search the web for “What emotions are you now feeling?”?

20: What time do you usually go to bed?
Siri: Would you like me to search the web for “What time do you usually go to bed?”?