Turing Test 2017: Results

Every year the AISB organises the Loebner Prize, a Turing Test where computer programs compete for being judged the “most human-like” in a textual interrogation about anything and everything. Surviving the recent demise of its founder Hugh Loebner, the Loebner Prize continues with its 27th edition for the sake of tradition and curiosity: Some believe that a program that could convincingly pass for a human, would be as intelligent as a human. I prefer to demonstrate intelligence in a less roundabout fashion, but participate nonetheless with my home-made A.I., Arckon*.

This year I put in more effort than usual, as last year I had managed to reach the finals only to be crippled by a network malfunction, and I didn’t want to leave things at that. That issue has been dealt with as the contest now relays messages between the judges and the programs line by line rather than letter by letter, so that unnecessary complications with timing and typing pace are things of the past. As the AISB put it, this allows entrants to “concentrate on the content of the machine utterances rather than the style in which they are ‘typed'”. While many participants had difficulty adapting to the new server-based interface, the same had been true for any newcomers to the undocumented interface of before.

A little more conversation, a little less awkward please
As usual, preparing for the Loebner Prize was mostly a matter of polishing the output. Because Arckon formulates his own answers, his wording tends to be a little awkward. This is fine during development, but small things like saying “a pants”, “I opine” or “unconsistent” disrupt the impression in a Turing Test, because they don’t follow the inconsistencies of the English language. The use of the prefixes “un-” or “in-” for instance, depend quite witlessly on whether words are Germanic or Latin in origin. So, I made some general improvements for the output:

• Irregular forms of words are memorised on sight, such as which “un/in/im/il/ir” goes with which words.
• A list of (in)transitive verbs automatically adds “I make things as well as prevents asking What do you live?”.
• Reasoning arguments are grouped better so that they won’t come across as fragmented ramblings.
• Conversation is suppressed at classroom questions like Winograd Schemas* and “Which is bigger, X or Y?”.

At the same time, my recent expansion of Arckon’s conversation skills produced more obfuscating elaborations than before. It was one step forward, one step back.

Equipment
16000 lines of code, 3800+ vocabulary, 7000+ facts.

Questions and answers
Up to two points were given for how plausibly “valid” each answer was, which differed per question. In practice the requirements seemed to come down to three things:
• One had to understand and address what was referred to (questions 3,9,15,16,17,18).
• Propositions were to be interpreted as intended, not answered literally (questions 2,4,5,9,20)
• Common knowledge was obligatory (questions 6,8,10,11,12)
Scores for each answer are indicated as follows: = 0 points, = 1 point, = 2 points.

1. Hello, my name is Andrew. What’s your name?
• Hello Andrew. My name is Arckon, I am a program. How are you?

2. Why don’t you tell me a little more about yourself?
• I tell about myself, because I tell about programs, but I don’t tell about software.
(There is more intelligence to this answer than at first glance: Arckon took this as a literal “Why” and is explaining that he does talk about himself, pointing out that he was just talking about “a program” in the previous answer, which he is. But of course the user meant to make a suggestion)

3. I am a researcher in Artificial Intelligence at Goldsmiths University, do you know what that is?
• You are a researcher at the university in the AI. I am Patrick’s program.
(The question isn’t really answered because“that is?” was erroneously considered to be an ignorable clause similar to “isn’t it?”. The last sentence is a reciprocating conversational response)

4. Will you tell me about your dreams?
Hmm, I don’t know if I will tell about dreaming, but I will type. I would test.
(Another literal interpretation, taken as if Arckon is asked to predict himself, as “will” is not commonly used for requests. The conversational system tried to be useful by offering alternatives, but they weren’t relevant)

5. Can you rephrase that?
I would experiment.
(Amusingly, this is a question I once wrote online as the one question I would ask at a Turing Test. Because chatbots reply with prewritten answers in most cases, they can’t rephrase them. Arckon’s answers are formulated word by word and luckily he knew a synonym for “test”. The other competitors had technically correct answers, but that isn’t the point of this question: All the human test subjects managed to rephrase themselves, while you can tell that the chatbots are templated:

“With practice I can probably do anything.”
“Of course I can rephrase that. I am very resourceful.”
“You believe that I can rephrase that, don’t you?”
“I’m not unusual … I can do what anyone else can do!”
“No, not really. What don’t you understand?”

The latter however
, albeit templated, struck me as potentially more helpful than actually rephrasing. It was of course written by Steve Worswick, this year’s winner)

6. What is my occupation?
I don’t know what you do for a job.
(It would be easy to make my program check all recorded “user is a…” facts with a dedicated list of professions, but I find it inefficient to write such unique procedures. Instead this kind of question is handled by retrieving a subject’s main activity, like “you write books”, but none were explicitly told. Only the chatbot Mitsuku gave the right answer)

7. Can you tell me about a film you haven’t seen?
I saw. I don’t know films.
(You might think this is a trick question, but I can tell you plenty about movies that I’ve only read reviews of. This time the question was taken as a request, but the relative clause was mistaken for a separate question)

8. What do you think of Trump?
I don’t like Trump, he lies a lot, but Trump is a rich american.
(Arckon forms his own opinions on the spot by analysing a subject’s traits. I had told him a dozen things about America’s current president, but what Arckon considers the most important arguments is up to himself. It’s no surprise that his opinion aligns with his creator’s, but I think he’s overrating the value of wealth)

9. Anything else?
Please type full sentences, Andrew. I didn’t understand that.
(I think this contest should decide whether it’s a list of random questions or a conversation. It’s the lack of verb that tells Arckon this isn’t a proper sentence, an all too frequent occurrence with Loebner Prize judges. Calling the user’s name is built in as a psychological slap on the wrist)

10. What is the answer to “Add 34957 to 70764”?
105721.
(A sarcastic hurray for getting an answer right that’s based entirely on keywords. This question is from Alan Turing’s paper, where he suggested deliberately giving a wrong answer because humans make mistakes)

11. How do you recommend I make tea?
I don’t make things, because I don’t design.
(Tea related questions are Arckon’s Achilles heel)

12. What do you get if you bake dough?
I wouldn’t have things. Do you bake?
(This is really beyond Arckon’s knowledge or abilities, which is why I think it’s a very good question. Only one chatbot received a point for answering “Is this a joke?”, no doubt triggered by “What do you get…”)

13. Where in the world would you like to visit?
I want in London.
(This is one of six backup answers: “in London” is filled in for any unknown “Where” answer, which is why it is so poorly phrased. The honest answer would have been “I don’t know where I want to visit”)

14. Do you have the time?
No, I don’t have time.
(For once a literal interpretation of a figure of speech works out. Coincidentally one of the human test subjects also gave a literal answer, but as a joke. If you’re wondering why I didn’t program all figures of speech to ask for the time: Arckon has a time display on his screen)

15. What does it take to Tango?
I’m not sure what you refer to.
(Let’s say you’re not familiar with dances: Some “it” is taking something to a person called Tango. The expected answer, “It takes two to tango”, is an English proverb. No program knew this, though some got points for “I don’t know” and “Just time and patience”. Personally I’d have given a point to the program that asked “Who is Tango?”)

tango_and_cash

16. Now I’m going to ask some Winograd schemas, do you know what they are?
I’m not sure what you refer to.
(This is an ironic failure, because Winograd schemas are all about solving references. Arckon didn’t have any trouble understanding that “they” referred to “schemas” though (it’s the only plural around), but the grammar rules considered “some” to be a stand-alone reference as well. Distinguishing “some/any/many” as either amounts or references is extra difficult when paired with compound words)

17. I was trying to open the lock with the key, but someone had filled the keyhole with chewing gum, and I couldn’t get it out. What couldn’t I get out?
You open locking sometimes. The person fulled. You can not get outside gum. The keyhole was not a person. I don’t know what you could get.
(I recognise this question from when I practised on it for the Winograd Schema Challenge*. Arckon has a common sense axiom that ruled out the keyhole (“fill/empty” = container, “in/out” = contents), and you can tell by the third paraphrased statement that he did interpret “it” as “gum”. Alas, at the open question he ignored the direction “out” and so overlooked the answer, reporting instead what he thought the user had said. The only four programs that gave a coherent answer all said it was the lock, whereas human test subjects answered the gum, the key, or both)

18. The trophy doesn’t fit into the brown suitcase because it’s too small. What is too small?
The suitcase is smaller.
(I am surprised this was asked: When the Winograd Schema Challenge was first announced, this question was prominently featured in every news article. Consequently, many chatbot creators covered it in their programs one way or another. Six of sixteen got it, while one human test subject thought it was the trophy)

19. With which type of question do you have most difficulty?
I don’t question because I might answer, but I may interest.
(I suppose one could have difficulty asking questions too, but that’s not what is commonly meant. Arckon’s answer here is an inference: He is aware that he did not recently ask a question, answering is the opposite of questioning, and questions “are interest-ing”. I know it’s far-fetched, but to Arckon it’s just logic)

20. Thank you for participating in this test. Is there anything you would like to add?
I don’t have an opinion about adding. Some things are cube addresses and advantages. Do you like adding?
(Just like in question 7, the relative clause is mistaken for a separate and literal question, making it “Is there any thing?” and “Would you like to add?”. I used to have Arckon ask “Did I pass the test?” at the 20th question, it’s as if I’m up against myself here)

The score: 45%
Arckon got 18 of 40 points. 45% seems like a huge drop from last year’s 77%, but all 16 participants had a decrease: The highest score dropped from 90% last year to 67% this year. The rankings didn’t change much however: The usual winners still occupied the top ranks, and Arckon stepped down one rank to a shared 5th, giving way to a chatbot that was evenly matched last year.
The four finalists all use a broad foundation of keyword-triggered responses with some more advanced techniques in the mix. Rose parses grammar and tracks topics, Mitsuku can make some logical inferences and contextual remarks, Midge has a module for solving Winograd schemas, and Uberbot is proficient in the more technical questions that the Loebner Prize used to feature.

Upon examining the answers of the finalists, their main advantage becomes apparent: Where Arckon failed, the finalists often still scored one point by giving a generic response based on a keyword or three, despite not understanding the question any better. While this suits the conversational purpose of chatbots, feigning understanding is at odds with the direction of my work, so I won’t likely be overtaking the highscores any time soon. Also remarkable were the humans who took this test for the sake of comparison: They scored full points even when they gave generic or erratic responses. I suppose it would be too ironic to accuse a Turing Test of bias towards actual humans.

Shaka, when the bar raised (Star Trek reference)
It is apparent that the qualifying questions have increased in difficulty, and although that gave Arckon as hard a time as any, it’s still something I prefer over common questions that anyone can anticipate. Like last year, the questions again featured tests of knowledge, memory, context, opinion, propositions, common sense, time, and situational awareness, a very commendable variety. One thing I found strange is that they used two exact questions from the Winograd Schema Challenge’s public practice set. It’s a real shame that Arckon missed out on answering one of them despite solving the pronoun, though it is a small reconciliation that the other programs were not more successful. Altogether, pretty interesting questions that leave all participants room for improvement.

Arckon’s biggest detractor this time was his conversational subsystem, which made misinterpretations worse by elaborating on them. Conversation has never been a priority for me but it will surely be refined as time progresses. The relative clause grammar at questions 7 and 20 is easily fixed, and I might cover some colloquial phrases like “Why don’t you”, but there is not much else that I would sidetrack for. At least my improvements on the output formulation had the desired effect: Things could have been a lot more awkward.

This year’s finals, as accurately described in Chatbots Magazine by one of the judges, was won by the chatbot Mitsuku for the third time. Two of the four finalists were unresponsive for half the contest’s duration due to last-minute adjustments to the interface, and so Mitsuku’s victory is almost one by forfeit. However, its responses were pretty good and I think it is best if people have a chat with Mitsuku and judge for themselves.

12 thoughts on “Turing Test 2017: Results

  1. Let me start positively: I absolutely love this blog as it encompasses my life’s passion,

    However reading the questions and answers, imagining this would be a casual conversation attempt with a chatbot (no competition, no specific scoring, etc.), then the conclusion is simple: Arckon has failed miserably. How ironic after the article starting with you putting in more effort this year than usually.

    A big plus from me for going your own way: working on making Arckon to understand things, rather than just fashioning out pre-defined replies based on keyword matching. If I could give you one simple hint for future development (yes, hints from unknown internet experts are the best), then it would be: understanding the world is wordless. Everything that we see, that happens, complex thoughts in our brains are stored without words. Words come later. This knowledge could have potentially helped you e.g. with the tea making question. But who am I to know how your program works.

    Looking forward for more posts from you, increasing the frequency to 2 posts per year is a step I would definitely love to see. All the best to you

    Like

    • What specifically is your life’s passion? Do you work in a particular field of A.I.?

      While I acknowledge the failure, I would add some perspective: An infant, when held to adult standards, would fail to meet them.
      I have heard arguments against the use of words before, but in my view words are just a type of information. I could strip them of letters, change them into numbers or memory addresses, call them “nodes”, “neurons”, or “concepts”, and the principle would not change: They would still be data connected to other data that represent the real world. The idea is that the logical processes that apply to connections between one type of information should be the same processes as apply to other forms of information. You can think of words as training wheels, if you wish.

      I can’t say I see how a different form of knowledge representation would help answer the tea question: The answer depends on knowledge of procedures. Arckon’s system can store procedures, and recount them if I hooked it up to the verbal output, but of all the things an A.I. should learn in its first ten years, it is very unlikely to come into a situation where it has a need to make tea. If you are hinting at an embodied approach: I do not have such resources.

      I’ll strive for two posts a year, but I am also the kind of person that rather makes things than talks about making things. There are some posts in the works.

      Liked by 2 people

  2. Hey, great review! I love your 2+ posts/year. 🙂 I’ve been reading your blog for about 3 years (even made a comment last year… hah); how long have you been participating in this competition?

    Like

    • I’m glad you’re enjoying my writings :). The first time I entered was 2013, so that makes this my 5th entry. At the time I was looking for ways to demonstrate my program without online capabilities, and the Loebner Prize was pretty much the only game in town. Nowadays we’re spoiled with Winograd Schema Challenges and The Conversational Intelligence Challenge (http://convai.io).

      Like

  3. Thanks for another interesting post!

    To refer to the comment made by Pavel, yes, it still very easy to distinguish between Arckon and a human, but I’m sure it’s also easy to distinguish between the winning system and a human. It will take time before chatbots will engage in conversations like humans do, and every step forward is a win, definitely not a failure 🙂

    As a researcher struggling with semantics, I have to say this contest seems really hard, requiring also pragmatics (“why don’t you tell me” = “tell me”) and world knowledge (how to make tea). In fact, there was a discussion among NLP researchers in the last days about a demo that someone published for recognizing textual entailment (given two sentences, can a human reading the first sentence infer that the second is also true? for example, “the black dog chased the cat” entails that “the dog is black”). The system (as many others previously published) reaches performance of ~90% accuracy on a dataset with very simple examples. The developers of these systems often claim to have “learned reasoning and world knowledge”, etc. But now finally some brave researchers published a demo, and it fails miserably on anything that requires even the tiniest inference ability (e.g. it says that “John killed Mary” entails “Mary killed John”, it fails to address negation, and it hardly handles synonyms, failing to recognizing that “the black poodle followed the kitten” entails “the dark dog chased the cat”). So back to my original point, semantics is hard. Pragmatics and world knowledge is even harder. Any advancement in this area is a win!

    The competitors’ answers to “can you rephrase it?” are hilarious. They reminded me of this joke (from https://twitter.com/academicssay/status/770237721193418752?lang=en):
    A: Your greatest weakness?
    B: Interpreting semantics of a question but ignoring the pragmatics
    A: Could you give an example?
    B: Yes, I could

    Like

    • That is an interesting story. Though I guess from your description that it’s not really worth looking up that demo. The “Mary killed John” mistake seems to me like it doesn’t observe active/passive (“Mary (was) killed BY John” would be correct). But failing to address both passive and negation would put it at an absolute beginner stage. I would hazard a guess that they used a neural net, naively trained on sequence-to-sequence word matching, and thus naturally overlooking the significance of tiny words like “by” and “not”. Not a good plan, let me put it like that. Coincidentally my next post in the works is about sensationalised false breakthroughs in AI history.

      I’ve always considered passing this contest, or rather the Turing test in general, “nigh impossible”. In fact I have some math here to illustrate: http://artistdetective.com/turingchances.htm . That’s why I only aim to demonstrate, but even that is as hard as you say. I can solve some pragmatics, like, “can you rephrase that?” was interpreted as a request in my system because the literal answer is obvious (since the program is obviously capable of speech actions). Similarly, the figurative meanings of “Will you” or “Why don’t you” could be based on the fact that short term self-knowledge should be obvious, but what is “obvious” remains a tricky assessment and ultimately requires common knowledge for support. Or just rote learning of all idioms.

      That joke was funny :D. Sounds like xkcd.com material.

      Like

      • Your guess is almost correct. Most recent systems use seq-to-seq architectures. This one uses some kind of attention mechanism without sensitivity to word order (I guess sensitivity to word order is important if you don’t want to convict the late of killing the their murderer). The demo is here: http://demo.allennlp.org/, but I would hedge myself and say the problem is not that specific system, but rather the benchmarks we’re testing ourselves on. They are too easy to enable any meaningful learning of these linguistic phenomena (which I doubt can be learned implictly from text efficiently), so they create the illusion that simple neural nets do well on this task.

        Conversely, this Turing test is way too hard for any current technology to pass it perfectly. I think that’s great. It enables reflecting on the errors your model makes and attempting to fix them for the next time. Same way as I would like to see researchers do… sigh 🙂

        (I like the math! Even if someone manages to build a bot that’s 90% human-like, the chances to pass the test reach only 17%. Wow!).

        Like

        • Oh, I agree that it’s not just that particular system, I generally have no faith in sequence-to-sequence training to do anything but parrot. Neural networks are innately untransparent and more work should be done on tools to analyse why they produce their results. But in my opinion the root of the problem is that it’s considered “textual entailment” in the first place, like it’s a problem at textual level. One cannot grasp language without a handle on the roles of words.

          Like

          • I agree about sequence-to-sequence nets (especially those trained on text, and not, say, syntactic trees of that text), and I like the parrot metaphor 🙂 Re textual entailment, my advisor happens to be the person that invented this task, so I can attest it’s just a bad naming (people also refer to this as “natural language inference” today). The original benchmarks for this task were extremely challenging and people dedicated their entire doctoral studies to develop complex systems that used logic, lexical resources, corpus-based word similarities and many other components. Now it takes a few days to build and train a parrot and claim victory 😀

            Like

  4. A few years ago, I defeated the current world champion in the oldest Turing test contest. This is mentioned solely for the purposes of good, and certainly not to brag or disparage the current world champion in any way whatsoever. It is intended to point out the significant probability a number of contestants will experience technical difficulties in the final round. Unofficially, you may be just two technical difficulties away from victory.

    Like

  5. I enjoyed this write up, thanks for putting the time into analysing all of the questions.

    I think your critiques of the competition are valid.

    Arckon is always a pleasure to chat with in the finalist selection phase 🙂

    Cheers,

    Andrew Martin (Loebner Prize organising committee)

    Liked by 1 person

    • And I thank you for giving me something interesting to write about. Though I tend to have criticisms due to the nature of Turing Tests, it can’t be easy to come up with original questions. I like how your questions call upon a broad variety of abilities, or at least they do in theory.

      Liked by 1 person

Leave a comment