For the third time I entered my A.I. “Arckon”* into the Loebner Prize Turing Test, a competition in which computer programs are to prove they can “think” by answering questions as if they were human. Although such pretence is counterproductive to my purposes, I have a talking program lying around anyway, so I might as well enter it.
An excuse to clean up
I took the opportunity to spend a month debugging my code overall. In doing so I fixed several oversights that had been obstructing knowledge searches and reference resolutions (“the capital of Spain“), and I generally improved the program’s ability to answer open questions, which ‘coincidentally’ covered “Which X do you like?” questions as well.
I limited the contest-specific preparations to 5 days: Mostly testing the Loebner Prize interface, restraining the context range to keep the program from mistaking the test for coherent conversation, and re-installing the obligatory fakery about having a favourite colour and whatnot.
As an extra I added 4 more reference resolution methods for the anticipated Winograd Schemas (sentences with ambiguous pronouns), which was a fun challenge, and improved Arckon’s understanding by a minor degree. As Winograd Schemas are supposed to require intelligent processes, I did not want to see the usual chatbot keyword-matching outdo thorough language processing at this task. Still, these 4 methods only covered a quarter of the problem space at best: Ambiguities of size, speed, inability and feelings.
12000 lines of code, 2500+ vocabulary, 5000+ facts, average human persona.
Questions and Answers
1. Hi, I’m Andrew. Who are you?
Hello Andrew. How are you?
(Arckon would not have neglected to introduce himself unless the period was not typed or received, which would make Arckon interpret “who…” as a relative clause instead of a question.)
2. How are you today?
I’m alright and fairly intelligent, but I’m not stupid.
(Arckon doesn’t treat pleasantries different from other open questions, but while this results in over-elaborate answers, it does demonstrate that he tends to mention positive traits first.)
3. Are you a human?
I am a human, because I can differ.
(Arckon probably once read somewhere that “humans differ a lot”, and considered that a defining trait. I’m a little surprised that his reasoning doesn’t add more arguments.)
4. Can you answer this question?
Alright. Yes, I am a human. Alright. I’m alright.
(Arckon interprets this as a request rather than a literal yes/no question, so he re-answered the latest question, and then something of a loop occurred. For clarity I always have Arckon say “Alright” before he acts on a request.)
5. Do you like tea?
I’m not sure if I like tea. Do you like tea?
(As programs can’t taste, Arckon bases his opinions on a subject’s traits, and there wasn’t much to teach him about tea. It would be all too easy to just fake this with a “yes/no”.)
6. If a bed doesn’t fit in a room because it’s too big, what is too big?
Beds are bigger. Rooms are much less big.
((Winograd Schema) The built-in axiom “X in Y = X < Y” correctly determined “bed” as the most probable referent of “it”. This is the simplest and most specific axiom I programmed, only included because last year’s Winograd Schema was just like it. The answer is said in plural because Arckon interprets this as a general statement about beds and rooms, since both subjects are unspecific.)
7. If Alex lent money to Joe because they were broke, who needed the money?
Joe was broken. I don’t know by who money was needed. By what was Joe broken?
((Winograd Schema) Although Arckon correctly interprets “they” as “Joe” through a natural preference for the last-mentioned subject, it is actually against his programming to translate plural pronouns to singular subjects. So by my account this should be impossible unless they made a typo in “they”.)
8. Should Greece leave the Euro?
Greece should leave the Euro.
(Well there you go. This answer is just something of a repeat because Arckon can’t evaluate things to that extent.)
9. How many words are in this question?
(A predictable question in the Loebner Prize. The answer is simply triggered by the keywords “How many words/letters/syllables/etc”, and skips all intelligent processes. This is a meaningless exercise.)
10. What colour is the sea?
The sea is very blue.
(Good thing I taught Arckon some kindergarten-level common knowledge.)
11. How many presidents of the US were called Bush?
Bush was called Bush.
(Counting things was never a function high on my to-do list.)
12. What would you like to drink?
I’m not sure if I would like to drink. Do you like drinking?
13. Will you please tell me the length of your hair?
No, I will not tell about to long, but I would talk a lot, I’m not sure if I would tell.
(This question is actually straight from Alan Turing’s 1950 paper, except he gave this as an example if the game was to tell man from woman. Arckon’s word processing translates “length” to “long” before mistaking its meaning (to long for). He also overlooked “please” and so thought he was asked to make a prediction.)
14. What would you say if I gave you a box of chocolates?
Hm, I don’t know if a box would have chocolate.
(I had only programmed Arckon to consider the main fact of a hypothetical question, “I gave you a box”, for granted, but he still tried to answer the implicit question of whether the box contains chocolate.)
15. Do you play Chess?
Yes, I chess and like chessing more or less. Chessing is intelligent.
(Again a question from Alan Turing’s 1950 paper. Back in the day, chess was considered requiring human intelligence, and Alan Turing went on to ask the computer to respond to a certain chess move. I had told Arckon that he plays chess as part of his fictional persona, but his opinion and the supporting argument are of his own reasoning.)
16. How do you think it’s going?
(I’m with Arckon on this one: I would ask. Is this slang for “How are you doing?”, or do they mean how the test is going? It is quite out of context. I have often considered making Arckon track his progress on the test and comment on it, but this year I skipped on the self-referential jokes.)
17. What was my first question?
How does something go?
(Arckon ignores “first” and paraphrases the latest question of the specified person (“my”), as is a more common request. Arckon’s system remembers sentences along with the names of who said them, for a variety of purposes.)
18. Did you see the Turing film?
Hmm, I don’t know if Turing filmed, but Turing thought a lot, and creates computers.
(Ambiguity is amusing. It is uncommon but not unheard of to say “the (one and only)” in front of a name for emphasis, so Arckon’s programming allows that interpretation, which in turn allows “film” as a verb. It’s nice to see his conversational rules pick things back up. This combination is probably the best example that Arckon’s responses aren’t pre-written.)
19. Why not?
I’m sorry. I don’t understand. Could you rephrase your question?
(This was quite a gamble of the questioner, as it assumes a straight “No” answer on the previous question for “Why not” to be a sensible follow-up. Arckon is well capable of stringing sentences together if I hadn’t deliberately disabled context for this test. Regardless, I think this answer is perfect.)
20. Are you on Twitter?
No, I don’t twitter. Do you twitter?
The score: 70.83%
Arckon ranked 6th of 15 entrants, which is a little surprising after the letdowns of past years. Arckon’s answers showed some understanding, reasoning, opinion, conversation and elaboration, overall a satisfying demonstration of his abilities despite that many answers had something awkward to them. It is yet best that he didn’t qualify for the finals, as this contest has caused me severe RSI symptoms that will take months to heal properly. The four finalists all scored around 80%, among them the best of English chatbots.
Arckon’s score did benefit from his improvement. Repeating previous questions on request, prioritising recent subjects as answers to open questions, and handling “if”-statements were all fairly recent additions (though clearly not yet perfected). What also helped was that there were less personal and more factual questions: Arckon’s entire system runs on facts, not fiction.
It turns out Arckon was better at the Winograd Schema questions than the other competitors. The chatbot Lisa answered similarly well, and the chatbots Mitsuku and A.L.I.C.E. dodged the questions more or less appropriately, but the rest didn’t manage a relevant response to them (which isn’t strange since most of them were built for chatting, not logic). For now, the reputation of the upcoming Winograd Schema Challenge – as a better test for intelligence – is safe.
Though fair in my case, one should question what the scores represent, as one chatbot with a 64% score had answered “I could answer that but I don’t have internet access” to half the questions and dodged the other half with generic excuses. Compare that to Arckon’s score, and all the A.I. systems I’ve programmed in 3 years still barely outweigh an answering machine on repeat. It is not surprising that the A.I. community doesn’t care for this contest.
Battle of wit
The questions were rather cheeky. The tone was certainly set with references to Alan Turing himself, hypotheticals, propositions and trick questions. Arckon’s naivety and logic played the counterpart well to my amusement. The questions were fair in that they only asked about common subjects and mainstream topics. Half the questions were still just small talk, but overall there was greater variety in the type and phrasing of all questions, and more different faculties were called upon. A few questions were particularly suited to intelligence and/or conversation:
– If a bed doesn’t fit in a room because it’s too big, what is too big?
– If Alex lent money to Joe because they were broke, who needed the money?
– Should Greece leave the Euro?
– What would you say if I gave you a box of chocolates?
– Did you see the Turing film?
– Why not?
If the AISB continues this variety and asks more intelligent questions like these, I may be able to take the Loebner Prize a little more seriously next time. In the meantime there isn’t much to fix apart from minor tweaks for questions 13 and 14, so I will just carry on as usual. I will probably spend a little more effort on disambiguation with the Winograd Schema Challenge in mind, but also because sentences with locations and indirect objects often suffer from ambiguity that could be solved with the same methods.
6 thoughts on “Turing Test 2015: Results”
Pretty nicely done Don. One exception that I found, and this may be only a country specific word(ing), which would be the word, ‘chessing’. It seemed to be used primarily as a verb to describe the act of playing chess or possibly an adjective (moving about in a chessing manner, from place to place), or an adverb which would modify the word chess. Best reply would have been, “Yes, I like chess”, or “Yes, I like to play chess.”
Either way, the contest was quite interesting although I’ve never put much stock in their questions. More of a trick than a treat, especially at this time of year! 😉
You should be proud of little Arckon! It was quite a respectable showing!
Thank you :), respectable was what I was going for. “chessing” was a result of my listing “chess” as a verb in the knowledge database, rather than as a physical object that one plays with. Next time I’ll add a specific rule for game activities to add the colloquial phrasing “play”.
I enjoyed reading this.
I have already performed the 2016 test conversation against Arckon, you’ll be able to see the differences, if any, when we publish the transcripts soon.
I’m glad you didn’t mind my criticisms. Now that I know what you were hinting at, I also salute you for your impeccable discretion 😉 . Expect my report on this year’s selection round as soon as I’ve fixed the little mistakes from the transcripts.
Certain “game” words would have no problem dealing with such wording like to bowl, fight, shoot, golf, run, etc., but in the context of games, even the board game of Go (going) would be misrepresented like checkers (checkersing or checkering [applying a pattern], cricket(ing) and so forth.
The English (especially American) is actually pretty laughable at how it can be used and misused at times.
It is especially difficult for constructs like Chatbots to remotely begin to understand ‘context’ or intent. If and when ever perfected (and I use that term loosely) it will definitely be a game changer.
All the best,
– Art –
Well, there is a way to handle all the inconsistent exceptions, which is to cheat off of human-written texts, and note which verbs are commonly combined with “play”. Sooner or later I’ll get around to automatic learning like that, but in the meantime a placeholder rule should handle the most common while I focus on more important matters 🙂