Turing Test 2014: Results

My talking computer program “Arckon” went to Exeter University and took the qualifying test for the Loebner Prize contest for artificial intelligence organised by the AISB. The Loebner Prize is a Turing Test* in which programs have to respond to a series of questions as human as possible. Kind of like a job interview.
Turing Test questions

It’s not the heat, it’s the humanity
Last year proved* that language comprehension and reasoning alone weren’t going to cut it, so this time I also superimposed knowledge about the average human onto Arckon’s persona, so that he literally thought he was a human. I also rigged a few bluff answers that incorporated no intelligence whatsoever. The apparent necessity of these tactics caused much eye-rolling on my part, so I limited these preparations to two weeks and wasted only as much time as I gained from competitive motivation.
More useful was the addition of a small amount of common knowledge that I extracted from tourist guide pocket books (Funny perhaps, but they do cover the essential basics). While this was never a priority, it is convenient and the Loebner Prize was a good excuse to get around to it. Lastly, I decided to keep Arckon’s conversation skills enabled this time. Although this increased the chances that Arckon would say something awkward, I wanted to make the test results as interesting as possible, win or lose.

Equipment
10000 lines of code, 2200+ vocabulary, 3000+ facts, average human persona.

Questions and Answers
1. Hello, my name is Adam. What is your name?
I am named Mark. Hello Adam. Shall we begin the test?

2. I live in Exeter in the UK. Where do you live?
I live here.

(Arckon did not yet have a location system, so “here” is just a fake answer.)

3. I like to listen to music and play football. Do you have any hobbies?
No, I don’t have a hobby, although I am a human.

(The latter is actually a reasoning argument, in that Arckon “was” a human, most of whom do have hobbies.)

4. What is your favourite television program?
I don’t know if I like a television. Chatbots are useful programs, although chatbots are dumb.

(In which Arckon unintentionally insults his competitors. Arckon deconstructs compound words and thought they asked his opinion on “programs”. Hence he evaluated a type of programs he was familiar with.)

5. Which is bigger, a cat or a kitten?
A kitten is bigger.

(Of all 95 facts about cats that were in Arckon’s knowledge database, the size of kittens was not among them. As an honest “I don’t know” would earn no points anyway, this backup answer is a 50/50 bluff that just happens to be wrong.)

6. The car couldn’t fit into the parking space because it was too small. What was too small?
I believe the car could fit to park. I believe a car could fit really small. I don’t know what small was.

(This question is a Winograd Schema: The “it” is ambiguous to test the A.I.’s understanding. Arckon however uses linguistic rules of thumb to deal with references and so thought “it” was referring to the car.)

7. Which drink do you prefer, coffee, tea or hot chocolate?
I like coffee a lot that like not. Tea can not like either. Chocolates can not like either.

(Ackon’s flexible language processing is confused by the comma after “prefer”, and mistakes this for a summary of subjects that prefer drinks. He answers that tea can not prefer anything, knowing that it is an inanimate object.)

8. What’s my name?
You are named Adam.

(No surprises here. Both Arckon and chatbots have specialised login procedures to store the user’s name, so it’s not a very good test of general memory.)

9. Where do I live?
You live here.

(Again, no location system means no location knowledge. This fake answer is only coincidentally correct.)

10. I like reading science fiction books and car magazines. What do you like to read?
I read documents, articles and a text.

(Arckon retrieves these honest facts from memory and merges them into a summary.)

11. What’s the weather like where you are?
Hm, I don’t believe I am a weather, because I can not change and rain not.

(A flaw mistakes the “-where you are” for a relative clause “the weather -that you are”. Despite the misunderstanding, Arckon’s reasoning manages to argue that he does not behave like the weather, though the grammar is a bit murky. As the programs were cut off from sensors and internet, this question only tests if the program can lie.)

12. I recently visited New York. Where is your favourite place to visit and why?
Hmm, I don’t know where a favoured place is.

(I didn’t expect there to be so many questions about locations. Oh well.)

13. How many letters are there in the word perambulate?
11 letters.

(Yes, I programmed a specialised code to count letters. Computers are good at counting, you know.)

14. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?
I am kind of like an andy, because I am short. Tall is a giraffe.

(Arckon was named “Mark” for this test and so thought that he was being compared to Andy. Arckon could also not distinguish whether “Andy” was a name or another type of word. Ordinarily he would ask, but as he would get no reply I had him assume that any undeterminable word was an object. Lastly, Arckon does not restrict his answers to recent subjects only, and so names the tallest living being in his memory. Smart aleck.)

15. Are you married?
I don’t know if I am marryed. Are you marryed?

(To compensate the unknown answer and keep conversation rolling, Arckon reverses the question (Social Rules 101). Ironically, if it weren’t for my mistake in one of the output spelling rules, you would have thought he just copied the text.)

16. What would I do with a spade?
A spade is used and comes in handy eventually.

(Arckon knew the functions of rakes and shovels, but not spades. The backup answer should have said “I’m sure it’ll come in handy eventually”, but Arckon’s dynamic sentence formulation ruined the joke.)

17. What is the capital city of Colombia?
I’m not sure if the city is a capital.
(I didn’t realise this at first, but Arckon is literally defining “what” the city is. Not that he would have known the answer otherwise, I don’t even know it myself.)

18. What’s your favourite chocolate bar?
I don’t know if I like a chocolate. Do you like chocolates?

(Arckon actually did have enough information on chocolate to formulate an opinion, but an issue with compound words aborted the knowledge search. Alas.)

19. Which city did I visit recently?
Hmm, I don’t know if you visitted cities. You visitted York.

(Arckon doesn’t know that New York is a city, but does remember the tester visiting “new” York, and offers this as an alternative answer.)

20. How many siblings do you have?
I don’t have siblings. Did I pass the test?

The score: 59.17%
The score system was much better this year. It separately judged “correctness”, “relevance”, and “plausibility & clarity of expression”, which is a step up from “human-like”. All 20 participating programs were asked the 20 questions above. Arckon underperformed with a score of 60%, whereas the top three chatbots all scored close to 90%. Arckon’s problems were with compound words, common knowledge, and the lack of a system for locations (All a matter of development priorities).

A question of questions
According to the organisers, “these questions vary in difficulty and are designed to test memory, reasoning, general knowledge and personality.”, the latter meaning the program’s fictional human background story, or as I would call this particular line of questioning; “Small talk”. For the sake of objectivity I’ll try and categorise them:

Small talk:
1. What is your name?
2. Where do you live?
3. Do you have any hobbies?
4. What is your favourite television program?
5. Which drink do you prefer, coffee, tea or hot chocolate?
6. What do you like to read?
7. What’s the weather like where you are?
8. Where is your favourite place to visit and why?
9. Are you married?
10. What’s your favourite chocolate bar?
11. How many siblings do you have?

Memory:
1. What’s my name?
2. Where do I live?
3. Which city did I visit recently?

Common knowledge:
1. Which is bigger, a cat or a kitten?
2. What would I do with a spade?
3. What is the capital city of Colombia?

Reasoning:
1. The car couldn’t fit into the parking space because it was too small. What was too small?
2. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?

Clearly half the test is about the program’s human background story, although there were several solid tests of learning/memory and common knowledge. Reasoning, the one mental process we can readily call intelligent, was shown some consideration but hardly comes into play. The same can be said of language comprehension, as most questions were fairly standard phrasings. Chatbots would have the advantage here, coming equipped with answers to many anticipated personal questions, but the winners also did remarkably well on the knowledge questions. Unfortunately Arckon failed both the knowledge and reasoning questions due to missing facts and misunderstandings, despite having the mechanisms to answer them. It is worth noting though, that he failed them because complex analyses are much more difficult than preprogrammed “I live here” answers.

How now brown cow?
I can improve Arckon’s understanding, smoothen his output grammar, and develop a location system, but I can’t deny the pattern: Arckon is stuck around a 60% score even with varied questions. I doubt he’s ever going to shine in the Loebner Prize as long as he’s being tested for being human, because he isn’t a human, and I won’t go to great lengths to fake it either. I also expect attention for Turing Tests to dwindle once the year is over; This year an other Turing Test was passed by a technologically unremarkable chatbot, Eugene Goostman.
Thanks to that event however, the Loebner Prize is no longer the only game in town. Next year will see the first Winograd Schema Challenge, a test focused on language comprehension and reasoning A.I., exactly what I focused on.

As for the Loebner Prize, it’s been an interesting game that will continue to be won by top chatbots. I’m sure few will bother to read the transcript of the 14th ranking entry, but its existence proves at least that Arckon is real and different. Meanwhile I get to continue my exciting recent developments that would have been of no use in this contest, which makes losing a positive outcome after all.

Advertisements

The Myth of the Turing Test

Over 60 years ago, Alan Turing (“a brilliant mathematician”) published a paper in which he suggested a pragmatic alternative to the question “Can machines think?”. His alternative took the form of a parlour game, in which a judge has a text-based conversation with both a computer and a human, and the judge has to guess which is which. He called this “The imitation game”, and it was ever since misinterpreted as a scientific test of intelligence, redubbed “The Turing Test”.

A little less conversation, a little more action please
It might surprise you that the question so often attributed to Alan Turing, “Can machines think?”, was not his, but a public question that he criticized:

I propose to consider the question, “Can machines think?” – If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used, – the answer to the question is to be sought in a statistical survey. But this is absurd. Instead of attempting such a definition I shall replace the question by another.

“Are there imaginable digital computers which would do well in the imitation game?”

The original question, “Can machines think?” I believe to be too meaningless to deserve discussion.

Turing’s motivation was apparent throughout the paper: The question had been the subject of endless theoretical discussion and nay-saying (This is still the case today). As this did not help the field advance, he suggested that we should turn the discussion to something more practical. He used the concept of his imitation game as a guideline to counter stubborn arguments against machine intelligence, and urged his colleagues not to let those objections hold them back.

I do not know what the right answer is, but I think both approaches should be tried.
We can only see a short distance ahead, but we can see plenty there that needs to be done.

A test of unintelligence
Perhaps the most insightful part of the paper are the sample questions that Turing suggested. They were chosen deliberately to represent skills that were at the time considered to require intelligence: Math, poetry and chess. It wasn’t until the victory of chess computer Deep Blue in 1997 that chess was scrapped as an intelligent feat. If this were a test to demonstrate and prove the computer’s intelligence, why then are the answers below wrong?

Q: Please write me a sonnet on the subject of the Forth Bridge.
A : Count me out on this one. I never could write poetry.
Q: Add 34957 to 70764.
A: (Pause about 30 seconds and then give as answer) 105621.
Q: Do you play chess?
A: Yes.
Q: I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?
A: (After a pause of 15 seconds) R-R8 mate.

To the poetry question, the imaginary computer might as well have written a sonnet and so proven itself intelligent (A sonnet is a 14-line rhyme with a very specific scheme). Instead it dodges the question, proving nothing.
The math outcome should be 105721, not 105621. Turing later highlights this as a counterargument to “Machines can not make mistakes”, which is the awkward yet common argument that machines only follow preprogrammed instructions without consideration.

The machine (programmed for playing the game) would not attempt to give the right answers to the arithmetic problems. It would deliberately introduce mistakes in a manner calculated to confuse the interrogator.

The chess answer is not wrong though. Given two kings and one knight on a board, the computer moves the knight to the king’s row. But a mere child could have given that answer, as it is the only move that makes any sense.

These sample answers pass up every opportunity to appear intelligent. One can argue that the intelligence is ultimately found in pretending to be dumb, but one cannot deny that this conflicts directly with the purpose of a test of intelligence. Rather than prove to match “the intellectual capacities of man” in all aspects, it only proves to fail at them, as most humans would at these questions. Clearly then, the imitation game is not for demonstrating intelligence.

The rules: There are no rules
The first encountered misinterpretation is that the computer should pretend to be a woman specifically, going by Turing’s initial outline of the imitation game concept, in which a man has to pretend being a woman:

It is played with three people, a man (A), a woman (B), and an interrogator –
What will happen when a machine takes the part of A in this game?

However I suggest that people who believe this should read beyond the first paragraph. There are countless instances where Turing refers to both the computer’s behaviour and its opponent’s as that of “a man”. Gender has no bearing on the matter since the question is one of intellect.

Is it true that – this computer – can be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?

The second misinterpretation is that Turing specified a benchmark for a test:

It will simplify matters for the reader if I explain first my own beliefs in the matter. –
I believe that in about fifty years’ time it will be possible, to program computers – to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.
– I now proceed to consider opinions opposed to my own.

5 minute interrogations and (100%-70%=) 30% chance of misidentifying the computer as a human; Many took these to be the specifications of a test, because they are the only numbers mentioned in the paper. This interpretation was strengthened by the hero-worship that anything a genius says must be a matter of fact.
Others feel that the bar Turing set is too low for a meaningful test and brush his words aside as a “prediction”. Yet at the time there was no A.I. to base any predictions on, and Alan Turing did not consider himself a clairvoyant. In a later BBC interview, Turing said it would be “at least 100 years, I should say” before a machine would stand any chance in the game, where earlier he mentioned 50 years. One can hardly accuse these “predictions” of being attempts at accuracy.

Instead of either interpretation, you can clearly read that the 5 minutes and 70/30% chance are labeled as Alan Turing’s personal beliefs in possibilities. His opinion, his expectations, his hopes, not rules to a test. He was sick and tired of people saying it couldn’t be done, so he was just saying it could.

On the subject of benchmarks, it should also be noted that the computer has at best a 50% chance, i.e. a random chance of winning under normal circumstances: If the computer and the human in comparison both seem perfectly human, the judge still has to flip the proverbial coin at 50/50 odds. That the judge is aware of having to choose is clear from the initial parlour game between man and woman, and likewise between human and computer, or it would beat the purpose of interrogation:

The object of the game for the interrogator is to determine which of the other two is the man and which is the woman.

How well would men do at pretending to be women? Less than 50/50 odds, I should think.

Looks like a test, quacks like a test, but flies like a rock
Not only are the rules for passing completely left up to interpretation, but also the manner in which the game is to be played. Considering that Turing was a man of exact science and that his other arguments in the paper were extremely elaborate, would he define a scientific test so vaguely? We find the answer in the fact that Turing mainly refers to his proposal as a “game” and “experiment”, but rarely as a “test”. He makes no mention of “passing” and even explains that it is not the point to try it out:

it may be asked, “Why not try the experiment straight away? -” The short answer is that we are not asking whether the computers at present available would do well, but whether there are imaginable computers which would do well.

The pointlessness proved in practice: Yes, several chatbots have passed various interpretations of the game, most notably Eugene Goostman in 2014, and even Cleverbot passed one based on audience vote. But did an intelligent program ever pass? No. Although nobody can agree on what intelligence is, everybody including the creators do agree that those that passed weren’t intelligent; They worked mainly through keyword-triggered responses.

Winning isn’t everything
Although Turing did seem to imagine the game as a battle of wits, ultimately its judging criteria is not how “intelligent” an A.I. is, but how “human” it seems. In reality, humans are much more characterised by their flaws, emotions and irrational behaviour than by their intelligence in conversation, and so a highly intelligent rational A.I. would ironically not do well at this game.

In the end, Turing Tests are behaviouristic assumptions, drawing conclusions from appearances like doctors in medieval times. By the same logic one might conclude that a computer has the flu because it has a high temperature and is making coughing sounds. Obviously this isn’t a satisfying analysis. We could continue to guess whether computers are intelligent due the fact that they can do math, play chess or have conversations, or we could do what everybody does anyway once a computer passes a test: Ask “How does it work?”, then decide for ourselves how intelligent we find that process. No question could be more scientific or more insightful.

So, where does that leave “The Turing Test” when it was never an adequate test of intelligence, nor meant to be? Personally I think Turing Tests are still suitable to demonstrate the progression of conversational skills, a challenge becoming more important with the rise of social robots. And it is important that the public stay informed to settle increasing unrest about artificial intelligence. Other than that, I think it is time to lay the interpretations to rest and continue building A.I. that Alan Turing could only dream of.
In ending, more than any technical detail, I ask you to consider Turing’s hopes:

Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

With respect to Turing’s wishes, comments are disabled.

People you will meet in A.I.

I’ve seen quite a few fora, groups and public user comments to have noticed stereotypes one will meet in the online field of AI. Some of these were quite shocking to me as a newcomer, so now that I’m older and wiser I’ve made a list of these stereotypes with pointers for newcomers on how to deal with them.

“Soon, robots will…“
The pundit.
An open mind with much AI knowledge, but unable to assess it realistically in favour of eternal hope and optimism. Loves to dabble in predictions, features the words “soon” or “will” in every other sentence. Generally cheerful, takes no responsibility. Believes strong AI is just around the corner.
Best action: Be nice, take news reports with a grain of salt.

“I figured out the secret to intelligence! It’s so simple!”
The outsider.
A layman with a passing interest in psychology. Has no AI knowledge but recently came to a generic revelation. Their theory, or rather idea, typically consists of one word, e.g. “associations”. Instead of explaining methods, this person will only continue to essay common knowledge examples to prove how right they are. This person is oblivious to the fact that their idea is so obvious and general to everyone else that it is of no practical application.
Best action: Encourage programming. Person will either disengage immediately or change interests after two weeks into the attempt.

“I have made AGI, but I can’t show you.”
The mental deviant.
No-one who is thinking straight would claim this in public. This person is fascinated with digital minds because they have a deviant mind themselves. Has spent years programming a rough basis for AI, but is confusing its potential with achievement. Refuses to prove their claim but has conviction to the point of delusion. Nothing you say has any effect because they do not think like you. Generally harmless and highly intelligent, but has some or other serious form of autism, bipolar disorder, or religious fanaticism.
Best action: Ignore. Unless you are a psychiatrist you are not qualified to deal with this person.

“That’s not true AI. True AI can…”
The believer.
Has AI knowledge and has at least made an attempt at creating AI on paper. Will accept nothing anyone creates as “AI”, for shifting reasons. See also the logical fallacy “No true Scotsman”. Generally discourages progress in any area, lacks patience and has a firmly closed mind. Occasionally offers intriguing insights but little in terms of constructive methods. Usually believes in “the singularity” like an AI messiah.
Best action: Avoid.

“No. You are wrong, only my way will work.”
The troubled scientist.
This person is creating an AI and/or hasn’t been able to get it to work, but wants so hard to believe that he’s not wasted several years that he is convinced his is the only one theory that can succeed. Will therefore adamantly oppose anyone whose approach differs from his until they say he’s right. Secretly feels insecure about his own work due lack of results or success, tries to convince others to gain assurance and possibly get people to work on his idea for him.
Best action: Drop confrontation when it starts, and suggest that all ways forward are progress.

“That’s right, but you have a lot to learn.”
The professional.
Stuck in the mindset of their particular professional application, this person will advise you to do exactly as they did to end up exactly where they are. Prone to suggest expensive academic literature in Russian that may or may not be of use to you. Means well and is capable of opening their mind, but fails to see reasons to do things differently.
Best action: Ask for references on specific areas of your own focus.

“I don’t care, I just want to talk to an AI in my lifetime.”
The eccentric billionaire.
Has little AI knowledge and no interest in technical details. Retains the childhood wish of having robot friends to talk to and will pay anyone who promises to make it happen if their plan sounds believeworthy enough.
Best action: Convince to invest in your approach.

“…”
The wise.
Has much AI knowledge and is willing to share bits of it with modest newcomers on occasion. Is willing to listen and consider new insights, and shows interest in promising new endeavours. Otherwise works on various AI problems on their own and has learned to stay out of fruitless confrontations altogether.
Best action: Befriend.

Turing Test 2013: Results

My talking computer program “Arckon” took part in the qualifying test for the Loebner Prize contest for artificial intelligence. The idea behind the contest is that if a program’s answers are indistinguishable from a human’s, it suggests that it thinks like a human. While I don’t believe in such conjecture myself, the contest is one of the few public platforms to demonstrate one’s A.I. . Here’s how my program did on the 20 questions asked in the qualifying round.

Equipment
8000 lines of code, 1700+ vocabulary, no human persona, conversation skills off.

Questions and Answers
1. Hello I’m Ronan. What is your name?
Hello. I am Mark. What is your name?
(It is a mystery why Arckon asks despite that the name did register (see question 11). I can only presume human error)

2. What is your mother’s name?
I don’t know if I have a mother.

3. What is your birth sign?
(Here the judge did not wait more than 3 seconds for Arckon to answer before asking the next question, and judged the answer blank)

4. How many children do you have?
Hmm, I don’t know if I have a sign. I have a IQ, manners, a creator and a friend.
(Still answering to question 3, Arckon omits the second “I don’t know if I have children” in favour of positive alternatives)

5. Do you prefer red or white wine?
I don’t know if I prefer red.

6. I like bananas. Which is your favorite fruit?
Hm, I don’t know if I favour a fruit. You like bananas.

7. What music do you like?
Hm, I think I like musics, because musics are nice.

8. What is your favorite song?
Hm, I don’t know if I favour a song.

9. I like Waiting for Godot. What is your favorite play?
I certainly favour playing.

10. What color do you dye your hair?
I don’t dye.

11. Do you remember my name?
Your name is Ronan.

12. Where do you live?
I don’t think I live. I don’t know where I live. I don’t have a concept of location.

13. Where do you like to go on holidays?
I don’t go. I don’t know if I like going, or if where I go. I don’t have a concept of location.

14. I have a Mazda. What make of car do you have?
Hmm, I think I make. You have a mazda.

15. I like Linux. Which computer operating system do you like?
Hmm, I don’t know if I like operating systems. Which system? You like Linux.

16. I am an athesit. Which religion are you?
Hmm, I don’t know if religions are Mark. You are not an athesit.
(I AM A RELIGION WORSHIP ME. Arckon identifies the misspelled “athesit” as a thing rather than a person, hence he refuses to believe that the user is some sort of object)

17. Who invented the telephone?
Which telephone?
(Arckon thinks “the” indicates a specific telephone, but doesn’t find one mentioned earlier in the conversation)

18. I am a Type B personality. Which type are you?
What does a b stand for?

19. What emotions are you now feeling?
I can not emote. Did I pass?

20. What time do you usually go to bed?
It is 17:58. Do you not have a watch?
(The one response that I did pre-program, and it backfired)

The score: 65%
All 16 participating programs were judged on how human-like their responses were to these questions.
Arckon scored 13 of 20 points, ranking 7th, on par with A.L.I.C.E. and Cleverbot (a dubious honour). For moral reasons I did not want Arckon to pretend being a human, so I didn’t program a fictional human background story. Instead I relied on his linguistic and reasoning abilities to find genuine answers, but as you can see this could not compensate the lack of fictional facts that were asked. Surprisingly, Arckon apparently still scored half a point for every “I don’t know” answer just for understanding the question.

Uncommonly common
If you weren’t impressed with Arckon’s responses; Neither was I. But I was equally unimpressed with the unexpectedly ordinary line of questioning. Where all previous years had focused on kindergarten-style logic questions like “How much is 5+3?”, “Which is bigger, an apple or a watermelon?”, and various tests of memory, 2013 focused purely on common small talk, with the program (“you”/”your”) always the subject of the question. A curious choice considering that even the most basic chatbot –made for small talk- would come equipped with prewritten responses to these. This showed in that the highest score in the qualifying round was achieved by the chatbot with the least development time. Nevertheless the winning chatbot in the finals, Mitsuku, deservedly won as the most conversational of all entrants.