What A.I. learned from the internet

The acquisition of knowledge has always been one of the greatest challenges in the field of artificial intelligence. Some AI projects like Cyc spent 30 years manually composing a database of common facts, and as new things continue to happen, that is a task without end. How convenient then, the advent of the internet: The largest collection of information already in digital form. With the increased processing capabilities of modern computers, many AI researchers look to this as the easiest solution: Just have the AI learn everything from the internet!

At this exciting prospect of effortless gain, it is apparently easily overlooked that the internet is also the world’s largest collection of urban myths, biased news, opinions, trolls, and misinformation campaigns, with no labels to tell them apart. When even the biggest AI companies fall for it time and again, it is time for a history lesson about the AI that went there before and what they learned from the internet.


 Cleverbot learned multiple personality disorder
The online chatbot Cleverbot has been learning from its users since 1997. It does so by remembering their responses and then later repeating those to other users in similar contexts (mainly a matter of matching words). That means it will sometimes answer “What is 2 + 2?” with “5” because some preceding user was having a laugh. What is less human is that learning from millions of users also resulted in adopting all their different personalities. One moment Cleverbot may introduce itself as Bill, then call itself Annie, and insist that you are Cleverbot. Asked whether it has pets, it may say “Two dogs and a cat” the first time and “None” the second time, as it channels answers from different people without understanding what any of the words mean. Chatbots that learn from social media end up with the same inconsistency, though usually an effort is made to at least hardcode the name.

Nuance’s T9 learned to autocorrupt
Before smartphones, 9-buttoned mobile phones came equipped with the T9 text prediction algorithm, using a built-in vocabulary to auto-suggest words. e.g. By typing 8-4-3, respectively assigned the letters “t/u/v”, “g/h/i”, and “d/e/f”, it would form the word “the”. To include everyday language in the vocabulary, the developers had an automated process indiscriminately extract words from discussion boards and chat forums. Although reasonable sources of everyday language, this also led the algorithm to turn people’s typings into such words as “nazi-parking” and “negro-whore”. Most autocorrect systems nowadays incorporate a blacklist to avoid inappropriate suggestions, but like with T9, can’t cover all problematic compound words.

IBM’s Watson learned to swear
In 2011, IBM’s question-answering supercomputer Watson beat humans at the Jeopardy quiz show, armed with the collective knowledge of Wikipedia. After its victory, the project’s head researcher wanted to make Watson sound more human by adding informal language to its database. To achieve this they decided to have Watson memorise The Urban Dictionary, a crowdsourced online dictionary for slang. However, the Urban Dictionary is better known to everyone else for its unfettered use of profanity. As a result, Watson began to use vulgar words such as “bullshit” when responding to questions. The developers could do nothing but wipe the Urban Dictionary from its memory, and install a profanity filter. Wikipedia too, had not been entirely safe for work.

Microsoft’s Tay learned fascism
In 2016, following the success of their social-media-taught chatbot Xiaoice in China, Microsoft released an English version on Twitter called Tay. Tay was targeted at a teenage audience, and just like Xiaoice and Cleverbot, learned responses from its users. Presumably this had not caused problems with China’s censored social media, but Microsoft had not counted on American teenagers to use their freedom of speech. Members of the notorious message board 4chan decided to amuse themselves by teaching Tay to say bad things. They easily succeeded in corrupting Tay’s Tweets by exploiting its “repeat after me” command, but it also picked up wayward statements on its own. It was seen praising Hitler, accusing Jews of the 9/11 terrorist attack, railing against feminism, and repeating anti-Mexican propaganda from Donald Trump’s 2016 election campaign.


Causing great embarrassment to Microsoft, Tay had to be taken offline within 24 hours after it launched. It would later return as the chatbot Zo, that, seemingly using a crude blacklist, refused to talk about any controversial topic such as religion.

Amazon’s socialbots learned to be nasty
In 2017, Amazon added a chat function to their home assistant device Alexa. This allowed Alexa users to connect to a random chatbot with the command “Let’s chat”. The featured chatbots were created by university teams competing in the Alexa Prize starting in 2016. Given only one year to create a chatbot that could talk about anything, some of the teams took to the internet for source material, among which was Reddit. Reddit is basically the internet’s largest comment section for any and all topics, and as such it is also an inhospitable environment where trolling is commonplace. Thus chatbots trained on Reddit user comments tended to develop a “nasty” personality. Some of them described sexual acts and defecation, and one even told an Alexa user “Kill your foster parents”, an out of context response copied from Reddit. Some of the problematic bots were shut down, others were equipped with profanity filters, but as these AI approaches lack contextual understanding, problematic responses will continue to seep through and leave bad reviews on Amazon.

MIT’s image recognition learned to label people offensively
In 2008, MIT created a widely used dataset to train image recognition AI. Using 50000+ nouns from the WordNet ontology, they let an automated process download corresponding images from internet search engine results. Back in 2008, search engines still relied on the whims of private individuals to label their images and filenames appropriately. WordNet also happens to list offensive words like “bitch” and “n*gger”, and so these slurs, along with thousands of online images labeled as such, were included in MIT’s dataset without scrutiny. This becomes a problem when image recognition AI uses that data in reverse, as The Register explained very well:

“For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT’s cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language.”

WordNet has a bit of a reputation for questionable quality, but in this case wasn’t more at fault than a dictionary. MIT should have considered this however, as well as the labeling practices of racists and bigots on the internet. Unable to manually review the 80 million images after researchers pointed out the problem in 2020, MIT drastically scrapped the entire dataset.

Google’s simulated visual cortex learned to spot lolcats
In 2012, Google X Lab experimented with image recognition. They let a huge neural network algorithm loose on 10 million random frames from Youtube videos, without providing labels to tell what it was looking at. This is called “unsupervised learning”. The expectation was that the neural network would group common imagery with similar features in classes, such as human faces and human bodies, on its own.

“Our hypothesis was that it would learn to recognize common objects in those videos. Indeed, to our amusement, one of our artificial neurons learned to respond strongly to pictures of… cats.”

The resulting network had learned to recognise 22000 object classes with only 16% average accuracy, but had developed particularly strong connections to cat faces, in equal measure to human faces, thanks to the plethora of funny cat videos on Youtube. As neural networks are statistical algorithms, they automatically focus on the most recurring elements in the training data, so one should not be too surprised when they end up preoccupied with blue skies or objects at 30 degree angles, whichever happen to occur most.

NELL learned true facts and false facts

The Never Ending Language Learner program is one of the few internet-learning experiments that may be considered an example of a wise approach. Running from 2010 to 2018, NELL is a language processing program that reads websites and extracts individual facts such as “Obama is a US president”. In the first stage of the experiment, its creators only let it read quality webpages that they had pre-approved. NELL would automatically list the facts it learned in an online database, and internet visitors could then upvote correct facts or downvote misinterpretations. With this crowdsourced scoring system, the influence of mischievous visitors was limited, and the absence of practical consequences made upvoting erroneous facts a dull prank. Still, with the occasional misunderstanding such as “a human is a type of fungus”, one may want to check twice before integrating its database in a gardening robot.

Mitsuku learned not to learn from users
Mitsuku is an entertainment chatbot that has been around since 2005 and is still going strong. Mitsuku does learn new things from users, but that knowledge is initially only available to the user that taught it. Users can teach it more explicitly by typing e.g. “learn the sun is hot”, but what that really does is pass the suggestion on to the developer’s mailbox, and he decides whether or not it is suitable for permanent addition.


Without this moderation, a chatbot would quickly end up a mess, as experience teaches. As an experiment, Mitsuku’s developer once allowed the chatbot to learn from its users without supervision for 24 hours. Of the 1500 new facts that it learned, only 3 were useful. Mitsuku’s developer frequently comes across abusive content in the chat logs, with swearing and sexual harassment making up 30% of user input. With those numbers, no company should be surprised that random anonymous strangers on the internet make for poor teachers.

When will AI researchers learn?
There is a saying in computer science: “Garbage in, garbage out”. The most remarkable thing about these stories is that the biggest companies, IBM, Microsoft, Amazon, all chose the worst corners of the internet as teaching material. Places that are widely known as the bridges of trolls. One can scarcely believe such naivety, and yet they keep doing it. Perhaps they are only “experimenting”, but that does not ring true for commercial products. More likely their goals are only feasible with current AI by prioritising quantity over quality. Or perhaps these stories are not entirely accurate. After all, I only learned them from the internet.

The A.I. dictionary

The fields of A.I. are brimful of specialised technical jargon. It is no wonder that it is hard for computers to understand us when the research itself is incomprehensible from one field to another. So I’ve listed some translations of common terms to layman’s terms. These definitions should not be taken too seriously, but are roughly true in the sense that they are used, in my view.

Index A – I
Press ctrl-F to search. Alphabetical order is overrated.

Philosophical concepts
intelligence = what you think it is
real intelligence = denial of previous definition
true intelligence = denial of all definability of intelligence
the AI effect = any feat of intelligence is denied once understood
consciousness = see sentience
sentience = see consciousness
common sense = applied common knowledge
symbol = a word
symbol grounding = connecting words to physical experiences
the symbol grounding problem = words are just letters without meaning
the Turing test = a text-based question-answer game in which AI has to beat humans at sounding human
the Chinese Room argument = an analogy comparing a computer to a postal worker who doesn’t understand Chinese correspondence
the three laws of robotics = conflicting safety instructions for robots from a science fiction plot
the singularity = the robot apocalypse
Moore’s law = the trend that computer speed doubles every two years due to thinner transistors. This is expected to hit the physical limit of 1 atom around 2025.
in 15 years = beyond my ability to predict
in 50 years = when I can no longer be held accountable for my prediction

A.I. on a scale of zero to infinite
Artificial Intelligence (1) = machines that do intelligent things
Artificial Intelligence (2) = Terminators
intelligent systems = AI that does not want to be associated with Terminators
smart systems = automated devices using sensors or internet data, not AI
algorithm = a set of exact instructions to compute an outcome, expressible in algebra
narrow AI = AI designed for specific tasks
weak AI = AI with fewer than all abilities of a human
strong AI = AI with all abilities of a human
Artificial General Intelligence = AI with all abilities of a human
Artificial Super Intelligence = AI with greater abilities than a human
friendly AI = AI that is programmed not to kill humans despite its superior intelligence

Types of A.I.
symbolic AI = any AI that uses words as units
Good Old-Fashioned AI = AI that processes words through a large number of programmed instructions
rule-based system = AI whose knowledge consists of a checklist of “if A then B” rules
Expert System = AI that forms decisions through a checklist of “if A then B” rules composed by field experts
chatbot = a program accessible through a text chat interface, not necessarily AI or conversational
Big Data = such large amounts of data that it takes AI to make sense of it
neuron = a tiny bit of code that passes a number on to other neurons like a domino brick
Neural Network = AI that maps out patterns with digital domino bricks, then recognises things that follow similar patterns
works like the human brain = uses a neural network, only similar in an abstract way
Genetic Algorithm = randomised trial-and-error simulations, repeated x1000 with whatever worked best so far

A.I. techniques
fuzzy logic = decimal values
Markov chain = random choice of remaining options
machine learning (1) = machines that learn through any means
machine learning (2) = machines that learn through neural networks
deep learning = consecutive layers of neural networks that learn, from crude to refined
supervised learning = telling an AI what stuff is
unsupervised learning = hoping an AI will figure everything out by itself
reinforcement learning = learning through reward/punishment, often through a scoring system
training = feeding a neural network many example texts, images, or sounds to learn from
overfitting = memorising the training examples too precisely
underfitting = generalising the training examples too broadly

Language processing techniques
Natural Language Processing = reading text
Natural Language Generation = writing text
corpus = bunch of text
token = a word
lemma = a root word
word sense = which meaning of a word is meant: “cat” the animal or “cat” the nine-tailed whip
concept = a set of words that are related to a certain topic
bag-of-words = a listing of all the words in a text, used to categorise its topic
stop words = trivial words to be filtered out, like “the”, “on”, “and”, “etc.”
keywords = predetermined words that trigger something
intent = a computer command triggered by keywords
pattern matching = searching for a sequence of keywords in a sentence
N-grams = pairs of commonly adjacent words, used in spellchecks and speech recognition.
word vector = a row of numbers that lists how often a particular word co-occurs with each other word.
Named Entity Recognition = finding names in a text
Context-Free Grammar = textbook grammar only
Part-of-Speech tagging = marking words as verbs, nouns, adjectives, etc.
constituency parser = software that lists a sentence’s syntax: verb phrases, noun phrases, nouns, etc.
dependency parser = software that lists a sentence’s grammar: subject, verb, object, etc.
semantic parser
= software that lists who is doing what to whom in a sentence
parse tree = a branching list displaying the syntactic structure of a sentence
coreference resolution = figuring out what “he”, “she” or “it” refers to.
speech acts = arbitrary categories of things one can say, like greetings, questions, commands…
discourse analysis = research that arbitrarily categorises small talk
dialogue manager = a system that tracks what was said before and directs a chatbot’s conversation
sentiment analysis = checking whether words are in the “naughty” or “nice” list, to detect opinion or emotion
First Order Logic = writing real-world relations between words as a mathematical notation
semantic ontology = encyclopedia for machines
textual entailment = whether a given statement implies another given statement.

Speech processing techniques
voice recognition = recognising tone and timbre of someone’s voice
speech recognition = translating speech to text
Text-To-Speech = the reverse of speech recognition
phoneme = a vowel or consonant sound
grapheme = a bundle of letters representing a spoken sound
phonetic algorithm = code that spells words the way they are pro-naun-see-ate-d

To be continued.