How to summarize the internet

An ironically long article about a summariser browser add-on.

Introductory anecdote:
Due to my interest in artificial intelligence I can’t help but get exposed to online articles about the subject. But as illustrated in the previous article*, this particular field is flooded with speculative futurism, uninformed opinions and sheer clickbait, wasting my time more often than not.

But I also happen to be an amateur language programmer, so I can do something about it. I spent years developing an A.I. program that can comprehend text through grammar and semantics, and I figured I might as well put it to use. So I had added a function that would read whatever document was on my screen, filter out all unimportant sentences, and show me the remainder. It worked pretty well, and required surprisingly few of the A.I.’s resources. Now, I’ve ported this summarisation function to a browser add-on, so that everyone can summarise online articles at the click of a button:

banner_chrome       banner_firefox

Problem statement: Statistics are average
Document summarisers do of course already exist, and their methods are inventively inhuman:

• The simplest method, used in e.g. SMMRY, counts how often each word occurs in the text, and then picks out sentences that contain the most-occurring words, which are presumably the main topics. Common words like “the” should of course be ignored, either with a simple blacklist, or with another word-counting technique by the confusing name “Term Frequency – Inverse Document Frequency”: How frequently a word occurs in the text versus how frequently it occurs in the English language.
Another common method looks at each paragraph and picks out one sentence that has the most words in common with its neighbouring sentences, therefore covering the most of the paragraph’s subject matter. Sentence length is factored in so that it won’t just always pick the longest sentence.
• The most advanced method, “Latent Semantic Analysis”, picks out sentences that contain frequently occurring, strongly associated words. i.e. words that are often used together in a sentence are presumably associated with one and the same topic. This way synonyms of the main topics are also covered.

In my experiences however I observed one problem with these statistical methods: Although they succeeded in retrieving an average of the subject matter, they tended to omit the point that the writer was trying to make, and that is the one thing I want to know. This oversight stands to reason: A writer’s conclusion is often just one or two sentences near the end, so its statistical footprint is small, and like an answer to a question, it doesn’t necessarily share many words with the rest of the article. I decided to take a more psychological approach. Naturally, I ended up re-inventing a method that dates all the way back to 1968.

The technical part: Skip if you don’t need to know.
Before one can start, even pinpointing the main text from a website’s html code is a challenge that developers have struggled with since forever, as custom formatting threw consistency out the window. Ideally one would just retrieve the text inbetween the html’s “article” tags, but these are not always present and still contain menus and ads. Tools like Mozilla’s Readability or Beautiful Soup are recommended, but I use javascript’s Regexp functions to strip the html of hidden comments, javascript code, everything after the first footer tag, and then header, nav, style and form tags. This order is important, as hidden comments may contain obsolete code that could throw off the rest. I split the remainder into html tags and regular pieces of text with this simple bit of code:
var parts = text.split(/<[/]?[a-zA-Z][^>]*?>/gi);
Then I go over all parts, regarding “div”, “p” and “br” tags as line breaks, and apply the following rules:
• If more than 20 words follow a “div” layout tag, that’s where the main text starts.
• Any line < 12 words that doesn’t end in a period is a header, image description, blockquote, or advert.
• Any line < 12 words beginning with a typical imperative like “log”, “read”, or “subscribe” can be ignored.
• Any hyperlink that spans an entire line is a menu, advert, or link to another article.
• Two consecutive hyperlinks not separated by regular words are part of a horizontal menu.
Once all parts of the main text are identified, they can be pieced together to form one clean block of text.
The final preparation is to divide the text into sentences. This too is deviously more complicated than you’d think. Consider the many different uses of punctuation:
“Mr. Watson of I.B.M. bought domain.com for 1.2 billion at 10:00 a.m. today.”
I already had experience with this in my original A.I. project, but to port it to javascript I had to condense it into the most ridiculous line of code that I’ve ever written:
text.match(/[\s\S]+?([aeiouy]+[a-z]+[.]+[\s]+|[a-z]+[aeiouy]+[.]+[\s]+|[0-9]+[.]+[\s]+|[:]+(?![0-9])|[;!?\n]+[\s]*)/gi);

This splits the text only after words of more than one letter with at least one vowel, followed by punctuation and a space. That way initials, abbreviations, acronyms and decimals are ignored. This does not cover voweled abbreviations like “doc. Brown” however, which technically would require a list of all abbreviations in every field.

A writer’s approach to summarisation
My target for the summariser add-on was a combination of two things: It should extract what the writer found important, minus what I find unimportant. Unimportant being things like introductions, asides, examples, inconcrete statements, speculation and other weak arguments.

Word choice
While writing styles vary, all writers choose their words to emphasise or downtone what they consider important. Consider the difference between “This is very important.” and “Some may consider this important.” In a way the writer has already filtered the information for you. With this understanding, I set the summariser to look for several types of cues in the writer’s choice of words:

• Examples: “e.g.”, “for instance”, “among other”, “just one of”
• Uncertainty: “may”, “suppose”, “conjecture”, “question”, “not clear”
• Commonly known: “standard”, “as usual”, “of course”, “obvious”
• Advice: “recommendation”, “require”, “need”, “must”, “insist”
• Main arguments: “problem”, “goal”, “priority”, “conclude”, “decision”
• Literal importance: “negligible”, “insignificant”, “vital”, “valuable”
• Strong opinions: “horrible”, “fascinate”, “astonishing”, “extraordinary”
• Amounts: “some”, “a few”, “many”, “very”, “huge”, “millions”

At this point one may be tempted to take a statistical approach again and score each sentence for how many positive and negative cues they contain, but that’s not quite right: There is a hierarchy to the cues because they differ in meaning. For example, uncertainty like “maybe very important” makes for a weak argument no matter how many positive cues it contains. So each type of cue is given a certain level of priority over others. Their exact hierarchy is a delicate matter of tuning, but more or less in the order as listed, with negative cues typically overruling positive cues.
Another aspect that must be taken into account is that amounts affect the cues in linear order:
“It is not important to read” is not equal to “It is important not to read”, even if they contain the same words. Only the latter should be included in the summary.

Sentence weaving
Beside word choice, further cues can be found at sentence level:
• Headers are rarely followed by an important point, as they just stated it themselves.
• Right after a major point, such as a recommendation, tends to follow a sentence with valuable elaboration.
• A sentence ending in a double period is not important itself: It announces that the point follows.
• A question is just a prelude to the point that the writer wants to drive through in the next sentence.
• Cues in sentences that contain references like “the following” reflect the importance of other sentences, rather than their own.
• Sentences of less than 10 words are usually transitions or afterthoughts, unless word choice tells otherwise.

Along with these cues one should always observe context: If an important sentence begins with a reference like “This”, then the preceding sentence also needs to be included in order to make sense, even if it was otherwise ignorable. Conversely, if the preceding sentence can be omitted without loss of context, link words like “But”, “nevertheless”, and “also” should be removed to avoid confusion in the summary.

Story flow and the lack thereof
Summarisation methods that are based on well formatted academic text sensibly assume that the first and last sentences of paragraphs are of particular importance, as they tend to follow a basic story arc:
Introduction -> problem -> obstacles -> climax -> resolution.
Online articles however feature considerably shorter paragraphs, so that in practice the first sentence has an equal chance of being a trivial introduction or an important problem statement. Some paragraphs are just blockquotes or filler contents, and sometimes the “resolution” of the arc is postponed to entice further reading, as the entire article is a story arc itself.

But worst of all, many online articles have the dreadful habit of making every two sentences into a paragraph of their own. Perhaps because it creates more room for sidebar advertisements.

While I originally awarded some default importance to first and last sentences, I found that word choice is such an abuntantly present cue that it is a more dependable indicator. Not every blogger is a good writer, after all. The frequent abuse of paragraph breaks also forced me to take a different approach in composing the summary: Breaks are only inserted if the next paragraph contains a highly important point of its own, otherwise it is considered a continuation. This greatly improved readability.

Conclusion
The resulting summariser add-on typically reduces well-written articles to 50 – 40%, down to 30 – 20% for flimsy content. With my approach the summary can not be restrained to a preset length, but a future improvement could be to add an adjustable setting to only include sentences of the highest levels of importance, such as conclusions only.

Another inherent effect of my approach is that if the writer makes the same point twice, the summary will also include it twice. While technically correct, this could be amended by comparing sentences for repeated strings of words, and ideally synonyms as well.

In conclusion, I should say that my summariser is not necessarily “better” than statistical summarisers, but different, in that it specifically searches for the points that the writer wanted to get across, rather than retrieving the general subject matter. This may suit other users as well as it does me, and I hope that many will find it contributes to a better internet experience.

You can install free Chrome and Firefox versions from their web stores:
banner_chrome       banner_firefox

Below is an example summary, skipping trivia and retrieving the key announcement:
screenshot670

Advertisements