Unsupervised Auto-labeling of Websites

May 28th, 2021 • A simple and fast way of generating labels for websites.

Unsupervised Auto-labeling of Websites

This post was written by Luka Dulčić, our data scientist and software engineer.

Ever wondered how nice it would be if labels were automatically generated for your links so you didn’t have to write them? Labels are great for adding context to things, grouping content, and making content easily searchable. That’s why we decided to give it a shot and implement auto-labeling of links in Pincone with a little bit of Natural Language Processing.

The whole process of generating labels is actually quite simple. First, we get the HTML of the link → then we extract text from the HTML → finally, we use a keyword extractor to generate labels.

Auto-generated labels when adding a link.

HTML → Text

The first step of implementing auto-labeling is extracting text from the link’s HTML. This is not a simple task at all because our target is literally the whole internet and there are all kinds of weird and broken HTMLs out there. We would like to extract the link’s “main body” of text which is representative of the link. For example, when we analyse a news article, we want to extract the article text, everything else on the webpage is noise to us. Luckily, there are many tools available nowadays that do just that.

Scrapinghub published the “Article Extraction Benchmark” paper in May 2020 where they compiled a dataset of 182 news articles, annotated them, and ran many of the currently available “HTML → Text” extractors on those links. They also published the code and dataset. The table below shows their results with my addition of some fields like the licence of open-source projects, time of the last commit, and the number of issues.

Name Precision Recall F1 Accuracy Language Last Release # Of Issues Stars Free Open-Source Licence
AutoExtract 0.984 0.956 0.970 0.470 Unknown FALSE FALSE Commercial licence
DiffBot 0.958 0.944 0.951 0.348 Unknown FALSE FALSE Commercial licence
boilerpipe 0.850 0.870 0.860 0.006 Python 10/03/2017 21 493 TRUE TRUE Apache v2
dragnet 0.925 0.889 0.907 0.221 Python 04/16/2019 17 951 TRUE TRUE MIT
html-text 0.500 0.994 0.665 0.000 Python 07/22/2020 10 71 TRUE TRUE MIT
newspaper 0.917 0.906 0.912 0.260 Python 09/28/2019 349 10968 TRUE TRUE MIT
readability 0.913 0.931 0.922 0.315 Python 07/04/2020 29 1969 TRUE TRUE Apache v2
xpath-text 0.246 0.992 0.394 0.000 XPath TRUE TRUE
trafilatura 0.925 0.966 0.945 0.221 Python 04/21/2021 12 125 TRUE TRUE GPL v3
go_readability 0.912 0.975 0.943 0.210 Go 10/11/2020 3 266 TRUE TRUE MIT
readability.js 0.853 0.924 0.887 0.149 Javascript 01/13/2021 160 3956 TRUE TRUE Apache v2
go_domdistiller 0.901 0.956 0.927 0.066 Go 12/22/2020 1 8 TRUE TRUE MIT
news_please 0.917 0.906 0.911 0.249 Python 05/06/2021 18 1038 TRUE TRUE Apache v2
goose3 0.930 0.847 0.887 0.227 Python 04/27/2021 12 492 TRUE TRUE Apache v2
inscriptis 0.517 0.993 0.679 0.000 Python 01/04/2021 1 67 TRUE TRUE Apache v2
html2text 0.499 0.983 0.662 0.000 Python 01/16/2021 60 1090 TRUE TRUE GPL v3
beautifulsoup 0.499 0.994 0.665 0.000 Python 10/03/2020 TRUE TRUE MIT
justext 0.858 0.754 0.802 0.088 Python 03/06/2016 8 425 TRUE TRUE BSD 2-Clause

Scrapinghub’s research focuses on extracting text from a specific set of web pages – news and blogs. Our target is broader, but these extractors generally work well for extracting relevant text from any kind of website.

When deciding which one to choose for Pincone, we compiled a small dataset of links → ran extractors mentioned above → generated labels using the extracted text. We only considered free and open-source extractors. Also, since the auto-labeling stack is written in Python, we only considered extractors written in Python.

The result of this experiment showed us that extractors with high F1 scores (>0.85) consistently produced very similar labels, which indicates that all of them are probably good enough. On the other hand, “simple” extractors with bad F1 scores and high recall, such as beautifulsoup and html-text, generated pretty bad labels. This is because these extractors tend to extract every piece of text from the website, which adds a lot of noise to the text.

In the end, we decided to use Readability. It has a good F1 score, it’s written in Python, open-source, the licence allows us to use it in Pincone, and it’s not abandoned by its maintainers. Trafilatura is also a very good extractor with an even higher F1 score, but unfortunately, it’s GPL v3 licenced which doesn’t allow us to use it in Pincone.

Here’s a code snippet which shows how to extract text with Readability. Readability is actually a tool which generates a “readable” version of HTML, that’s why we use html-text as the last step to extract the actual text.

import html_text
from readability import Document

def extract_text(html):
    doc = Document(html)
    return html_text.extract_text(doc.summary())

Generating Labels

Ok, now that we extracted text from HTML, let’s generate labels from the text. Labels are generated by extracting keywords from the text and using the top keywords as labels. Keywords are not the perfect labels for the text, and the obvious downside is that you can never have a label that’s not in the text, but keywords are very often close to those “true” labels that we would choose ourselves. The (technical) upside of using keywords as labels is that it’s a completely unsupervised method, we don’t need any training data, and there are a lot of keyword extractors available.

We considered keyword extractors from textacy package: TextRank, YAKE, sCAKE, and SGRank. Unfortunately, we didn’t have any data to properly evaluate how keyword extractors perform for label generation. We collected a small dataset of various links → generated labels for those links with every keyword extractor → manually inspected those labels and decided which keyword extractor to choose.

This is our general overview of keyword extractors that we tested:

TextRank — TextRank is very verbose, it often generates keywords consisting of more than 3 or 4 words which is not very desirable for labels. Often, these keywords are not very good. In our opinion, of all the keyword extractors that we tested, TextRank appears to be the worst for auto-labeling.

YAKE — YAKE consistently produced good keywords in our experiments. Keywords are very concise, often capturing the essence of the link content which is what we need.

sCAKE — Like YAKE, it consistently produced good keywords in experiments, but it was a bit too verbose. It’s not as bad as TextRank, but it often generates too many words for a single label.

SGRank — SGRank was not bad in our experiments, but it was not as good as YAKE and sCAKE. It just didn’t consistently capture keywords which were most relevant for the content.

Our dilemma was between using YAKE or sCAKE. In the end, we decided to use YAKE because its keywords are more concise and therefore more suitable to be labels. There are a couple of real examples of links with labels generated by every extractor at the end of this post so you can get a feel of how these keyword extractors perform.

The code sample below shows a minimal working example of the complete auto-labeling pipeline for the English language.

import requests
import textacy
from textacy.extract.keyterms import yake
import html_text
from readability import Document

def extract_text(html):
    doc = Document(html)
    return html_text.extract_text(doc.summary())

def generate_keywords(text):
    doc = textacy.make_spacy_doc(text, "en_core_web_sm")
    return yake(doc, normalize='lemma', ngrams=(1, 2))

def generate_labels(url):
    html = requests.get(url).content
    text = extract_text(html)
    return generate_keywords(text)

Pincone auto-labeling relies on spaCy for language support. Here you can find languages supported by spaCy at the moment.

Issues

Speed

Speed of generating keywords decreases linearly with the length of the text. Depending on your use case, this can be an issue. In Pincone, keyword extraction is part of the link scraping pipeline which needs to be as fast as possible to ensure good UX for users.

We set the timeout at three seconds for the keyword extraction part to prevent compromising the UX of adding links. In the majority of cases, this works just fine. Keywords are extracted before the timeout is hit and users are served labels along with the link content. But for links with a lot of content, keyword extraction will often hit the timeout and users will not get autogenerated labels. That’s the trade-off we are willing to make.

The graph below shows the speed of auto-labeling depending on document length. We scraped 800+ links from the Hacker News front page and ran auto-labeling on the text. Document length refers to the length of text after extracting it from HTML. We can see from the graph that the vast majority of websites can be auto-labeled in under a second.

Quality

Quality of labels as extracted keywords is questionable. This is actually a very subjective issue because if you give 10 people a task to write labels for an article, you’d likely get 10 sets of labels that don’t overlap too much. Our goal here was to capture what the link is about. The user can accept those labels, change them, or remove them. The point is that the users don’t have to write them themselves for every link.

Capturing the essence of the link content with a keyword extractor works fine in most cases, but not always. Sometimes, only a subset of the generated labels is a fit for the link content, or in rare cases, none of the labels are a fit. This can happen because there is too much noise in the text, or there is not enough text to analyse it well, or the keyword extractor simply failed to capture the content. We try to minimise these instances by adjusting the threshold for keyword confidence and filtering keywords which are not relevant such as stopwords, dates, or ordinals.

Conclusion

Although it’s not without its issues, our auto-labeling pipeline is a simple and fast way of generating relevant labels for the links. We believe it’s a valuable addition which will improve the experience of using Pincone for our users.

Examples

Below are some examples of links and the output from relevant keyword extractors.

Design for Fidgeting

textrank yake scake sgrank
strange tool usage pattern tool software design work
software design design acceptable design principle usage pattern
well tool fun design organization Yuan comment
individual tool usage body product design unapologetic
acceptable design principle play human body thinking

Paul Graham's New Ideas Essay

textrank yake scake sgrank
big new idea idea big new idea new idea
feeble new idea new feeble new idea domain expert
radical new idea people radical new idea reasonable person
reasonable people domain reasonable domain expert reason people
reasonable domain expert new idea reasonable people way

Google Blog: A simpler and safer future without passwords

textrank yake scake sgrank
password management easy password password management easy management easy
complicated password security complicated password information safe
strong password safe strong password password
personal information safe complicated online security complicated password
online security online security risk big threat

I want to thank Domagoj Alagić for pointing me to textacy package, Ivan Božić for reading the drafts of this post, and last but not least, to Lea Metličić for reading the drafts and editing the post. Thank you! 😊

Continue Reading

Inboxes Changes to our pricing model Surprise Me Onboarding and Other Improvements A Redesigned Search Experience

Stay in the Loop

Subscribe to the RSS feed Follow us on X Send us an Email

🍪 We use cookies to store small bits of information to improve your browsing experience.

Accept Reject