Unsupervised Auto-labeling of Websites

May 28th, 2021 • A simple and fast way of generating labels for websites.

This post was written by Luka Dulčić, our data scientist and software engineer.

Ever wondered how nice it would be if labels were automatically generated for your links so you didn’t have to write them? Labels are great for adding context to things, grouping content, and making content easily searchable. That’s why we decided to give it a shot and implement auto-labeling of links in Pincone with a little bit of Natural Language Processing.

The whole process of generating labels is actually quite simple. First, we get the HTML of the link → then we extract text from the HTML → finally, we use a keyword extractor to generate labels.

Auto-generated labels when adding a link.

HTML → Text

The first step of implementing auto-labeling is extracting text from the link’s HTML. This is not a simple task at all because our target is literally the whole internet and there are all kinds of weird and broken HTMLs out there. We would like to extract the link’s “main body” of text which is representative of the link. For example, when we analyse a news article, we want to extract the article text, everything else on the webpage is noise to us. Luckily, there are many tools available nowadays that do just that.

Scrapinghub published the “Article Extraction Benchmark” paper in May 2020 where they compiled a dataset of 182 news articles, annotated them, and ran many of the currently available “HTML → Text” extractors on those links. They also published the code and dataset. The table below shows their results with my addition of some fields like the licence of open-source projects, time of the last commit, and the number of issues.

Name	Precision	Recall	F1	Accuracy	Language	Last Release	# Of Issues	Stars	Free	Open-Source	Licence
AutoExtract	0.984	0.956	0.970	0.470	Unknown				FALSE	FALSE	Commercial licence
DiffBot	0.958	0.944	0.951	0.348	Unknown				FALSE	FALSE	Commercial licence
boilerpipe	0.850	0.870	0.860	0.006	Python	10/03/2017	21	493	TRUE	TRUE	Apache v2
dragnet	0.925	0.889	0.907	0.221	Python	04/16/2019	17	951	TRUE	TRUE	MIT
html-text	0.500	0.994	0.665	0.000	Python	07/22/2020	10	71	TRUE	TRUE	MIT
newspaper	0.917	0.906	0.912	0.260	Python	09/28/2019	349	10968	TRUE	TRUE	MIT
readability	0.913	0.931	0.922	0.315	Python	07/04/2020	29	1969	TRUE	TRUE	Apache v2
xpath-text	0.246	0.992	0.394	0.000	XPath				TRUE	TRUE
trafilatura	0.925	0.966	0.945	0.221	Python	04/21/2021	12	125	TRUE	TRUE	GPL v3
go_readability	0.912	0.975	0.943	0.210	Go	10/11/2020	3	266	TRUE	TRUE	MIT
readability.js	0.853	0.924	0.887	0.149	Javascript	01/13/2021	160	3956	TRUE	TRUE	Apache v2
go_domdistiller	0.901	0.956	0.927	0.066	Go	12/22/2020	1	8	TRUE	TRUE	MIT
news_please	0.917	0.906	0.911	0.249	Python	05/06/2021	18	1038	TRUE	TRUE	Apache v2
goose3	0.930	0.847	0.887	0.227	Python	04/27/2021	12	492	TRUE	TRUE	Apache v2
inscriptis	0.517	0.993	0.679	0.000	Python	01/04/2021	1	67	TRUE	TRUE	Apache v2
html2text	0.499	0.983	0.662	0.000	Python	01/16/2021	60	1090	TRUE	TRUE	GPL v3
beautifulsoup	0.499	0.994	0.665	0.000	Python	10/03/2020			TRUE	TRUE	MIT
justext	0.858	0.754	0.802	0.088	Python	03/06/2016	8	425	TRUE	TRUE	BSD 2-Clause

Scrapinghub’s research focuses on extracting text from a specific set of web pages – news and blogs. Our target is broader, but these extractors generally work well for extracting relevant text from any kind of website.

When deciding which one to choose for Pincone, we compiled a small dataset of links → ran extractors mentioned above → generated labels using the extracted text. We only considered free and open-source extractors. Also, since the auto-labeling stack is written in Python, we only considered extractors written in Python.

The result of this experiment showed us that extractors with high F1 scores (>0.85) consistently produced very similar labels, which indicates that all of them are probably good enough. On the other hand, “simple” extractors with bad F1 scores and high recall, such as beautifulsoup and html-text, generated pretty bad labels. This is because these extractors tend to extract every piece of text from the website, which adds a lot of noise to the text.

In the end, we decided to use Readability. It has a good F1 score, it’s written in Python, open-source, the licence allows us to use it in Pincone, and it’s not abandoned by its maintainers. Trafilatura is also a very good extractor with an even higher F1 score, but unfortunately, it’s GPL v3 licenced which doesn’t allow us to use it in Pincone.

Here’s a code snippet which shows how to extract text with Readability. Readability is actually a tool which generates a “readable” version of HTML, that’s why we use html-text as the last step to extract the actual text.

import html_text
from readability import Document

def extract_text(html):
    doc = Document(html)
    return html_text.extract_text(doc.summary())

Generating Labels

Ok, now that we extracted text from HTML, let’s generate labels from the text. Labels are generated by extracting keywords from the text and using the top keywords as labels. Keywords are not the perfect labels for the text, and the obvious downside is that you can never have a label that’s not in the text, but keywords are very often close to those “true” labels that we would choose ourselves. The (technical) upside of using keywords as labels is that it’s a completely unsupervised method, we don’t need any training data, and there are a lot of keyword extractors available.

We considered keyword extractors from textacy package: TextRank, YAKE, sCAKE, and SGRank. Unfortunately, we didn’t have any data to properly evaluate how keyword extractors perform for label generation. We collected a small dataset of various links → generated labels for those links with every keyword extractor → manually inspected those labels and decided which keyword extractor to choose.

This is our general overview of keyword extractors that we tested:

TextRank — TextRank is very verbose, it often generates keywords consisting of more than 3 or 4 words which is not very desirable for labels. Often, these keywords are not very good. In our opinion, of all the keyword extractors that we tested, TextRank appears to be the worst for auto-labeling.

YAKE — YAKE consistently produced good keywords in our experiments. Keywords are very concise, often capturing the essence of the link content which is what we need.

sCAKE — Like YAKE, it consistently produced good keywords in experiments, but it was a bit too verbose. It’s not as bad as TextRank, but it often generates too many words for a single label.

SGRank — SGRank was not bad in our experiments, but it was not as good as YAKE and sCAKE. It just didn’t consistently capture keywords which were most relevant for the content.

Our dilemma was between using YAKE or sCAKE. In the end, we decided to use YAKE because its keywords are more concise and therefore more suitable to be labels. There are a couple of real examples of links with labels generated by every extractor at the end of this post so you can get a feel of how these keyword extractors perform.

The code sample below shows a minimal working example of the complete auto-labeling pipeline for the English language.

import requests
import textacy
from textacy.extract.keyterms import yake
import html_text
from readability import Document

def extract_text(html):
    doc = Document(html)
    return html_text.extract_text(doc.summary())

def generate_keywords(text):
    doc = textacy.make_spacy_doc(text, "en_core_web_sm")
    return yake(doc, normalize='lemma', ngrams=(1, 2))

def generate_labels(url):
    html = requests.get(url).content
    text = extract_text(html)
    return generate_keywords(text)

Pincone auto-labeling relies on spaCy for language support. Here you can find languages supported by spaCy at the moment.

Issues

Speed

Speed of generating keywords decreases linearly with the length of the text. Depending on your use case, this can be an issue. In Pincone, keyword extraction is part of the link scraping pipeline which needs to be as fast as possible to ensure good UX for users.

We set the timeout at three seconds for the keyword extraction part to prevent compromising the UX of adding links. In the majority of cases, this works just fine. Keywords are extracted before the timeout is hit and users are served labels along with the link content. But for links with a lot of content, keyword extraction will often hit the timeout and users will not get autogenerated labels. That’s the trade-off we are willing to make.

The graph below shows the speed of auto-labeling depending on document length. We scraped 800+ links from the Hacker News front page and ran auto-labeling on the text. Document length refers to the length of text after extracting it from HTML. We can see from the graph that the vast majority of websites can be auto-labeled in under a second.

Quality

Quality of labels as extracted keywords is questionable. This is actually a very subjective issue because if you give 10 people a task to write labels for an article, you’d likely get 10 sets of labels that don’t overlap too much. Our goal here was to capture what the link is about. The user can accept those labels, change them, or remove them. The point is that the users don’t have to write them themselves for every link.

Capturing the essence of the link content with a keyword extractor works fine in most cases, but not always. Sometimes, only a subset of the generated labels is a fit for the link content, or in rare cases, none of the labels are a fit. This can happen because there is too much noise in the text, or there is not enough text to analyse it well, or the keyword extractor simply failed to capture the content. We try to minimise these instances by adjusting the threshold for keyword confidence and filtering keywords which are not relevant such as stopwords, dates, or ordinals.

Conclusion

Although it’s not without its issues, our auto-labeling pipeline is a simple and fast way of generating relevant labels for the links. We believe it’s a valuable addition which will improve the experience of using Pincone for our users.

Examples

Below are some examples of links and the output from relevant keyword extractors.

Design for Fidgeting

textrank	yake	scake	sgrank
strange tool usage pattern	tool	software design	work
software design	design	acceptable design principle	usage pattern
well tool	fun	design organization	Yuan comment
individual tool usage	body	product design	unapologetic
acceptable design principle	play	human body	thinking

Paul Graham's New Ideas Essay

textrank	yake	scake	sgrank
big new idea	idea	big new idea	new idea
feeble new idea	new	feeble new idea	domain expert
radical new idea	people	radical new idea	reasonable person
reasonable people	domain	reasonable domain expert	reason people
reasonable domain expert	new idea	reasonable people	way

Google Blog: A simpler and safer future without passwords

textrank	yake	scake	sgrank
password management easy	password	password management easy	management easy
complicated password	security	complicated password	information safe
strong password	safe	strong password	password
personal information safe	complicated	online security	complicated password
online security	online	security risk	big threat

I want to thank Domagoj Alagić for pointing me to textacy package, Ivan Božić for reading the drafts of this post, and last but not least, to Lea Metličić for reading the drafts and editing the post. Thank you! 😊

Unsupervised Auto-labeling of Websites

HTML → Text

Generating Labels

Issues

Speed

Quality

Conclusion

Examples

Continue Reading

Stay in the Loop

Ready to start collecting your thoughts in one place?

Product

Company

Contact