May 28th, 2021 • A simple and fast way of generating labels for websites.
This post was written by Luka Dulčić, our data scientist and software engineer.
Ever wondered how nice it would be if labels were automatically generated for your links so you didn’t have to write them? Labels are great for adding context to things, grouping content, and making content easily searchable. That’s why we decided to give it a shot and implement auto-labeling of links in Pincone with a little bit of Natural Language Processing.
The whole process of generating labels is actually quite simple. First, we get the HTML of the link → then we extract text from the HTML → finally, we use a keyword extractor to generate labels.
The first step of implementing auto-labeling is extracting text from the link’s HTML. This is not a simple task at all because our target is literally the whole internet and there are all kinds of weird and broken HTMLs out there. We would like to extract the link’s “main body” of text which is representative of the link. For example, when we analyse a news article, we want to extract the article text, everything else on the webpage is noise to us. Luckily, there are many tools available nowadays that do just that.
Scrapinghub published the “Article Extraction Benchmark” paper in May 2020 where they compiled a dataset of 182 news articles, annotated them, and ran many of the currently available “HTML → Text” extractors on those links. They also published the code and dataset. The table below shows their results with my addition of some fields like the licence of open-source projects, time of the last commit, and the number of issues.
Name | Precision | Recall | F1 | Accuracy | Language | Last Release | # Of Issues | Stars | Free | Open-Source | Licence |
---|---|---|---|---|---|---|---|---|---|---|---|
AutoExtract | 0.984 | 0.956 | 0.970 | 0.470 | Unknown | FALSE | FALSE | Commercial licence | |||
DiffBot | 0.958 | 0.944 | 0.951 | 0.348 | Unknown | FALSE | FALSE | Commercial licence | |||
boilerpipe | 0.850 | 0.870 | 0.860 | 0.006 | Python | 10/03/2017 | 21 | 493 | TRUE | TRUE | Apache v2 |
dragnet | 0.925 | 0.889 | 0.907 | 0.221 | Python | 04/16/2019 | 17 | 951 | TRUE | TRUE | MIT |
html-text | 0.500 | 0.994 | 0.665 | 0.000 | Python | 07/22/2020 | 10 | 71 | TRUE | TRUE | MIT |
newspaper | 0.917 | 0.906 | 0.912 | 0.260 | Python | 09/28/2019 | 349 | 10968 | TRUE | TRUE | MIT |
readability | 0.913 | 0.931 | 0.922 | 0.315 | Python | 07/04/2020 | 29 | 1969 | TRUE | TRUE | Apache v2 |
xpath-text | 0.246 | 0.992 | 0.394 | 0.000 | XPath | TRUE | TRUE | ||||
trafilatura | 0.925 | 0.966 | 0.945 | 0.221 | Python | 04/21/2021 | 12 | 125 | TRUE | TRUE | GPL v3 |
go_readability | 0.912 | 0.975 | 0.943 | 0.210 | Go | 10/11/2020 | 3 | 266 | TRUE | TRUE | MIT |
readability.js | 0.853 | 0.924 | 0.887 | 0.149 | Javascript | 01/13/2021 | 160 | 3956 | TRUE | TRUE | Apache v2 |
go_domdistiller | 0.901 | 0.956 | 0.927 | 0.066 | Go | 12/22/2020 | 1 | 8 | TRUE | TRUE | MIT |
news_please | 0.917 | 0.906 | 0.911 | 0.249 | Python | 05/06/2021 | 18 | 1038 | TRUE | TRUE | Apache v2 |
goose3 | 0.930 | 0.847 | 0.887 | 0.227 | Python | 04/27/2021 | 12 | 492 | TRUE | TRUE | Apache v2 |
inscriptis | 0.517 | 0.993 | 0.679 | 0.000 | Python | 01/04/2021 | 1 | 67 | TRUE | TRUE | Apache v2 |
html2text | 0.499 | 0.983 | 0.662 | 0.000 | Python | 01/16/2021 | 60 | 1090 | TRUE | TRUE | GPL v3 |
beautifulsoup | 0.499 | 0.994 | 0.665 | 0.000 | Python | 10/03/2020 | TRUE | TRUE | MIT | ||
justext | 0.858 | 0.754 | 0.802 | 0.088 | Python | 03/06/2016 | 8 | 425 | TRUE | TRUE | BSD 2-Clause |
Scrapinghub’s research focuses on extracting text from a specific set of web pages – news and blogs. Our target is broader, but these extractors generally work well for extracting relevant text from any kind of website.
When deciding which one to choose for Pincone, we compiled a small dataset of links → ran extractors mentioned above → generated labels using the extracted text. We only considered free and open-source extractors. Also, since the auto-labeling stack is written in Python, we only considered extractors written in Python.
The result of this experiment showed us that extractors with high F1 scores (>0.85) consistently produced very similar labels, which indicates that all of them are probably good enough. On the other hand, “simple” extractors with bad F1 scores and high recall, such as beautifulsoup and html-text, generated pretty bad labels. This is because these extractors tend to extract every piece of text from the website, which adds a lot of noise to the text.
In the end, we decided to use Readability. It has a good F1 score, it’s written in Python, open-source, the licence allows us to use it in Pincone, and it’s not abandoned by its maintainers. Trafilatura is also a very good extractor with an even higher F1 score, but unfortunately, it’s GPL v3 licenced which doesn’t allow us to use it in Pincone.
Here’s a code snippet which shows how to extract text with Readability. Readability is actually a tool which generates a “readable” version of HTML, that’s why we use html-text as the last step to extract the actual text.
import html_text
from readability import Document
def extract_text(html):
doc = Document(html)
return html_text.extract_text(doc.summary())
Ok, now that we extracted text from HTML, let’s generate labels from the text. Labels are generated by extracting keywords from the text and using the top keywords as labels. Keywords are not the perfect labels for the text, and the obvious downside is that you can never have a label that’s not in the text, but keywords are very often close to those “true” labels that we would choose ourselves. The (technical) upside of using keywords as labels is that it’s a completely unsupervised method, we don’t need any training data, and there are a lot of keyword extractors available.
We considered keyword extractors from textacy package: TextRank, YAKE, sCAKE, and SGRank. Unfortunately, we didn’t have any data to properly evaluate how keyword extractors perform for label generation. We collected a small dataset of various links → generated labels for those links with every keyword extractor → manually inspected those labels and decided which keyword extractor to choose.
This is our general overview of keyword extractors that we tested:
TextRank — TextRank is very verbose, it often generates keywords consisting of more than 3 or 4 words which is not very desirable for labels. Often, these keywords are not very good. In our opinion, of all the keyword extractors that we tested, TextRank appears to be the worst for auto-labeling.
YAKE — YAKE consistently produced good keywords in our experiments. Keywords are very concise, often capturing the essence of the link content which is what we need.
sCAKE — Like YAKE, it consistently produced good keywords in experiments, but it was a bit too verbose. It’s not as bad as TextRank, but it often generates too many words for a single label.
SGRank — SGRank was not bad in our experiments, but it was not as good as YAKE and sCAKE. It just didn’t consistently capture keywords which were most relevant for the content.
Our dilemma was between using YAKE or sCAKE. In the end, we decided to use YAKE because its keywords are more concise and therefore more suitable to be labels. There are a couple of real examples of links with labels generated by every extractor at the end of this post so you can get a feel of how these keyword extractors perform.
The code sample below shows a minimal working example of the complete auto-labeling pipeline for the English language.
import requests
import textacy
from textacy.extract.keyterms import yake
import html_text
from readability import Document
def extract_text(html):
doc = Document(html)
return html_text.extract_text(doc.summary())
def generate_keywords(text):
doc = textacy.make_spacy_doc(text, "en_core_web_sm")
return yake(doc, normalize='lemma', ngrams=(1, 2))
def generate_labels(url):
html = requests.get(url).content
text = extract_text(html)
return generate_keywords(text)
Pincone auto-labeling relies on spaCy for language support. Here you can find languages supported by spaCy at the moment.
Speed of generating keywords decreases linearly with the length of the text. Depending on your use case, this can be an issue. In Pincone, keyword extraction is part of the link scraping pipeline which needs to be as fast as possible to ensure good UX for users.
We set the timeout at three seconds for the keyword extraction part to prevent compromising the UX of adding links. In the majority of cases, this works just fine. Keywords are extracted before the timeout is hit and users are served labels along with the link content. But for links with a lot of content, keyword extraction will often hit the timeout and users will not get autogenerated labels. That’s the trade-off we are willing to make.
The graph below shows the speed of auto-labeling depending on document length. We scraped 800+ links from the Hacker News front page and ran auto-labeling on the text. Document length refers to the length of text after extracting it from HTML. We can see from the graph that the vast majority of websites can be auto-labeled in under a second.
Quality of labels as extracted keywords is questionable. This is actually a very subjective issue because if you give 10 people a task to write labels for an article, you’d likely get 10 sets of labels that don’t overlap too much. Our goal here was to capture what the link is about. The user can accept those labels, change them, or remove them. The point is that the users don’t have to write them themselves for every link.
Capturing the essence of the link content with a keyword extractor works fine in most cases, but not always. Sometimes, only a subset of the generated labels is a fit for the link content, or in rare cases, none of the labels are a fit. This can happen because there is too much noise in the text, or there is not enough text to analyse it well, or the keyword extractor simply failed to capture the content. We try to minimise these instances by adjusting the threshold for keyword confidence and filtering keywords which are not relevant such as stopwords, dates, or ordinals.
Although it’s not without its issues, our auto-labeling pipeline is a simple and fast way of generating relevant labels for the links. We believe it’s a valuable addition which will improve the experience of using Pincone for our users.
Below are some examples of links and the output from relevant keyword extractors.
textrank | yake | scake | sgrank |
---|---|---|---|
strange tool usage pattern | tool | software design | work |
software design | design | acceptable design principle | usage pattern |
well tool | fun | design organization | Yuan comment |
individual tool usage | body | product design | unapologetic |
acceptable design principle | play | human body | thinking |
textrank | yake | scake | sgrank |
---|---|---|---|
big new idea | idea | big new idea | new idea |
feeble new idea | new | feeble new idea | domain expert |
radical new idea | people | radical new idea | reasonable person |
reasonable people | domain | reasonable domain expert | reason people |
reasonable domain expert | new idea | reasonable people | way |
Google Blog: A simpler and safer future without passwords
textrank | yake | scake | sgrank |
---|---|---|---|
password management easy | password | password management easy | management easy |
complicated password | security | complicated password | information safe |
strong password | safe | strong password | password |
personal information safe | complicated | online security | complicated password |
online security | online | security risk | big threat |
I want to thank Domagoj Alagić for pointing me to textacy package, Ivan Božić for reading the drafts of this post, and last but not least, to Lea Metličić for reading the drafts and editing the post. Thank you! 😊