TF-IDF and Prominent Words & Phrases —SEO Impact & Comparison

Michał Suski
8 min readJun 5, 2019

The relationship shared by words can be confusing at times, despite the fact that we have a number of ways to try and decipher this bond. One such method is known as TF-IDF, and it can allow you to determine the strength of the connection between specific words within a chosen subject. What I aim to do here is teach you more about what TD-IDF is and the way in which it is calculated, as well as why it’s worth paying attention to in the first place.

What is TF-IDF?

Put simply; this is a method for calculating the weight of words based on how frequently they occur. Getting more technical, it belongs to a specific group of algorithms that calculates the statistical weight of certain terms. It might sound a little confusing, but don’t let that stop you from reading on, it sounds harder than it actually is.

The analysis process itself is based on how regularly the word occurs within the document, as well as the inverse frequency of the word within a specified collection of documents. This means that TF-IDF can show you which words have the most importance within the text provided. We also have the ability to refer to the top ten websites within your niche, and because of this, we can fully optimise your website using the high-frequency words.

TF — terms frequency

How terms frequency is calculated

Checks how often the word appears in the document in relation to the amount of content within it.

IDF — inversed document frequency

How Inversed document frequency is calculated

It calculates the inverse ratio of the number of documents in which the word occurred and compares it to all of the documents within the set. The best part is, it can determine popular words within topics as well, so the results are fitting for the area you are working in.

How TF-IDF is calculated

Why it is interesting?

“TF-IDF analysis allows you to optimise the balance of terms in your content according to what is already being shown by the algorithm.” — Matt Diggity

It is thanks to TF-IDF that we are able to discover words that are relevant in terms of the context of a particular expression. This is perfect for fully optimising pages, as well as building up relevant topics to garner better SEO results. It also allows you to rank the words in order from most important to least, which means you can get a clear idea of the scope of the words for the selected topic.

TF-IDF efficiency

The set of documents that you are using is the basic variable that affects the final weight of the individual words in question when calculating the IDF. The problem with these collections is that the IDF needs to be recalculated for each of the words appearing in the documents, and it is the most effective and efficient way to do things.

The larger the set of data in question, the more data you will need to convert. This can cause problems with the infrastructure as well as issues regarding the determination of the size of the collection. In the end, however, the larger the set, the more accurate results you will receive.
Some of the problems that the efficiency of the TD-IDF calculation system are as follows:

  • Specifying the size of a set of documents.
  • Taking care of the impartiality of the collection.
  • Cyclic updates resulting from the creation of new words.
  • The need to have separate collections for different languages.
  • Extremely expensive TF-IDF calculation for expressions consisting of two or more words.

Despite all of this, TD-IDF remains an exceptionally effective and useful tool when creating and optimising content. No tool is perfect, and this one works excellently to make things clearer for you.

What we don’t know about TF-IDF?

Of course, there are things we don’t know about TF-IDF as well, and that is part of what makes it so interesting. We don’t know if Google uses TF-IDF, and even if it does, we have no idea what form. It is one of the most basic issues with this algorithm because of the fact it depends on the analysis of the set of documents to such a large extent.

If the set has been poorly matched or is incomplete, it will also skew the judgement of the overall word weight. For example, using Wikipedia as a set of documents for IDF analysis will not necessarily work well because each of the collections will be biased to some degree.

However, if Google does, in fact, use TF-IF, it is in a better position than any other tool on the market because its body is made up of every piece of content on the web. The results would, therefore, be impartial and useful for comparison. There are some private data correlation studies that have suggested it is highly likely, according to Matt Diggity.

Effective TF-IDF analysis requirements

In order to get the best and most effective results from TF-IDF analysis, the following is required:

  • A large set of documents for valuable analysis.
  • A database with precalculated IDF for each word in order to get valuable results fast.

At Surfer, we realised how the use of incorrect datasets could easily blur the entire picture, and that it is also very hard to define the best datasets. We decided that it would be best to leave this decision to Google’s algorithm, so it could analyse the semantical features of the top ten results.

We believe that the top ten is representative of the most relevant websites, according to Google’s algorithm. Surfer works to find the common words and phrases for these websites, and in many cases, the results are very similar to TD-IDF that has been used by other tools. After this, Surfer collects the second set of keywords; the most popular words and phrases that occur on each of the top ten websites.

It will then cross-examine both of the data sets (common and popular) before selecting the most meaningful words and phrases. Those that are less important (think privacy policies and terms and conditions) are rejected and set aside. Due to these operations, we get the most prominent words, just like TD-IDF, but we are also more resistant to potential errors.

It is also worth noting that TF-IDF can be a ranking factor even though there are a number of different algorithms that Google is able to use. The prominent words and phrases function will work regardless of the method, while TF-IDF only works when we assume that Google uses this (or similar) techniques.

How Prominent words and phrases are calculated

The process of determining Prominent words and phrases begins in the same way as the standard TF-IDF. That is to say from the calculation of the frequency of occurrence. The results of this algorithm are then available in the form of tables; one for popular words, and one for popular phrases. It will appear here is it is one of the top 30, and it must appear in the content at least twice.

The second part of this process is known as the calculation of common words and expressions, and it is for the pages appearing in the top ten search results. We rely directly on the current results provided by the Google algorithm, and the word or phrase is listed if it appears on at least four pages from the top ten.

Both sets are then cut, and the results contain words and phrases that are found in both sets. The phrases that have been obtained in this manner paint an accurate and clear picture of the content that is currently being found by Google. This is regardless of the way in which said content was analysed.

Why we decided to go with Prominent instead of TF-IDF analysis

The results from TF-IDF are incredibly valuable, but they only contain words. Expressions are what differentiate the content, and the TF-IDF analysis for expressions from a large database is almost impossible due to the number of calculations involved. There are two reasons that we rely heavily on Prominent analysis:

  • We can analyse expressions that have a greater differentiating value.
  • We are not trying to recreate the Google algorithm; we analyse its results. Thanks to this, Prominent words and phrases are independent of how (and if) Google uses TF-IDF.

Results comparison

We conducted an experiment on the analysis of TF-IDF and Prominent words and phrases. The phrase that we used was SEO services in the USA. When we ran the algorithm and examined the final results, we received the following conclusions:

  • More words and expressions were found in Prominent.
  • There was greater accuracy, and instead of the word SEO we received a dozen variants.
Results: http://bit.ly/2QMUocD

As a side note, for the TD-IDF analysis, we used SEObility.

To Conclude

TF-IDF is both an effective and valuable way to optimise your content for a specific keyword. However, we decided that our own solution would be the best way forward. Prominent words and phrases are able to provide more data and are based directly on the results that have been provided by the Google algorithm. You would find TF-IDF in Surfer if it provided better guidance.

In many ways, Prominent is very similar to TF-IDF, and the first part does remain the same through the process. The second part is the one that is different (as we have explained), but even so, it is not particularly complicated and works just as quickly as standard TD-IDF. With the results, you will find that they are much clearer and more detailed than usual thanks to the requirements for the Prominent list. It’s a system well worth using.

Originally published at https://surferseo.com.

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--

Michał Suski

SEO expert, on-page optimization enthusiast and co-founder of time saving seo tool — Surfer