Digital Marketing

Utilizing machine studying (for those who can't code) to assist your key phrase analysis

I've already written about why keyword research isn't dead. An important topic that I am constantly doing is that categorizing keywords is incredibly important to be useful so you can optimize for topics and clusters rather than individual keywords.

My keyword research documents often exceed 20,000 to 50,000 keywords, which are typically divided into two, three, or sometimes more categories that reflect the site taxonomy in question.

As you can see, I have divided the keywords into 4 filterable columns, in which you can select a specific "topic" and display the collective search volume for a cohort of keywords. What you can't see is that there are over 8,000 keywords.

A few years ago, I categorized this fairly manually, using some simple formulas where I could. It took forever. So I created a keyword categorization tool to help me. It was built with PHP and is still fairly rudimentary, but it has shortened the time it takes me to do keyword research and categorize it from a few days to 12 to 15 hours depending on how many keywords there are.

I am a fool for a trend. The moment all SEOs start screaming how great Python is, I'm on the train of course. My goal is to further optimize the keyword research process, and I love learning such an adaptable language. But then I came across this video by David Sottimano in which he introduced BigML into my life. Imagine an online drag-and-drop machine learning service. A system that literally anyone can use. This is BigML.

I'm still pursuing my ultimate goal of mastering Python, but in the meantime BigML has provided me with some very interesting insights that have already speeded up my keyword categorization. The aim of this article is to give you some ideas for using (free) technologies that already exist to work smarter.

BigML is a freemium tool before we go into it. There is a monthly fee if you want to process a lot of data or add additional features (e.g. more than one person on the account at the same time). However, to get the results in this article, the free tier is more than sufficient. In fact, the free tier will always be enough for you unless you are a reputable data scientist and need to analyze a lot of variables.

Step 1 – Get Training Data

In this example, we pretend to do keyword research for River Island – a large clothing retailer in the UK for all my friends across the pond. (If you are reading this and working for River Island, I will not do full keyword research.)

If we look at River Island's location taxonomy, we see the following:

For the purpose of this guide, we only do keyword research for men and focus on these few product elements:

Suppose I hypothetically want to divide my keywords into the following categories and subcategories:

Tops > Coats and Jackets

> T-Shirts and Vests

Bottoms > Jeans

> Pants and chinos

We do the "Bottoms" first.

Get the "Jeans" URL for River Island and connect it to SEMRush:

Filter and export by top 20 keywords:

I chose the top 20 because they often rank for irrelevant and sometimes quite strange keywords. Yes, River Island is number 58 for this term:

We do not want these terms to affect our training model.

If we filter and export for "Jeans" by keywords at positions 1 to 20, we get 900 odd keywords. Put them in a table and add the headings "Category 1" and "Category 2". You will then drop "bottoms" into Category 1 and "jeans" into Category 2 and complete the following:

This is the beginning of your machine learning "training data". There is probably already enough data here, but I like to be thorough, so I will also retrieve all the keywords from a company that I know has a high priority for every clothing-based keyword – ASOS.

I will repeat the process for her jeans page:

After I exported the resulting ranking keywords from SEMRush, added them to my table, deleted the categories and deactivated the list, I have 1,300 keywords for Bottoms> Jeans.

I will repeat the process for:

bottoms> pants and chinos

Tops> Coats and Jackets

Tops> T-Shirts and Vests

For these 3 I didn't care to integrate the River Island domain in SEMRush because ASOS was classified for so many keywords that there was enough data for my training model.

After a quick search and replace to remove brand keywords:

And with deduplication, I have almost 8,000 keywords left, which are divided into "bottoms" and "tops" on the first level and "jeans" and "pants / chinos" on the second level.

Tip – You may need to use the trim function to remove spaces after searching and replacing, otherwise this sheet will upload incorrectly when we use it as training data:

Time spent so far: 5 minutes

You can of course do this for all River Islands products and in any number of categories. If you make men and women, they are probably the first category. Then you might have a fourth category, in which things like "jackets" are further divided into elements like "puffer jackets" and "leather jackets".

If you have difficulty visualizing the categories you may need, I will write a post shortly. Sometimes it's just common sense, but there is also a machine learning program to help you when you need it:

Step 2 – Training your machine learning model

Cool – we have our list of 8,000 unbranded keywords categorized in 5 minutes.

Save the file as CSV, then go to BigML and register. It's free.

Now we're going to do the following incredibly simple steps to train the machine tutorial on categorizing keywords.

Go to the Sources tab and upload your training data:

After loading, click the file to open the settings:

Click "Configure Data Source" and make sure the categories are set to "categorical":

In most cases, the rest of the settings should be fine. If you would like to learn more about the settings, we recommend that you watch BigML’s YouTube education channel here.

Close the settings for "Configure source" and click on the "Configure data record" button. Then deactivate "Category 2":

Click the button "Create data record":

However, first rename the "Dataset Name" to "ML Blog Data" (Category 1).

Select your new record on the "Records" tab:

All of your keywords are now "tokenized". From here there are so many exciting models that you can train, but for the purposes of this article, we'll do the simplest. Navigate to the monitored model with one click:

After the calculation is complete, a decision tree like the following is displayed:

Again, I'm not going to go over everything you can do with it, but essentially a series of if statements are created based on the data you provide to determine the likelihood of a category.

The circle I hovered over in the picture is, for example, a decision path with the following attributes: If the keyword does not contain "jeans" or "pants", it is probably a "top" with confidence score of 85.71% ,

You can actually create a so-called "ensemble model" that is even more accurate. You can also split the data and do a controlled test to see how accurate it will be before you use it. If you want to know more about it, please contact me or read the documentation on the website.

So we created a model to categorize the keywords in Category 1. We now have to do the same for the second category.

Go back to your sources and select your training record again:

Repeat the above steps, but this time turn off Category 1 when you configure your record:

Create a monitored model with one click as before:

Voila – Your second decision tree:

Now we have two trained models that use machine learning to categorize your keywords with a fairly high level of accuracy.

Time spent so far: 10 minutes (possibly one hour if you created all product categories on the River Islands website)

Get the rest of your keywords

We trained only one model to cover 2 categories and 4 subcategories. Assuming you've trained it for every product on the River Island website (which will likely take an hour or two, maybe even a virtual assistant will do it for you and put your feet up), the rest of your keyword research will do the same be easy.

All I'm going to do now is plug in the following competing domain-level domains in SEMRush and export the ranking keywords of their entire site (for clarification, I'm not going to go to every product folder like the training data) :
https: // www.

And I could go on.

After I've deduped all keywords on these websites and removed branded keywords, I have about 100,000 uncategorized keywords.

I can also use some standard keyword research techniques, such as For example, the use of merge words and keyword planner or Ahref's keyword explorer to get even more keyword suggestions. The nice thing is, we don't have to take long to ensure that the keywords we export are categorized correctly. We can literally just plug in and export domains and seed keywords.

Then you'll put this huge, ugly, uncategorized list on Google Sheets:

Time spent so far : 25 minutes (or an hour and 25 minutes if you received all product categories from the River Islands website)

Using the BigML API to categorize your keywords

Get the BigML addon on Google Sheets:

You will need to enter your username and API key, but you can find them easily in your BigML dashboard and in your settings.

Now the fun begins.

Highlight the array that needs to be categorized and select the model that you have trained and that you want to use. In this case I'm using Category 1 (at the moment I think we can only create one category at a time. I haven't figured out how both work, which is why we trained two different models):

Then click on "Predict" and let go:

It may take a while depending on how many keywords you have, but at least you can continue with some other tasks. You will find that there is also a probability rating. I tend to only filter for less than 50% and delete them. I have 100,000 keywords and I won't miss the few.

Next we make a copy of the sheet, delete the two columns and do exactly the same for Category 2:

Once we have both categorizations and deleted keywords with a low "confidence", all you have to do is clear the formatting and then run Vlookup to merge them:

Carry out as many categories as you need, and then retrieve all other important data for your final keyword research document:

Some concluding remarks

So there we have it – an easy way to categorize 100,000 keywords in less than a few hours of actual work (I mean, you have to wait for the ML to go through the keywords one by one, but it won't work).

I haven't found a way to do both at the same time, but I imagine there is a way to do it. The model we use is not as accurate as some of the other options in the engine. For example, using an ensemble model would lead to better results, especially if the training model were smaller but the configuration was somewhat more complicated. You can also use the engine to identify categories and closely related topics. But that's for another post.

It's pretty simple, but surprisingly powerful and a really nice introduction to machine learning. Have fun!

The opinions expressed in this article are those of the guest author and not necessarily the country of the search engines. The authors of the employees are listed here.

About the Author

Andy Chadwick is a digital marketing consultant who specializes in SEO but also covers PPC services with his company digitalquokka. He is primarily known for his unique approach to keyword research, as he has developed his own tools that help categorize keywords. Andy started teaching SEO himself in 2013 when he co-founded a company that raised over £ 2.5m in its third year. Since leaving the company in 2018, he has consulted and supported other start-ups and international organizations with their digital marketing strategies.