As developers, we need to be ready to rapidly find solutions to the many problems that we face every day and, as we know, every second counts. This need to get a much done as fast as possible sometimes forces us to implement non-optimal but easy and fast to implement solutions. Many times, these implementations either require a high grade of manual interaction or they are too slanted to the particular presented problem.

One usual case where we see these types of mistakes is when we have to design a system that classifies different groups of datasets according to certain criteria. These types of problems are often so specific that we tend to think that, with just a few simple lines of code, the groups would be perfectly delimited.

Text Classification Using Machine Learning

A real-life example of this kind of problem is URL classification. Kompyte is able to classify different types of URLs and understand if a specific URL is, for example, a main page, a blog page, a landing page etc. This is really useful as it saves our users time and we can easily detect pages that need require urgent attention like new promotions or updates to pricing pages.

The most straightforward solution that might come to mind is to find special words ( or ‘tokens’) either inside the URL or the actual webpage itself. Main pages are usually those with an empty path (like https://www.kompyte.com/), blog pages tend to be those with the words ‘blog’ or ‘news’ in any position of the URL ( for example https://www.kompyte.com/blog/) and pricing pages usually contain a ‘pricing’ token.

Sounds like a simple, efficient and clean solution doesn’t it? Sadly, it is not.

If we look into examples we find that those words apply for only a small set of cases. There are a lot of different words that can define those sections that we didn’t take into consideration. For example ‘main’, ‘plans’ and the translations of each one of these words. It is really clear now that a biased implementation, such as this one, even though it is easy and fast to implement, does not actually solve the problem. And what’s worse, it requires a lot of manual interaction, as not everything is detected automatically and we will have to manually add each case not previously taken into account.

Obviously, we must present a solution that is both more general and self-sufficient so it’s able to classify URLs with minimal manual interaction. In our line of work, it is crystal clear that the extra time that will be needed to develop a more complex but automatic solution will eventually be worth our time and effort.

To learn how we solved this particular problem, stay tuned to the second part of this article. That’s right, it’s a cliffhanger – you can find the second part right here.