As developers, we need to be ready to rapidly find solutions to the many problems that we face everyday and, as we know, every second counts. This regular requirement sometimes forces us to implement non-optimal but easy and fast to implement solutions. Many times, these implementations either require a high grade of manual interaction or they are too slanted to the particular presented problem.
One usual case where we see these types of mistakes is when we have to design a system that classifies different groups of datasets according to certain criteria. These types of problems are often so specific that we tend to think that, with just a few simple lines of code, the groups would be perfectly delimited.
Text Classification Using Machine Learning
A real-life example of the this kind of problem is URL classification. Kompyte is able to classify different types of URLs and understand if a specific URL is, for example, a main page, a blog page, a landing page etc. This is really useful as it saves our users’ time and we can easily detect pages that need might require urgent attention like new promtions or updates to their pricing pages.
The most straightforward solution that might come to mind is to find special words ( or ‘tokens’) either inside the URL or the actual webpage itself. Main pages are usually those with an empty path (like http://www.kompyte.com/), blog pages trend to be those with the words ‘blog’ or ‘news’ in any position of the URL ( for example http://www.kompyte.com/blog/) and pricing pages usually contain the ‘pricing’ token (http://www.kompye.com/pricing/).
Sounds like a simple, efficient and clean solution doesn’t it? Sadly, it is not.
If we check out more examples we easily find out that those words solve just a small set of the actual cases. There are a lot of different words that can define those sections that we didn’t take into consideration. For example ‘main’, ‘plans’ and the translations of each one of these words. It is really clear now that a biased implementation, such as this one, even though it is easy and fast to implement, does not actually solve the problem. And what’s worse, it requires a lot of manual interaction, as not everything is detected automatically and we will have to manually add each case not previously taken into account.
Obviously, we must present a solution that is both more general and self-sufficient so it’s able to classify URLs with minimum manual interaction. In our line of work, it is crystal clear that the extra time that will be needed to develop a more complex but automatic solution will eventually be worth our time and effort.
To learn how we solved this particular problem, stay tuned to the second part of this article. That’s right, it’s a cliffhanger – you can find the second part right here.