This is part 2 of 2 on our discussion about implementing text classification using machine learning. You can find part 1 here.
The most common way to classify ‘Machine Learning’ algorithms is by the level of supervision they require. Most classification algorithms belong to the supervised learning algorithms. Among all of these algorithms, I will focus in Support Vector Machines (SVM), due to their simplicity and the fact that they are one of the most popular classification algorithms over the last decade.
We have to point out that one of the main limitations that we all face with supervised learning algorithms is the necessity to create a set of examples big enough to successfully define the different groups that we want to classify. For this reason, the first thing that we have to do is to generate a good asset of examples.
SVMs are vectorial classification algorithms, that is to say, they are based on the division of points inside a multi-dimensional space. In order to make this separation, hyperplanes are calculated so that the distance between those hyperplanes and the points in each group is maximized. Sometimes, the groups are not linearly divisible. In these cases various transformations to the vectorial spaces are performed trying to make the groups linearly divisible.
As I previously mentioned, SVMs can successfully classify points inside a multi-dimensional space. But how can we apply this to a real-life problem, where we classify texts and not points or vectors? The answer is simple: parameterizing the texts so that the most important features are represented with each dimension of the space. If we do this, each text will be defined by a point inside the vectorial space and as a result, we will be able to apply the mentioned algorithm.
It is this parameterization task where the most important difficulty lays. There are a lot of ways to parametrize a text. And we ought to keep in mind that a small difference can determine whether a classifier works or not. What’s more, the number of characteristics should be limited, as a problem where the dimensions are infinite cannot be solved.
Let’s apply these facts to a real problem: URL classification.
As we already mentioned in the first post, using a list of ‘tokens’ associated to each joint of URLs is neither optimal nor viable. Despite this, having a partial solution like this one can give us some tips about how we should parametrize the examples in order to apply the SVM algorithm.
In that naive method, we used a list of ‘tokens’ that could be found in some URLs of each group. The main issue was the fact that we did not have a list of tokens that was big enough.
Using SVMs, we can workaround this. One way is to get all possible ‘tokens’ in which example URLs can be splitted using a simple regular expression and associate each one to a dimension. In this case, the selection of the optimal ‘tokens’ will be implicit, as the classifier will always select the examples that contain the most common tokens among the elements in each group.
This solution greatly improves the manual one presented in the previous post. First of all, it uses all possible tokens that can be found in the example list. Second, it also considers negative examples ie words that are only found in URLs that do not belong to a certain group, even if they also contain a token that most of the ULRs in that group share. An example of this second case is the URL “http://www.kompyte.com/comments/feed”, as containing “comments” means that the URL is not the blog’s feed, even though it contains the word “feed”.
Avoiding slanted solutions:
With SVMs, we have greatly improved our naive solution. Nevertheless, the algorithm is still too slanted, as it only considers each page’s URL and not every possible token is taken into consideration because the number of examples is very limited. Now, we will try to find some workarounds to these persistent issues.
To fix the first issue, we could add more sources for the token list. For example we could use the ‘title’ or ‘meta’ tags, or even all the texts contained in each webpage. However, it opens a new difficulty: the number of tokens can be too high to be processable. In this case, we have to limit the number of possible token. One of the most common ways to do this is to sort them according to their frequency and discard the least frequent ones.
To increase the number of examples and as a result reduce the bias of the solution, one possible solution is to automatically find new examples by applying the classifier to a group of non-classified URLs. This method can be repeated successively to further increase the number of examples. This method is commonly named “bootstrapping”.