In the previous part we continued work on our binary classifier. We showed how to use scikit-learn's Pipeline class to combine feature calculation and prediction into one simple step. In this part we will improve performance of our model by adding new features. This will also allow us to show how useful pipelines can be when it comes to running prediction on many different features.
Additional features and DictVectorizer
So far we used only one column of our dataset: uri. Now we would like to use the remaining columns: method, http_version, is_static and __has_referer__ as additional features. But first we need to convert them into vectors that can be used by the predictive model. For this purpose we will use another of the scikit-learn transformers: DictVectorizer. DictVectorizer implements so-called one-hot encoding for categorical features stored in dictionaries. Before we explain what one-hot encoding is, let's just look at an example:
For now we will focus only on two columns: http_version (string) and isstatic (boolean). Let's take the first 10 records from our development dataset, select these two columns and convert the result to a list of dictionaries (this is the format accepted by DictVectorizer). We will use these 10 records to show how DictVectorizer transforms data. <a href="https://3.bp.blogspot.com/-KaThK0gmM44/WXrbcU-VrI/AAAAAAAAAFc/XDPyhi8Jul8VwfF7ZxoPHJIVoMjYfj11wCLcBGAs/s1600/Screen%2BShot%2B2017-07-28%2Bat%2B08.36.14.png" imageanchor="1">Let's now use DictVectorizer's fit_transform() method to transform our records. By default DictVectorizer outputs data in sparse format, but for this small dataset we will set sparse=False: As we can see each record is now represented by a list with three binary values. We can examine dictvectorizer.vocabulary dictionary to see what they represent:
The first element will be equal to one when http_version is equal to v1.1, the second when http_version is equal to v2.0 and the last when is_static is equal to true. This is how one-hot encoding works. Each possible value of http_version seen in the dataset gets one binary column. Boolean values like is_static are encoded to only one column, since they already are binary.
FeatureUnion and custom transformers
Our new model will use features from both CountVectorizer and DictVectorizer. In order to combine these two sets of features we can use scikit-learn's FeatureUnion. FeatureUnion will run both transformers in parallel on the input data and then concatenate the results.
However, we still have one more problem to solve. We would like to run the pipeline directly on our full dataset which is stored as pandas DataFrame. CountVectorizer should operate only on uri column and DictVectorizer should use all the columns besides uri. Our pipeline requires additional step that will select appropriate data for each vectorizer. We will write a custom transformer that will be able to do this: Any class that implements fit() and transform() methods can be used as a transformer in scikit-learn's Pipeline. Additionally we can inherit from BaseEstimator and TransformerMixin to obtain some common scikit-learn methods like for example fit_transform() (which is just a shortcut for sequential calls of fit() and transform()).
Finally, we can create processing pipeline that will include all new features. There are two steps inside: FeatureUnion and XGBoost. As you can see FeatureUnion combines two smaller pipelines: one for calculation of text features and the second for calculation of categorical features.
Notice that we also added ngram_range parameter to the constructor of CountVectorizer. In the previous experiments we used only single characters as tokens that are counted by the CountVectorizer. This time CountVectorizer will operate on n-grams of size between 1 and 3. What are n-grams? N-gram is basically a list of n sequential items from a sequence. In our case the items are characters and the sequence is the uri. In natural language processing words are often used as items of n-grams. N-grams on words are also often called shingles.
To illustrate how n-grams are created let's see a list of all 3-grams of uri /index.php:
/in, ind, nde, dex, ex., x.p, .ph, php.
As we mentioned earlier we will use n-grams of size between 1 and 3, so in result we will obtain attributes that represent occurrences of single characters (unigrams), pairs of characters (bigrams) and triples of characters (trigrams).
Now all that is left is to call fit() on our model and see what are the results.
As we can see adding new features has resulted in significant improvement of average precision. The score improved from 73% to 96%. This can be both due to using n-grams or categorical features from DictVectorizer. Both sets of attributes have been added by us in a single step, so we cannot be sure which one has contributed most to the increase in performance. However, we can find out by analyzing feature importance extracted from xgboost:
Not surprisingly it seems that n-grams contributed most to the predictive model. We could probably enhance performance even more by creating more advanced features from uri. Still, 96% of average precision is already a very good result. We can now tune probability threshold of our model to achieve desired precision-recall tradeoff. Let's assume that our business requirement is to have precision of at least 99.5%. This code can help us to check what is our desired value of threshold, and what recall will be achieved with this precision: