Integrated Knowledge Solutions

CountVectorizer to HashingVectorizer for Numeric Representation of Text

According to some estimates, the unstructured data accounts for about 90% of the data being generated everyday. A large part of unstructured data consists of text in the form of emails, news reports, social media postings, phone transcripts, product reviews etc. Analyzing such data for pattern discovery requires converting text to numeric representation in the form of a vector using words as features. Such a representation is known as vector space model in information retrieval; in machine learning it is known as bag-of-words (BoW) model.

In this post, I will describe different text vectorizers from sklearn library. I will do this using a small corpus of four documents, shown below.

corpus = ['The sky is blue and beautiful',
'The king is old and the queen is beautiful',
'Love this beautiful blue sky',
'The beautiful queen and the old king']

CountVectorizer

The CountVectorizer is the simplest way of converting text to vector. It tokenizes the documents to build a vocabulary of the words present in the corpus and counts how often each word from the vocabulary is present in each and every document in the corpus. Thus, every document is represented by a vector whose size equals the vocabulary size and entries in the vector for a particular document show the count for words in that document. When the document vectors are arranged as rows, the resulting matrix is called document-term matrix; it is a convenient way of representing a small corpus.

For our example corpus, the CountVectorizer produces the following representation.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
Doc_Term_Matrix = pd.DataFrame(X.toarray(),columns= vectorizer.get_feature_names())
Doc_Term_Matrix

['and', 'beautiful', 'blue', 'is', 'king', 'love', 'old', 'queen', 'sky', 'the', 'this']

The column headings are the word features arranged in alphabetical order and row indices refer to documents in the corpus. In the present example, the size of the resulting document-term matrix is 4×11, as there are 4 documents in the example corpus and there are 11 distinct words in the corpus. Since common words such as “is”, “the”, “this” etc do not provide any indication about the document content, we can safely remove such words by telling CountVectorizer to perform stop word filtering as shown below.

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
Doc_Term_Matrix = pd.DataFrame(X.toarray(),columns= vectorizer.get_feature_names())
Doc_Term_Matrix

['beautiful', 'blue', 'king', 'love', 'old', 'queen', 'sky']

Although the document-term matrix for our small corpus example doesn’t have too many zeros, it is easy to conceive that for any large corpus the resulting matrix will be a sparse matrix. Thus, internally the sparse matrix representation is used to store document vectors.

N-gram Word Features

One issue with the bag of words representation is the loss of context. The BoW representation just focuses on words presence in isolation; it doesn’t use the neighboring words to build a more meaningful representation. The CountVectorizer provides a way to overcome this issue by allowing a vector representation using N-grams of words. In such a model, N successive words are used as features. Thus, in a bi-gram model, N = 2, two successive words will be used as features in the vector representations of documents. The result of such a vectorization for our small corpus example is shown below. Here the parameter ngram_range = (1,2) tells the vectorizer to use two successive words along with each single word as features for the resulting vector representation.

vectorizer = CountVectorizer(ngram_range = (1,2),stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
Doc_Term_Matrix = pd.DataFrame(X.toarray(),columns= vectorizer.get_feature_names())
Doc_Term_Matrix

['beautiful', 'beautiful blue', 'beautiful queen', 'blue', 'blue beautiful', 'blue sky', 'king', 'king old', 'love', 'love beautiful', 'old', 'old king', 'old queen', 'queen', 'queen beautiful', 'queen old', 'sky', 'sky blue']

It is obvious that while N-gram features provide context and consequently better results in pattern discovery, it comes at the cost of increased vector size.

TfidfVectorizer

Simply using the word count as a feature value of a word really doesn’t reflect the importance of that word in a document. For example, if a word is present frequently in all documents in a corpus, then its count value in different documents is not helpful in discriminating between different documents. On other hand, if a word is present only in a few of documents, then its count value in those documents can help discriminating them from the rest of the documents. Thus, the importance of a word, i.e. its feature value, for a document not only depends upon how often it is present in that document but also how is its overall presence in the corpus. This notion of importance of a word in a document is captured by a scheme, known as the term frequency-inverse document frequency (tf-idf ) weighting scheme.

The term frequency is a ratio of the count of a word’s occurrence in a document and the number of words in the document. Thus, it is a normalized measure that takes into consideration the document length. Let us show the count of word i in document j by $tf_{ij}$ . The document frequency of word i represents the number of documents in the corpus with word i in them. Let us represent document frequency for word i by $df_i$ . With N as the number of documents in the corpus, the tf-idf weight $w_{ij}$ for word i in document j is computed by the following formula:

$w_{ij} = tf_{ij}\times(1 + \text{log}\frac{1+N}{1+df_{ij}})$

The sklearn library offers two ways to generate the tf-idf representations of documents. The TfidfTransformer transforms the count values produced by the CountVectorizer to tf-idf weights.

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
Doc_Term_Matrix = pd.DataFrame(tfidf.toarray(),columns= vectorizer.get_feature_names())
pd.set_option("display.precision", 2)
Doc_Term_Matrix

Another way is to use the TfidfVectorizer which combines both counting and term weighting in a single class as shown below.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range = (1,2),stop_words='english')
tfidf = vectorizer.fit_transform(corpus)
Doc_Term_Matrix = pd.DataFrame(tfidf.toarray(),columns= vectorizer.get_feature_names())
Doc_Term_Matrix

One thing to note is that the tf-idf weights are normalized so that the resulting document vector is of unit length. You can easily check this by squaring and adding the weight values along each row of the document-term matrix; the resulting sum should be one. This sum represents the squared length of the document vector.

HashingVectorizer

There are two main issues with the CountVectorizer and TdidfVectorizer. First, the vocabulary size can grow so much so as not to fit in the available memory for large corpus. In such a case, we need two passes over data. If we were to distribute the vectorization task to several computers, then we will need to synchronize vocabulary building across computing nodes. The other issue arises in the context of an online text classifier built using the count vectorizer, for example spam classifier which needs to decide whether an incoming email is spam or not. When such a classifier encounters words not in its vocabulary, it ignores them. A spammer can take advantage of this by deliberately misspelling words in its message which when ignored by the spam filter will cause the spam message appear normal. The HashingVectorizer overcomes these limitations.

The HashingVectorizer is based on feature hashing, also known as the hashing trick. Unlike the CountVectorizer where the index assigned to a word in the document vector is determined by the alphabetical order of the word in the vocabulary, the HashingVectorizer maintains no vocabulary and determines the index of a word in an array of fixed size via hashing. Since no vocabulary is maintained, the presence of new or misspelled words doesn’t create any problem. Also the hashing is done on the fly and memory need is diminshed.

You may recall that hashing is a way of converting a key into an address of a table, known as the hash table. As an example, consider the following hash function for a string s of length n:

$\text{hash(s)} = (s[0] + s[1] \cdot p +s[2]\cdot p^2 +\cdots + s[n-1] \cdot p^{n-1})\text{ mod }m$

The quantities p and m are chosen in practice to minimize collision and set the hash table size. Letting p = 31 and m = 1063, both prime numbers, the above hash function will map the word “blue” to location 493 in an array of size 1064 as per the calculations:

$\text{hash(blue)} = (2 + 12 \cdot 31 + 21\cdot 31^2 + 5\cdot 31^3)\text{ mod 1063} = 493\text{, }$

where letter b is replaced by 2, l by 12, and so on based on their positions in the alphabet sequence. Similarly, the word “king” will be hashed to location 114 although “king” comes later than “blue” in alphabetical order.

The HashingVectorizer implemented in sklearn uses the Murmur3 hashing function which returns both positive and negative values. The sign of the hashed value is used as the sign of the value stored in the document-term matrix. By default, the size of the hash table is set to $2^{20}$ ; however, you can specify the size if the corpus is not exceedingly large. The result of applying HashingVectorizer to our example corpus is shown below. You will note that the parameter n_features, which determines the hash table size, is set to 6. This has been done to show collisions since our corpus has 7 distinct words after filtering stop words.

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(n_features=6,norm = None,stop_words='english')
X = vectorizer.fit_transform(corpus)
Doc_Term_Matrix = pd.DataFrame(X.toarray())
Doc_Term_Matrix

You will note that column headings are integer numbers referring to hash table locations. Also that hash table location indexed 5 shows the presence of collisions. There are three words that are being hashed to this location. These collisions disappear when the hash table size is set to 8 which is more than the vocabulary size of 7. In this case, we get the following document-term matrix.

The HashingVectorizer has a norm parameter that determines whether any normalization of the resulting vectors will be done or not. When norm is set to None as done in the above, the resulting vectors are not normalized and the vector entries, i.e. feature values, are all positive or negative integers. When norm parameter is set to l1, the feature values are normalized so as the sum of all feature values for any document sums to positive/negative 1. In the case of our example corpus, the result of using l1 norm will be as follows.

With norm set to l2, the HashingVectorizer normalizes each document vector to unit length. With this setting, we will get the following document-term matrix for our example corpus.

The HashingVectorizer is not without its drawbacks. First of all, you cannot recover feature words from the hashed values and thus tf-idf weighting cannot be applied. However, the inverse-document frequency part of the tf-idf weighting can be still applied to the resulting hashed vectors, if needed. The second issue is that of collision. To avoid collisions, hash table size should be selected carefully. For very large corpora, the hash table size of $2^{18}$ or more seems to give good performance. While this size might appear large, some comparative numbers illuminate the advantage of feature hashing. For example, an email classifier with hash table of size 4 million locations has been shown to perform well on a well-known spam filtering dataset having 40 million unique words extracted from 3.2 million emails. That is a ten times reduction in the document vectors size.

To summarize different vectorizers, the TfidfVectorizer appears a good choice and possibly the most popular choice for working with a static corpus or even with a slowing changing corpus provided periodic updating of the vocabulary and the classification model is not problematic. On the other hand, the HashingVectorizer is the best choice when working with a dynamic corpus or in an online setting.

Principal Component Analysis Explained with Examples

(Originally published on August 21, 2018)

Any machine learning model building task begins with a collection of data vectors wherein each vector consists of a fixed number of components. These components represent the measurements, known as attributes or features, deemed useful for the given machine learning task at hand. The number of components, i.e. the size of the vector, is termed as the dimensionality of the feature space. When the number of features is large, we are often interested in reducing their number to limit the number of training examples needed to strike a proper balance with the number of model parameters. One way to reduce the number of features is to look for a subset of original features via some suitable search technique. Another way to reduce the number of features or dimensionality is to map or transform the original features in to another feature space of smaller dimensionality. The Principal Component Analysis (PCA) is an example of this feature transformation approach where the new features are constructed by applying a linear transformation on the original set of features. The use of PCA does not require knowledge of the class labels associated with each data vector. Thus, PCA is characterized as a linear, unsupervised technique for dimensionality reduction.

Basic Idea Behind PCA

The basic idea behind PCA is to exploit the correlations between the original features. To understand this, lets look at the following two plots showing how a pair of variables vary together. In the left plot, there is no relationship in how X-Y values are varying; the values seem to be varying randomly. On the other hand, the variations in the right side plot exhibits a pattern; the Y values are moving up in a linear fashion. In terms of correlation, we say that the values in the left plot show no correlation while the values in the right plot show good correlation. It is not hard to notice that given a X-value from the right plot, we can reasonably guess the Y-value; however this cannot be done for X-values in the left graph. This means that we can represent the data in the right plot with a good approximation lying along a line, that is we can reduce the original two-dimensional data in one dimensions. Thus achieving dimensionality reduction. Of course, such a reduction is not possible for data in the left plot where there is no correlation in X-Y pairs of values.

PCA Steps

Now that we know the basic idea behind the PCA, let's look at the steps needed to perform PCA. These are:

We start with N d-dimensionaldata vectors, $\boldsymbol{x}_i, i= 1, \cdots,N$, and find the eigenvalues and eigenvectors of the sample covariance matrix of size d x d using the given data
We select the top k eigenvalues, k ≤ d, and use the corresponding eigenvectors to define the linear transformation matrix A of size k x d for transforming original features into the new space.
Obtain the transformed vectors, $\boldsymbol{y}_i, i= 1, \cdots,N$, using the following relationship. Note the transformation involves first shifting the origin of the original feature space using the mean of the input vectors as shown below, and then applying the transformation.

$\boldsymbol{y}_i = \bf{A}(\bf{x}_i - \bf{m}_x)$

The transformed vectors are the ones we then use for visualization and building our predictive model. We can also recover the original data vectors with certain error by using the following relationship

$\boldsymbol{\hat x}_i = \boldsymbol{A}^t\boldsymbol{y}_i + \boldsymbol{m}_x$

The mean square error (mse) between the original and reconstructed vectors is the sum of the eigenvalues whose corresponding eigenvectors are not used in the transformation matrix A.

$ e_{mse} = \sum\limits_{j=k+1}\limits^d \lambda_j$

Another way to look at how good the PCA is doing is by calculating the percentage variability, P, captured by the eigenvectors corresponding to top k eigenvalues. This is expressed by the following formula

$ P = \frac{\sum\limits_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j}$

A Simple PCA Example

Let's look at PCA computation in python using 10 vectors in three dimensions. The PCA calculations will be done following the steps given above. Lets first describe the input vectors, calculate the mean vector and the covariance matrix.

Next, we get the eigenvalues and eigenvectors. We are going to reduce the data to two dimensions. So we form the transformation matrix A using the eigenvectors of top two eigenvalues.

With the calculated A matrix, we transform the input vectors to obtain vectors in two dimensions to complete the PCA operation.

Looking at the calculated mean square error, we find that it is not equal to the smallest eigenvalue (0.74992815) as expected. So what is the catch here? It turns out that the formula used in calculating the covariance matrix assums the number of examples, N, to be large. In our case, the number of examples is rather small, only 10. Thus, if we multiply the mse value by N/(N-1), known as the small sample correction, we will get the result identical to the smallest eigenvalue. As N becomes large, the ratio N/(N-1) approaches unity and no such correction is required.

The above PCA computation was deliberately done through a series of steps. In practice, the PCA can be easily done using the scikit-learn implementation as shown below.

Before wrapping up, let me summarize a few takeaways about PCA.

You should not expect PCA to provide much reduction in dimensionality if the original features have little correlation with each other.
It is often a good practice to perform data normalization prior to applying PCA. The normalization converts each feature to have a zero mean and unit variance. Without normalization, the features showing large variance tend to dominate the result. Such large variances could also be caused by the scales used for measuring different features. This can be easily done by using sklearn.preprocessing.StandardScalar class.
Instead of performing PCA using the covariance matrix, we can also use the correlation matrix. The correlation matrix has a built-in normalization of features and thus the data normalization is not needed. Sometimes, the correlation matrix is referred to as the standardized covariance matrix.
Eigenvalues and eigenvectors are typically calculated by the singular value decomposion (SVD) method of matrix factorization. Thus, PCA and SVD are often viewed the same. But you should remember that the starting point for PCA is a collection of data vectors that are needed to compute sample covariance/correlation matrices to perform eigenvector decomposition which is often done by SVD.

Gradient Boosted Regression and Classification Trees

A long while ago I had posted about how a "A Bunch of Weak Classifiers Working Together Can Outperform a Single Strong Classifier." Such a bunch of classifiers/regressors is called an ensemble. As mentioned in that post, bagging and boosting are two techniques that are used to create an ensemble of classifiers/regressors. In this post, I will explain how to build an ensemble of decision trees using gradient boosting. Before going into the details, however, let us first understand boosting, gradient boosting, and why the decision tree classifiers and regression trees are the most suitable candidates for creating an ensemble.

The individual units in an ensemble are known as weak learners. Such learners are brittle in nature, i.e. a small change in the training data can have a major impact on their performance. This brittle nature helps to ensure that the weak learners are not well-correlated with each other. The choice of the tree model for weak learners is popular because tree models exhibit decent brittleness and such models are easily trained without much computation effort.

Boosting is one technique to combine weak learners to obtain a single strong learner or model. Bagging is another such technique. In bagging, all weak learners learn in parallel and independent of other learners. Every learner in bagging looks at only a part of the training data or a subset of features or both. In boosting, the weak learners are build sequentially using the full training data and the results from the preceding weak learners. Thus, boosting builds an additive model and the output of the ensemble is taken as the weighted sum of the predictions of the weak learners.

In the original boosting algorithm AdaBoost, the successive weak learners try to improve the accuracy by giving more importance to those training examples that are misclassified by the prior weak learners. In gradient boosting, the results from the preceding weak learners are compared with the desired output to obtain a measure of the error which the next weak learner tries to minimize. Thus, the focus in gradient boost is on the residuals, i.e. the difference between the desired output and the actual output, rather than on those specific examples that were misclassified by the earlier learners. The term gradient is used because the error minimization is carried out by using the gradient of the error function. The sequence of successive trees are of identical depth from 1 (stump) to 4.

A Walk Through Example to Build a Gradient Boosted Regression Tree

Let us work through an example to clearly understand how gradient boosted trees are build. We will take a regression example because building a gradient boosted regression tree is easier to understand than a gradient boosted classification tree. The example consists of two predictors, fertilizer input (x0) and insecticide input (x1), and the output variable is the crop yield (y). The goal is to build a gradient boosted regression tree to predict the crop yield given the input numbers for x0 and x1.

This image has an empty alt attribute; its file name is screen-shot-2022-02-15-at-1.26.17-pm.png

We will build a tree that minimizes the squared error (mse) between the actual output and the predicted output. We begin by building a base learner that yields the same prediction irrespective of the values of the predictor variables. This base prediction, let us denote it as y_hat_0, equals average of the output variable y, y_bar. Next, we calculate the residuals, the difference between the actual output y and the predicted output to obtain the residuals as shown below.

This image has an empty alt attribute; its file name is screen-shot-2022-02-15-at-2.01.27-pm.png

The next step is to adjust the predictions by relating the residuals with x0 and x1. We do this by fitting a regression tree to the residuals. We will use a tree of depth 1. To build the tree, we find the feature and the cut-off or the threshold value that best partitions the training data to minimize the mean squared error (mse). For the root node, such a combination consists of predictor x1 and the cut-off value of 11.5. The resulting tree is shown below. This tree was build using the DecisionTreeRegressor of the Sklearn library. Let's look at the meanings of different entries shown in the tree diagram. The mse value of 181.889 in the internal node of the tree is nothing but the error value if the residuals_0 column is approximated by the column average which is being shown by "value = 0." The number of examples associated with a node in the tree is given by "samples."

This image has an empty alt attribute; its file name is image-3.png

The above tree shows residual approximation by -11.667 for those training examples whose x1 value is less than or equal to 11.5 and the approximation by 11.667 when x1 is greater than 11.5. Assuming a learning rate of 0.75, we combine the above tree with the base tree (stump) to obtain updated predictions and the next set of residuals as shown below. To understand how the updated predictions were calculated, consider the predictor variables x0=6 and x1=4. The above tree tells us that for x1=4, the residual should be approximated by -11.667. The previous prediction for this particular x0,x1 combination is 57.667. Thus, the new prediction in this case becomes 57.667+0.75*(-11.667) which equals 48.916.

This image has an empty alt attribute; its file name is screen-shot-2022-02-15-at-5.36.20-pm.png

With the new set of residuals, we again build a tree to relate x0, x1 with residuals_1. This tree is shown below.

This image has an empty alt attribute; its file name is image-4.png

We update the predictions. For the first training example, the updating leads to the prediction 48.916+0.75*(-2.717) which equals 46.879. These updated predictions, y_hat2, and the new residuals, residuals_2 are shown below.

This image has an empty alt attribute; its file name is screen-shot-2022-02-15-at-8.37.10-pm.png

The tree building process continues in this fashion to generate either the specified number of weak learners or to the acceptable level of error.

Now that we know how gradient boosted trees are created, let us finish the above example using the GradientBoostingRegressor from the Sklearn library as shown below. The number of learners has been specified as 10 and the depth is set to 1.

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Create data for illustration
X = np.array([[6,4],[12,5], [16,9],[22,14],[24,20],[32,24]])
y = np.array([40, 46, 52, 60, 68,80])
regressor = GradientBoostingRegressor(
    max_depth=1,
    n_estimators=10,
    learning_rate=0.75
)
gbt = regressor.fit(X, y)
gbt.predict(X)

array([40.18852649, 46.37977649, 49.56846063, 62.91676214, 67.36073713,
       79.58573713])

How about Gradient Boosted Classification Trees?

Let us convert the above regression example to a classification example by considering whether the crop yield was profitable or not based on fertilizer and insecticide costs. Thus, the output variable y now has two values, 0 (loss) and 1 (gain) as shown below.

This image has an empty alt attribute; its file name is screen-shot-2022-02-20-at-6.32.41-pm.png

Just like the procedure for building gradient boosted regression trees, we begin with a crude approximation for the output variable. The initial probability of class 1 over all the training examples, equal to 0.667, is taken as this approximation. The gradient boosted classifier is built using regression trees just like it is done while building gradient regression trees. The successive learners try to reduce the residuals, the difference between the y and the predicted probability.

Log Likelihood Loss Function and Log(odds)

The loss function minimized in the gradient boosted classification tree algorithm is the negative log likelihood. For a classification problem with two classes as we have, the negative log likelihood for the i-th training example is expressed by the following formula where p stands for the probability that the example is from class 1.

This image has an empty alt attribute; its file name is screen-shot-2022-02-21-at-4.38.03-pm.png

The loss is summed over all training examples to get the over all loss. The ratio p/(1-p) is called the odds or odds ratio, and often is expressed in terms of log(odds) by taking the natural log of the odds ratio. The probability p of an event and its log(odds) are related through the following equation:

This image has an empty alt attribute; its file name is screen-shot-2022-02-24-at-4.27.43-pm.png

Through some simple algebraic manipulation, we can also express the loss function as

This image has an empty alt attribute; its file name is screen-shot-2022-02-21-at-4.46.00-pm.png

Instead of calculating the gradient of the loss function with respect to p, it turns out that adjusting log(odds) offers a better way to minimize the loss function. Thus, taking the gradient of the loss function with respect to log(odd) results in the following expression:

This image has an empty alt attribute; its file name is screen-shot-2022-02-21-at-7.08.24-pm.png

The first term on the RHS of the equation is the negative of the desired output while the second term corresponds to the predicted probability (see the relationship between p and log(odds) above). The negative of the RHS in the above equation defines the residuals that we try to minimize by building successive trees via regression.

Back to Example

Coming back to our example, since the initial probability attached to every training example is 0.667 (ratio of class 1 labels and class 0 labels), the residuals are:

-0.667,  0.333,  0.333,  0.333,  0.333, -0.667

Let us try to fit a regression tree to the residuals at hand. Again, we will fit a tree of unit depth using the DecisionTreeRegressor from the Sklearn. The resulting tree shown below.

This image has an empty alt attribute; its file name is image-7.png

While in the case of the gradient regression trees, the "value" from the leaf nodes were directly used in the calculations of the updated residuals, we need to map "value" into log(odds) for calculating a new set of residuals in the present case. This mapping is done using the following relationship:

This image has an empty alt attribute; its file name is image-8.png

Let us try to understand the quantities in the RHS of the above expression. We will do so by referring to the right leaf node of the above tree. This leaf node has 5 samples that refer to the index "i" in the above expression. The item "value" in the leaf nodes is the residual for each example in the leaf node. There are five examples associated with the right leaf node; these are all but the first example of our data set. The previous probability is 0.667 for each of these examples. Thus, the mapped_log(odds) for these five examples are 5*0.133/(5*0.667*0.333), which equals 0.599. With 0.75 as the learning rate, the new log(odds) values for these five examples become 0.693(old log(odds)) + 0.75*(0.599) resulting in 1.142 for each of the five examples of the data set for which x0<9. Converting these log(odds) to probabilities, we see that 0.758 is the probability for the last five examples in the right leaf node to be from class 1. Carrying out similar calculations for the left leaf node, we end up with -1.559 as the updated log(odds) which translates to 0.174 as the probability that our first example in the data set is from class 1 or from the positive class.

Subtracting the updated probabilities from column y, we get a new set of residuals as shown below:

[-0.174,  0.242,  0.242,  0.242,  0.242, -0.758]

The regression tree built to fit these residuals is shown below.

This image has an empty alt attribute; its file name is image-9.png

We need to calculate the mapped_log(odds) again. The example associated with the right leaf in this case is the last example of the data set. The mapped log(odds) value for this example comes out as -4.132 following the formula for mapping shown above. The updated log(odds) value is then the 1.142(prior value) + 0.75* (-4.132). Converting the result into the probability, we get 0.123 as the probability that the last example in our data set comes from the positive class. Thus, it is classified as belonging to the negative (label 0) class. Let us now map the left leaf node value to log(odds) using the mapping formula.

= 5*0.159/((4*0.758*(1-0.758)+0.174*(1-0.174)))
= 0.906

This results in 0.906. Out of the five examples associated with the left leaf node, the four examples(all with y = 1) were together in the previous tree and had the updated log(odds) value 1.142 at that instance. The current log(odds) value for these four examples is then 1.142+0.75*0.906, equal to 1.821 which yields a probability value of 0.860. Thus, these four examples are correctly assigned the probability of 0.860 for the positive class. The remaining fifth example in the left node has a prior log(odds) value of -1.559. Thus, its updated log(odds) becomes -1.559+0.75*0.906, equal to -0.879. This gives 0.293 as the probability of this example from the positive class. Since the probability for the negative class is higher, this example is also correctly classified. At this stage, there is no need to add anymore learner because all examples have been correctly classified.

The above example was done to illustrate how the gradient boosted classifier builds its learners. Let us apply the GradientBoostingClassifier from the Sklearn library to check our step by step assembly of the learners explained above. We will print out the probability values to compare against our calculations.

from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(
    criterion='mse', max_depth=1,
    n_estimators=2,
    learning_rate=0.75
)
gbdt = clf.fit(X, y)
list(gbdt.predict_proba(X))

[array([0.70657313, 0.29342687]),
 array([0.13928973, 0.86071027]),
 array([0.13928973, 0.86071027]),
 array([0.13928973, 0.86071027]),
 array([0.13928973, 0.86071027]),
 array([0.87645946, 0.12354054])]

You can see that these probabilities are identical to those calculated earlier. Thus, you now know how to build a gradient boosted classifier.

Before You Leave

Let us summarize the post before you leave. Boosting relies on an ensemble of trees to perform regression and classification. The gradient boosted trees for regression and classification give excellent performance. Using the Sklearn library without paying much attention to the number of learners and tree depth can lead to over-fitting; so care must be taken. One disadvantage common to all ensemble methods is that the simplicity of understanding model that one gets using a single tree is lost. A speedier and more accurate and scalable version of gradient boosted trees is the extreme gradient boosted trees (XGBoost). The performance of XGBoost algorithm is at par with deep learning models and this model is very popular.

Image

Insert an image to make a visual statement.

Settings

Alternative text

Describe the purpose of the image.(opens in a new tab)

Leave empty if decorative.

Aspect ratio

Width

Height

Scale

Resolution

Select the size of the source image.

Looking for other block settings? They've moved to the styles tab.