Evaluation of Boosting Algorithms: XGBoost vs LightGBM

In this article, we’re going to discuss about boosting algorithms. Boosting algorithms started with the advent of ADABoost and today’s most powerful boosting algorithm is XGBoost. Today, XGBoost is an algorithm that every young aspiring as well as experienced Data scientist have in their arsenal. But what really is XGBoost, let’s discuss more on that.

XGBoost stands for eXtreme Gradient Boosting. “The name XGBoost, though, actually refers to the engineering goal to push the limit of computation resources for boosted tree algorithms, which is the reason why many people use XGBoost” – Tianqi Chen, creator of XGBoost.

XGBoost has been featured in many winning solutions – if not most – and has been dominating Machine Learning based competitions on Kaggle, KDDCup and on many other such platforms. XGBoost is an optimized distributed gradient boosting implementation, designed to be highly efficient, flexible and portable. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. But even though XGBoost has it all, when given a huge data, XGBoost takes a long time to run.

Enter LightGBM. Microsoft has been lately increasing their development of tools in the analytics and machine learning space.  And a recently released such tool is LightGBM. LightGBM is a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks. Since, LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in LightGBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence might result in much better accuracy, it is very fast, hence the word ‘Light’.

Below is a diagrammatic representation by the makers of the LightGBM to explain the difference between how LightGBM and XGBoost build trees.

 

Leaf wise splits lead to increase in complexity and may lead to over fitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur.

The question then arises, which boosting algorithm is better or the best, XGBoost or its challenger LightGBM, to find this out, we thought, it’d be best if we implemented these two frameworks on a real dataset, so for this we picked a competition on Kaggle, called “AllState Claims Severity”. A little overview regarding the data follows. AllState, is a personal insurer in the United States, and is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect. AllState is currently developing automated methods of predicting the cost, and hence severity, of claims. The data consists of information about the individuals that are customers of AllState, target is to predict the amount of claim an individual is going to or will make. So, to predict the cost of claims, we’re going to use XGBoost and LightGBM algorithms and compare their results to see which works better. The data consists of 132 features and 188319 observations.

The implementation can be divided into three phases, Data pre-processing, Data Modeling, and performance of the model.

As you can see from the above screens, 75% of data falls under the amount of 3864$. The maximum value is 121012$, so there must be outliers and it has a very high variance too as a result of outliers. Let’s check the distribution of the target variable to get a better understanding.

The target variable appears to be skewed, where as it should be normally distributed. One of our assumptions that we make before applying any regression is that, the features follow normal distribution; now we have to transform the target variable so that it is normally distributed.

Skewness in the target variable is 3.79, which means that the target variable is highly skewed. The value of skew for a normal distribution is 0. Above, we are removing the outliers in the data by capping it at 99%, i.e. we’re taking the population that falls under 99%, and remaining 1% contains outliers. After removing the outliers in the target variable, the max value in the target variable is 13981, previously it was 121021 but because of the 1% of outliers in the data, the whole distribution and behavior of the target variable was affected.

The distribution seems a bit better now after removing the outliers and the skewness has come down too. But as you can see in the above graph, it is still not follow the blue bell curve fully, the data distribution is not fully normal yet. To make the target variable follow normal distribution, we’re going to apply some transformations to the target variable. In this case, we’re applying BoxCox transformation. BoxCox transformation is an exponent, lambda (λ), all the values of λ are considered and the optimal value for your data is selected. The “optimal value” is the one which results in the best approximation of a normal distribution curve. The BoxCox transformation is readily available as a method in the scipy library.

When we applied the BoxCox transformation, skewness in the target variable has come down to -0.0004 which is equal to ~ ‘0’. The distribution of the target variable now follows normal distribution.

We’ve written a function with which we can reverse the boxcox transformed values and get the original values, which we would need to calculate the metrics for the model.

There are 116 categorical variables and 14 continuous variables, the data pre-processing for the continuous variables is done, they all follow normal distribution, they next step now, is to convert the text variables to numerical and drop unnecessary variables from the training set.

The Data preprocessing phase is done, and we have split the train set into two sets, we’re going to train the model on the first set which contains 80% of the original train set and test it on the remaining 20% of the original train set.

The next phase is the Modeling phase. We will apply XGBoost and LightGBM algorithms to this data set and compare the results. Before we apply we need to convert our data which is stored in the form of a data structure into a matrix format viz. sparse or dense matrix, as XGBoost and LightGBM only works with numeric vectors.

We’ve applied XGBoost model and the parameters we’ve passed to the model are parameters XGBoost uses to build trees, we’ve arrived at these values after many iterations done previously, we’re going to pass similar parameter values to LightGBM so that we can have a fair evaluation/comparison.

We’ve applied both XGBoost and LightGBM, now it’s time to compare the performance of the algorithms. Since, we’ve used XGBoost and LightGBM to solve a regression problem, we’re going to compare the metric ‘Mean Absolute Error’ for both the models as well as compare the execution times.

The Mean Absolute Error for both the models is more or less the same. The standout of course is the execution time, LightGBM took only 5 seconds, and it’s pretty fast, hence the word ‘Light’. XGBoost on the other hand, took 110 seconds, and this could be the defining factor when it comes to choosing which algorithm would be more suitable to be applied to large datasets. Perhaps, this is just one such case, we should probably test it out more, apply these algorithms to many other problems and see how they perform, and it’d be naïve to come to conclusions regarding which is the best after just one evaluation/example. It’s up to the readers to decide which algorithm they think is the best, I think there’s no such thing as one is better than the other, both XGBoost and LightGBM have their pros and cons, and it all comes down to the problem that you’re trying to solve and which algorithm suits it better, i.e. which algorithm can produce better results for the business, as after all, data science is all about using the data, drawing out insights from it to better the business, help make better business decisions that can improve and boost the business.

 

Author:

Anand Mohan Munigoti

(Data Science Team)

3 thoughts on “Evaluation of Boosting Algorithms: XGBoost vs LightGBM”

  1. How can we evaluate Boosting versus Bagging first? Or they two different animals and cannot be compared?
    And coming to this post context, can we apply any of these machine learning algorithms to find out whether a data record having multiple fields is genuine or not. Of course you will be given a set of rules (weak learner rules) like -it is genuine if i) that record has the defined set of field lables (enumeration numbers, in real) only ii) the data format of those fields belong to defined set of formats only; iii) it has the exchange codes; iv) the source ip of that record emanated from a defined set of IP addresses; etc And NOT genuine if i) that record has negative pricing; etc. Will your algo work efficiently to convert these weak ones to form a stronger and present the results of data accuracy. Dont know if i’m making sense here….

    1. @Ganga. Your problem sounds similar to email spam detection. Supervised learning is used for such problems. Train the models using a training data set that is labelled as “Genuine” or “Not Genuine” (SPAM or NOT SPAM).

Leave a Reply

Your email address will not be published. Required fields are marked *