/proc is cool …….

Have you ever peeped into the /proc directory in a Linux system?
I believe it is one magical directory which can tell us a couple of things as transparent especially if you are debugging some problems related to networking and performance.
I do check this directory quite often whenever I’m in Operations attire. 🙂 But a while ago i looked into /proc while debugging a Hyperledger Fabric Blockchain related smart contract deployment and interactions with the network peers and orders. So i thought of sharing some basic tips regarding this aspect though most of the senior guys already know about it but they can also refresh their memories here… 🙂

Would like to demonstrate some of those couple of things here. Let me grab the pid of a process called ‘peer’ in my hyperledger fabric linux host

root@blk_chain_hlf1:/home/ganga# pgrep peer

And we are gonna look for it under /proc directory..

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/
total 0
dr-xr-xr-x 2 root root 0 Oct 27 22:53 attr
-r–r–r– 1 root root 0 Oct 27 22:53 cgroup
–w——- 1 root root 0 Oct 27 22:53 clear_refs
-r–r–r– 1 root root 0 Oct 27 01:05 cmdline
-rw-r–r– 1 root root 0 Oct 27 22:53 comm
-rw-r–r– 1 root root 0 Oct 27 22:53 coredump_filter
-r–r–r– 1 root root 0 Oct 27 22:53 cpuset
lrwxrwxrwx 1 root root 0 Oct 27 22:53 cwd -> /opt/gopath/src/github.com/hyperledger/fabric/peer
-r——– 1 root root 0 Oct 27 22:53 environ
lrwxrwxrwx 1 root root 0 Oct 27 01:05 exe -> /usr/local/bin/peer
dr-x—— 2 root root 0 Oct 26 15:31 fd
dr-x—— 2 root root 0 Oct 27 22:53 fdinfo
-r–r–r– 1 root root 0 Oct 27 22:53 wchan

Let’s start looking at the file descriptors. You know file descriptors are the files which are opened by the program..

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/fd
total 0
lr-x—— 1 root root 64 Oct 26 15:31 0 -> pipe:[4502095]
l-wx—— 1 root root 64 Oct 26 15:31 1 -> pipe:[4502096]
l-wx—— 1 root root 64 Oct 27 22:56 10 -> /var/hyperledger/production/ledgersData/chains/index/000001.log
lrwx—— 1 root root 64 Oct 27 22:56 11 -> /var/hyperledger/production/ledgersData/stateLeveldb/LOCK
l-wx—— 1 root root 64 Oct 27 22:56 16 -> /var/hyperledger/production/ledgersData/historyLeveldb/LOG
l-wx—— 1 root root 64 Oct 27 22:56 17 -> /var/hyperledger/production/ledgersData/historyLeveldb/MANIFEST-000000
l-wx—— 1 root root 64 Oct 27 22:56 18 -> /var/hyperledger/production/ledgersData/historyLeveldb/000001.log
lrwx—— 1 root root 64 Oct 27 22:56 23 -> /var/hyperledger/production/ledgersData/chains/chains/myc/blockfile_000000
lrwx—— 1 root root 64 Oct 27 22:56 24 -> socket:[4503236]
lrwx—— 1 root root 64 Oct 27 22:56 7 -> /var/hyperledger/production/ledgersData/chains/index/LOCK
l-wx—— 1 root root 64 Oct 27 22:56 8 -> /var/hyperledger/production/ledgersData/chains/index/LOG
l-wx—— 1 root root 64 Oct 27 22:56 9 -> /var/hyperledger/production/ledgersData/chains/index/MANIFEST-000000

Another thing we can do is to take a look at under exe which tells us which executable this program is running..

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/exe
lrwxrwxrwx 1 root root 0 Oct 27 01:05 /proc/23709/exe -> /usr/local/bin/peer

Um. We will look at /cmdline and we can cat that and see which command it is using.. Cool.

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/cmdline
-r–r–r– 1 root root 0 Oct 27 01:05 /proc/23709/cmdline
root@blk_chain_hlf1:/home/ganga# cat /proc/23709/cmdline

One more thing we can see is its environment variables…Like..

root@blk_chain_hlf1:/home/ganga# cat /proc/23709/environ

The content of the environ could be dumped out like above.. but no worries, we can make it more readble by adding a newline to each of those entries like –
root@blk_chain_hlf1:/home/ganga# cat /proc/23709/environ | tr ‘\0’ ‘\n’

Really cool……

Evaluation of Boosting Algorithms: XGBoost vs LightGBM

In this article, we’re going to discuss about boosting algorithms. Boosting algorithms started with the advent of ADABoost and today’s most powerful boosting algorithm is XGBoost. Today, XGBoost is an algorithm that every young aspiring as well as experienced Data scientist have in their arsenal. But what really is XGBoost, let’s discuss more on that.

XGBoost stands for eXtreme Gradient Boosting. “The name XGBoost, though, actually refers to the engineering goal to push the limit of computation resources for boosted tree algorithms, which is the reason why many people use XGBoost” – Tianqi Chen, creator of XGBoost.

XGBoost has been featured in many winning solutions – if not most – and has been dominating Machine Learning based competitions on Kaggle, KDDCup and on many other such platforms. XGBoost is an optimized distributed gradient boosting implementation, designed to be highly efficient, flexible and portable. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. But even though XGBoost has it all, when given a huge data, XGBoost takes a long time to run.

Enter LightGBM. Microsoft has been lately increasing their development of tools in the analytics and machine learning space.  And a recently released such tool is LightGBM. LightGBM is a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks. Since, LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in LightGBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence might result in much better accuracy, it is very fast, hence the word ‘Light’.

Below is a diagrammatic representation by the makers of the LightGBM to explain the difference between how LightGBM and XGBoost build trees.


Leaf wise splits lead to increase in complexity and may lead to over fitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur.

The question then arises, which boosting algorithm is better or the best, XGBoost or its challenger LightGBM, to find this out, we thought, it’d be best if we implemented these two frameworks on a real dataset, so for this we picked a competition on Kaggle, called “AllState Claims Severity”. A little overview regarding the data follows. AllState, is a personal insurer in the United States, and is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect. AllState is currently developing automated methods of predicting the cost, and hence severity, of claims. The data consists of information about the individuals that are customers of AllState, target is to predict the amount of claim an individual is going to or will make. So, to predict the cost of claims, we’re going to use XGBoost and LightGBM algorithms and compare their results to see which works better. The data consists of 132 features and 188319 observations.

The implementation can be divided into three phases, Data pre-processing, Data Modeling, and performance of the model.

As you can see from the above screens, 75% of data falls under the amount of 3864$. The maximum value is 121012$, so there must be outliers and it has a very high variance too as a result of outliers. Let’s check the distribution of the target variable to get a better understanding.

The target variable appears to be skewed, where as it should be normally distributed. One of our assumptions that we make before applying any regression is that, the features follow normal distribution; now we have to transform the target variable so that it is normally distributed.

Skewness in the target variable is 3.79, which means that the target variable is highly skewed. The value of skew for a normal distribution is 0. Above, we are removing the outliers in the data by capping it at 99%, i.e. we’re taking the population that falls under 99%, and remaining 1% contains outliers. After removing the outliers in the target variable, the max value in the target variable is 13981, previously it was 121021 but because of the 1% of outliers in the data, the whole distribution and behavior of the target variable was affected.

The distribution seems a bit better now after removing the outliers and the skewness has come down too. But as you can see in the above graph, it is still not follow the blue bell curve fully, the data distribution is not fully normal yet. To make the target variable follow normal distribution, we’re going to apply some transformations to the target variable. In this case, we’re applying BoxCox transformation. BoxCox transformation is an exponent, lambda (λ), all the values of λ are considered and the optimal value for your data is selected. The “optimal value” is the one which results in the best approximation of a normal distribution curve. The BoxCox transformation is readily available as a method in the scipy library.

When we applied the BoxCox transformation, skewness in the target variable has come down to -0.0004 which is equal to ~ ‘0’. The distribution of the target variable now follows normal distribution.

We’ve written a function with which we can reverse the boxcox transformed values and get the original values, which we would need to calculate the metrics for the model.

There are 116 categorical variables and 14 continuous variables, the data pre-processing for the continuous variables is done, they all follow normal distribution, they next step now, is to convert the text variables to numerical and drop unnecessary variables from the training set.

The Data preprocessing phase is done, and we have split the train set into two sets, we’re going to train the model on the first set which contains 80% of the original train set and test it on the remaining 20% of the original train set.

The next phase is the Modeling phase. We will apply XGBoost and LightGBM algorithms to this data set and compare the results. Before we apply we need to convert our data which is stored in the form of a data structure into a matrix format viz. sparse or dense matrix, as XGBoost and LightGBM only works with numeric vectors.

We’ve applied XGBoost model and the parameters we’ve passed to the model are parameters XGBoost uses to build trees, we’ve arrived at these values after many iterations done previously, we’re going to pass similar parameter values to LightGBM so that we can have a fair evaluation/comparison.

We’ve applied both XGBoost and LightGBM, now it’s time to compare the performance of the algorithms. Since, we’ve used XGBoost and LightGBM to solve a regression problem, we’re going to compare the metric ‘Mean Absolute Error’ for both the models as well as compare the execution times.

The Mean Absolute Error for both the models is more or less the same. The standout of course is the execution time, LightGBM took only 5 seconds, and it’s pretty fast, hence the word ‘Light’. XGBoost on the other hand, took 110 seconds, and this could be the defining factor when it comes to choosing which algorithm would be more suitable to be applied to large datasets. Perhaps, this is just one such case, we should probably test it out more, apply these algorithms to many other problems and see how they perform, and it’d be naïve to come to conclusions regarding which is the best after just one evaluation/example. It’s up to the readers to decide which algorithm they think is the best, I think there’s no such thing as one is better than the other, both XGBoost and LightGBM have their pros and cons, and it all comes down to the problem that you’re trying to solve and which algorithm suits it better, i.e. which algorithm can produce better results for the business, as after all, data science is all about using the data, drawing out insights from it to better the business, help make better business decisions that can improve and boost the business.



Anand Mohan Munigoti

(Data Science Team)