Getting git experiences

Had to get a file contents dynamically from the AWS codecommit repo and supply those contents as an array list to the groovy shell in a jenkins job. This file, belongs to an AWS codecommit repo, is so small and just has some info as one word per line. Those items are related to the tags (modules) which are defined in the robot file destined for automation.

I can have the file put in the jenkins server manually and read it from the location from the Jenkins’s groovy shell. But that file might be modified and/or contents might be added in the due course whenever new modules get added for automation. The contents of that file will be shown as the checkbox items whenever you hit the jenkins job’s ‘build with parameters’ button to trigger the job and user has to supply those checkbox items (selecting one or many) as inputs to the jenkins job. So, getting that file from the codecommit repo dynamically whenever the Jenkins job runs is inevitable, otherwise managing the file manually would be costly laborious affair.

Well, that’s the background story. So, getting or downloading or checking out that module list items file from the codecommit repo without cloning the entire repo is the initial step. I can’t use the SCM step (which clones the entire repo) for the automation job because we will have to show the items as checkbox items before cloning the repo. Hope you got it.

Know that there are couple of ways to checkout a singe file from the repo by using git sparse checkout or git archives or some other ways. Even one can use wget on the git raw content url (for example, like, https://raw.githubusercontent.com/ambatigan/list-items/master/items_list.txt) if it is github repo. But this is AWS Codecommit repo. They will not provide the raw content urls like github. AWS provides some API calls for getting the blob of the file content though, but they should be authenticated first.

If you have svn installed in your server, you can also try ‘svn export’ on the github url. (for example, svn export https://github.com/ambatigan/list-items.git/trunk/items_list.txt). But this approach for the AWS codecommit needs the credentials to be supplied in the command itself.

Discovered that in the AWS CLI for codecommit (for the latest CLI version, 1.16.x ONLY), they introduced get-file subcommand for aws codecommit. The response of this get-file command output gives the fileContent in the base64 encoding. And you can use the default base64 decoder in the linux server, to decode it back to the original content.

For example, the response to the get-file command (ex: aws codecommit get-file –repository-name Testing-Automation –file-path /Jenkins/Dev/Resources/Input_data/tag_names.txt) is as follows:

[ec2-user@ip-60-0-1-94 ~]$ aws codecommit get-file –repository-name Testing-Automation –file-path /Jenkins/Dev/Resources/Input_data/tag_names.txt
{
“filePath”: “Jenkins/Dev/Resources/Input_data/tag_names.txt”,
“blobId”: “a6c7ac16cf059e739c3ad50efc2375d95feea03c”,
“commitId”: “e2a51016ce9b3f281504257124e6b6d72d3e338e”,
“fileSize”: 26,
“fileContent”: “QWxsCkxvZ2luClRyZW5kcwpTaXRlX21hcAo=”,
“fileMode”: “NORMAL”
}

And we can decode it like –

[ec2-user@ip-60-0-1-94 ~]$ echo QWxsCkxvZ2luClRyZW5kcwpTaXRlX21hcAo= | base64 -d
All
Login
Trends
Site_map
[ec2-user@ip-60-0-1-94 ~]$

Summary – echo ` aws codecommit get-file –repository-name Testing-Automation –file-path /Jenkins/Dev/Resources/Input_data/tag_names.txt|jq -r ‘.fileContent’`|base64 -d

Refer to https://github.com/ambatigan/list-items – to know how I implemented this core concept into the groovy shell in the Jenkins job.

PMI ACP

I’m a PMI Agile Certified Practitioner (ACP) now! I’m pleased to inform you all that I have passed this toughest exam with the ‘Above Target’ rating, in the second attempt on 16th July, 2018.

Yes, I failed at the first attempt, a couple of months ago. That was all happened because I underestimated this PMI ACP exam. Though I had spent a reasonable number of hours in preparation for the exam, I went into it with a wrong mindset and failed at that time. I had an impression that this exam was not that hard, at least not when compared to the other exams that I appeared. It got instilled a sense of complacency and overconfidence in me. I thought that having an IT background, and having managed a lot of software development projects for the past 11 years, I should be able to sail through this. So, obviously I was failed in that PMI ACP exam, for the first attempt. But that was not the end of the story. I drew a fishbone diagram to display the root cause analysis of why I failed the exam. Using that technique, identified the factors that caused or affected me to fail. And started the journey again.

And today, the big day happened and cracked the PMI ACP! Passing any exam is a matter of satisfaction and relief, but more so when you have invested big dollars and time into it. 🙂

Thanks to the 400+ pages book, PMI – ACP Exam prep by Mike Griffiths, without which I couldn’t accomplish this success…

disk space

Have you encountered a situation like you see your root partition gets full (100%) but you see nothing needs to be cleaned up. In other words, the df output shows 100% or some thing huge disk space at a particular folder but the ‘du’ output shows it was not that much huge?

For example, I encountered one such situation in one of our production hosts, where the root partition got full with 100%..but the ‘du’ command on all the folders under that partition showed me nothing that much used space.

It lead me to run the following lsof command to investigate into…

undisclosed-host:/ # lsof|grep -i delete
….
….
lrthdf5   55870  root    1w      REG                8,3 13876053553    3278175 /var/log/lrthdf5/onl_dev_bin_mc3_dev.log (deleted)
lrthdf5   56098  root    2w      REG                8,3 13881587249    3278181 /var/log/lrthdf5/onl_replay_mc3replay.log (deleted)

From the above ‘lsof’ output, we find the processes with the pids 55870 and 56098 have kept the files /var/log/lrthdf5/onl_dev_bin_mc3_dev.log and /var/log/lrthdf5/onl_replay_mc3replay.log as open with the corresponding file descriptors (fd) mentioned..

It seemed to me like, as the part of a maintenance process, somebody deleted those log files while the files are being written by those processes in action.

As these files have been identified, I did free space occupied by those file by shutting down the processes in question.

Before:

undisclosed-host:/ # df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda3 ext3 63G 60G 384M 100% /
tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
udev tmpfs 16G 212K 16G 1% /dev
tmpfs tmpfs 16G 0 16G 0% /dev/shm
/dev/sdb1 xfs 5.0T 3.7T 1.3T 74% /mnt/sdb1

After shutting down the process (the used space got down from 100% to 78% ):

undisclosed-host:/ # stpcap onl_replay mc3replay
Fri Feb 9 00:18:09 PST 2018: Stopping capture processes. Please wait, it may take a while…

undisclosed-host:/ # df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda3 ext3 63G 47G 14G 78% /
tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
udev tmpfs 16G 212K 16G 1% /dev
tmpfs tmpfs 16G 0 16G 0% /dev/shm
/dev/sdb1 xfs 5.0T 3.7T 1.3T 74% /mnt/sdb1

Alternatively, it is possible to force the system to de-allocate the space consumed by an in-use file by forcing the system to truncate the file via the proc file system. But it is more advanced. Not needed in my case though..

/proc is cool …….

Have you ever peeped into the /proc directory in a Linux system?
I believe it is one magical directory which can tell us a couple of things as transparent especially if you are debugging some problems related to networking and performance.
I do check this directory quite often whenever I’m in Operations attire. 🙂 But a while ago i looked into /proc while debugging a Hyperledger Fabric Blockchain related smart contract deployment and interactions with the network peers and orders. So i thought of sharing some basic tips regarding this aspect though most of the senior guys already know about it but they can also refresh their memories here… 🙂

Would like to demonstrate some of those couple of things here. Let me grab the pid of a process called ‘peer’ in my hyperledger fabric linux host

root@blk_chain_hlf1:/home/ganga# pgrep peer
23709

And we are gonna look for it under /proc directory..

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/
total 0
dr-xr-xr-x 2 root root 0 Oct 27 22:53 attr
….
-r–r–r– 1 root root 0 Oct 27 22:53 cgroup
–w——- 1 root root 0 Oct 27 22:53 clear_refs
-r–r–r– 1 root root 0 Oct 27 01:05 cmdline
-rw-r–r– 1 root root 0 Oct 27 22:53 comm
-rw-r–r– 1 root root 0 Oct 27 22:53 coredump_filter
-r–r–r– 1 root root 0 Oct 27 22:53 cpuset
lrwxrwxrwx 1 root root 0 Oct 27 22:53 cwd -> /opt/gopath/src/github.com/hyperledger/fabric/peer
-r——– 1 root root 0 Oct 27 22:53 environ
lrwxrwxrwx 1 root root 0 Oct 27 01:05 exe -> /usr/local/bin/peer
dr-x—— 2 root root 0 Oct 26 15:31 fd
dr-x—— 2 root root 0 Oct 27 22:53 fdinfo
……
……
-r–r–r– 1 root root 0 Oct 27 22:53 wchan
root@blk_chain_hlf1:/home/ganga#

Let’s start looking at the file descriptors. You know file descriptors are the files which are opened by the program..

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/fd
total 0
lr-x—— 1 root root 64 Oct 26 15:31 0 -> pipe:[4502095]
l-wx—— 1 root root 64 Oct 26 15:31 1 -> pipe:[4502096]
l-wx—— 1 root root 64 Oct 27 22:56 10 -> /var/hyperledger/production/ledgersData/chains/index/000001.log
lrwx—— 1 root root 64 Oct 27 22:56 11 -> /var/hyperledger/production/ledgersData/stateLeveldb/LOCK
…….
l-wx—— 1 root root 64 Oct 27 22:56 16 -> /var/hyperledger/production/ledgersData/historyLeveldb/LOG
l-wx—— 1 root root 64 Oct 27 22:56 17 -> /var/hyperledger/production/ledgersData/historyLeveldb/MANIFEST-000000
l-wx—— 1 root root 64 Oct 27 22:56 18 -> /var/hyperledger/production/ledgersData/historyLeveldb/000001.log
lrwx—— 1 root root 64 Oct 27 22:56 23 -> /var/hyperledger/production/ledgersData/chains/chains/myc/blockfile_000000
lrwx—— 1 root root 64 Oct 27 22:56 24 -> socket:[4503236]
……….
lrwx—— 1 root root 64 Oct 27 22:56 7 -> /var/hyperledger/production/ledgersData/chains/index/LOCK
l-wx—— 1 root root 64 Oct 27 22:56 8 -> /var/hyperledger/production/ledgersData/chains/index/LOG
l-wx—— 1 root root 64 Oct 27 22:56 9 -> /var/hyperledger/production/ledgersData/chains/index/MANIFEST-000000

Another thing we can do is to take a look at under exe which tells us which executable this program is running..

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/exe
lrwxrwxrwx 1 root root 0 Oct 27 01:05 /proc/23709/exe -> /usr/local/bin/peer
root@blk_chain_hlf1:/home/ganga#

Um. We will look at /cmdline and we can cat that and see which command it is using.. Cool.

root@blk_chain_hlf1:/home/ganga# ls -l /proc/23709/cmdline
-r–r–r– 1 root root 0 Oct 27 01:05 /proc/23709/cmdline
root@blk_chain_hlf1:/home/ganga# cat /proc/23709/cmdline
peernodestart–peer-chaincodedev=true-oorderer:7050

One more thing we can see is its environment variables…Like..

root@blk_chain_hlf1:/home/ganga# cat /proc/23709/environ
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOSTNAME=95b88453a1acCORE_PEER_ID=peerCORE_PEER_GOSSIP_EXTERNALENDPOINT=peer:7051CORE_LOGGING_LEVEL=DEBUGCORE_PEER_LOCALMSPID=DEFAULTCORE_PEER_ADDRESS=peer:7051CORE_VM_ENDPOINT=unix:///host/var/run/docker.sockCORE_PEER_MSPCONFIGPATH=/etc/hyperledger/mspFABRIC_CFG_PATH=/etc/hyperledger/fabricHOME=/root

The content of the environ could be dumped out like above.. but no worries, we can make it more readble by adding a newline to each of those entries like –
root@blk_chain_hlf1:/home/ganga# cat /proc/23709/environ | tr ‘\0’ ‘\n’
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=95b88453a1ac
CORE_PEER_ID=peer
CORE_PEER_GOSSIP_EXTERNALENDPOINT=peer:7051
CORE_LOGGING_LEVEL=DEBUG
CORE_PEER_LOCALMSPID=DEFAULT
CORE_PEER_ADDRESS=peer:7051
CORE_VM_ENDPOINT=unix:///host/var/run/docker.sock
CORE_PEER_MSPCONFIGPATH=/etc/hyperledger/msp
FABRIC_CFG_PATH=/etc/hyperledger/fabric
HOME=/root

Really cool……

Evaluation of Boosting Algorithms: XGBoost vs LightGBM

In this article, we’re going to discuss about boosting algorithms. Boosting algorithms started with the advent of ADABoost and today’s most powerful boosting algorithm is XGBoost. Today, XGBoost is an algorithm that every young aspiring as well as experienced Data scientist have in their arsenal. But what really is XGBoost, let’s discuss more on that.

XGBoost stands for eXtreme Gradient Boosting. “The name XGBoost, though, actually refers to the engineering goal to push the limit of computation resources for boosted tree algorithms, which is the reason why many people use XGBoost” – Tianqi Chen, creator of XGBoost.

XGBoost has been featured in many winning solutions – if not most – and has been dominating Machine Learning based competitions on Kaggle, KDDCup and on many other such platforms. XGBoost is an optimized distributed gradient boosting implementation, designed to be highly efficient, flexible and portable. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. But even though XGBoost has it all, when given a huge data, XGBoost takes a long time to run.

Enter LightGBM. Microsoft has been lately increasing their development of tools in the analytics and machine learning space.  And a recently released such tool is LightGBM. LightGBM is a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks. Since, LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in LightGBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence might result in much better accuracy, it is very fast, hence the word ‘Light’.

Below is a diagrammatic representation by the makers of the LightGBM to explain the difference between how LightGBM and XGBoost build trees.

 

Leaf wise splits lead to increase in complexity and may lead to over fitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur.

The question then arises, which boosting algorithm is better or the best, XGBoost or its challenger LightGBM, to find this out, we thought, it’d be best if we implemented these two frameworks on a real dataset, so for this we picked a competition on Kaggle, called “AllState Claims Severity”. A little overview regarding the data follows. AllState, is a personal insurer in the United States, and is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect. AllState is currently developing automated methods of predicting the cost, and hence severity, of claims. The data consists of information about the individuals that are customers of AllState, target is to predict the amount of claim an individual is going to or will make. So, to predict the cost of claims, we’re going to use XGBoost and LightGBM algorithms and compare their results to see which works better. The data consists of 132 features and 188319 observations.

The implementation can be divided into three phases, Data pre-processing, Data Modeling, and performance of the model.

As you can see from the above screens, 75% of data falls under the amount of 3864$. The maximum value is 121012$, so there must be outliers and it has a very high variance too as a result of outliers. Let’s check the distribution of the target variable to get a better understanding.

The target variable appears to be skewed, where as it should be normally distributed. One of our assumptions that we make before applying any regression is that, the features follow normal distribution; now we have to transform the target variable so that it is normally distributed.

Skewness in the target variable is 3.79, which means that the target variable is highly skewed. The value of skew for a normal distribution is 0. Above, we are removing the outliers in the data by capping it at 99%, i.e. we’re taking the population that falls under 99%, and remaining 1% contains outliers. After removing the outliers in the target variable, the max value in the target variable is 13981, previously it was 121021 but because of the 1% of outliers in the data, the whole distribution and behavior of the target variable was affected.

The distribution seems a bit better now after removing the outliers and the skewness has come down too. But as you can see in the above graph, it is still not follow the blue bell curve fully, the data distribution is not fully normal yet. To make the target variable follow normal distribution, we’re going to apply some transformations to the target variable. In this case, we’re applying BoxCox transformation. BoxCox transformation is an exponent, lambda (λ), all the values of λ are considered and the optimal value for your data is selected. The “optimal value” is the one which results in the best approximation of a normal distribution curve. The BoxCox transformation is readily available as a method in the scipy library.

When we applied the BoxCox transformation, skewness in the target variable has come down to -0.0004 which is equal to ~ ‘0’. The distribution of the target variable now follows normal distribution.

We’ve written a function with which we can reverse the boxcox transformed values and get the original values, which we would need to calculate the metrics for the model.

There are 116 categorical variables and 14 continuous variables, the data pre-processing for the continuous variables is done, they all follow normal distribution, they next step now, is to convert the text variables to numerical and drop unnecessary variables from the training set.

The Data preprocessing phase is done, and we have split the train set into two sets, we’re going to train the model on the first set which contains 80% of the original train set and test it on the remaining 20% of the original train set.

The next phase is the Modeling phase. We will apply XGBoost and LightGBM algorithms to this data set and compare the results. Before we apply we need to convert our data which is stored in the form of a data structure into a matrix format viz. sparse or dense matrix, as XGBoost and LightGBM only works with numeric vectors.

We’ve applied XGBoost model and the parameters we’ve passed to the model are parameters XGBoost uses to build trees, we’ve arrived at these values after many iterations done previously, we’re going to pass similar parameter values to LightGBM so that we can have a fair evaluation/comparison.

We’ve applied both XGBoost and LightGBM, now it’s time to compare the performance of the algorithms. Since, we’ve used XGBoost and LightGBM to solve a regression problem, we’re going to compare the metric ‘Mean Absolute Error’ for both the models as well as compare the execution times.

The Mean Absolute Error for both the models is more or less the same. The standout of course is the execution time, LightGBM took only 5 seconds, and it’s pretty fast, hence the word ‘Light’. XGBoost on the other hand, took 110 seconds, and this could be the defining factor when it comes to choosing which algorithm would be more suitable to be applied to large datasets. Perhaps, this is just one such case, we should probably test it out more, apply these algorithms to many other problems and see how they perform, and it’d be naïve to come to conclusions regarding which is the best after just one evaluation/example. It’s up to the readers to decide which algorithm they think is the best, I think there’s no such thing as one is better than the other, both XGBoost and LightGBM have their pros and cons, and it all comes down to the problem that you’re trying to solve and which algorithm suits it better, i.e. which algorithm can produce better results for the business, as after all, data science is all about using the data, drawing out insights from it to better the business, help make better business decisions that can improve and boost the business.

 

Author:

Anand Mohan Munigoti

(Data Science Team)

Rambling at Blockchain technologies

As per my research and understanding of the concepts related to blockchain, any organization can ensure success in this technology, if the following three important decisions would be made to design and develop blockchain solutions. They are –

  1. Choosing the appropriate use case
  2. Choosing the right platform
  3. Building or having the relevant expertise

Let us go through them one by one.

Choosing the appropriate use case:

We need to pick a use case that can give immediate value, sets the arena for a larger transformational play and discloses value for those participating in the network

Choosing the right platform

We need to select a blockchain platform that wouldn’t make us compromise on functionality.

The key things the platform should have are –

  1. Let’s look for the one which has significant development momentum. In other words, that platform should be backed by the vast developer community, have a clear mission, clear road map of their future releases mapped to the additional features.
  2. It should allow us to write chain code logic into the ledger.it should let us to write chain code in more than one language like not having any limitations at programming it should give us a good tool set to play with.
  3. It should have a flexible consensus methodology. Flexible in the sense, pluggable approach to any other or third party consensus algorithm.
  4. The platform should allow us to have permissioned blockchain network. In such a way, only certain members can be made to host a ledger by giving the ability to host and authenticate the entire blockchain.
  5. Data privacy is the key for a permissioned blockchain network so, the platform should allow various participants in the network to execute transactions, while only allowing the rest of the participants to see the activity based on the need.

Building or having the relevant expertise

  1. Understand the key concepts like crypto currency, blockchain data structure, members, ordering nodes, validating peers, committers, consensus algorithms, replication mechanisms, networking concepts etc.
  2. Go through the blockchain platforms like Hyperledger fabric, Ethereum, Sawtooth etc. and identify the key concepts while playing with samples.
  3. Play with the available tools to get acquainted with the blockchain network, transaction work flow, smart contract creation/deployment/querying.
  4. Learn at least one approach to program smart contracts.