Why XGBoost?

Xgboost is short for the eXtreme Gradient Boosting package.

What is boosting? Quick Explanation:

Two common terms used in ML is Bagging & Boosting
Bagging: It is an approach where you take random samples of data, builds learning algorithms, and take simple means to find bagging probabilities.
Boosting: Boosting is similar, however, the selection of the sample is made more intelligently. We subsequently give more and more weight to hard to classify observations.

XGBOOST – Why is it so Important?

  • In broad terms, it’s the efficiency, accuracy, and feasibility of this algorithm.
  • It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.
  • It also has additional features for doing cross-validation and finding important variables.

Features – XGBOOST:

  • Speed: it can automatically do parallel computation on Windows and Linux, with OpenMP. It is generally over 10 times faster than the classical gbm.
  • Input Type: it takes several types of input data:
  • Dense Matrix: R’s dense matrix, i.e. matrix ;
  • Sparse Matrix: R’s sparse matrix, i.e. Matrix::dgCMatrix ;
  • Data File: local data files ;
  • xgb.DMatrix: its own class (recommended).
  • Sparsity: it accepts sparse input for both tree booster and linear booster, and is optimized for sparse input ;
  • Customization: it supports customized objective functions and evaluation functions.

Numeric vs Categorical Variables:

Xgboost manages only numeric vectors.

What to do when you have categorical data?

A simple method to convert a categorical variable into a numeric vector is One Hot Encoding.

Tree Boosting in a Nutshell:

We first briefly review the learning objective of tree boosting. For a given data set with n examples and m features a tree ensemble model (shown in Fig. above ) uses K additive functions to predict the output.

Industry Usage?

It has also been widely adopted by industry users, including Google, Alibaba and Tencent, and various startup companies. According to a popular article in Forbes, xgboost can scale with hundreds of workers (with each worker utilizing multiple processors) smoothly and solve machine learning problems involving Terabytes of real-world data.

