Xgboost is short for the eXtreme Gradient Boosting package.
What is boosting? Quick Explanation:
Two common terms used in ML is Bagging & Boosting
Bagging: It is an approach where you take random samples of data, builds learning algorithms, and take simple means to find bagging probabilities.
Boosting: Boosting is similar, however, the selection of the sample is made more intelligently. We subsequently give more and more weight to hard to classify observations.
XGBOOST – Why is it so Important?
- In broad terms, it’s the efficiency, accuracy, and feasibility of this algorithm.
- It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.
- It also has additional features for doing cross-validation and finding important variables.
Features – XGBOOST:
- Speed: it can automatically do parallel computation on Windows and Linux, with OpenMP. It is generally over 10 times faster than the classical gbm.
- Input Type: it takes several types of input data:
- Dense Matrix: R’s dense matrix, i.e. matrix ;
- Sparse Matrix: R’s sparse matrix, i.e. Matrix::dgCMatrix ;
- Data File: local data files ;
- xgb.DMatrix: its own class (recommended).
- Sparsity: it accepts sparse input for both tree booster and linear booster, and is optimized for sparse input ;
- Customization: it supports customized objective functions and evaluation functions.
Numeric vs Categorical Variables:
Xgboost manages only numeric vectors.
What to do when you have categorical data?
A simple method to convert a categorical variable into a numeric vector is One Hot Encoding.
Tree Boosting in a Nutshell:
We first briefly review the learning objective of tree boosting. For a given data set with n examples and m features a tree ensemble model (shown in Fig. above ) uses K additive functions to predict the output.
It has also been widely adopted by industry users, including Google, Alibaba and Tencent, and various startup companies. According to a popular article in Forbes, xgboost can scale with hundreds of workers (with each worker utilizing multiple processors) smoothly and solve machine learning problems involving Terabytes of real-world data.