# Non-technical intro to Random Forest and Gradient Boosting in machine learning

A collective wisdom of many is likely more accurate than any one. Wisdom of the crowd – Aristotle, 300BC-

The concept of *Ensemble *is fundamental to many areas of our lives. A *choir * of singers is an *ensemble*. A *band* of instrumentalists is an *ensemble*. A group of *vocalists* singing different notes (Bass, alto, tenor, sopranos) is an ensemble. A *group* of kids singing melodious acapella is an *ensemble *. You can already see the trend right?

In all these examples above, individually, each one of them may be a good performer, however, when they perform together, they can render an exceptionally beautiful performance.

This is the same concept of *Ensemble modeling* in machine learning where we pool or chain a collection of learning models together to build a generalized, robust model that performs significantly well for prediction when new data is provided.

A model is any of the supervised or unsupervised algorithms listed below each of which have been well-proven to have excellent performance in predictive modeling. However, when a collection of the same (or different) model is aggregated together, we can get an excellent, more accurate predictive performance.

**Supervised Learning:**Classification (or label) of the sample dataset are well-known and predetermined.**Unsupervised Learning:**We have little to no prior knowledge of the results or data grouping. We need to check the relationship in the data to determine appropriate clustering.

Supervised Learning ("Y" is Known) | Un-supervised Learning ("Y" is Unknown) | Semi-Supervised Learning (Sometimes we know "Y") |
---|---|---|

Regression (Lasso, Ridge, Logistic) | Clustering (K-means clustering, Mean shift clustering, Spectral Clustering) | Prediction & Classifications |

Decision Tree (Gradient Boosting & Random Forest) | Apriori Rule | Clustering |

Neural Network | Kernel Density Estimation | Expectation-Maximization (EM) |

Support Vector Machine (SVM) | Principal Component Analysis (PCA) (Kernel PCA, Sparse PCA) | Transductive Support Vector Machines (TSVM) |

Naive Bayes | Singular Value Decomposition (SVD) | Manifold Regularization |

K-Nearest Neighbor (KNN) | Self-ORganizing Machine (SOM) | Auto-encoder (Multilayer Perceptron, Restricted Boltzmann machine (RBM)) |

### Single Model

### Ensemble Model

In general,

is all about using many different models in collaboration to combine their strengths and compensate for their weaknesses as well as to make the resulting model generalize better to future data.Ensemble model

Ensemble models have significant application in several business domains. From movie recommendation engine, aircraft maintenance forecasting, business/sales forecasting, election forecasting using polls from various markets and demographics to weather forecasting. Ensemble modeling is undoubtedly a fantastic way to ensure better accuracy and reliability of prediction when machine learning is applied in several application domains.

### How do we combine RESULTS of multiple models?

Here we are concerned about how we treat the output coming from the multiple models. If we’re using 7 different models and each of the generate an output, how do we make the final prediction based on output from these 7 models. *Remember, a model may be any of the supervised or unsupervised algorithm listed above.*

**Interval Targets**We find the average of the outputs coming from each of the different models. Perfect for numeric and continuous data.**Categorical Targets**We simply count all results from each of the models and use results with majority votes as final prediction. Great for result that gives True/False, Yes or No

### How do we combine PROCESSES of multiple models?

Here we discuss the technique that we can use to connect multiple models together. It explains the connection process that occur before each of the model generates a result. There are four ways we can effectively join multiple models together:

- One Algorithm, (trained with) Different Data Samples
- One Algorithm, (tuned with) Different Configuration Options
- Different Algorithm (Chaining)
- Expert (Rule-Based Systems) knowledge (Joining with Heuristic ideas)

### 1.) One Algorithm, (trained with) Different Data Samples

*Bagging (Bootstrap Aggregating)*

This method is formally referred to as ** Bagging (Bootstrap Aggregating)** where random samples from the available full dataset are divided into different subsets which are then fed into a model/algorithm, the result of which is fed into an

*Ensembler*. The two parallel blocks in the middle are the processes which the last block,

*Ensembler*is the result combiner.

Notice that in this example, we have used a *Decision Tree* which is one of the models/algorithm listed above. However, any of the other methods could as well have been used. ** Random Forest** is a particularly popular high-performance algorithms built around the concept of Bagging.

**Random Forest:** Random forest is an ensemble of decision trees where training (sample) dataset are recursively partitioned into sections along different branches such that similar observations are grouped together at the terminal leaves of the decision tree.

To provide more robustness, an ensemble / collection of decision trees can be trained to become a *Forest* where new observations are scored by majority vote or averaging. *Each Decision Tree is trained from different random samples from the full dataset.*

Random Forest is not just about bagging, it has some specific enhancements:

For example, Decision Trees in Random Forest are well-trained. This means, averaging a large number of trees will result in more robust prediction than using a single highly fine-tuned decision tree.

Also, Random Forest can be used to determine the variables (or features) with most significant impact. Hence, RF can help in pruning the (not-so-influential) variables if there’s a need for such reduction.

*Boosting*

Similar to *Bagging*, Boosting uses a single algorithm to train model on different data samples, however, with a different twist. Boosting start by training a single model on a sample of the dataset.

Those data points that are *misclassified* are assigned heavier *weights* in the next iteration so that focus is on the misclassified points. This helps to create an adaptive learning approach which improves the performance of the algorithm.

Obviously, this technique is an expensive operation due to overhead of sequential training. Boosting can be used with any model, not just decision trees. However, an example of Boosting which uses Decision Trees is ** Gradient Boosting**.

Boosting effectively transforms a weak model into a stronger, more powerful model. However, Boosting degrades with noisy data because the assigned weights may not be appropriate in the next iteration.

In comparison with Bagging, Boosting here uses feedback control technique. boosting is adaptive. Bagging can work in parallel while Boosting requires sequential processing.

### 2.) One Algorithm, (tuned with) Different Configuration Options

In this approach, we vary the configuration options on the SAME algorithm, one the SAME dataset. This helps to determine the best tweak, parameters or options that optimizes the performance of the model or algorithm. For example, multiple configurations of a Neural Network model can be trained or evaluated with different tuning parameters to determine the optimal *Hidden Units (HU)* that works best for the model.

Notice how the four NNs are grouped into three (3) ensembles and then compared at the final output. Ensemble #1 (All NN with 3HU, 10HU, 30HU, 50HU), Ensemble #2 (10HU, 30HU, 50HU), Ensemble #3 (30HU & 50HU).

### Chaining Different Algorithms

Ensemble using multiple and completely different algorithms. Each of the algorithm or model assumes some form of relationship between the inputs (features, variables) and the targets. For example, Linear regression assumes linear relation, Decision Tree assumes constant relation within ranges of the inputs while the Neural Network assume nonlinear relation which depends on the architecture.

## In Summary

- Ensemble model are great for producing robust, highly optimized and improved models.
- Random Forest and Gradient Boosting are Ensembled-Based algorithms
- Random Forest uses
*Bagging technique*while Gradient Boosting uses*Boosting technique*. - Bagging uses
*multiple random data sampling*for modeling while Boosting uses iterative refinement for modeling. - Ensemble models are not easy to interpret and they often work like a little back box.
- Multiple algorithms must be minimally used to that the prediction system can be reasonably tractable.

I hope you find this useful. If you think I’ve missed something important, please use the comment box below to suggest changes or discuss your implementations. Cheers!