docs/mllib-ensembles.md · 2d33323cadbf58dd1d05ffff998d18cad6a896cd · cs525-sp18-g07 / spark · GitLab

Snippets Groups Projects

10 years ago

[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve · 25e271d9

MechCoder authored 10 years ago

Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4906 from MechCoder/spark-6025 and squashes the following commits:

67146ab [MechCoder] Minor
352001f [MechCoder] Minor
6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables
bc99ac6 [MechCoder] Refactor the method and stuff
dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve

25e271d9

[SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve

MechCoder authored 10 years ago

Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4906 from MechCoder/spark-6025 and squashes the following commits:

67146ab [MechCoder] Minor
352001f [MechCoder] Minor
6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables
bc99ac6 [MechCoder] Refactor the method and stuff
dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve

mllib-ensembles.md 33.43 KiB

layout: global
title: Ensembles - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Ensembles

Table of contents {:toc}

An ensemble method is a learning algorithm which creates a model composed of a set of other base models. MLlib supports two major ensemble algorithms: GradientBoostedTrees and RandomForest. Both use decision trees as their base models.

Gradient-Boosted Trees vs. Random Forests

Both Gradient-Boosted Trees (GBTs) and Random Forests are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:

GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests can train multiple trees in parallel.
- On the other hand, it is often reasonable to use smaller (shallower) trees with GBTs than with Random Forests, and training smaller trees takes less time.
Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting. (In statistical language, Random Forests reduce variance by using more trees, whereas GBTs reduce bias by using more trees.)
Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).

In short, both algorithms can be effective, and the choice should be based on the particular dataset.

Random Forests

Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.

MLlib supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features. MLlib implements random forests using the existing decision tree implementation. Please see the decision tree guide for more information on trees.

Basic algorithm

Random forests train a set of decision trees separately, so the training can be done in parallel. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions from each tree reduces the variance of the predictions, improving the performance on test data.

Training

The randomness injected into the training process includes:

Subsampling the original dataset on each iteration to get a different training set (a.k.a. bootstrapping).
Considering different random subsets of features to split on at each tree node.

Apart from these randomizations, decision tree training is done in the same way as for individual decision trees.

Prediction

To make a prediction on a new instance, a random forest must aggregate the predictions from its set of decision trees. This aggregation is done differently for classification and regression.

Classification: Majority vote. Each tree's prediction is counted as a vote for one class. The label is predicted to be the class which receives the most votes.

Regression: Averaging. Each tree predicts a real value. The label is predicted to be the average of the tree predictions.