In this post, I am going to compare two popular ensemble methods, Random Forests (RF) and Gradient Boosting Machine (GBM). GBM and RF both are ensemble learning methods and predict (regression or classification) by combining the outputs from individual trees (we assume tree-based GBM or GBT). They have all the strengths and weaknesses of the ensemble methods mentioned in my previous post. So, here we compare them only with respect to each other.
GBM and RF differ in the way the trees are built: the order and the way the results are combined. It has been shown that GBM performs better than RF if parameters tuned carefully [1,2].
Gradient Boosting: GBT build trees one at a time, where each new tree helps to correct errors made by previously trained tree.
A) Real-world application
A great application of GBM is anomaly detection in supervised learning settings where data is often highly unbalanced such as DNA sequences, credit card transactions or cybersecurity.
Reference [3] presents a more specific application in this context, supervised anomaly detection task with a learning to rank approach. Learning to rank means the application of machine learning in the construction of ranking models for information retrieval systems. This results in finding the anomalies with the highest precision without giving too many genuine examples to the experts. According to this manuscript, gradient boosting has shown to be a powerful method on real-life datasets to address learning to rank problems due to its two main features:
- It performs the optimization in function space (rather than in parameter space) which makes the use of custom loss functions much easier.
- Boosting focuses step by step on difficult examples that give a nice strategy to deal with unbalanced datasets by strengthening the impact of the positive class.
B) Strengths of the modelSince boosted trees are derived by optimizing an objective function, basically GBM can be used to solve almost all objective function that we can write gradient out. This including things like ranking and poission regression, which RF is harder to achieve.
C) Weaknesses of the model
- GBMs are more sensitive to overfitting if the data is noisy.
- Training generally takes longer because of the fact that trees are built sequentially.
- GBMs are harder to tune than RF. There are typically three parameters: number of trees, depth of trees and learning rate, and each tree built is generally shallow.
Random Forest: RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data [1,2].
A) Real-world application
The most prominent application of random forest is multi-class object detection in large-scale real-world computer vision problems [4]. RF methods can handle a large amount of training data efficiently and are inherently suited for multi-class problems.
Another application is in bioinformatics, like medical diagnosis [5]. This method is especially attractive for this application in the following cases:
- The real-world data is noisy and contains many missing values, some of the attributes are categorical, or semi-continuous.
- There are needs to integrate different data sources which face the issue of weighting them.
- We need high predictive accuracy for a high-dimensional problem with highly correlated features.
B) Strengths of the model
- RF is much easier to tune than GBM. There are typically two parameters in RF: number of trees and number of features to be selected at each node.
- RF is harder to overfit than GBM.
C) weaknesses of the model
- The main limitation of the Random Forests algorithm is that a large number of trees may make the algorithm slow for real-time prediction.
- For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. Methods such as partial permutations were used to solve the problem [7].
- If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups [7].
Note: A very important point regarding how RF and GBM methods are handling missing data. Gradient Boosting Trees uses CART trees (in a standard setup, as it was proposed by its authors). CART trees are also used in Random Forests. CART handles missing values either by imputation with average, either by rough average/mode, either by an averaging/mode based on proximities. However, one can build a GBM or RF with other types of decision trees. The usual replacement for CART is C4.5 proposed by Quinlan. In C4.5 the missing values are not replaced on data set. Instead, the impurity function computed takes into account the missing values by penalizing the impurity score with the ratio of missing values [8].
References
- When would one use Random Forests over Gradient Boosted Machines?
- What are the advantages/disadvantages of using Gradient Boosting over Random Forests?
- Efficient top rank optimization with gradient boosting for supervised anomaly detection
- An Introduction to Random Forests for Multi-class Object Detection
- Using Random Forest for Reliable Classification and Cost-Ssensitive Learning for medical diagnosis
- Random Forest ? Features and Advantages
- Random Forest ? Disadvantages
- Why doesn?t Random Forest handle missing values in predictors?