If you are familiar with Machine Learning methods, you must head about the strong weapon in ML: XGBoost. It is a powerful waepon with lots of adventages and opitmization. In this article, I only talk about one adventage in XGBoost: It can deal with missing values naturally. But how?
Sparsity-Aware Split Finding
In the original paper, they came up with a brilliant idea called Sparsity-Aware Split Finding. The algorithm is belowed.
Let me explain how it works without using lots of mathematical symbols.
- When it needs to split, it will split data into two groups: data with missing values and data without missing values.
- Use data without missing values to find out the best threshold to cut.
- Try to put all data with missing values on one side, and calculate the Gain on the other side. The Gain of the side with missing values are calculated by parent Gain minus the Gain without missing values
- Calculate both side and find out which direction is the best way to put missing values.
Basically, XGBoost trys to put all data with missing values on either side, and find the maximum Gain in which direction of missing values put.