Machine Learning Kaggle Competition Part Two Improving
Feature engineering, feature selection, and model evaluation
Like most problems in life, there are several potential approaches to a Kaggle competition:
- Lock yourself away from the outside world and work in isolation
I recommend against the “lone genius” path, not only because it’s exceedingly lonely, but also because you will miss out on the most important part of a Kaggle competition: learning from other data scientists. If you work by yourself, you end up relying on the same old methods while the rest of the world adopts more efficient and accurate techniques.
As a concrete example, I recently have been dependent on the random forest model, automatically applying it to any supervised machine learning task. This competition finally made me realize that although the random forest is a decent starting model, everyone else has moved on to the superior gradient boosting machine.
The other extreme approach is also limiting:
- Copy one of the leader’s scripts (called “kernels” on Kaggle), run it, and shoot up the leaderboard without writing a single line of code