Machine Learning Kaggle Competition Part Two Improving

Feature engineering, feature selection, and model evaluation

Like most problems in life, there are several potential approaches to a Kaggle competition:

  1. Lock yourself away from the outside world and work in isolation

I recommend against the “lone genius” path, not only because it’s exceedingly lonely, but also because you will miss out on the most important part of a Kaggle competition: learning from other data scientists. If you work by yourself, you end up relying on the same old methods while the rest of the world adopts more efficient and accurate techniques.

As a concrete example, I recently have been dependent on the random forest model, automatically applying it to any supervised machine learning task. This competition finally made me realize that although the random forest is a decent starting model, everyone else has moved on to the superior gradient boosting machine.

The other extreme approach is also limiting:

  1. Copy one of the leader’s scripts (called “kernels” on Kaggle), run it, and shoot up the leaderboard without writing a single line of code
Read More

Automated Feature Engineering In Python

How to automatically create machine learning features

Machine learning is increasingly moving from hand-designed models to automatically optimized pipelines using tools such as H20, TPOT, and auto-sklearn. These libraries, along with methods such as random search, aim to simplify the model selection and tuning parts of machine learning by finding the best model for a dataset with little to no manual intervention. However, feature engineering, an arguably more valuable aspect of the machine learning pipeline, remains almost entirely a human labor.

Feature engineering, also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial (see the excellent paper “A Few Useful Things to Know about Machine Learning”).

Typically, feature engineering is a drawn-out manual process, relying on domain knowledge, intuition, and data manipulation. This process can be extremely tedious and the final features will be limited both by human subjectivity and time. Automated feature engineering aims to help the data scientist by automatically creating many candidate features out of a dataset from which the best can be selected and used for training.

In this article, we will walk through an example of using automated feature engineering with the featuretools Python library. We will use an example dataset to show the basics (stay tuned for future posts using real-world data). The complete code for this article is available on GitHub.

Read More

Machine Learning Kaggle Competition Part One Getting Started

Learning the Kaggle Environment and an Introductory Notebook

In the field of data science, there are almost too many resources available: from Datacamp to Udacity to KDnuggets, there are thousands of places online to learn about data science. However, if you are someone who likes to jump in and learn by doing, Kaggle might be the single best location for expanding your skills through hands-on data science projects.

While it originally was known as a place for machine learning competitions, Kaggle — which bills itself as “Your Home for Data Science” — now offers an array of data science resources. Although this series of articles will focus on a competition, it’s worth pointing out the main aspects of Kaggle:

  • Datasets: Tens of thousands of datasets of all different types and sizes that you can download and use for free. This is a great place to go if you are looking for interesting data to explore or to test your modeling skills.
  • Machine Learning Competitions: once the heart of Kaggle, these tests of modeling skill are a great way to learn cutting edge machine learning techniques and hone your abilities on interesting problems using real data.
  • Learn: A series of data science learning tracks covering SQL to Deep Learning taught in Jupyter Notebooks.
  • Discussion:A place to ask questions and get advice from the thousands of data scientists in the Kaggle community.
  • Kernels: Online programming environments running on Kaggle’s servers where you can write Python/R scripts, or Jupyter Notebooks. These kernels are entirely free to run (you can even add a GPU) and are a great resource because you don’t have to worry about setting up a data science environment on your own computer. The kernels can be used to analyze any dataset, compete in machine learning competitions, or complete the learning tracks. You can copy and build on existing kernels from other users and share your kernels with the community for feedback.
Read More

Automated Machine Learning On The Cloud In Python

An introduction to the future of data science

Two trends have recently become apparent in data science:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

This article will cover a brief introduction to these topics and show how to implement them, using Google Colaboratory to do automated machine learning on the cloud in Python.

Read More

A Complete Machine Learning Walk Through In Python Part Three

Interpreting a machine learning model and presenting results

Machine learning models are often criticized as black boxes: we put data in one side, and get out answers — often very accurate answers — with no explanations on the other. In the third part of this series showing a complete machine learning solution, we will peer into the model we developed to try and understand how it makes predictions and what it can teach us about the problem. We will wrap up by discussing perhaps the most important part of a machine learning project: documenting our work and presenting results.

Part one of the series covered data cleaning, exploratory data analysis, feature engineering, and feature selection. Part two covered imputing missing values, implementing and comparing machine learning models, hyperparameter tuning using random search with cross validation, and evaluating a model.

All the code for this project is on GitHub. The third Jupyter Notebook, corresponding to this post, is here. I encourage anyone to share, use, and build on this code!

Read More