How to Visualize a Decision Tree from a Random Forest in Python Using Scikit-Learn

Published on August 18, 2018

Categories: useful tips , visualization

A helpful utility for understanding your model

Here’s the complete code: just copy and paste into a Jupyter Notebook or Python script, replace with your data and run:

Code to visualize a decision tree and save as png ([on GitHub here](https://gist.github.com/WillKoehrsen/ff77f5f308362819805a3defd9495ffd)).

The final result is a complete decision tree as an image.

Decision Tree for Iris Dataset

Parallelizing Feature Engineering with Dask

Published on August 16, 2018

Categories: parallelization , feature engineering , Dask

How to scale Featuretools using parallel processing

When a computation is prohibitively slow, the most important question to ask is: “What is the bottleneck?” Once you know the answer, the logical next step is to figure out how to get around that bottleneck.

Often, as we’ll see, the bottleneck is that we aren’t taking full advantage of our hardware resources, for example, running a calculation on only one core when our computer has eight. Simply getting a bigger machine — in terms of RAM or cores — will not solve the problem if our code isn’t written to use all our resources. The solution therefore is to rewrite the code to utilize whatever hardware we do have as efficiently as possible.

In this article, we’ll see how to refactor our automated feature engineering code to run in parallel on all our laptop’s cores, in the process reducing computation time by over 8x. We’ll make use of two open-source libraries — Featuretools for automated feature engineering and Dask for parallel processing — and solve a problem with a real-world dataset.

Featuretools	Dask

We’ll combine two important technologies: automated feature engineering in Featuretools and parallel computation in Dask.

Our exact solution is specific for this problem, but the general approach we develop can be utilized to scale your own computations to larger datasets.

The Most Important Part of a Data Science Project is Writing a Blog Post

Published on August 11, 2018

Categories: writing , data science

Writing creates opportunities, gives you critical communication practice, and makes you a better data scientist through feedback

It can be tempting to call a data science project complete after you’ve uploaded the final code to GitHub or handed in your assignment. However, if you stop there, you’re missing out on the most crucial step of the process: writing and sharing an article about your project. Writing a blog post isn’t typically considered part of the data science pipeline, but to get the most from your work, then it should be the standard last step in any of your projects.

There are three benefits to writing even a simple blog post about your work:

Communication Practice: good code by itself it not enough. The best analysis will have no impact if you can’t make people care about the work.
Writing Creates Opportunities: by exposing your work to the world, you’ll be able to form connections that can lead to job offers, collaborations, and new project ideas.
Feedback: the cycle of getting better is: do work, share, listen to constructive criticism, improve work, repeat

Writing is one of those activities — exercise and education also come to mind — that might have no payout in the short term but almost unlimited potential rewards in the long term. Personally, I make $0 from the 10,000 daily views my blog posts receive, each of which takes 3–15 hours to write. Yet, I also wouldn’t have a full-time data science job were it not for my articles.

Moreover, I know the quality of my data science work is much higher, both because I intend to write about it, and because I used the previous feedback I’ve received, making the long term return from writing decidedly positive.

Why Automated Feature Engineering Will Change the Way You Do Machine Learning

Published on August 9, 2018

Categories: feature engineering , machine learning

Automated feature engineering will save you time, build better predictive models, create meaningful features, and prevent data leakage

There are few certainties in data science — libraries, tools, and algorithms constantly change as better methods are developed. However, one trend that is not going away is the move towards increased levels of automation.

Recent years have seen progress in automating model selection and hyperparameter tuning, but the most important aspect of the machine learning pipeline, feature engineering, has largely been neglected. The most capable entry in this critical field is Featuretools, an open-source Python library. In this article, we’ll use this library to see how automated feature engineering will change the way you do machine learning for the better.

Featuretools is an open-source Python library for automated feature engineering.

Automated feature engineering is a relatively new technique, but, after using it to solve a number of data science problems using real-world data sets, I’m convinced it should be a standard part of any machine learning workflow. Here we’ll take a look at the results and conclusions from two of these projects with the full code available as Jupyter Notebooks on GitHub.

How To Get The Right Data Why Not Ask For It

Published on July 26, 2018

Categories: data , learning

An example of why the most important skills in data science may not be technical

While the technical skills of data science — think modeling with a gradient boosting machine — get most of the attention, other equally important, general-purpose problem-solving abilities can be overlooked. Proficiency in asking the right question, being persistent, and taking advantage of multiple resources are critical to the success of a data science project but often take a back seat to coding ability when people ask what it takes to be a data scientist.

Recently, I was reminded of the importance of these non-technical skills while working on a data science for good project. The project, currently live on Kaggle involves identifying schools in New York City that would most benefit from programs that encourage disadvantaged students to take the Specialized High Schools Admission Test (SHSAT). This task comes with a small data set including test results from 2016, but the organizers encourage the use of any publicly available data.

Data Science is for more than just getting people to click on ads (Get Started Here)