Feature Engineering: What Powers Machine Learning


How to Extract Features from Raw Data for Machine Learning

This is the third in a four-part series on how we approach machine learning at Feature Labs. The complete set of articles is:

  1. Overview: A General-Purpose Framework for Machine Learning
  2. Prediction Engineering: How to Set Up Your Machine Learning Problem
  3. Feature Engineering (this article)
  4. Modeling: Teaching an Algorithm to Make Predictions

These articles cover the concepts and a full implementation as applied to predicting customer churn. The project Jupyter Notebooks are all available on GitHub. (Full disclosure: I work for Feature Labs, a startup developing tooling, including Featuretools, for solving problems with machine learning. All of the work documented here was completed with open-source tools and data.)


Feature Engineering

It’s often said that “data is the fuel of machine learning.” This isn’t quite true: data is like the crude oil of machine learning which means it has to be refined into features — predictor variables — to be useful for training a model. Without relevant features, you can’t train an accurate model, no matter how complex the machine learning algorithm. The process of extracting features from a raw dataset is called feature engineering_.

Read More

Prediction Engineering: How to Set Up Your Machine Learning Problem


An explanation and implementation of the first step in solving problems with machine learning

This is the second in a four-part series on how we approach machine learning at Feature Labs. The other articles can be found below:

  1. Overview: A General-Purpose Framework for Machine Learning
  2. Feature Engineering: What Powers Machine Learning (coming soon)
  3. Modeling: Teaching an Algorithm to Make Predictions (coming soon)

These articles will cover the concepts and a full implementation as applied to predicting customer churn. The project Jupyter Notebooks are all available on GitHub. (Full disclosure: I work for Feature Labs, a startup developing tooling, including Featuretools, for solving problems with machine learning. All of the work documented here was completed with open-source tools and data.)

Read More

How to Create Value with Machine Learning


A General-Purpose Framework for Defining and Solving Meaningful Problems in 3 Steps

Imagine the following scenario: your boss asks you to build a machine learning model to predict every month which customers of your subscription service will churn during the month with churn defined as no active membership for more than 31 days. You painstakingly make labels by finding historical examples of churn, brainstorm and engineer features by hand, then train and manually tune a machine learning model to make predictions.

Pleased with the metrics on the holdout testing set, you return to your boss with the results, only to be told now you must develop a different solution: one that makes predictions every two weeks with churn defined as 14 days of inactivity. Dismayed, you realize none of your previous work can be reused because it was designed for a single prediction problem.

You wrote a labeling function for a narrow definition of churn and the downstream steps in the pipeline — feature engineering and modeling — were also dependent on the initial parameters and will have to be redone. Due to hard-coding a specific set of values, you’ll have to build an entirely new pipeline to address for what is only a small change in problem definition.


Structuring the Machine Learning Process

This situation is indicative of how solving problems with machine learning is currently approached. The process is ad-hoc and requires a custom solution for each parameter set even when using the same data. The result is companies miss out on the full benefits of machine learning because they are limited to solving a small number of problems with a time-intensive approach.

A lack of standardized methodology means there is no scaffolding for solving problems with machine learning that can be quickly adapted and deployed as parameters to a problem change.

How can we improve this process? Making machine learning more accessible will require a general-purpose framework for setting up and solving problems. This framework should accommodate existing tools, be rapidly adaptable to changing parameters, applicable across different industries, and provide enough structure to give data scientists a clear path for laying out and working through meaningful problems with machine learning.

At Feature Labs, we’ve put a lot of thought into this issue and developed what we think is a better way to solve useful problems with machine learning. In the next three parts of this series, I’ll lay out how we approach framing and building machine learning solutions in a structured, repeatable manner built around the steps of prediction engineering, feature engineering, and modeling.

We’ll walk through the approach as applied in full to one use case — predicting customer churn — and see how we can adapt the solution if the parameters of the problem change. Moreover, we’ll be able to utilize existing tools — Pandas, Scikit-Learn, Featuretools — commonly used for machine learning.

The general machine learning framework is outlined below:

  1. Prediction Engineering: State the business need, translate into a machine learning problem, and generate labeled examples from a dataset
  2. Feature Engineering: Extract predictor variables — features — from the raw data for each of the labels
  3. Modeling: Train a machine learning model on the features, tune for the business need, and validate predictions before deploying to new data

A general-purpose framework for defining and solving meaningful problems with machine learning

Read More

Recurrent Neural Networks by Example in Python


(Source)

Using a Recurrent Neural Network to Write Patent Abstracts

The first time I attempted to study recurrent neural networks, I made the mistake of trying to learn the theory behind things like LSTMs and GRUs first. After several frustrating days looking at linear algebra equations, I happened on the following passage in Deep Learning with Python:

In summary, you don’t need to understand everything about the specific architecture of an LSTM cell; as a human, it shouldn’t be your job to understand it. Just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time.

This was the author of the library Keras (Francois Chollet), an expert in deep learning, telling me I didn’t need to understand everything at the foundational level! I realized that my mistake had been starting at the bottom, with the theory, instead of just trying to build a recurrent neural network.

Shortly thereafter, I switched tactics and decided to try the most effective way of learning a data science technique: find a problem and solve it!

Read More

Biases and How to Overcome Them

(Source)

We’re awful at viewing the world objectively. Data can help.

There’s a pervasive myth — perhaps taught to you by an economics course — that humans are rational. The traditional view is we objectively analyze the world, draw accurate conclusions, and make decisions in our best interest. While few people completely buy into this argument anymore, we are still often unaware of our cognitive biases, with the result that we vote, spend money, and form opinions based on a distorted view of the world.

A recent personal experience where I badly misjudged reality — due to cognitive illusions— brought home this point and demonstrated the importance of fact-checking our views of the world. While this situation had no negative consequences, it was a great reminder that we are all subject to powerful biases and personal opinions are no substitute for checking the data.

Read More