Thoughts On The Two Cultures Of Statistical Modeling

(Source)

Accuracy beats interpretability and other takeaways from “Statistical Modeling: The Two Cultures” by Leo Breiman

In the paper: “Statistical Modeling: The Two Cultures”, Leo Breiman — developer of the random forest as well as bagging and boosted ensembles — describes two contrasting approaches to modeling in statistics:

  1. Data Modeling: choose a simple (linear) model based on intuition about the data-generating mechanism. Emphasis is on model interpretability and validation, if done at all, is done through goodness-of-fit.
  2. Algorithmic Modeling: choose the model with the highest predictive validation accuracy with no consideration for model explainability.

At the time of writing in 2001, Breiman estimated 98% of statisticians were in the data modeling group while 2% (himself included) were in the algorithmic modeling culture. The paper is written as a call to arms for statisticians to stop relying solely on data modeling — which leads to “misleading conclusions” and “irrelevant theory” — and embrace algorithmic modeling to solve novel real-world problems arising from massive data sets. Breiman was an academic, working as a statistician at Berkely for 21 years, but he had previously worked for 13 years as a freelance consultant giving him a well-formed perspective on how statistics can be useful in industry.

Read More

How To Avoid Common Difficulties In Your Data Science Programming Environment

(Source)

Reduce the incidental issues in your programming environment so you can focus on the important data science problems.

Consider the following situation: you’re trying to practice your soccer skills, but each time you take to the field, you encounter some problems: your shoes are on the wrong feet, the laces aren’t tied correctly, your socks are too short, your shorts are too long, and the ball is the wrong size. This is a ridiculous situation, but it’s analogous to that many data scientists find themselves in due to a few common, easily solvable issues:

  • Failure to manage library dependencies
  • Inconsistent code style
  • Inconsistent naming conventions
  • Different development environments across a team
  • Not using an integrated development environment for code editing

All of these mistakes “trip” you up, costing you time and valuable mental resources worrying about small details. Instead of solving data science problems, you find yourself struggling with incidental difficulties trying to set up your environment or get your code to run. Fortunately, the above issues are simple to fix with the right tooling and approach. In this article, we’ll look at best practices for a data science programming environment that will give you more time and concentration for working on the problems that matter.

Read More

Data Scientists Your Variable Names Are Awful Heres How To Fix Them

Wading your way through data science code is like hacking through a jungle. (Source)

A Simple Way to Greatly Improve Code Quality

Quick, what does the following code do?


for i in range(n):
    for j in range(m):
        for k in range(l):
            temp_value = X[i][j][k] * 12.5
            new_array[i][j][k] = temp_value + 150

It’s impossible to tell right? If you were trying to modify or debug this code, you’d be at a loss unless you could read the author’s mind. Even if you were the author, a few days after writing this code you wouldn’t know what it does because of the unhelpful variable names and use of “magic” numbers.

Working with data science code, I often see examples like above (or worse): code with variable names such as X, y, xs, x1, x2, tp, tn, clf, reg, xi, yi, iiand numerous unnamed constant values. To put it frankly, data scientists (myself included) are terrible at naming variables when we go to the trouble of naming them at all.

As I’ve grown from writing research-oriented data science code for one-off analyses to production-level code (at Cortex Building Intel), I’ve had to improve my programming by unlearning practices from data science books, courses, and the lab. There are many differences between machine learning code that can be deployed and how data scientists are taught to program, but we’ll start here by focusing on two common problems with a large impact:

  • Unhelpful/confusing/vague variable names
  • Unnamed “magic” constant numbers

Both these problems contribute to the disconnect between data science research (or Kaggle projects) and production machine learning systems. Yes, you can get away with them in a Jupyter Notebook that runs once, but when you have mission-critical machine-learning pipelines running hundreds of times per day with no errors, you have to write readable and understandable code. Fortunately, there are best practices from software engineering we data scientists can adopt to this end, including the ones we’ll cover in this article.

Read More

Notes On Software Construction From Code Complete

Source

Lessons from “Code Complete: A Practical Handbook of Software Construction” with applications to data science

When people ask about the hardest part of my job as a data scientist, they often expect me to say building machine learning models. Given that all of our ML modeling is done in about 3 lines:


from sklearn import model

model.fit(training_features, training_targets)

predictions = model.predict(testing_features)

I reply that machine learning is one of the easier parts of the job. Rather, the hardest part of being a data scientist in industry is the software engineering required to build the infrastructure that goes into running machine learning models continuously in production.

Starting out, (at Cortex Building Intel) I could write a good Jupyter Notebook for a one-time machine learning project, but I had no idea what it meant to “run machine learning in production” let alone how to do it. Half a year in, and having built several ML systems making predictions around the clock to help engineers run buildings more efficiently, I’ve learned it takes a whole lot of software construction and a tiny bit of data science. Moreover, while there are not yet standard practices in data science, there are time-tested best practices for writing software that can help you be more effective as a programmer.

With a relative lack of software engineering skills entering my job, I’ve had to learn quickly. Much of that came from interacting with other software engineers and soaking up their knowledge, but some of it has also come from resources such as textbooks and online tutorials. One of those textbooks is the 900-page masterwork on constructing quality software, Code Complete: A Practical Handbook of Software Constructionby Steve McConnell. In this article, I wanted to outline the high-level points regarding software construction I took away from reading this book. These are as follows:

  1. Thoroughly plan your project before touching a keyboard
  2. Write readable code because it’s read more than it’s written
  3. Reduce the complexity of your programs to free mental capacity
  4. Test and review every line of code in a program
  5. Be an egoless programmer
  6. Iterate on your designs and repeatedly measure progress
Read More

Masters In Computer Science At Georgia Tech Personal Statement

Source

Why I’m pursuing an advanced degree in computer science

Author’s Note: this is my personal statement for application to Georgia Tech’s Online Master’s in Computer Science (OMSCS). This degree, ranked 8th in the country for Computer Science, is the best deal in graduate education (at least in the United States) coming in under $7,000. (Compared to over $70,000 for degrees at lower-rated institutions.) It’s designed for working professionals, which means I’ll be working full-time at Cortex Building Intel as I pursue the degree. While this is still a work in progress, and I haven’t yet been accepted, I thought I’d share and any feedback is much appreciated. If I get in, I’m very much looking forward to continuing my education which is a must for anyone in the field of data science!

Update July 12, 2019: I have been accepted into the program. I will be attending the Online Master’s in Computer Science starting January 2020.

Read More