Practical Advice for Data Science Writing

Useful tips for writing about your data science projects

Writing is something that everyone wants to do more of, yet we often find it difficult to get started. We know that writing about data science projects improves our communication abilities, opens doors, and makes us better data scientists, but we often struggle with thoughts that our writing isn’t good enough or that we don’t have the necessary background or education.

I’ve struggled with these feelings myself, and, over the past year, have developed a mindset to get through these barriers as well as general principles about data science writing. While there is no one secret to writing, there are practical tips that make it easier to establish a productive writing habit:

  1. Aim for 90%: the imperfect project that gets finished is better than the perfect project you never complete
  2. Consistency helps: the more you write, the easier it gets
  3. Don’t worry about credentials: in data science there are no barriers to prevent you from contributing or learning anything you want
  4. The best tool is the one that gets the job done: don’t over-optimize your writing software, blogging platform, or development environment
  5. Read widely and deeply: borrow, remix, and improve on other’s ideas

In this article, we’ll go through each point these briefly, and I’ll touch on ways in which I’ve implemented them to improve my writing. Over the course of dozens of articles, I’ve made lots of mistakes, and, rather than making these same errors yourself, you can learn from my experiences.

Read More

An Implementation and Explanation of the Random Forest in Python

A guide for using and understanding the random forest by building up from a single decision tree.

Fortunately, with libraries such as Scikit-Learn, it’s now easy to implement hundreds of machine learning algorithms in Python. It’s so easy that we often don’t need any underlying knowledge of how the model works in order to use it. While knowing all the details is not necessary, it’s still helpful to have an idea of how a machine learning model works under the hood. This lets us diagnose the model when it’s underperforming or explain how it makes decisions, which is crucial if we want to convince others to trust our models.

In this article, we’ll look at how to build and use the Random Forest in Python. In addition to seeing the code, we’ll try to get an understanding of how this model works. Because a random forest in made of many decision trees, we’ll start by understanding how a single decision tree makes classifications on a simple problem. Then, we’ll work our way to using a random forest on a real-world data science problem. The complete code for this article is available as a Jupyter Notebook on GitHub.

Read More

How to Put Fully Interactive, Runnable Code in a Medium Post

Code is meant to be interactive.

  1. Head to repl.it
  2. Write some code in your favorite language.
  3. Copy and paste the url for your repl into a Medium post.
  4. Publish the post to make the code interactive.

Interactive code. In a Medium article! (from repl.it).

Feel free to play around with the code above. This is a simple example, but you can create complex scripts for readers to run in many languages.

Read More

A Data Science for Good Machine Learning Project Walk-Through in Python: Part Two

Getting the most from our model, figuring out what it all means, and experimenting with new techniques

Machine learning is a powerful framework that from the outside may look complex and intimidating. However, once we break down a problem into its component steps, we see that machine learning is really only a sequence of understandable processes, each one simple by itself.

In the first half of this series, we saw how we could implement a solution to a “data science for good” machine learning problem, leaving off after we had selected the Gradient Boosting Machine as our model of choice.

Model evaluation results from part one.

In this article, we’ll continue with our pipeline for predicting poverty in Costa Rica, performing model optimizing, interpreting the model, and trying out some experimental techniques.

The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills.

Read More

A Data Science for Good Machine Learning Project Walk-Through in Python: Part One

Solving a complete machine learning problem for societal benefit

Data science is an immensely powerful tool in our data-driven world. Call me idealistic, but I believe this tool should be used for more than getting people to click on ads or spend more time consumed by social media.

In this article and the sequel, we’ll walk through a complete machine learning project on a “Data Science for Good” problem: predicting household poverty in Costa Rica. Not only do we get to improve our data science skills in the most effective manner — through practice on real-world data — but we also get the reward of working on a problem with social benefits.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.

The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills.

Read More