The Most Important Part of a Data Science Project is Writing a Blog Post

Published on August 11, 2018

Categories: writing , data science

Writing creates opportunities, gives you critical communication practice, and makes you a better data scientist through feedback

It can be tempting to call a data science project complete after you’ve uploaded the final code to GitHub or handed in your assignment. However, if you stop there, you’re missing out on the most crucial step of the process: writing and sharing an article about your project. Writing a blog post isn’t typically considered part of the data science pipeline, but to get the most from your work, then it should be the standard last step in any of your projects.

There are three benefits to writing even a simple blog post about your work:

Communication Practice: good code by itself it not enough. The best analysis will have no impact if you can’t make people care about the work.
Writing Creates Opportunities: by exposing your work to the world, you’ll be able to form connections that can lead to job offers, collaborations, and new project ideas.
Feedback: the cycle of getting better is: do work, share, listen to constructive criticism, improve work, repeat

Writing is one of those activities — exercise and education also come to mind — that might have no payout in the short term but almost unlimited potential rewards in the long term. Personally, I make $0 from the 10,000 daily views my blog posts receive, each of which takes 3–15 hours to write. Yet, I also wouldn’t have a full-time data science job were it not for my articles.

Moreover, I know the quality of my data science work is much higher, both because I intend to write about it, and because I used the previous feedback I’ve received, making the long term return from writing decidedly positive.

Communication: Good Code is not enough

I know the feeling: you’ve put up some Jupyter Notebooks or scripts on GitHub and you want to stop and say “I’ve done the work, now I’ll let other people discover it.” While this might happen in an ideal world, in the real world, getting your projects noticed requires communicating your results.

It would be nice if the best work on GitHub automatically surfaced to the top, but in reality, it’s the work that is best communicated that has the greatest impact.

Think about the last time you found a project code repository on GitHub: if you’re like me, then you read an interesting article about a project and then followed through to the code. People go from an article to the code because first they need a compelling reason to check out the code. That’s not meant to be cynical, it’s just the way things work: people aren’t going to dig into your analysis until they know what you did and why it’s important/interesting.

To give a real-world example, my Data Analysis repo is a collection of numerous data science projects, most written with very rough code. Yet, because I wrote a few articles about some of the projects, it has over 600 stars. While stars are not a great way to measure impact, it’s clear that people are using this code and finding value in it. Yet, the other day when I stumbled on this repo for Bayesian Optimization of Combinatorial Structures (BOCS) , which objectively has better code than anything I’ve written, I was shocked to see it had only 2 stars. Much like great ideas die in isolation, the best code will go unnoticed without compelling communication of the results.

An Analysis is Only as Valuable as the Explanation

The value of an analysis is proportional not to using the best algorithm or the most data, but rather to how well you can share the results with a wide audience. In 1854, John Snow helped slow a cholera epidemic in London using 578 data points, a public essay, and a dot map. Rather than hide away his results in a notebook and hope that people stumbled on them, he published his work and made it easily accessible.

John Snow’s dot map of the London cholera outbreak. (Source)

In the end, he was able to convince the town members to disable a water pump, thereby stopping the spread of cholera and achieving the objective of data science: make better real-world decisions using data.

Writing a blog post gives you practice in one of the most critical parts of data science: communicating your work to a wide audience. Well-written code and a thorough analysis is a good start, but to complete your project, you need to tie it into a compelling narrative. An article is the perfect medium to explain your results and make people care about all your hard work.

Opportunities: Writing Opens Doors

Although data science can be more objective in hiring than other fields, getting a job is still mostly about who you know — or who knows of you — rather than what you know. The whole point of going to college (only a slight exaggeration here) is not to learn things you’ll use in your career, but to get to know people and make connections in your intended career field.

Fortunately, in data science at this point, while going to college for something is helpful, it’s not a necessity. With the ability to reach thousands of people online through a blog post, you can form those critical connections and open doors just through the act of writing and sharing— with no tuition required. When you write about your projects in a public forum, you can gain access to opportunities that don’t come just from turning in an assignment.

I went to college for mechanical engineering, and didn’t make a single connection (let alone learn any useful skills) in data science at school. However, I did start writing in my last semester, and as a result, was able to form numerous relationships with potential employers, collaborators, and even book editors (the answer is eventually) that have been immensely helpful as I navigate the start of a data science career.

Going back to the first point, my code is nowhere near as good as many other data scientists’, but I‘ve been fortunate to get opportunities because I’m able to make my work accessible.

I have never been contacted solely from someone who found me on GitHub, but I’ve been contacted hundreds of times from people who read my articles.

While my employer — Feature Labs — did find my GitHub work, it wasn’t by searching for “great data science analysis” on GitHub. Rather, it was through an article I’d written that walked through a project and summarized the conclusions. Remember, it’s not code to article, it’s article to code.

A blog post is a great medium for building important connections because it shows that 1. You’re doing good data science work and 2. You care about sharing it and teaching it to others. Excessive enthusiasm for data science is not a requisite to a job, but showing that you are interested in the field and learning will help attract employers, especially if you are just starting out and don’t have much experience. Furthermore, well-written blog posts can have a long shelf life, giving you a portfolio for potentially years to come.

There isn’t yet an established path to a data science job which means that we all get to forge our own. Writing and sharing with the community can help you form all-important connections and gain a foothold in the field.

As a new field, there are rarely any standard answers in data science. The best way to learn is to try something out, make a mistake, and learn from that experience. Putting your work out in a public venue means you can get feedback from thousands of data scientists with thousands of years of collective experience. That’s the benefit of being part of a community: together, we know more than any one person ever could, and by being a contributing member of that community, you can take advantage of that knowledge by using feedback to improve your own work.

Dealing with feedback on the Internet can be tough, but I’ve found the data science community, and in particular, Towards Data Science on Medium, to be extremely civil. My strategy for dealing with comments is:

Positive Comments: acknowledge with a thanks
Constructive Criticism: write down the comment, fix any parts of the current analysis that can be fixed, and practice implementing the recommendation whenever possible in future projects
Non-constructive criticism: ignore

Cycle of Improvement

Unfortunately, we often don’t take the time to review our own work as often as we should, but, fortunately, we can share it with the world and have other people review it. These other people are probably more honest about our work than we would have been, so we get a more objective assessment by sharing.

The most valuable part of a class is never the content, it’s the feedback you get from professors on your assignments. Fortunately, you can get that feedback without taking any classes by publicly sharing your projects with the data science community in a blog post.

Although school teaches us to be failure-averse, it’s only by repeatedly failing and then improving as a result that we get any better. Unequivocally, I’m a better writer and data scientist because I’ve put my work out for criticism and listened to the feedback.

What to Do?

Right now, you probably have one or a dozen Jupyter notebooks that would make great articles! Take an hour or two to write up one of these and put it out into the world. It doesn’t have to be perfect: as long as you have done the data science work, people will respect your article.

If you struggle with releasing anything that’s not perfect (one of my largest problems), then set a time limit, say 60 minutes, and whatever you get done in 60 minutes has to be released. I’ve had to do this a couple times, and it’s made my resulting work more to the point and more effective.

Right now, take one of your Jupyter notebooks, and write an article. Put it out on Medium and then let the community see your work. Although the rewards are not instantaneous, over time the benefits will accrue:

You’ll get better at the crucial task of communication
Opportunities/connections will open up
Your data science and writing will improve as you build on constructive criticism.

Keep working on your data science projects, but don’t stop at the moment the code goes up on GitHub or is turned in. Take that final step and write an article. Your future self will thank you!

As always, I welcome feedback, constructive criticism, and hearing about your data science projects. I can be reached on Twitter @koehrsen_will or on my personal website at willk.online.