Data Visualization Hackathon Style

My effort to liberate data from spreadsheets

Everyone — corporations, governments, individuals — has data, but few people know how to use it effectively. Data can tell us much about how to make better decisions, but often this knowledge is hidden within the numbers. One problem is that most of the data looks something like this:

Although the information here, global CO2 emissions, is “open data” in the sense that it’s publicly available for anyone to download, it might as well be locked away for all the good it is doing anyone in a spreadsheet. At its core, data science is about taking these meaningless pages of numbers and turning them into useful knowledge. One of the most effective ways of revealing insights within numbers is through data visualization.

Data from a spreadsheet turned into knowledge

For HackCWRU 2018, a hackathon hosted at Case Western Reserve University, I wanted to explore the public CO2 data and make it accessible to a general audience. For those who haven’t had the experience, a hackathon is where a bunch of passionate makers — coders, artists, hardware specialists, and occasionally data scientists — get together for a weekend to work on projects for 24 or 36 straight hours. Sometimes there are specific problems to solve, but in other cases, such as with HackCWRU, you are free to choose your team and project. With a limited amount of time to accomplish your goal, sleeping is generally discouraged!

Read More

Bayes Rule Applied

Using Bayesian Inference on a real-world problem

The fundamental idea of Bayesian inference is to become “less wrong” with more data. The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information. Although we don’t think about it as Bayesian Inference, we use this technique all the time. For example, we might initially think there is a 50% chance we will get a promotion at the end of the quarter. If we receive positive feedback from our manager, we adjust our estimate upwards, and conversely, we might decrease the probability if we make a mess with the coffee machine. As we continually gather information, we refine our estimate to get closer to the “true” answer.

Our intuitive actions are formalized in the simple yet powerful equation known as Bayes’ Rule:

We read the left side, called the posterior, as the conditional probability of event A given event B. On the right side, P(A) is our prior, or the initial belief of the probability of event A, P(B|A) is the likelihood (also a conditional probability), which we derive from our data, and P(B) is a normalization constant to make the probability distribution sum to 1. The general form of Bayes’ Rule in statistical language is the posterior probability equals the likelihood times the prior divided by the normalization constant. This short equation leads to the entire field of Bayesian Inference, an effective method for reasoning about the world.

Read More

Markov Chain Monte Carlo In Python

A Complete Real-World Implementation

The past few months, I encountered one term again and again in the data science world: Markov Chain Monte Carlo. In my research lab, in podcasts, in articles, every time I heard the phrase I would nod and think that sounds pretty cool with only a vague idea of what anyone was talking about. Several times I tried to learn MCMC and Bayesian inference, but every time I started reading the books, I soon gave up. Exasperated, I turned to the best method to learn any new skill: apply it to a problem.

Using some of my sleep data I had been meaning to explore and a hands-on application-based book (Bayesian Methods for Hackers, available free online), I finally learned Markov Chain Monte Carlo through a real-world project. As usual, it was much easier (and more enjoyable) to understand the technical concepts when I applied them to a problem rather than reading them as abstract ideas on a page. This article walks through the introductory implementation of Markov Chain Monte Carlo in Python that finally taught me this powerful modeling and analysis tool.

The full code and data for this project is on GitHub. I encourage anyone to take a look and use it on their own data. This article focuses on applications and results, so there are a lot of topics covered at a high level, but I have tried to provide links for those wanting to learn more!

Read More

The Multiple Comparisons Problem

How to avoid being fooled by randomness

The CEO of a major drug company has a problem. The new miracle drug developed by his chemists to do increase willpower has failed in every trial. The CEO cannot believe these results, but the researchers tell him there is no evidence of an effect on willpower at a significance level (p-value) of 0.05. Convinced the drug must be beneficial in some way, the CEO has a brilliant idea: instead of testing the drug for just one effect, test it for 1000 different effects at the same time, all at the same p-value. Even if it doesn’t increase willpower, it must do something, like reduce anxiety or boost memory. Skeptical, the researchers redo the trials exactly as the CEO says, monitoring 1000 different health measures of subjects on the drug. The researchers come back with astounding news: the drug had a significant effect on 50 of the measured values! Miraculous right? Actually, it would be more surprising if they had found no significant effects with this experimental analysis.

The CEO’s mistake is an example of the multiple comparisons problem. The issue comes down to the noisiness of data in the real world. While the chance of noise affecting one result may be small, the more measurements we make, the larger the probability that a random fluctuation is mis-classified as a meaningful result. While this affects researchers performing objective studies, it can also be used for nefarious purposes.

The CEO has a drug that he wants to sell but it doesn’t do what it was designed for. Instead of admitting failure, he instructs his researchers to keep looking until they find some vital sign the drug improves. Even if the drug has absolutely no effect on any health markers, the researchers will eventually find it does improve some measures because of random noise in the data. For this reason, the multiple comparisons problem is also called the look-elsewhere effect: if a researcher doesn’t find the result she wants, she can just keep looking until she finds some beneficial effect!

If at first you don’t succeed, just keep searching!

Read More

Python Is The Perfect Tool For Any Problem

Reflecting on my first Python program

Reflection is always a helpful (and sometimes entertaining ) exercise. For nostalgia’s sake — if one can be nostalgic for something 2 years old— I wanted to share my first Python program. I initially picked up Python as an aerospace engineering student to avoid spreadsheets and little did I know how good a decision this would turn out to be.

My Python education began with the book Automate the Boring Stuff with Pythonby Al Sweigart, an excellent application-based book with simple programs to do useful tasks. When I learn a new topic, I look for any chances to use it and I needed a problem to solve in Python. Fortunately, I found one in the form of a $200 textbook required for a class. My personal limit for textbooks is about $20 (Automate the Boring Stuff is free online) and I refused to even rent this book. Desperate to get the book before the first assignment, I saw it was available for a free one-week trial through Amazon with a new account. I got the book for one week and was able to do the first assignment. While I could have kept creating new accounts one week at a time, I needed a better solution. Enter Python and my first programming application.

Read More