How Reproducible Research and Jupyter may very well be the future of Data Science
The week i.e. Saturday, 20th August, 2016, brought us a very thought-provoking and interesting session by Zubin Dowlaty – “Ensemble Methods and Interactive Visualizations using Notebooks”. The video description of the session touted the use of something Zubin termed a “Portfolio Ensemble Method” of predictive analysis using Jupyter Notebooks. After seeing the video, my expectations were, understandably, high.
Well, they were more than surpassed, and how!
For those of you fresh in MSU, or living in Antarctica, Zubin Dowlaty heads Innovation and Development in Mu Sigma. Throughout the session, his knowledge of Data Science and Statistical Modeling shone through – the result of 21 years in this field, more than apparent. He managed to achieve this fine balance between the technical know-how and a creative enthusiasm to keep his audience rapt, and made it look so easy!
He opened the session with a discussion on “Reproducible Research” – a paradigm that he says is the direction that Data Science is headed towards. The idea is that, in Data Analysis and scientific computations, a minimum standard is that the results must be reproducible; the code and data is arranged in such a way that the experiment and conclusions can be re-created. The necessity for such practices is recognized, however they are not as widely practiced, largely because not all of the data scientists and statisticians have adopted all of the tools required for reproducible research.
Zubin spoke at length about the difference between professional data scientists and amateurs. “Professional Data Scientists are craftsmen,” he says.
“What we do is create art out of this deluge of data and present it to our clients.”
So, quite naturally, the clients would be curious about how the solutions were arrived at, using the data that they provide.
Sometimes, we are given, say, 5000 rows of data, and the final result makes use of only about 3000 of them. In such cases, the client would want to know what happened to the 2000 that weren’t used. The metaphor, “I’ve seen the final dish, and found it tasty, now I wanna see the kitchen, no matter how dirty it is,” does really seem fitting in this case, and that is exactly what the reality is.
You really don’t need any sophisticated technology to follow the principles of Reproducible Research. “Reproducible Research is NOT Technology,” asserts Zubin, “It’s a mindset, a discipline that professional Data Scientists follow in order to achieve the ‘Gold Standard’ of publication and research reproducibility.” Having said that, using the right tools can only make our jobs that much easier and smoother.
His favorite quote, one he iterated throughout the session was, “Are you a mad one?” (Figure – 1)
He kept asking us, “Do you conform to the old standards, like sheep, or do you dance to your own tune and tell the others, ‘You’re listening to the wrong tune’ .. ?”
That was the most thought-provoking part of the session for me.
His next topic provided the perfect segue to Nietzsche’s dictum. Jupyter Notebook is a tool that makes code and data interweave with insights and prose beautifully. According to the Jupyter website, “Jupyter is a web application that allows you to create and share documents containing live code, equations, visualizations and explanatory text.”
What this means for you:
- You can include your executable code along with description of the code in the same page.
- Sharing notebooks is as easy as sending an email, Dropbox or Github link.
- The output can be exported in any format, be it Excel, Word, PDF, HTML … you name it.
- Jupyter has support for more than 40 languages, including R, Scala, Python, Julia and SAS (yes, SAS! Even I couldn’t believe it! Of course, you (or your client) will still need a paid subscription to operate the SAS kernel, else you can’t connect to anything, but meh… Semantics…).
- Jupyter also has support for Spark clusters! Need I say more?
After introducing us to the wonders of Jupyter, Zubin showed us what he termed as “the Portfolio Method” of predictive modeling. The way it works is thus; first, select a ‘champion model’, a mathematical modeling technique whose results would be the ones we base our recommendation to the client on. The main idea is to ‘not put all our eggs in one basket’. So, we validate the results of our champion model by running the training data set through other predictive models, to:
- Corroborate our findings from the champion model against other models (after all, Trust but Verify)
- Find ways to improve the results of our champion model based on recommendations from the other models.
For the purpose of this demonstration, the Champion model chosen was OLS regression (Ordinary Least Squares). Then, the verification models chosen were:
- Robust Regression
- Stepwise Regression
- Partial Least Squares
- Decision Trees
- Random Forest
- SVM (Support Vector Machines) and Neural Nets
- GAM (Generalized Additive Model) and MARS (Multivariate Adaptive Regression Splines)
- LARS (Least Angle Regression)
- Latent Class Regression
All of these models contribute more and more towards the improvement of the champion model (OLS).
“Progress is impossible without change, and those who cannot change their minds cannot change anything,” said George Bernard Shaw. Well, I believe we have reached a turning point in our journey as Decision Scientists, and this is it. Decisions are no longer about pipelines, but about transparency. Do we stick to Excel Spreadsheets and VBA and immerse ourselves in pain, or is it time to revolutionize how we think, break the shackles of dogma and explore?