Catallaxy Services | Launching a Data Science Project: Cleaning is Half the Battle

ABSTRACT

There's an old adage in software development: Garbage In, Garbage Out. This adage certainly applies to data science projects: if you simply throw raw data at models, you will end up with garbage results. In this session, we will build an understanding of just what it takes to implement a data science project whose results are not garbage. We will the Microsoft Team Data Science Process as our model for project implementation, learning what each step of the process entails. To motivate this walkthrough, we will see what we can learn from a survey of data professionals' salaries.

ADDITIONAL MEDIA

I performed a version of this talk for DataPlatformGeeks. You can get the recording on their Youtube channel.

SLIDES

Click here to access the slides for this presentation.

The slides are licensed under Creative Commons Attribution-ShareAlike.

DEMO CODE

Click here to access demo code for this presentation. This includes a Jupyter notebook which walks through our example.

The source code is licensed under the terms offered by the GPL.

LINKS & FURTHER INFO

For a more detailed explanation, check out my blog series entitled Launching A Data Science Project, where I cover the topic in this talk in more detail.

Setup Resources

I use the following in this talk:

R
Jupyter Notebooks. I have a guide for Windows and notes for Linux installation.

Resources

Microsoft's Team Data Science Process formed the crux of this talk. Although I do not follow the process exactly, I think it serves as a good starting point for a research-heavy project in an Agile world.
Microsoft also has an implementation guide for their Team Data Science Process using Azure Machine Learning. This is still useful in general even if you don't use Azure ML.
Definitely check out Microsoft's algorithm cheat sheet. There are many algorithms not covered in this PDF, but it gives you a head start on thinking through algorithm choices.
Raj Bandyopadhyay has his own take on what a data scientist actually does. It stops before getting to the "real" development phase but I definitely agree with the questions he asks.
Feature Engineering has its own special definition that I wanted to include separately from other items. It's sometimes hard to tell where data analysis leaves off and feature engineering picks up, but the linked definition does a good job.
SethBling has a YouTube video on MariFlow, along with a Google Doc showing how to set it up on your own.
SethBling also set up MarI/O, machine learning for video games.
Finally, I am indebted to Brent Ozar for his annual data professional salary survey.