Completed Twelve Myths of Data Science

This week I completed Ben Sullins’(@teslanomicsco) short 30 minute Lynda.com video: “Twelve Myths About Data Science” in which he helps to sort out fact from fiction on some claims made over the past several years from vendors and the media. I’d argue the video should actually be titled the “Twelve Myths About Big Data” as the core content was regarding data processing systems for “large” data but I understand given Garter has removed the term “Big Data” from the hype cycle and there are probably more people searching now for “data science.” In the end I enjoyed the video and would recommend watching it if you have access to Lynda.com. Below are a few of my key takeaways from the video.

Key Takeaways:

1. Is this a solution to a problem or a solution in search of a problem?

As technology professionals we all like playing working with the latest technology but before rushing out and standing up an expensive Hadoop or Spark cluster ask yourself this “Is this solution really necessary for the given use-case?” I think this one applies to most technology solutions as we all too often have a solution in search of a problem. Data processing solutions such as Hadoop and Spark were created for use-cases in which it was impracticable to use existing data processing technology.

2. Data Scientists are the key users of “Big Data” systems but shouldn’t be responsible for building and maintaining those systems

At this point there are numerous articles out there that explain the differences between Data Scientists and Data Engineers so I won’t go into detail other than to say Data Scientists are the people with advanced statistics and domain knowledge that use data to gain insights where as the Data Engineers are the folks with a technical background who work with Data Scientists to design and build the systems required to process, clean, and store that data.

3. Be realistic about the cost and resources required

While cloud technologies have made it much easier to provision the infrastructure for big data processing systems there is still a lot of work done behind the scenes make the data flow and become usable. The key point here is that if you have a problem that is a valid use-case for one of the “Big Data” technologies then you need to be realistic about the investment required to build, maintain, and analyse data to solve that problem. I always get a good laugh when I come across a job posting that seems to be looking for an entire data department in a single person.