Working with data is still full of human challenges

The summer break was a great opportunity to go through my Kindle library. One book that really stuck with me was “Hello world: How to be a human in the age of the machine” by Hannah Fry. It’s a great read that will get you thinking on several philosophical points related to the modern world and more specifically how data analysis affects our lives!

It was also a great reminder, for my nerdier/techie side, that the toughest business challenges usually have the “human” element at its core, rather than technology! Let me share two great stories from Hannah’s book. I will explain why I consider them to be among the most fundamental difficulties anyone working with anomaly detection is facing.

Don’t take the existence of digital data for granted

Remember a few years back when Google’s DeepMind and IBM’s Watson were often featured in the news as they were trying to move into healthcare and create a digital super-doctor? A few of the reasons which gave the brilliant minds in both these teams a hard time were:

  • The abstractness of the natural language makes it hard for machines to translate patient feedback. Just imagine when was the last time you managed to have an intuitive discussion with a chatbot (probably never) or how ambiguous natural language can be. When someone says that their stomach is killing them are they being literal or are they just referring to a stomach ache?
  • The lousy repository of historical data in healthcare most medical care centers do not have access to a patient’s complete medical history. The handwriting of the average doctor is also not particularly digital age-friendly.

When you start your journey to create a machine that would “eradicate cancer”, you would expect that you will have to deal with more technologically advanced issues!

But the truth is that data professionals coming from digitally mature markets (e.g., web analytics) sometimes live in a bubble that lets us forget the effort devoted to reaching the state where we are today:

  • Data engineers often consider the availability of digital data as the norm
  • Data analysts and data scientists can forget how difficult it can be to just create a complete dataset by combining different data sources
  • Managers often fail to think of the effort devoted by data analysts to clean up a dataset before they can produce reliable insights

Machines will imitate human behavior

Another example that blew my mind, is the challenge faced by teams trying to create a model that would speed up court rulings, by creating a machine learning model that would go through the facts of each case and propose a ruling to a judge. Even if most of us would expect the judicial system to be objective and with little room for misinterpretations, research has shown this is not the case.

In a 2001 study, several UK judges were asked whether they’d award bail to several imaginary defendants, and they failed to agree unanimously on a single one of the 41 cases presented to them! Up to half of the judges differed in their opinion of the best course of action on any one case. Listen to the best part, a few of the cases even appeared more than once in the study (with different names) and most judges didn’t manage to make the same decision, on the same case, when seeing it for the second time.

If you were tasked to create a decision ML model (like this one) chances are that you would expect a clear set of requirements and set of goals (because this is how engineering teams are working). It would be challenging to reach a successful result when there’s such diversity in the success criteria. Trust me when I say that diverse success criteria are not isolated to this specific use case.

ML models are like kids that copy their parents, and you can’t expect your “kid” to behave if you fail to train it (This is a fun and interesting take on this topic). Data scientists should not forget the “human” part of their role if they want to achieve success.

Successful anomaly detection is a human challenge

Maybe the real reason why I enjoyed reading “Hello world” is that I was able to see myself and identify many of the challenges we have faced while building Baresquare’s anomaly detection product. It reminded me how human our job can be, and it helps me remember that the value of Baresquare isn’t just in its state-of-the-art algorithms and underlying tech—it’s in the distilled domain knowledge from data analysts who know what they’re looking for.

Defining what is interesting is a human problem.

I’ve been working in the field of data analysis for over 15 years now, and I’ve seen how much has changed. It used to be that anomaly detection was all about finding a way to identify outliers (a technology problem), but these days, it’s more about finding ways to identify anomalies that are actually interesting. Strangely enough, the definition of “interesting” is still not clear or is debatable in most cases and feels like a “human” problem as we need to better define what we need.

Anomaly detection is not just about detecting outliers, but also understanding why it is an outlier—and what can be done with that information. You have to constantly prioritize (or even re-prioritize) outliers and group them together according to your user’s needs and feedback.

This is where critical thinking and domain knowledge come in. The more you know about your data, the more effective your algorithms will be at identifying anomalies and producing meaningful insights.

The value of Baresquare lies in its ability to think like a human analyst. It’s not just an anomaly detection tool that blindly processes data and returns long lists of anomalies. It utilizes the domain knowledge of experienced data analysts and is designed to produce the results you would expect to get from an experienced colleague and most importantly in a human-friendly format.

If you are interested in using Baresquare and seeing how it performs against your own data you can use our free forever plan with a Google Analytics account you own. You will be amazed by the way it can change the way you interact with your dataset on a daily basis.

Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.