Exploratory Data Analisys: using visuals to see your data
Most attention in data visualisation focuses on its role as a means of communicating data to others. However, this only represents one side of the coin. The other key purpose of visualisation is to help us, as analysts, explore data. Visuals help supplement statistical analysis, offering techniques that can help us thoroughly interrogate data to unearth insights and qualities that may otherwise be hidden from view.
As mathematician John Tukey once described, ‘Exploratory data analysis (EDA) is an attitude, a flexibility, and a reliance on display, not a bundle of techniques.’. There is no single path to undertaking this activity effectively; it requires a number of different technical, practical and conceptual capabilities:
Instinct of the analyst: The attitude and flexibility that Tukey describes is about recognising the importance of the traits of the analyst. Effective EDA is not about the tool. There are many vendors out there pitching their devices as the magic ‘point and click’ option that will uncover deep discoveries. Technology inevitably plays a key role in facilitating this endeavour, but you cannot underestimate the value of a good analyst: it is arguably more influential than the differentiating characteristics between one tool and the next.
In the absence of a defined procedure for conducting EDA, the analyst needs to possess a capacity to recognise and pursue a scent of enquiry. A good analyst will have that special blend of natural inquisitiveness and the sense to know what approaches (statistical or visual) to employ and when. Furthermore, when these traits collide with a strong subject knowledge this means better judgments are made about which findings from the analysis are meaningful and which are not.
Reasoning: As I have mentioned, efficiency is a particularly important aspect of this exploration activity. The act of interrogating data, waiting for it to volunteer its secrets, can take a lot of time and energy. Even with smaller datasets you can find yourself tempted into trying out myriad combinations of different analyses, driven by the desire to find the killer insight lurking away in the shadows.
Reasoning is an attempt to help reduce the size of this challenge. Even in relatively small datasets you cannot expect nor afford to try pursuing all potential avenues of enquiry. With so many statistical and visual methods available to analysts, unleashing the full exploratory artillery is rarely feasible. EDA is about being smart, recognising that you need to be discerning about your tactics. In academia there are two distinctions in approaches to reasoning – deductive and inductive – that I feel are usefully applied in this discussion:
- Deductive reasoning is targeted: You have a specific curiosity or hypothesis, framed by subject knowledge, and you are going to interrogate the data in order to determine whether there is any evidence of relevance or interest in the concluding finding. I consider this adopting a detective’s mindset (Sherlock Holmes). This will assist in confirming the things you think you know, as well as helping to investigate the things you are aware you don’t know. Sometimes the consequence of this reasoning is not to obtain answers but to have a better understanding of the key questions.
- Inductive reasoning is much more open in nature: You will ‘play around’ with the data, based initially on a sense about what might be of interest, and then wait and see what emerges. In some ways this is like prospecting, hoping for that moment of serendipity when you unearth gold. You will maintain an open mind, letting the flow of discovery take you down potentially unexpected permutations. It is important to give yourself room to embark on these somewhat less structured exploratory journeys.
I tend to think about EDA by comparing it to the challenge of solving a ‘Where’s Wally?’ visual puzzle. The process of finding Wally feels random. You tend to begin by letting your eyes race around the scene like a dog who has just been let out of the car and is torpedoing across a field. After the initial burst of randomness, perhaps subconsciously, you then go through a more considered process of visual analysis. Elimination takes place by working around different parts of the scene and sequentially declaring ‘Wally-free’ zones. This aids your focus and strategy for where to look next. As you then move across each mini-scene you are pattern-matching, seeking the giveaway characteristics of the boy wearing glasses, a red-and-white-striped hat and jumper, and blue trousers.
The objective of this task is clear and singular in definition. The challenge of EDA is rarely that clean. There will always be a source curiosity to follow and you might find evidence of the ‘Wally’ somewhere in your data. However, unlike the ‘Where’s Wally?’ challenge, in EDA you have the chance also to find other answers. Things that might alter the scope of what qualifies as interesting and relevant. In unearthing other discoveries, you might determine that you no longer care about Wally; finding him no longer represents the main enquiry.
Chart types: This is about seeing data from all available visual angles. The power of visual perception means that we can easily rely on our pattern-matching and sense-making capabilities – in harmony with contextual subject knowledge – to make observations about our data. Through visualising your data for yourself you are able to establish a greater acquaintance with the characteristics of your data’s values: its magnitude, distribution, relationships, and exceptions.
Visualisations help you move beyond looking at data towards starting to see it. You discover what is in your data but also, crucially, what is NOT in there. Every chart type offers a different view of your data and facilitates specific observations. You need to learn the capabilities and limitations of each chart type to understand how and when to deploy them. You also need to develop your charting vocabulary by embracing a larger range of options, not limiting yourself to the narrow set of tried and trusted approaches. As with your statistical literacy, broadening your visual literacy will widen the potential view of your data.
Subject knowledge: Conducting exploratory analysis without the requisite subject domain knowledge you leave you exposed: you may not know if what you are seeing is meaningful, significant or unexpected. The approach to bolstering your knowledge of a subject is largely common sense: you explore the places (books, websites) and consult the people (experts, colleagues) that will collectively give you the best chance of asking the right questions of your data and knowing how to interpret the answers you get back.
Nothing to see here?: What if you have found nothing? You have hit a dead end. Despite trying out all conceivable angles of attack, you have discovered no significance in any relationships and have fundamentally found nothing ‘interesting’ about the patterns and shape of your data. What do you do? In these situations you need to adopt the attitude that nothing is usually something. Going down blind alleys and hitting dead ends can be useful. The ‘nothing to see hear’ discovery can help you develop focus by eliminating dimensions of possible analysis, as I illustrated with the ‘Where’s Wally’ example. If you have attributes of nothingness in your data – gaps, nulls, zeroes – you might find that these could prove to be the critical insight.
There is always something interesting in your data. If a value has not changed over time, maybe it was supposed to – that is an insight. If everything is the same size, that is the story. If there is no significance in the quantities, categories or spatial relationships, make those your insights. You will only know that these findings are relevant by truly understanding the context of the subject matter. This is why you must make as much effort as possible to convert as many of your unknowns into knowns.