Whether you’re a large enterprise or a small business, you’ve probably been told that data is one of your organization’s more important resources. Fair enough. But if you’re not already actively using analytics to tap this resource, where do you start? While this blog post won’t teach you how to start an analytics process within your organization, it will provide some practical pointers on handling analytics whatever the size of your organization (or your institutional experience with them).
1.- Consider More than Just Simple Descriptive Statistics
Large datasets can be tough to handle, and whatever the data, you need a toehold into it to begin to explore it. But remember to move beyond simple, go-to descriptive statistics like mean or standard variation—they can mask important insights on your data and send you down blind alleys before you have even begun.
To prove this point, statistician Francis Anscombe generated four small bivariate datasets with identical (or nearly identical) descriptive statistics. Now known as Anscombe’s quartet, even though the datasets have the same mean and sample variance for their respective X and Y values (and even nearly the same Pearson correlation and line of linear regression), the huge differences in datasets becomes immediately clear once they are graphed.
The point is not that mean, median, and standard deviation are bad, but that they can only give you so much insight into unfamiliar datasets. Consider incorporating other fast methods of initial data inspection such as histograms, cumulative distribution functions (CDFs), and quantile-quantile (Q-Q) plots.
2.- Visualize Your Data
The example of Anscombe’s quartet in the last point illustrates that sometimes the fastest way to get initial insights about the shape of your data is through visualization. At the same time, don’t limit your visualizations to back-of-the-envelope sketches like histograms or CDFs; more sophisticated visualizations like scatter-plot matrices, principal component analysis (PCA), or t-distributed stochastic neighbor embedding (t-SNE) analysis can help you make sense of multi-dimensional data and point you in the most fruitful directions for further investigation. Beyond just detecting structure in high-dimension data, visualization techniques like an Andrews plot can also help you detect outliers in your data.
3.- Consider Outliers…Carefully
Outliers to your data might be the proverbial canaries in the coalmine that can alert you to underlying problems in your analysis. Then again, they might not be; some outlying data you might never manage to explain. The point is to neither ignore data outliers nor get sucked down rabbit holes chasing analytic noise. Come up with a process and a practice to understand why you might need to flag some data as ‘unusual’ and then either move on or track that data for meta-analysis (such as unusual data being generated by communications issues rather than physical phenomena) without devoting undue time to the issue.
4.- Keep Yourself Honest by Reporting Your Confidence
The world is a messy place. Anomalous things occur. Errors creep into data through instrumentation or reproduction. In short, noise happens.
While it is a danger to find patterns in the noise, the analysis you produce should nevertheless reflect that you are sifting through the noise to discern actual patterns—and that you can never fully get rid of statistical noise with real-world data. Always make sure to include the idea of confidence with your estimators. Expressing confidence can take many forms based on what you are estimating: confidence intervals, p-values, Bayes factors, or whatever. The point is never to present your estimates as being cleaner or more precise than they actually are—especially to yourself!
5.- Look at Small Examples of Your Data and Your Analysis of that Data
In his (very) short story “On Exactitude in Science,” Jorge Luis Borges tells of a group of cartographers who create a map so detailed and accurate that it was as big as the empire they were attempting to encapsulate. This is precisely the opposite of what good data analytics does. To provide useful analysis, you must remove many, many features from your underlying data to summarize it. The trick is to remove the right ones.
One way of evaluating whether you are removing extraneous features and keeping only useful ones is to look at small examples of the data. Using subsets of the data is especially helpful when writing computer code for your analysis. Examples drawn from your larger datasets allow you to examine the data in its full complexity and gain a more complete understanding of how the code is working on the data.
Ensure also that you are not too focused on one set of cases, particularly on common cases. Techniques such as stratified sampling can help make sure that you consider examples from the full distribution of values of your data, especially ones at the extreme ends.
6.- Slice Your Data into Related Subsets
Slicing data is subtly different from extracting subsets of data to create example sets. Examples serve as a sanity check for your analytical methodology across the entire dataset, while a slice of a dataset encapsulates a chunk of data that you expect to have different metrics from other subsets of your dataset. For example, this might be sex, age, or race for demographic data, or make and model for vehicular data.
Data slicing accomplishes two things. First, it allows you to make better predictions about similar phenomena in your data. Second, it enables you to gauge internal consistency of your methodology and helps you determine whether you are measuring the right things across your entire dataset.
For example, comparing data slices might help you determine that you have bad data in one data slice. Note, however, that when comparing data slices, you want to ensure that you compare a similar amount of data from different data slices. Taking too much data from one slice and not enough from another can give you a biased comparison.
7.- Check Your Data for Consistency over Time
Systems in your organization tend to evolve over time—both the systems that use your data and those that produce it. Something in your data pipeline is likely to break at some point, and data slices by time are most likely to identify those problems. It is almost always wise to slice your data by some unit of time.
The unit you use depends on your underlying data. A lot of data runs in daily or weekly cycles, though some machinery might produce data where hourly time slices make more sense. Geospatial data might also have a role to play in your data slicing by time: for example, taking into account time zones would be important when analyzing global data streams from Twitter.
This is far from being a conclusive list of all best practices for dealing with your data or undertaking analytics. For example, I haven’t touched on tips for gathering or cleaning your data. Nevertheless, all of these are relevant practices that can help you with your analytics whether you’re completely new to it or an old hand who wants to fight off bad habits.