Without being aware of it, the decisions you make when joining data can skew your results.

Last month, we discussed the use of external data as benchmarks to provide perspective when analyzing your own internal data. Even if you are not pulling information from public data sources, you are likely comparing information across applications and departments. Every time you want to compare an item from a purchase order with a line item from an invoice, you want to have a set of values that match exactly between the tables you are comparing. Many times you don’t. Sometimes this is done intentionally —  for instance, to make sure that a report supports a decision that has already been made — but that is a subject for another article. I want to discuss how skew[1] can be introduced unintentionally and sometimes without anyone even noticing. Here are some simple examples along with some corrective suggestions to consider:

Joins as Filters:

Say you want to match customer information against calls into your help line. Your CRM system stores phone numbers as your users enter them with formats ranging from (888) 555-1212 to 888.555.1212 to 8885551212. Your call center system’s activity log stores phone numbers as 8885551212. If you attempt to join customers to calls using the two phone number columns, you will filter every CRM record using a formatted phone number. At first glance, it will appear that there are no support calls for any of those customers. It will imply that you have too many staff members in your call center or that they are speaking with the wrong people — both biased assumptions because of the absence of a potentially large number of filtered records. For this example, you would want to consider using a rule or a script to remove all formatting from the CRM phone number value before the join is made. This is an example of a rule-based data map, and it is a very powerful tool.

Another common rule-based correction is transforming “LastName, FirstName” to “FirstName LastName.” The point I want to highlight here is that having no records returned due to a bad match is better than having some records returned. A report with no records is simply broken; one with some records is almost certainly skewing your result, and if you don’t catch it then, this unintended bias can lead to erroneous decisions. 

Perhaps you want to use stock market close data as a benchmark for sales. Your business is open 362 days this year, while the US stock market is open 250. We learned from the example above that if we match on the date, then the 112 days you’re open and the market is closed will disappear from your report. In data sciences, an introductory concept for visualizing this is frequently the Venn Diagram, and this match will return only the intersection of sales and market data. One of the simplest ways to address this issue is to use a left join to force your report to return all sales data and any close data that matches on a date resulting in sales but no market data for every Saturday, Sunday, and market holiday. If you attempt to graph this data set, you will have jarring gaps in the market data.

A slightly more complicated solution would be to use a numbers table to left join to the market data table. The product of this join will create a table with all of the dates for the year. You could then assign the last close data to the empty values. We could argue about whether or not this is more or less accurate, but I think in the end, we’d agree that the graph is more visually appealing without the breaks in the market data line.

The final filtering issue I want to discuss is mapping. Let’s imagine that over the last few years, your business has grown due to the acquisition of similar businesses in other locations. The products you sell are mostly the same across locations and regions, but each of your acquired companies used different product codes for these items. If you want to do any trend analysis crossing the boundary between the old product item codes and the new, then a better solution than editing the history in each legacy system would be the creation of a mapping table containing the old code and the new. This step can prevent counting apples as oranges in the case when the legacy data uses a code for something other than what the current system has.

Joins as Multipliers:

Ah, but what happens when your mapping table has multiple destinations for the same code? Let’s say that Acquisition A uses code 123 for your product ABC, and that Acquisition B uses code 123 for your product DEF. When you map this data on product code alone, you multiply your results:

locationcode

oldcode

oldcode

newcode

1

123

123

ABC

1

123

123

DEF

2

123

123

ABC

2

123

123

DEF

You get four rows where you expected two. (This is corrected by adding location to the map.)

What makes this type of error so pernicious is that it likely won’t duplicate or multiply everything in your dataset — just individual codes here and there. If you are looking at a dashboard where this data is aggregated, you might not notice the skew.

Reporting on Your Reports:

Now, this is going to sound terribly meta, but I’m going to say it anyway: It is possible and desirable to create reports that focus on the quality of your data. Datasets possess characteristics that can be monitored over time, and tolerances can be set for anomalies. When an anomaly is detected, you will find something that has changed in your business. Some frequent measures to track over time include:

Having a few reports looking at the metadata of your reporting platform can save hours and weeks of frustrating audit time when a change in your business introduces one of these errors into a long-trusted report. Entire courses are taught on data quality, and if you find yourself performing analytical tasks on a daily basis, then taking one might be worthwhile. But typically, these simple metrics are enough to indicate a skew has occurred in your dataset.

 

As an Alliance Advisor, I offer expertise in this and other areas of data analysis. Please contact me if you or your company would like to start a conversation about your needs.

[1] I chose this term to avoid using bias and based this decision on this definition skew: To give a bias to; distort. I did not intend to confuse this with the statistics term Skewness.