In my previous post, I covered the various methods you can use to access a data lookup table and create realistic test data that is usable and protects citizens’ rights to privacy under the EU General Data Protection Regulation (GDPR).
I concluded a lot depends how a lookup table is accessed. Some access methods give repeatable results, some don’t. Some are reversible, some aren’t. In this post we’ll look at the concept of two-tier access to a lookup table. The idea is to pre-filter the lookup table with some criteria so replacement values only come from a selected range of rows.
How Two-Tier Access to a Lookup Table Works
Consider the example of an international organization with clients in many countries. If the organization wants to mask addresses, it will likely be required to replace any address with another coming from the same country.
One reason for this is different address formatting exists between countries. Another is that a multi-national company would typically have products or services (or prices) tailored by country.
The company’s client table could like this:
The customers come from six different countries. The masking requirement is to preserve the original country and mask it in such a way that the replacement address conforms to the formatting standards of the given country.
We need a lookup table with a selection of addresses. Let’s assume that the masking team has created the following lookup, where “Seq” is the key (lookup tables should always have a primary key column), “STREET” is the street address and so on:
There are two issues with the table, but let’s say that the masking team isn’t aware of them yet. The team moves forward and creates a masking rule using the method of two-tier access to a lookup table.
In tier one, the original country is selected and compared against the lookup table, resulting in a pre-selected range of addresses matching the original country. Tier two will simply select one of the addresses within the pre-selected range (let’s say random access will be used):
Let’s execute the masking rule and check the result:
We see that countries remain unchanged and the replacement addresses are from the same country as the source. However, a careful reader will spot two problems:
- The addresses of customers from Austria and the U.K. are unchanged (rows one and three)
- The address of the client from Poland looks the same, save for being lowercase (row six)
What, then, has gone wrong?
Ad-1. Go back to the lookup table. Is there any address from Austria or the U.K? No! That’s a design error. If we want the replacement addresses to be from the same country as the source, we obviously need to know our business and prepare the lookup table appropriately.
Ad-2. It looks like the rule attempted to mask the address (uppercase in the source, lowercase in the replacement), but the address looks like the same. Rule error? No! Again, if you check the lookup table, you will see one of the two Polish addresses is actually the same as the source. The other is different, but we have a 50 percent chance of picking the same address. Since we used the random method for the second tier, chances are the next execution will pick up the other address.
Two-tier access to a lookup table is the method of choice when your replacement values must come from a specific data range. Tier-one processing will pre-select the range based on user criteria and tier-two will select a row from the range. As demonstrated above, it’s not only about creating the rule; the lookup table must be carefully designed so the end results align with requirements.
Reviewing Your Data Lookup Table
Translation is a popular method of data disguise, as it produces readable results and is therefore typically used for replacement of names or addresses. A lot depends on how the lookup table is accessed, as various access paths produce different results. However, creating a good lookup table is a science on its own.
We need to analyze the requirements, know our data sources, and design the lookup table and determine how it should be accessed. Only then one can develop the masking rules and deliver the solution that both meets tester’s requirements for usability and complies with the GDPR’s call to respect citizens’ rights to privacy.
This concludes my series, “The GDPR Clock Is Ticking,” intended to provide you with a practical approach to test data privacy. You can read the previous articles in the series and subscribe to get updates when other data and security blogs are posted at InsideTechTalk.com.
- How to Start a Test Data Privacy Project
- Pseudonymization and Test Data Quality
- Data Disguise Techniques
- Creating a Data Lookup Table
- Accessing a Data Lookup Table
Photo: Flickr:Tim Green