In my previous post, I covered common data disguise techniques and discussed how format preserving encryption helps create test data that is both usable and protects citizens’ rights to privacy under the EU General Data Protection Regulation (GDPR).
I concluded format preserving encryption is the best technique for numerical data, such as telephone numbers or any kind of ID numbers. It is not, however, the technique of choice for character data, such as names and addresses. Why? Let’s see an example:
Input: Marcin Grabinski
Sample Encryption Output*: LR/TVdWcXniHAdoN0zhLEw
Sample Format Preserving Encryption Output†: Ppqigh Seeslsxor
Classical encryption provides a high level of security, but makes the data unusable for testing. Format preserving encryption keeps the format (uppercase and lowercase, space in between) and length, but still gives testers a hard time—unless tests are fully automated. As long as testers are humans, they’ll prefer working with real names rather than “Ppqigh Seeslsxor.”
This brings us to data translation (a.k.a. data lookup), another commonly used data disguise technique. Data lookup is based on a table containing values used for replacement and an access path governing which row is used for replacement. Data lookup is typically used for names and addresses, but its use can go beyond that.
Creating a Data Lookup Table
If realistic names or addresses must be used, there is no magic to ease the process—you have to create a data lookup table. There are two major approaches:
- Fictional names
Names such as Mickey Mouse, Donald Duck or Snow White could be used. Just go through a database of Disney movies to get some ideas and have fun creating the data lookup table! I jest.
- Names from a real customer database
Now let’s get serious. There’s nothing wrong in using real names to populate your data lookup table—as long as they are disconnected from other identifying information such as date of birth or SSN. Take some first names, some last names and put them together into a translation table, like this:
Smithing must have been a noble occupation, since millions of people bear a last name meaning or derived from the word smith. Before you object to using “John Smith” for data disguising because so many people bear the name, think for a moment: I’m signing this post and all previous posts with my real name, Marcin Grabiński. Does it breach my right to privacy? No, because I’m not providing any other sensitive information such as date of birth, ID number or address.
My name is uncommon, but it’s not unique so it doesn’t identify me as a natural person. As long as a name is only a name, disconnected from a person’s other pieces of information, it fulfills the requirements of pseudonymization, and your organization will be okay under the EU GDPR. If you’re still in doubt about using “John Smith,” feel free to use my name in your lookup table.
Determining the Details of a Data Lookup Table
You may have noticed our sample data lookup table contains first and last names in a few different formats. This is the reality for storing data in real life databases—the same information is repeated in various formats. It’s a good practice to create a data lookup table reflecting the various formats. In terms of performance, it’s better to fetch a row from the data lookup table and replace the source values, rather than concatenate the pieces forming a name.
How large should a data lookup table be? Don’t be tempted into thinking it need as many rows as a production contact table contains. First of all, names aren’t unique and it’s perfectly okay to have 10,000 “John Smiths” (apologies to all Johns bearing that surname). Second, for testing purposes it’s rarely necessary to reflect the variety of names we have in real life. Customers I have worked with on test data privacy projects typically take 1-5 percent of their production names. The same ratio is good for addresses.
With addresses, one more consideration needs to be taken into account. They have to be valid, as, typically, a ZIP code is checked against a city and street. That can create a dilemma whether private addresses can be used, even if disconnected from clients’ names.
Most companies choose to use public addresses rather than private. For example, a UK bank I worked for created a lookup table for addresses using their 400 UK branches to avoid having any private address in a testing environments. (A bank having approximately 2 million end clients, was perfectly okay with just 400 distinct addresses in their test databases.) It’s a very sound approach.
Some customers, however, have a requirement for a more sophisticated addresses database including both private and public locations. Typically, insurance companies would have a choice of valid addresses, including private, because policy ratings can be based on exact property location.
Reviewing Your Data Lookup Table
Once a lookup table is set up, some thought must be given to the way the data lookup table is accessed and how replacement rows are retrieved. There are various techniques, like random, sequential, direct or hash lookup. Depending on what is used, the results can be irreversible or reversible, unique or non-unique, consistent or inconsistent across executions. I will cover these issues in an upcoming article. Stay tuned and subscribe to InsideTechTalk.com not to miss it.
To learn more about test data privacy in light of the GDPR, read the other posts in my “The GDPR Clock Is Ticking” blog series:
- How to Start a Test Data Privacy Project
- Pseudonymization and Test Data Quality
- Data Disguise Techniques
- Accessing a Data Lookup Table
- Two-tier Access to a Lookup Table
*I used https://www.tools4noobs.com/online_tools/encrypt/ with typed-in key to get the result
†I used Compuware Test Data Privacy and a Format Preserving Encryption Rule, with the same key as above