Accessing a data lookup table
August 9, 2016 Cybersecurity, Data 0 Comments

The GDPR Clock Is Ticking: Accessing a Data Lookup Table

In my previous post, I wrote about creating a data lookup table to generate realistic, usable test data that protects citizens’ rights to privacy under the EU General Data Protection Regulation (GDPR).

I concluded that a lot depends on the method used for accessing a data lookup table. Some access methods give repeatable results, some don’t. Some are reversible, some aren’t.

In this post, I’ll review the most commonly used methods for accessing a data lookup table, illustrate them with an example and discuss their impact on data quality and data security. Below are the input and sample lookup tables I’ll reference throughout this piece.

Input (Real data):

Test Data Privacy | Data Lookup Table

Sample Lookup Table:

Test Data Privacy | Data Lookup Table

Methods for Accessing a Data Lookup Table

Random Access

Random access is self-explanatory. A random row is selected from the lookup table and values from that row are used to replace respective source values.

Using our sample input and lookup table, the result might be:

Test Data Privacy | Data Lookup Table

If I execute the data masking again, I might get another result:

Test Data Privacy | Data Lookup Table

Using the random method for accessing a data lookup table produces inconsistent results across executions, which might be welcomed by security compliance but considered unacceptable to testers. The results are non-reversible, i.e. knowing the output, one cannot guess the input. The method by default gives non-unique results, although uniqueness can be requested.

Conclusion:

  • Consistency across executions: No
  • Reversible: No
  • Unique: No (by default; uniqueness can be requested)

Sequential Access

Another self-explanatory method for accessing a data lookup table is the sequential method. During execution, rows are selected sequentially until the end of the lookup table is reached, and then again from the beginning (unless uniqueness is required).

Using our sample input and lookup table the result will be:

Test Data Privacy | Data Lookup Table

No matter how many times I execute my masking, the result will be always the same (provided no one fiddles with the lookup table!). This means the method is reversible. You quite easily could figure out the sequence and, therefore, know that “Jean Kovalev” in tests always translates to “Marcin Grabinski” in production.

Conclusion:

  • Consistency across executions: Yes
  • Reversible: Yes
  • Unique: No (by default; uniqueness can be requested)

Direct Lookup Access (Search)

Using direct lookup for accessing a data lookup table is based on having some real piece of data in the lookup table and using it for replacement-values retrieval. In our example the lookup table contains Client ID (column “ID”) which is equal to the real identification number (column CLIENT_ID in the production table).

Using our sample input and lookup table the result will be:

Test Data Privacy | Data Lookup Table

No matter how many times I execute my masking, the result will be always the same (again, provided no one fiddles with the lookup table!). This means the method is reversible. You could figure out what replaces what and, therefore, know that “Hans Kowalski” in tests always translates to “Marcin Grabinski” in production.

One con of this method is the lookup table requires careful preparation—you need a replacement value for each Client ID represented in the production.

Conclusion:

  • Consistency across executions: Yes
  • Reversible: Yes
  • Unique: Yes (as long as the lookup has a unique row for each unique source)

Hash Access

Using the hash method for accessing a data lookup table converts any input string into a number between one and the size of the lookup table. That number is used as a row pointer.

The conversion is consistent but doesn’t guarantee uniqueness. Many source values can be converted to the same pointer, which is great because it eliminates the need for creating lookup tables that reflect production (as in the Search method described above).

Using our sample input and lookup table (with the “CLIENT_ID” column used as the source value for the hashing function) the result will be:

Test Data Privacy | Data Lookup Table

The result will be consistent across executions (as long as the lookup remains untouched). What’s more, by using CLIENT_ID as the source value for hashing, the results will be consistent across the enterprise (assuming CLIENT_ID is used consistently). Note, however, that hashing is a “N to 1” function, i.e. many client IDs can be converted to the same number. This makes the method irreversible.

Conclusion:

  • Consistency across executions: Yes
  • Reversible: No
  • Unique: No (by default; uniqueness can be requested)

Reviewing Your Data Lookup Table

To summarize, before a lookup table is set up, some thoughts must be given to the method you’ll use for accessing the data lookup table and how replacement rows are retrieved. There are various techniques, like random, sequential, direct and hash lookup. Depending on what is used, the results can be irreversible or reversible, unique or non-unique, consistent or inconsistent across executions.

In the spirit of accessing a data lookup table, I’ll explain the possible results:

Test Data Privacy | Data Lookup Table

  1. Reversible? These cells indicate a person with access to target (masked) values and the lookup table could figure out the source (real) values.
  2. Unique? These cells indicate the methods that? produce non-unique results by default, but uniqueness can be requested.
  3. Consistency? These cells indicate the results will be consistent as long as the lookup table doesn’t change.

In the next, and last, post of my blog series, I’ll cover the topic of why replacement values in a lookup table sometimes must come from a defined range. For example, international companies typically require that when masking addresses, the replacement address is from the same country as the original.

I’ll also expand on uniqueness, as many organizations misunderstand it, yet require unique values where reality is not unique. Stay tuned and subscribe to InsideTechTalk.com not to miss it.

To Learn more about test data privacy in light of the GDPR, read the other posts in my “The GDPR Clock Is Ticking” blog series:

The following two tabs change content below.

Marcin Grabinski

Marcin Grabiński, EMEA Technical Solution Specialist at Compuware, has almost 20 years of experience in the IT sector, including over 15 in mainframe. Marcin is Compuware’s resident expert in its Test Data Privacy solution and he has been involved in the execution of numerous data privacy projects with major companies in the financial services sector across Europe, making him the Most Frequent Traveler in Compuware. Marcin is also a history fanatic brimming with little-known historical facts.
Share: