Based on my previous posts, you should have an idea of how to approach a test data privacy project and are aware of the considerations for test data quality in light of the ensuing EU General Data Protection Regulation (GDPR), let’s have a look at common data disguise techniques, taking a closer look at one in particular later on.
Data Disguise Techniques
First, we need to clarify the different terms used, sometimes interchangeably, to describe the process of data anonymization. Organizations embarking on a test data privacy project should create their own glossaries to remove any misunderstandings. For the sake of this post (and a few upcoming), I will use the following definitions:
The permanent replacement of data to conceal the identity of original data. This action is accomplished using one or more techniques, such as translate, encryption, replacement or generate to replace the underlying data. Data disguising is also known as data sanitization and data obfuscation.
The process of encoding messages or information so only authorized parties can read it. A data encryption scheme usually uses a pseudo-random encryption key generated by an algorithm or provided by the person who initiates the encryption. In the context of test data privacy, we will change the definition and will speak of format-preserving encryption instead.
A variation of data encryption that keeps the original format of input data, thus making it more usable for data testing purposes. More on this later.
The temporary replacement of data to hide it from the view of the user, typically with a symbol prior to the display or printing of data. Note that the underlying data is not modified.
The process of converting an input value (seed) to a hash value (typically a number) using an algorithm. In the context of test data privacy, this process is used to obtain a replacement value from a hash table (look-up table). The hash value serves as a pointer to the look-up table (by row number or key). It is part of a disguise technique that we will call data translation.
Data translation is based upon using existing values stored within files as replacements to sensitive data values. Data translation fits well for fields that require resulting values to be fictionalized, yet still readable to a user and valid for an application test. For this reason, different data translation methods need to be identified depending on the data, access method and organization of the translation tables being used to replace existing data.
Data Disguise Techniques to Start with: Format-Preserving Encryption
Most organizations start their data disguise efforts with data encryption because it’s easy to implement, fast and a secure method of data disguising. We need to remember, however, that data disguising is not a goal in itself—disguised data still needs to be used in tests. Therefore, it’s better to use format-preserving encryption.
Let’s consider the example of a phone number:
Input: +44 1628 611000
If we use an encryption scheme, depending on the algorithm and key used, we might get the following (I used this online encrypt tool to get the result):
Sample Encryption Output: IbhuIv6YK2M38C97EXQpFA
From a data security perspective, the output string looks good. You can’t even tell it’s a phone number, let alone try to decipher it. But here’s the crux of the problem—how possible is it for application testers to use this cryptic string? If we implement a format-preserving encryption, the result will look different:
Sample Format Preserving Encryption Output: +44 2356 257230
To obtain the above result I used a test data privacy toolset that supports format-preserving encryption, and I requested that the first three characters remain unmasked. The encryption key was the same as in the first example.
The phone number is encrypted, but we can clearly see it’s from the UK and the formatting is reader-friendly. Additionally, the original length has been preserved. You might argue the new value could be somebody else’s phone number. Indeed, it could; however, remember in the context of test data privacy the goal is not disguising data, it’s pseudonymization—making it reasonably difficult to de-identify a natural person.
If we swap two people’s phone numbers, we end up with valid test data that looks realistic but is fictional (i.e. the phone numbers are real, but they aren’t the real phone numbers of the people). This way the test data satisfies both the EU GDPR and testing requirements, being both usable and protecting the right to privacy.
Deciding on the Right Data Disguise Techniques
Let’s take our data encryption example further and encrypt my own name, using the same methods as above with the same encryption key.
Input: Marcin Grabinski
Sample Encryption Output: LR/TVdWcXniHAdoN0zhLEw
Sample Format Preserving Encryption Output: Ppqigh Seeslsxor
Again, a classic data encryption provides a high level of security; however, it makes the data unusable for testing. Format-preserving encryption keeps the format and length, but—unless tests are fully automated—would give testers a hard time.
We see, therefore, that format-preserving encryption works great for numerical data, such as phone numbers, account numbers and any kinds of IDs, but if the end result must be readable to human eyes, it’s not the technique of choice for string data, such as names or addresses.
In my test data privacy career, I’ve met only one customer who agreed to use encryption for all data. The reason was simple: all tests were fully automated and the testing software didn’t mind dealing with “Ppqigh Seeslsxor.”
However, most organizations would need something more sophisticated. If we need the disguised data to resemble reality, my recommendation is data translation (replacement with a look-up table), which I will cover in a future post. Stay tuned by subscribing to InsideTechTalk.com.
To learn more about test data privacy in light of the GDPR, read the other posts in my “The GDPR Clock Is Ticking” blog series: