The EU General Data Protection Regulation (GDPR) is already a valid law in all 28 EU countries, and it’s coming into full force on May 25, 2018. As I pointed out in my last post, test environments are vulnerable against GDPR legislation, and therefore, action must be taken to desensitize test data and improve data protection. More than likely, a lot of IT organizations need to be thinking of how to start a test data privacy project.
You may be tempted to start a test data privacy project with buying a tool for masking data. That’s a good move. You will need one, and a good one at that. However, the journey begins with you and your teammates.
After you define a test data privacy project, the journey should be supported by a:
- Project manager
Your test data privacy project team should include these stakeholders:
- Representatives of data protection compliance
- Test management
- Database administrators
- Application subject matter experts
The goal of a test data privacy project is to eliminate the risk of a data breach from testing. It’s a myth that thorough testing requires real data. It’s best if sensitive data is entirely omitted. The question for stakeholders is: What is considered sensitive data?
What to Consider Sensitive Data for a Test Data Privacy Project
The GDPR says:
“The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
The key is “identifiable natural person.” Is a person’s name sensitive and does it need protection? Consider this: My name is Marcin Grabiński. I’m disclosing it freely to anyone, including you, my dear readers. It’s not a very common name, but it isn’t unique and it doesn’t identify me—others could have the same name.
Therefore, it’s absolutely valid to have real names in your testing databases. Surprised? The key is that a real name cannot be linked to a real home address, date of birth, passport number, or any other identifying information. Most test data privacy experts would nod their heads with approval if you used the following combination of information for testing:
Marcin Grabiński (my real name)
Norden Rd, Maidenhead, Berkshire SL6 4AY, UK (a valid address in the UK, but not mine)
09.06.1976 (not my real date of birth)
+1 313-227-7088 (a valid phone number but not mine)
The above information is a realistic set of personal data, but nothing sensitive is disclosed. A careful reader may notice the UK address and the U.S. phone number, their contrast creating an unreality, but let’s leave that aside for now. I’ll write another article later regarding test data quality in respect to anonymization (the GDPR actually uses the term “pseudonymization”). Subscribe to InsideTechTalk.com and you won’t miss it.
Create an Inventory of Columns in Databases
When it comes to initially determining sensitive data, those representing GDPR compliance might say all test data must be disguised, while testers might say nothing should be anonymized. Once all the stakeholders agree on what is sensitive, an inventory should be made of all columns (fields) in all databases (data sources) across all testing environments.
The next step is to assign all sensitive columns (fields) to a category. In your database, there might be dozens of columns (fields) storing phone numbers (ex. TELEPHONE, CONTACT_PHONE, MOBILE_NR, AUX_PHONE_NUMBER), but all of them fall into one category, “Telephone Number.”
Why should this be done? Because you don’t want to work with hundreds, if not thousands, of sensitive columns (fields); you want to be able to define disguise rules per category. Typically, any organization would define about 10-25 categories of sensitive data. This approach will significantly lower the effort required for consistent and thorough disguise. You may want to jump to an earlier article “the Magic of Data Elements” to learn more on the concept of categorizing sensitive data.
At the end of preparation process, you should have a spreadsheet like this:
At one glance you can assess if any given table (file) contains sensitive data (PART_TABLE doesn’t and thus can be excluded from the scope). On the other hand, you can clearly see if a rule for disguising addresses is created, tests should involve CUSTOMER_TABLE.
Categorizing sensitive data in this way is a great first step for beginning a test data privacy project with the end goal of improving data privacy and complying with GDPR legislation before fines are dropped on noncompliant companies less than two years from now. As promised, in my next post I’ll dive deeper into the idea of pseudonymization. For now, I recommend reading this article from the International Association of Privacy Professionals.
To learn more about test data privacy in light of the GDPR, read the other posts in my “The GDPR Clock Is Ticking” blog series:
- Pseudonymization and Test Data Quality
- Data Disguise Techniques
- Creating a Data Lookup Table
- Accessing a Data Lookup Table
- Two-tier Access to a Lookup Table