Anonymisation & Pseudonymisation of Data-I
In the previous Series of these articles, we have tried to establish the meaning of Privacy, its relation to Data & Protection, as per the Indian PDPB2019. We have also learnt about Data Privacy in some detail, as well as the concept of Personal & Sensitive Personal information. In the last series we have touched upon Data Classification.
In this Series of Article, we learn how Data is to be treated as per the Indian PDPB2019.
The entire applicability is summarised in Section 2 of the Act which states as under Indian PDPB2019,
-Application of Act to processing of personal data
The provisions of this Act shall apply to,
(a) the processing of personal data where such data has been collected, disclosed, shared or otherwise processed within the territory of India;
(b) the processing of personal data by the State, any Indian company, any citizen of India or any person or body of persons incorporated or created under Indian law;
(c) the processing of personal data by data fiduciaries or data processors not present within the territory of India, if such processing is
(i) in connection with any business carried on in India, or any systematic activity of offering goods or services to data principals within the territory of India; or
(ii) in connection with any activity which involves profiling of data principals within the territory of India.
(D) shall not apply to the processing of anonymised data, other than the anonymised data referred to in the Bill.
Among the host of security techniques available, pseudonymisation or anonymisation is highly recommended by the Privacy regulations. Such techniques minimise risk and are helpful for Data Fiduciary & Data Processors in fulfilling their data compliance regulations.
Anonymized Data
Anonymization as per Section 3(2) is defined as
“anonymisation” in relation to personal data, means such irreversible process of transforming or converting personal data to a form in which a data principal cannot be identified, which meets the standards of irreversibility specified by the Authority.
Accordingly, data about living individuals which has been anonymised such that it is not possible to identify the data principal from the data or from the data together with certain other information.
Another crucial aspect to be noted is that the provisions of this Act will not be applicable to Anonymized data.
Therefore, to protect data, anonymise personal data and make sure re-identification by combining anonymised data with other population data is impossible. Statistical packages may have tooling for anonymisation. Additionally, Anonymisation may also be described as a type of information sanitisation whose intent is to protect privacy. It is the process of either encrypting or removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous. Identifiers can apply to any natural or legal person, living or dead, including their dependents, ascendants and descendants. Included are other related persons, direct or through interaction.
Anonymised data is always unrecognisable, even to the data owner.
On the other hand, Pseudonymisation is a procedure in which identifying fields in a data record are replaced by artificial identifiers (pseudonyms). There can be a single pseudonym for a collection of replaced fields or a pseudonym per replaced field. “Pseudonymisation” of data means replacing any identifying characteristics of data with a pseudonym, or, in other words, a value which does not allow the data subject to be directly identified. The purpose is to make it harder to identify individuals from the data record and thus to lower respondent or patient objections to its use. Data in this form are suitable for extensive analytics and processing.
There can be a single pseudonym for a collection of replaced fields or a pseudonym per replaced field.
As an example, in the following scenario, pseudonymisation will facilitate.
So, when it is necessary to not fully anonymize your data? When data subjects have the right to withdraw their data from the study. Here, the data controller has to be able to identify the data of a specific subject in order to delete this data from the dataset.
Although pseudonymisation has many uses, it should be distinguished from anonymisation, as it only provides a limited protection for the identity of data subjects in many cases as it still allows identification using indirect means. Where a pseudonym is used, it is often possible to identify the data subject by analysing the underlying or related data.
The legal distinction between anonymised and pseudonymised data is its categorisation as personal data. Pseudonymous data still allows for some form of re-identification (even indirect and remote), while anonymous data cannot be re-identified.
In general terms, a natural person can be considered as “identified” when, within a group of persons, he or she is “distinguished” from all other members of the group. Accordingly, the natural person is “identifiable” when, although the person has not been identified yet, it is possible to do it… Thus, a person does not have to be named in order to be identified. If there is other information enabling an individual to be connected to data about them, which could not be about someone else in the group, they may still “be identified”.
Anonymisation & Pseudonymisation of Data-II
Just to reiterate, Pseudonymization is a method to substitute identifiable data with a reversible, consistent value. Anonymization is the destruction of the identifiable data.
Some basic guidelines for the process of Anonymisation are as under:
-remove all direct identifiers
-remove indirect identifiers that are not essential for reusing the data
-remove indirect identifiers with a high disclosure risk, such as usual or known characteristics
-reduce the level of detail of the indirect identifier
A combination of indirect identifiers may also lead to identification of a respondent; for instance, research about deaf & dumb people in a specific village. Consequently, in certain cases it is advised to choose a higher sample size, such as a particular state instead of the precise village or town. This is primarily done to respect the privacy of the individual.
Another example is the combination of age in days and date of exam, which may lead to the exact age of the respondent. In research concerning school classes, participating children may thus be identified. In this case, either the exam date can be reduced to the year, or the age should be adjusted to month or year.
Make sure you do not share the following direct identifiers with others or archive them in a public archive:
The Indian PDPB2019 does not apply to anonymised data. However, pseudonymised data falls fully within the scope of the Bill and must be treated with the same levels of consideration in terms of collection, security, processing and deletion.
Non- Anonymized Data
Whatever is not considered as “Anonymized” relates to data that will have an element of identification of an individual. The concept of “Anonymization” relates to personal data but there are other categories of data which is neither personal data not anonymized data. This category of data may include data that has an identity of a “Company” or business data which does not include personal identity which is outside the definition of personal data or anonymized data. Such data is also outside the scope of the PDPB2019.
One of the common practise being used for pseudonymization is a process called Tokenization. This provides a logical token for each unique name and requires access to additional information to re-identify the data:
Name of Author | Token/Pseudo Name | Anonymised |
Amish | XUYE03T | aaaaaa |
Gurcharan | PTQZB9Y | aaaaaa |
Ruskin | PJ24T | aaaaaa |
Amish | XUYE03T | aaaaaa |
Corbett | TYLH | aaaaaa |
Wodehouse | BCDKW9 | aaaaaa |
Gurcharan | PTQZB9Y | aaaaaa |
Ruskin | PJ24T | aaaaaa |
Here, with the pseudonymized data, we may not know the identity of the data subject, but we can correlate entries with specific subjects (records 1 and 4, 2 and 7, 3 and 8 are reference of the same person). If we have access to re-identify the data via the token lookup tables, then we can get back to the real identity. With the anonymized data, however, we only know that there are 8 records and there is no method to re-identify the data.
With Anonymization, we must also be concerned about “indirect re-identification”. If we return to our author example above. An analysis of the writing style of our anonymous authors might allow us to indirectly identify them. We might not be able to identify the name, but we might be able to identify that specific books were written by the same person, because of their unique writing style.
For instance, direct identifiable data such as audio or video files are hard to anonymise (without losing their scientific value) and should in general not be published in open access.
“Pseudonymisation is effectively only a security measure. It does not change the status of the data as personal data. The Indian PDPB2019 makes it clear that pseudonymised personal data remains personal data and within the scope of the Bill.”
Additionally, in the situation where clinical trial data has had all identifiers removed, this can only be considered anonymised data if it was impossible to re-identify the trial subjects, even when cross referenced against supporting documentation.
While there may be incentives for some organisations to process data in anonymised form, this technique may devalue the data, so that it is no longer of useful for some purposes. Therefore, before anonymization consideration should be given to the purposes for which the data is to be used.
A variety of methods are available depending on the degree of risk and the intended use of the data.
A directory replacement method involves modifying the name of individuals integrated within the data, while maintaining consistency between values, such as “postcode + city”.
Scrambling techniques involve a mixing or obfuscation of letters. The process can sometimes be reversible. For example: Sameer could become Meesar.
A masking technique allows a part of the data to be hidden with random characters or other data. For example: Pseudonymisation with masking of identities or important identifiers. The advantage of masking is the ability to identify data without manipulating actual identities.
Personalised anonymisation is another popular method. This allows the user to utilise his own anonymisation technique. Custom anonymisation can be carried out using scripts or an application.
Data blurring uses an approximation of data values to render their meaning obsolete and/or render the identification of individuals impossible.
Data masking versus data encryption: A comparison of two pseudonymisation methods
Distinct from data masking, data encryption translates data into another form, or code, so that only people with access to a secret key (formally called a decryption key) or password can read it.
Data masking is a more widely applicable solution as it enables organizations to maintain the usability of their customer data.
By Sameer Mathur
-Founder & CEO, SM Consulting
-President, Delhi-NCR Chapter of the Foundation of Data Protection Professionals in India
-With inputs from Mr. Vijayashankar Nagaraj Rao
You can also read about the classification of data which we have discussed in detail in our previous blog.