Privacy Requirements
With the advancement in the Information and Communication technology, there is an increasing need in storing electronic data securely and sharing it with authorized users. If a large amount of data is made public, then it can be used for various research purposes. Data mining technique can be used as an extraction tool to get maximum information from huge collection of data. However, there is another perspective to this public access. The public sharing of the data can lead to leakage of sensitive information about users. This can in turn create privacy or ethical issues. As a result of privacy issues, many individuals restrict themselves from sharing the data with public and it results in unavailability of data.
Hence, it can be concluded that privacy should be given top priority in data mining field. Privacy Preserving Data Mining (PPDM) is gaining grounds as a research technique to resolve the privacy issues in data mining. This paper will study different literatures and provide a conclusion on the basis of some pre-defined parameters. The paper has been presented in a way that initial part provides the basic concepts and understandings of PPDM. The next section provides classification of PPDM on the basis of two scenarios – distributed and centralized. Furthermore, the paper presents exploration ways to study the privacy issues. Finally, a comparative study of different PPDM techniques along with limitations and conclusions.
Problem
In 2002, there was a country wide protest in Japan over the data collection by the government. In another instance, there was a significant worry with respect to UTIA or US Total Information Awareness program which even was responsible for the introduction within the US Senate a Bill. This bill would have barred the US department from conducting data mining initatives. These reactions not only ignore the importance of data mining but also eliminate probable chances of having important discoveries.
Privacy requirements
Among all the concerns, the main concern while studying different types of data is the privacy issue. This data can be professional or medical data. There are two ways in which privacy can be interrupted. In one way, privacy is very important with respect to the medical data because it contains sensitive data regarding patient’s health that one can be excluded from one’s society. While performing, medical data mining, the original data is required to draw conclusions accurately or else incorrect data may lead to unusable solutions. If any kind of person-specific information is disclosed, it can lead to various issues including ethical problems. Hence, privacy concerns of the individual should be dealt with extra care and prior research should be done before publishing the data.
In another way of interpretation, privacy can be related to preventing confidential data to be disclosed during the data mining process. The level of privacy issue depends on the sensitiveness of the disclosed data and the context in which it is used. However, the main concern is how to provide privacy during data mining without losing the information. There are various methods which can provide privacy but they tend to loss information such as hiding of data, compression of data and removal of attributes. Besides this, there is another issue of computational burden. Methods like cryptographic tend to create additional overhead with respect to computation and technical. The feasibility of a secure protocol implementation depends on the genetic mutations i.e. the size of combinatorial circuit which calculates the function to be evaluated (Nathiya, Kuyin & Sundari, 2016).
Methods
In a distributed scenario, as the number of involving parties increases, both the computational and communication cost become very high (Nathiya, Kuyin & Sundari, 2016). Even, PPDM algorithm is not able to resolve all these concerns. However, there is no such genetic solutions are available to address all the privacy issues but some recent researches are focusing on finding out proficient protocols to key problems which can bring privacy, computational feasibility and data utility at same level.
Methods
A considerable number of data mining methods work on this assumption that almost all data is available at the central location named as data warehouse. This creates a huge privacy problem because an exposure to single piece of data may disclose all the data. Though users are comfortable to share some of their data with some entities but they are not comfortable in exposing all their data. This became the main cause of protest in Japan. In this case, government was not gathering new data instead they were storing all data in a single repository which was previously managed by the prefectures. The mediators as well as federated database don’t resolve this centralization issue.
They end up changing only the nature of attacks. This is so because mediators will then provide access to data and because of this access that data gets exposed irrespective of its storage location. It is immaterial whether the data warehouse has a physical existence or not. If the data mining algorithm gets access to it, then it could also be accessed by an attacker too.
The techniques of PPDM or Privacy-Preserving Data Mining allow to extract information from large volume of datasets while preserving the access to non-authenticated users. Along with this, PPDM methods also remove some original data to ensure privacy is maintained (Nathiya, Kuyin & Sundari, 2016). This degradation in the quality of data is known as a common trade-off between quality of data and level of privacy. The methods of PPDM are designed with the aim of extracting maximum knowledge from data while maintaining a level of privacy. During the entire course of process, the transformed data will refer to the data which is an outcome of privacy-preserving technique. A natural solution to this problem is de-identification i.e. removing all the identifiable information from data and then letting users to access it.
But then, main issue lies in defining the identifiable data. Even though, de-identification is feasible and acceptable with respect to legal aspect, but then it does not ensure that data will not lost its utility. But alternate better solutions are available such as avoiding creating a centralized data warehouse at very first step. Then, making data mining algorithm to work in such a way that minimum data exchange occurs at global level. However, much solutions are also possible. First one involved avoiding of building a common warehouse for data in the first place. Afterwards making use of distributed data mining algorithms in order to minimize data exchange needs for accepted models
Anonymization Based
A primitive solution towards this problem is ‘de-identification’ which means removing all of the critical and sensitive information from the data and then data is released afterwards. However, pinpointing the exact data that needs to be removed is often a difficult task. Moreover, even if the ‘de-identification’ process turns out to be legally acceptable, it would still be extremely hard to perform it effectively without having to lose the main utility of the data. Such Anonymization methods is achievable through techniques such as data removal, generalization, swapping, data removal and permutation (“Data Anonymization Approach for Data Privacy”, 2015). Among these anonymization methods, the ‘k-anonymity’ method is considered to be the classical method for anonymization and majority of the studies are typically centred around this method.
Apart from these, there are other methods that are improvised versions of k-anonmity method and these includes t-closeness, p-sensitive, (a,k) anonymity, l-diversity, km-anoymization and k,e anonymity (Wei, Natesan Ramamurthy & Varshney, 2018). There are some drawbacks in k-anonymity and therefore the improvised versions try to fill in those gaps. The works by Samarati and Sweeny (SWEENEY, 2002) prove that just removing the sensitive information from a data is not sufficient and rather k-anonymity and the methods based on these should be used to provide better protection. The quasi-identifier also known as QI is a combination of people specific identifiers and is the choice of technique used in k-anonymity based methods.
For instance, in k-anonymity, a general identifier such as date of births could be simply generalized as month of birth. Furthermore, there exists tasks independent techniques as well to preserve information, utility and privacy of data. Within this technique, only the sensitive data in the raw data is converted before it being sent for mining. In all of these generalized privacy preserving methods, loss of data is applicable because of generalization of sensitive as well as QI attributes. They are able to demand both no information loss as well as privacy by just transforming sensitive attributes and certain parts of the QI. Specialized use cases are also handled under Anonymization methods. For instance, a modified l-diversity model allows data operators to hide sensitive medical attributed of patient’s data. In this case, the additional characteristics and conditions of a medical information are taken into consideration.
Apart from this, address related problems are also handled using anonymization method. Another method of anonymization is called as Condensation and it’s a statistical approach that constructs constrained clusters within a dataset and post that it generates pseudo data using the statistics of such clusters (Nathiya, Kuyin & Sundari, 2016). It essentially builds groups of non-homogenous size using the whole data and ensuring that each of the record in the group whose overall size is equal to their anonymity level lies in the group. In this case, pseudo data would be generated from every single group in order to create synthetic dataset with similar distribution as that of the original data. This particular approach could also be efficiently used in order to classify the problem.
Finally, this pseudo-data ends up providing an added layer of security because it becomes challenging to perform attacks on a rather synthetic data. Since the overall aggregate behaviour of data is actually preserved, it essentially remains useful for data mining operations. The methods and techniques presented above is not an exhaustive list and still there exists methods that goes beyond the realms of what is mentioned here. However, these are currently the most common methods of anonymization techniques for privacy preservation.
Data perturbation
Yet another approach is data perturbation. This means that the data is modified in a way that it does not reflect the real data anymore. This perturbed version of the data can then be released for data mining operations and is known as a data distortion method in order to protect privacy (“COMPRESSION OF GEOMETRIC DATA WITH THE USE OF PERTURBATION FUNCTIONS”, 2018). Adding, additional noise from existing distribution is among the other perturbation techniques that are widely accepted. Before even beginning data mining operations, the miner should be able to reconstruct the perturbed version in order to recover the original version of the data.
This is because the data would not reflect real values. Now, even if the data is leaked somehow, the data would not point to any individual or his / her privacy since it remains distorted. A particular example is that o the US Census Bureau’s Public Use Microdata Sets. One of the most common perturbation techniques is that of the data swapping wherein real data values are exchanged among one another so as to preserve certain statistics however destroying the real value. An alternative to this method is the randomization wherein noise is added into the data so as to prevent discovery of any kind of real value.
Since the data does not reflect real-world value, it wouldn’t be able to violate privacy. The main challenge is however in obtaining appropriate mining results using the perturbed data. Perturbation methods can be employed in both of the scenarios including distributed as well as central server. The method involves data distortion techniques such as condensation, noise addition or even randomization. Although, perturbation techniques are being used to achieve data privacy it would also have certain limitations:
- Since, this method utilizes distributions rather than using original records, the range for algorithmic techniques are restricted.
- Simultaneously, loss of key information that are available in multidimensional databases is another limitation.
The randomization method mentioned above is a type of data distortion technique also known as a statistical technique and it was introduced by Warner in order to resolve the survey issue (Ojala, Vuokko, Kallio, Haiminen & Mannila, 2009). In a Randomized process, the data overall is scrambled in such a way that it won’t be able to decipher with probability even better than a previously defined threshold such that whether the data contains any right or wrong information. Moreover, th information being received from every single individual would be scrambled and if turns out that the number of users is large, then an overall aggregated information of such users would be able to be estimated with commendable accuracy.
Cryptography Based
The approaches that have been discussed in the prior sections are only applicable when any particular data could be disclosed beyond its control of the collection process. However, if the data is spread across multiple sites wherein it’s legally prohibited to share their collections from one another, it would still be able to build a data mining model. Cryptographic based techniques are widely studied in a lot of distributed environments. SMC or Secured Multiparty Computation technique is a means to maintain privacy in various type of distributed data mining operations. SMC is typically based on 3 different techniques and these include the Secret sharing, Homomorphic encryption and the semi-honest method (Pinkas, 2002).
The SMC protocol typically deals with 2 type of adversary which is malicious and semi-honest. Within this, the majority of the people make use of semi-honest types wherein the adversary follows a protocol specification but tries to learn information that was exchanged more compared to the results (Naidu, 2018). This protocol only needs 2 communication between each of the data site and the mixer in 1 round of collection. Pseudonymization is a type of approach that ends up breaking the link between the medical information and the personal information. It actually provides a method of traceable anonymity of all the heal records. This method is different wherein the personally identifiable information from the data is not removed rather the identifiable information is actually transferred into another piece of information which then cannot be pinpointed to a patient unless a secret key is known.
This is known as Encryption and it’s a well-known technique in the security domain (“An Encryption Scheme for Privacy Preserving in Data Mining using different classification Algorithm”, 2016). This kind of encryption can either be performed on the application level or the database level. The major advantage for an SMC based solution is that it provides good idea to understand what is essentially revealed. In a perfect scenario, an SMC based protocol reveals nothing, but in real-world this may not be entirely applicable. Despite of this, SMC theory provides credible proof to clearly distinguish what is secret and what is known. However, the major drawback to this is efficiency and SMC protocols are inefficient especially for large inputs and these are what data mining operations are made of.
Limitations
There does not exist a PPDM technique which features one-size-fits-all concept. As a result, the ultimate choice if often by weighing in the benefits and the trade-offs between information loss and the level of privacy. This is typically measured by practical feasibility or complexity of the said techniques and data utility metrics. The homomorphic and transfer protocol encryption are 2 techniques meant for preserving security and privacy and they are able to achieve complete privacy without any data loss. However, the aforementioned techniques are not optimized and efficient enough for real-time applications (Laskar & Lachit, 2014). On top of this, homomorphic encryption needs trade-offs between efficiency and functionality.
Conclusion
Organizations and corporations around the world collect data in-order to improvise their services. However, this process requires data collection, it’s analysis and sometimes sharing of these sensitive data. This is where Privacy Preserving Data Mining methods come into the picture and help attain partial or complete privacy of the data being used for data mining. The paper above presented several data mining methods that are able to provide privacy for today’s data. However, still there exists some trade-offs between information loss, privacy and overall computational overheads.
References
An Encryption Scheme for Privacy Preserving in Data Mining using different classification Algorithm. (2016). International Journal Of Science And Research (IJSR), 5(6), 2424-2427.
COMPRESSION OF GEOMETRIC DATA WITH THE USE OF PERTURBATION FUNCTIONS. (2018). ??????????, (4). doi: 10.15372/aut20180403
Data Anonymization Approach for Data Privacy. (2015). International Journal Of Science And Research (IJSR), 4(12), 1534-1539. doi: 10.21275/v4i12.12121502
Laskar, D., & Lachit, G. (2014). A Review on “Privacy Preservation Data Mining (PPDM). International Journal Of Computer Applications Technology And Research, 3(7), 403-408. doi: 10.7753/ijcatr0307.1003
Naidu, P. (2018). An Efficient Approach for Privacy Preserving Data Mining using SMC Techniques and Related Algorithms. International Journal For Research In Applied Science And Engineering Technology, 6(4), 1731-1737. doi: 10.22214/ijraset.2018.4292
Nathiya,, S., Kuyin, C., & Sundari, j. (2016). Providing Multi Security In Privacy Preserving Data Mining. International Journal Of Engineering And Computer Science. doi:
Ojala, M., Vuokko, N., Kallio, A., Haiminen, N., & Mannila, H. (2009). Randomization methods for assessing data analysis results on real-valued matrices. Statistical Analysis And Data Mining, 2(4), 209-230. doi: 10.1002/sam.10042
Pinkas, B. (2002). Cryptographic techniques for privacy-preserving data mining. ACM SIGKDD Explorations Newsletter, 4(2), 12-19. doi: 10.1145/772862.772865
SWEENEY, L. (2002). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal Of Uncertainty, Fuzziness And Knowledge-Based Systems, 10(05), 557-570. doi: 10.1142/s0218488502001648
Wei, D., Natesan Ramamurthy, K., & Varshney, K. (2018). Distribution-preserving k-anonymity. Statistical Analysis And Data Mining: The ASA Data Science Journal.