Generating Privacy-Preserving Data without Compromising Analytical Utility

Authors: Xiaopeng Luo, Shanlin Feng, Yunlin Liu, and Taoting Xiao
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 550-563
Keywords: Mixed data Feature selection Feature ranking Privacy-preserving

Abstract

As machine learning technology improves by leaps and bounds, there is a rapid growth in the demand for data. However, in many practical applications, a challenging problem is efficiently capturing useful information in private data. Moreover, personal information in the data will seriously threaten the privacy of participating users, building blocks for data-driven decision-making. The popularization of communication technology and data collection devices have led to mixed-type data containing numerical and categorical features. Mixed data provides more comprehensive and rich information to help us discover hidden patterns between features and data labels. It is worth noting that not all features contribute equally to the classification task. Features with poor correlation to the labels may not provide valuable information to the dataset and can even affect the accuracy of the analysis results. Such features were considered noise and irrelevant to the analysis task. This paper proposes a novel data synthesis method that considers the relevance of heterogeneous features to the data labels, even in scenarios with limited data. By employing strict privacy constraints through differential privacy and protecting user privacy information with noise, this method generates new data, increasing the quantity and diversity of training data while preserving its utility. We evaluate the newly generated data protected under privacy constraints, assessing their utility in classifiers through experiments. The experimental results demonstrate that this method preserves the original data's utility and improves the classifiers' classification results.
📄 View Full Paper (PDF) 📋 Show Citation