Information technology is undergoing the Big Bang, in which the size of data generated worldwide doubles every two years. This unparalleled age of digital breakthroughs is propelled by billions of smartphones and other devices. The volume, variety, and velocity of data generation have led to a paradigm shift in data-driven decision-making.
Transcending human intelligence, increasingly powerful and sophisticated software tools are taking over decision-making in business environments. Businesses harness the power of cutting-edge artificial intelligence (AI) technologies to draw insightful inferences and make predictions about user behavior.
AI technologies mainly rely on data fed into the system to train models and improve their performance. Most often, this data comprises sensitive personal data about users, with which businesses aim to personalize user experiences and optimize targeted advertising purposes. Extensive use of AI, however, raises crucial concerns about user privacy.
In this blog post, we will try to decode data privacy in the age of AI by understanding its challenges and solutions.
Privacy challenges in AI and their respective solutions
The problem of Re-Identification
AI is mainly known for seemingly infinite pattern recognition capabilities and establishing connections between various data points, which pose a risk of unauthorized reidentification. Many studies have highlighted how the information that is anonymized and scrubbed of all identifiers can still be reidentified using emerging computational strategies.
When different data sets (with no trace of personally identifiable information present in any of them) are combined, it gives birth to the ‘Data Mosaic effect’, which is the ability to uniquely identify an individual. It increases the privacy risks of allowing private AI companies to process the personal data of consumers, even in circumstances where anonymization is carried out.
This is where privacy legislation like the GDPR comes into question, which highlights the need to identify a legal basis and informed consent for any organization utilizing or analyzing personal data.
Pseudonymization techniques like masking of direct identifiers were once considered sufficient to “anonymize” a dataset. However, recent developments in linkage attacks, which are performed by employing data mining and machine learning to look for overlapping matches between the common attributes, or quasi-identifiers, of two or more data sets, have exposed the vulnerabilities with pseudonymization to protect privacy.
Whilst anonymization can help alleviate some privacy concerns, the GDPR in particular sets a very high bar, with requirements around audits and assessments of how effective the anonymization is. But the reality is that for most AI models, the output is so vast and varied that there is no 100% reliable assessment method to validate the anonymization reaches the required standards.
Profiling attacks constitute another kind of privacy attack that leverages AI to re-identify individuals based on their behavioral patterns. In peer-reviewed papers, for example, joint research by the Vienna University of Economics and Business and MOSTLY AI in their paper, demonstrates how profiling attacks are possible to successfully re-identify browsing patterns of individuals.
The unauthorized risk of reidentification by AI can be effectively mitigated by embracing technologies that allow organizations to realize commercial objectives without compromising data privacy and security. Advancements in privacy-preserving techniques like differential privacy and synthetic data generation offer promising alternatives.
Organizations that prioritize the value of consumer trust avoid transferring actual production data (the final stage of a product’s lifecycle, readily accessible to end users, whether through websites or mobile apps) into non-production (development and testing) environments. They understand the principle of data minimization and that the data of a customer should be utilized to serve that actual customer only.
Hence, they adopt statistically representative synthetic data for privacy-preserving AI and analytics. Synthetic data generation (SDG) involves creating artificial data sets with similar statistical properties but without any trace of identifiable information about real individuals. SDG eliminates the one-to-one correlation between data points and actual customers, making it significantly cumbersome to re-identify individuals.
The power of consent
Most existing privacy laws, including Federal Trade Commission enforcement, rely on a model of consumer choice built around ‘notice-and-consent’. In an AI-driven world, the ‘notice-and-choice’ model may not be deemed fit. The current consent models rarely offer fine-grained control mechanisms for AI purposes.
As per GDPR Article 6(1)(a), even if users consent to their data being collected, it is doubtful that the consent would be truly “informed.” They likely may not be able to comprehend the extent of unpredictable AI uses, given that the technology rapidly continues to evolve and may involve unforeseen data uses over time.
Furthermore, the requirements for the contract as a legal basis for AI processing of EU personal data under GDPR Article 6(1)(b) will not be satisfied; there are less-intrusive, “data minimized” means of carrying out the performance of the contract without requiring AI processing.
Consent models also do not fit well for applications with critical infrastructure like self-driving cars and smart traffic signals. For example, imagine a self-driving car pausing mid-intersection for a driver to review and accept data permission used for split-second decision-making, like a sudden maneuver to avoid an accident. Such real-time decisions require immediate action and can’t wait for one’s approval, rendering the notice-and-consent model ineffective.