AI and Privacy and Security

This lesson examines the intersection of artificial intelligence systems with privacy and security concerns. Core topics include the types of data processed by AI, mechanisms through which privacy is eroded, limitations of common protection techniques, and implications for surveillance and predictive policing. The discussion covers both technical and ethical dimensions, drawing on economic, legal, and sociotechnical perspectives.

Overview

This lesson examines the intersection of artificial intelligence with privacy and security concerns. AI systems rely on large-scale data collection and processing, creating inherent conflicts with individual privacy rights and data security. Core topics include classification of data types (personal, sensitive, behavioural, inferred), mechanisms of privacy erosion through tracking and inference, limitations of anonymization and privacy-preserving techniques, high-risk applications such as predictive policing, biometric surveillance, and automated decision systems, trade-offs between model utility and protection strength, technical safeguards (federated learning, differential privacy, encrypted computation, synthetic data), and regulatory, governance, and economic factors in AI data protection. The analysis integrates technical detail with ethical, legal, and sociotechnical perspectives, addressing how AI amplifies information and power asymmetries while assessing the effectiveness and constraints of current mitigation approaches in high-dimensional, adversarial, and large-scale settings.

Learning Objectives

Distinguish among personal, sensitive, behavioural, and inferred data categories and identify how each is used in AI systems.
Explain re-identification risks and why k-anonymity, differential privacy, and other anonymization methods frequently fail in practice.
Describe how AI amplifies privacy erosion in everyday digital interactions.
Evaluate the privacy implications of predictive policing, facial recognition, and automated surveillance systems.
Compare privacy-enhancing technologies and regulatory approaches to data protection in AI contexts.
Identify key ethical tensions between utility of AI models and individual privacy rights.

Motivation

Machine learning models require large volumes of data for training and inference. The performance gains from more data create incentives to collect and retain extensive records of human behaviour. When these records include personal attributes or can be linked across contexts, they enable detailed profiling, targeted manipulation, and exclusionary decisions. Privacy protections developed for static databases prove inadequate against adaptive adversaries and high-dimensional feature spaces typical in modern AI. Understanding these dynamics is necessary for designing systems that balance analytical utility with fundamental rights.

AI and Data Privacy

Artificial intelligence systems depend on diverse categories of data for training and operation, each carrying distinct privacy implications. Personal data directly identifies individuals, sensitive data reveals protected attributes such as health status or political opinions, behavioural data captures actions and interactions across digital and physical contexts, and inferred data consists of attributes derived algorithmically from the preceding types. AI-driven services routinely collect behavioural data at scale through tracking mechanisms embedded in platforms, devices, and applications, enabling construction of detailed individual profiles used for prediction, targeting, and decision-making. This pervasive data processing erodes privacy by exposing patterns individuals did not intend to disclose, often without meaningful consent, while amplifying risks of unauthorized linkage, profiling, and secondary use across contexts.

Types of Data

AI systems process multiple categories of data with different privacy implications.

Personal data

Any information relating to an identified or identifiable natural person (direct identifiers).

Examples:

Full name combined with date of birth and postal code.
Email address used for account registration on a social media platform.
National identification number or social security number stored in a health insurance database.
Phone number linked to a mobile banking application.

Sensitive data

Personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data for unique identification, data concerning health, or data concerning sex life or sexual orientation.

Examples:

Medical diagnosis codes and treatment history in electronic health records.
Genetic test results indicating predisposition to hereditary diseases.
Fingerprint or iris scan templates used in biometric authentication systems.
Political party membership or voting records in electoral databases.
Religious affiliation inferred from donation patterns to specific organizations.

Behavioural data

Information generated by an individual’s actions, interactions, and movements in digital or physical environments.

Examples:

Sequence of websites visited during a browsing session tracked via cookies and device fingerprinting.
Timestamped GPS coordinates logged by a ride-sharing application.
Clickstream data showing which products were viewed and for how long on an e-commerce site.
Messages sent and received on a messaging platform, including metadata such as time, recipient, and message length.
Music tracks played and skipped on a streaming service.

Inferred data

Attributes or characteristics computed or predicted by algorithmic models from personal, sensitive, or behavioural data.

Examples:

Credit risk score computed from transaction history, browsing patterns, and social connections.
Political ideology estimate derived from liked pages, followed accounts, and shared articles.
Mental health indicators predicted from language use in social media posts and search queries.
Sexual orientation inferred from social graph structure and interaction patterns.
Likelihood of job change predicted from professional network activity and resume keyword trends.

Inferred data often carries the highest risk because it reveals information individuals did not explicitly disclose and may be inaccurate.

The Impact of AI on Privacy in Everyday Life

AI-driven services collect behavioural data at scale through tracking embedded in digital platforms, mobile applications, smart devices, and connected infrastructure. Streaming platforms record viewing patterns, pause events, and device switches to build taste profiles. Social media tracks dwell time, scroll behaviour, reactions, and interaction sequences to infer preferences and emotional states. Voice assistants capture audio commands and ambient sounds to model routines. Location services log geotraces and venue visits. Advertising networks link cross-app signals and device fingerprints for persistent identification. Wearables monitor physiological data and activity to derive health and lifestyle inferences. These high-resolution data streams construct detailed behavioural profiles revealing habits, social ties, and vulnerabilities individuals did not intend to disclose, often without effective awareness or control, leading to pervasive surveillance, targeted manipulation, differential treatment, and diminished personal autonomy in routine digital interactions.

Examples:

Video streaming platforms record watch time, pause events, rewinds, and device switching to refine recommendations and build taste profiles.
Social media feeds track dwell time on posts, scroll velocity, and reactions to rank future content and infer emotional states.
Smart home devices capture voice commands, ambient sound patterns, and motion sensor triggers to create occupancy and routine models.
Mobile advertising ecosystems use cross-app tracking IDs to link in-app purchases, search history, and location pings for hyper-personalized ads.
Fitness trackers log heart rate variability, sleep stages, and step counts to infer stress levels, recovery status, and lifestyle patterns.

These systems generate detailed behavioural profiles used for micro-targeted advertising, price discrimination, and content curation. Cumulative effects reduce individual autonomy by shaping information exposure and decision environments.

Re-identification and the Limits of Anonymization

Anonymization removes or alters direct identifiers so that the data can no longer be attributed to a specific individual without additional information. Re-identification matches anonymized records to real individuals using auxiliary data sources or background knowledge.

Example:

Netflix Prize dataset (2006) contained anonymized movie ratings. Narayanan and Shmatikov (2008) linked it to IMDb public profiles using unique rating patterns, re-identifying many users with high accuracy.

Common Anonymization Techniques and Their Limitations

When training AI models, organizations often need to release or share datasets. Protecting the privacy of individuals in these datasets is essential, but achieving strong privacy while maintaining data utility is challenging. The following techniques are widely used, yet each has notable limitations.

k-Anonymity

k-Anonymity ensures that each record in the released dataset is indistinguishable from at least k−1 other records based on a set of quasi-identifiers (such as age group, gender, and postal code). This makes it difficult to identify any specific individual.

Example

Suppose k = 5. Consider this group in the dataset:

Record 1: Male, 25–30 years old, ZIP code 90210
Record 2: Male, 25–30 years old, ZIP code 90210
Record 3: Male, 25–30 years old, ZIP code 90210
Record 4: Male, 25–30 years old, ZIP code 90210
Record 5: Male, 25–30 years old, ZIP code 90210

Since all five records share the exact same quasi-identifiers, no individual can be uniquely identified.

Limitation and Example of Failure

k-Anonymity only protects identity; it does not protect sensitive attributes. If all (or most) records in a group share the same sensitive value, an attacker can still make strong inferences. For example, a hospital releases a research dataset under k=5 anonymity. One equivalence class contains five records with the quasi-identifiers: Male, 30–35 years old, ZIP code 10001.
The sensitive attribute “Medical Diagnosis” for these five records is:
Lung Cancer, Lung Cancer, Lung Cancer, Lung Cancer, Lung Cancer.

An attacker who knows that their colleague (a 32-year-old male living in ZIP 10001) is in this dataset can confidently conclude that he has lung cancer. Even though the person’s name is not revealed, the sensitive medical information is effectively exposed.

l-Diversity

l-Diversity improves on k-Anonymity by requiring that each equivalence class contains at least l different well-represented values for the sensitive attribute. This aims to reduce the risk of inference attacks based on uniform sensitive values.

Example

Suppose l = 3. In one group, the sensitive attribute “Disease” appears as:
HIV Positive, HIV Positive, Cancer, Asthma, Diabetes

There are four different values satisfying l = 3.

Limitation and Example of Failure

l-Diversity fails when the distribution of sensitive values is highly skewed, even if multiple values exist.For example, a dataset satisfies l=2 diversity. One equivalence class contains 10 records with the same quasi-identifiers (Female, 40–45 years old, ZIP code 60601).
The sensitive attribute “Health Status” is:
HIV Positive (8 records), HIV Negative (2 records).

An attacker who knows that a particular woman belongs to this group can calculate that there is an 80% probability she is HIV Positive. This probabilistic inference can still cause serious privacy harm, stigma, or discrimination, even though the technique meets the l-diversity requirement.

t-Closeness

t-Closeness is a stronger privacy technique that requires the distribution of the sensitive attribute in every equivalence class (group) to be statistically similar to its distribution in the entire original dataset. The similarity is usually measured using metrics such as Earth Mover’s Distance.

Example:

Suppose we have a dataset of 1,000 patient records used for training an AI model.

In the entire dataset, 8% of individuals have diabetes (80 people out of 1,000).

Now consider one equivalence class (a group of indistinguishable records) that contains 50 records with the same quasi-identifiers (e.g., Age 40–45, Male, ZIP code 90210). In this group of 50 records, exactly 4 people have diabetes (8%). The distribution (8%) is very close to the overall dataset distribution.

Example of Failure (poor t-closeness)

In another equivalence class of 50 records with the same quasi-identifiers (Age 40–45, Male, ZIP code 90210), 35 people have diabetes (70%). Even if exact individual diagnoses are hidden, an attacker can clearly see that the rate of diabetes in this group (70%) is much higher than the overall population rate (8%).
This allows the attacker to infer with high confidence that most individuals in this group likely have diabetes, revealing sensitive health information despite the anonymization effort.

Differential privacy

The output of a computation is statistically indistinguishable whether any single individual’s data is included or excluded, typically achieved by adding calibrated noise.

Example of failure in practice:
When applied to small subpopulations (e.g., rare medical conditions), the required noise level makes the output statistically useless for the very groups most in need of analysis.

In high-dimensional settings (thousands of features), even differential privacy struggles to preserve meaningful accuracy while preventing re-identification.

Policing, Surveillance, and Predictive Systems as Privacy Issues

AI enables new forms of surveillance and control.

Examples:

Predictive policing platforms (e.g., PredPol) use historical crime reports to forecast crime locations, leading to increased patrols in certain neighborhoods and reinforcing arrest disparities.
Real-time facial recognition in public spaces (e.g., Clearview AI database combined with municipal cameras) allows identification of individuals without their knowledge or consent.
Social scoring systems (e.g., China’s Social Credit System) aggregate financial behaviour, traffic violations, social media activity, and neighbour reports to assign scores affecting loan access, travel permissions, and job eligibility.
Automated risk assessment tools at borders (e.g., EU’s iBorderCtrl or U.S. CBP systems) profile travellers based on facial micro-expressions, travel history, and social media content to flag potential security risks.
Workplace monitoring software analyzes keystroke patterns, webcam feeds, and email sentiment to infer productivity, emotional state, and unionization risk.

These applications concentrate power in state and corporate hands, reduce anonymity in public spaces, and create chilling effects on free expression and association.

Privacy and Data Protection Approaches

Technical Approaches

Technical methods aim to enable useful AI computation while minimizing exposure of raw personal data. These approaches operate at different stages of the data lifecycle and offer varying protection strengths against different threat models.

Federated learning

Model updates are computed locally on user devices using private data; only aggregated or differentially private model updates are sent to a central server.

Examples

Google Gboard keyboard uses federated learning to improve next-word prediction without uploading full typing histories.
Mobile health applications train personalized activity recognition models on wearable sensor data while keeping raw accelerometer and heart rate traces on-device.

Secure multi-party computation

Multiple parties jointly compute a function over their private inputs while keeping those inputs secret from each other; the output is revealed only to authorized parties.

Examples

Several banks compute aggregate fraud statistics (e.g., total suspicious transactions per region) without disclosing individual customer transaction details to competitors.
Hospitals perform collaborative risk scoring for rare diseases by computing joint logistic regression parameters without sharing patient-level records.

Homomorphic encryption

Allows computation on ciphertexts such that the result, after decryption, matches the result of operations on the corresponding plaintexts.

Examples

Cloud-based epidemiological analysis computes population-level statistics (mean blood pressure, disease prevalence) directly on encrypted patient records stored by hospitals.
Insurance companies perform actuarial calculations on encrypted claims data without decrypting individual policyholder information during processing.

Synthetic data generation

Generates artificial datasets that statistically mimic real data distributions while containing no actual personal records, often using generative models such as GANs or diffusion models.

Examples

Pharmaceutical research creates synthetic electronic health records to train drug response prediction models when access to real rare-disease patient data is restricted by privacy regulations.
Financial institutions generate synthetic transaction datasets for stress-testing anti-money-laundering models without risking exposure of customer financial histories.

On-device inference

Performs model inference entirely on the user’s device using local compute resources, avoiding transmission of input data to remote servers.

Examples

Smartphone voice assistants (e.g., Siri, Google Assistant) run speech-to-text conversion locally for common commands to reduce latency and prevent audio snippets from leaving the device.
Camera-based applications perform face detection or object recognition directly on the device sensor stream, keeping image frames off the network.

Regulatory and Governance Approaches

Regulatory frameworks impose obligations on data controllers and processors for AI systems handling personal data, limiting collection, ensuring lawful processing, granting rights, and requiring risk assessments.

Examples:

GDPR Article 6 requires a lawful basis for processing personal data (e.g., consent, contract, legitimate interests with balancing test). Article 9 restricts special category (sensitive) data unless explicit consent or narrow exception applies.
CCPA/CPRA provides California residents opt-out rights for sale/sharing of personal information and limits on sensitive data use, plus disclosure requirements for significant automated decisions.
GDPR Article 35 requires Data Protection Impact Assessments for high-risk processing (e.g., large-scale profiling, significant-effect automated decisions, large-scale sensitive data).
GDPR Article 22 grants rights regarding solely automated decisions with legal or significant effects, including human intervention, contestation, and explanation of logic.
GDPR data minimization (Article 5(1)(c)) limits data to what is necessary, enforced via audits, retention policies, and erasure rights (Article 17).
In Canada, PIPEDA governs commercial personal information handling, requiring meaningful consent, collection limitation, and safeguards. OPC guidance addresses AI automated decision-making, stressing transparency and accountability .

These rules shape AI data practices in the EU, USA, and Canada, with extraterritorial influence.

Economic Perspectives

Digital platform markets show high concentration, with dominant firms holding large user bases and data volumes. This reduces incentives to compete on privacy strength, as users face high switching costs and few privacy-focused alternatives. Firms gain private benefits from data collection (better model performance, targeted advertising revenue) while externalizing social costs (privacy erosion, discrimination, behavioural chilling effects), resulting in over-collection beyond the social optimum.

Examples of over-collection

Targeted advertising platforms collect extensive behavioural data to maximize ad revenue, even when marginal privacy harm exceeds marginal benefit to users.
Social networks retain detailed interaction histories indefinitely to improve engagement algorithms, despite limited incremental value and growing privacy risks.

Privacy-utility trade-offs

Stronger privacy measures reduce signal quality and predictive accuracy.

Examples:

Adding differential privacy noise to recommendation systems lowers recommendation relevance, reducing user engagement and platform revenue.
Federated learning on mobile devices limits data centralization but can degrade model performance due to heterogeneous local datasets.

Adversarial attacks on privacy

Adversarial attacks exploit model outputs to breach privacy.

Examples:

Membership inference attack: An attacker queries a health prediction model and infers whether a specific patient’s record was used for training by comparing confidence scores.
Model inversion attack: An adversary reconstructs facial images from a face recognition model’s output probabilities, recovering training set faces without direct access.

Common Limitations Observed in Practice

Differential privacy noise

Adding calibrated noise to achieve meaningful privacy guarantees frequently degrades model utility to unacceptable levels for tasks involving rare events or small subpopulations.

Examples

In fraud detection systems, noise addition reduces true positive rates for low-frequency fraud patterns, causing financial institutions to miss significant fraudulent transactions.
In medical diagnostics, differential privacy applied to genomic datasets obscures signals needed to identify rare disease variants, limiting clinical utility for precision medicine.

Federated learning vulnerabilities

Federated learning distributes training across client devices, but the decentralized nature introduces attack vectors not present in centralized training.

Examples

Model poisoning attacks: A small fraction of malicious participants (e.g., compromised smartphones) inject poisoned local updates, causing the global model to misclassify targeted inputs after aggregation.
Byzantine attacks: Adversaries send arbitrary or inconsistent updates that disrupt convergence, leading to reduced accuracy even when only a minority of devices are malicious.

Re-identification of anonymized data

Techniques intended to remove direct identifiers prove insufficient against linkage attacks using auxiliary information available in public or commercial datasets.

Examples

Mobility datasets anonymized by removing names and exact locations are re-identified using public social media check-ins or credit card transaction metadata with high accuracy.
Netflix Prize dataset (anonymized ratings) was re-identified by cross-referencing with public IMDb ratings, exposing individual viewing histories despite removal of explicit identifiers.

Bias reinforcement in predictive policing

Systems trained on historical enforcement data inherit and amplify existing biases present in arrest, stop, and conviction records.

Examples

PredPol-style hotspot prediction concentrates patrols in neighbourhoods with historical over-policing, increasing arrest rates in those areas and feeding more biased data back into the model (feedback loop).
Risk assessment tools used in pretrial detention or sentencing assign higher risk scores to individuals from over-policed communities due to disproportionate prior contact records, perpetuating racial and socioeconomic disparities.

Summary

AI systems depend on large-scale data collection that includes personal, sensitive, behavioural, and inferred information. Re-identification risks undermine anonymization efforts, while everyday applications erode privacy through pervasive tracking. Surveillance and predictive tools concentrate power and amplify bias. Technical protections (differential privacy, federated learning) and regulatory frameworks (consent, minimization) offer partial mitigation, but significant trade-offs remain between utility and privacy. Addressing these tensions requires coordinated advances in technology, policy, and institutional design.

Overview

Learning Objectives

Motivation

AI and Data Privacy

Types of Data

Personal data

Sensitive data

Behavioural data

Inferred data

The Impact of AI on Privacy in Everyday Life

Re-identification and the Limits of Anonymization

Common Anonymization Techniques and Their Limitations

k-Anonymity

l-Diversity

t-Closeness

Differential privacy

Policing, Surveillance, and Predictive Systems as Privacy Issues

Privacy and Data Protection Approaches

Technical Approaches

Federated learning

Secure multi-party computation

Homomorphic encryption

Synthetic data generation

On-device inference

Regulatory and Governance Approaches

Economic Perspectives

Examples of over-collection

Privacy-utility trade-offs

Examples:

Adversarial attacks on privacy

Examples:

Common Limitations Observed in Practice

Differential privacy noise

Examples

Federated learning vulnerabilities

Examples

Consent mechanism failures

Examples

Re-identification of anonymized data

Examples

Bias reinforcement in predictive policing

Examples

Summary