This lesson examines the intersection of artificial intelligence with privacy and security concerns. AI systems rely on large-scale data collection and processing, creating inherent conflicts with individual privacy rights and data security. Core topics include classification of data types (personal, sensitive, behavioural, inferred), mechanisms of privacy erosion through tracking and inference, limitations of anonymization and privacy-preserving techniques, high-risk applications such as predictive policing, biometric surveillance, and automated decision systems, trade-offs between model utility and protection strength, technical safeguards (federated learning, differential privacy, encrypted computation, synthetic data), and regulatory, governance, and economic factors in AI data protection. The analysis integrates technical detail with ethical, legal, and sociotechnical perspectives, addressing how AI amplifies information and power asymmetries while assessing the effectiveness and constraints of current mitigation approaches in high-dimensional, adversarial, and large-scale settings.
Machine learning models require large volumes of data for training and inference. The performance gains from more data create incentives to collect and retain extensive records of human behaviour. When these records include personal attributes or can be linked across contexts, they enable detailed profiling, targeted manipulation, and exclusionary decisions. Privacy protections developed for static databases prove inadequate against adaptive adversaries and high-dimensional feature spaces typical in modern AI. Understanding these dynamics is necessary for designing systems that balance analytical utility with fundamental rights.
Artificial intelligence systems depend on diverse categories of data for training and operation, each carrying distinct privacy implications. Personal data directly identifies individuals, sensitive data reveals protected attributes such as health status or political opinions, behavioural data captures actions and interactions across digital and physical contexts, and inferred data consists of attributes derived algorithmically from the preceding types. AI-driven services routinely collect behavioural data at scale through tracking mechanisms embedded in platforms, devices, and applications, enabling construction of detailed individual profiles used for prediction, targeting, and decision-making. This pervasive data processing erodes privacy by exposing patterns individuals did not intend to disclose, often without meaningful consent, while amplifying risks of unauthorized linkage, profiling, and secondary use across contexts.
AI systems process multiple categories of data with different privacy implications.
Any information relating to an identified or identifiable natural person (direct identifiers).
Examples:
Personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data for unique identification, data concerning health, or data concerning sex life or sexual orientation.
Examples:
Information generated by an individual’s actions, interactions, and movements in digital or physical environments.
Examples:
Attributes or characteristics computed or predicted by algorithmic models from personal, sensitive, or behavioural data.
Examples:
Inferred data often carries the highest risk because it reveals information individuals did not explicitly disclose and may be inaccurate.
AI-driven services collect behavioural data at scale through tracking embedded in digital platforms, mobile applications, smart devices, and connected infrastructure. Streaming platforms record viewing patterns, pause events, and device switches to build taste profiles. Social media tracks dwell time, scroll behaviour, reactions, and interaction sequences to infer preferences and emotional states. Voice assistants capture audio commands and ambient sounds to model routines. Location services log geotraces and venue visits. Advertising networks link cross-app signals and device fingerprints for persistent identification. Wearables monitor physiological data and activity to derive health and lifestyle inferences. These high-resolution data streams construct detailed behavioural profiles revealing habits, social ties, and vulnerabilities individuals did not intend to disclose, often without effective awareness or control, leading to pervasive surveillance, targeted manipulation, differential treatment, and diminished personal autonomy in routine digital interactions.
Examples:
These systems generate detailed behavioural profiles used for micro-targeted advertising, price discrimination, and content curation. Cumulative effects reduce individual autonomy by shaping information exposure and decision environments.
Anonymization removes or alters direct identifiers so that the data can no longer be attributed to a specific individual without additional information. Re-identification matches anonymized records to real individuals using auxiliary data sources or background knowledge.
Example:
Netflix Prize dataset (2006) contained anonymized movie ratings. Narayanan and Shmatikov (2008) linked it to IMDb public profiles using unique rating patterns, re-identifying many users with high accuracy.
Each record in the released dataset must be indistinguishable from at least k−1 other records with respect to a set of quasi-identifiers.
Example of failure:
A dataset with age ranges (20-30, 30-40) and ZIP codes where most records in a group share the same disease (e.g., diabetes) allows inference of sensitive attribute even without direct linkage.
Each equivalence class must contain at least l well-represented sensitive attribute values.
Example of failure:
An equivalence class contains three records with ages 25-35 and ZIP 90210, but sensitive values are {HIV positive, HIV positive, negative}. An adversary knowing a target is in this group can infer 2/3 probability of HIV positive status.
The distribution of a sensitive attribute in any equivalence class must be close to the distribution of the attribute in the overall table (measured by Earth mover’s distance or similar).
Example of failure:
In a salary dataset, a group with average salary significantly higher than population average reveals that members are likely senior employees even if exact values are hidden.
The output of a computation is statistically indistinguishable whether any single individual’s data is included or excluded, typically achieved by adding calibrated noise.
Example of failure in practice:
When applied to small subpopulations (e.g., rare medical conditions), the required noise level makes the output statistically useless for the very groups most in need of analysis.
In high-dimensional settings (thousands of features), even differential privacy struggles to preserve meaningful accuracy while preventing re-identification.
AI enables new forms of surveillance and control.
Examples:
These applications concentrate power in state and corporate hands, reduce anonymity in public spaces, and create chilling effects on free expression and association.
Technical methods aim to enable useful AI computation while minimizing exposure of raw personal data. These approaches operate at different stages of the data lifecycle and offer varying protection strengths against different threat models.
Model updates are computed locally on user devices using private data; only aggregated or differentially private model updates are sent to a central server.
Examples
Multiple parties jointly compute a function over their private inputs while keeping those inputs secret from each other; the output is revealed only to authorized parties.
Examples
Allows computation on ciphertexts such that the result, after decryption, matches the result of operations on the corresponding plaintexts.
Examples
Generates artificial datasets that statistically mimic real data distributions while containing no actual personal records, often using generative models such as GANs or diffusion models.
Examples
Performs model inference entirely on the user’s device using local compute resources, avoiding transmission of input data to remote servers.
Examples
Regulatory frameworks impose obligations on data controllers and processors for AI systems handling personal data, limiting collection, ensuring lawful processing, granting rights, and requiring risk assessments.
Examples:
GDPR Article 6 requires a lawful basis for processing personal data (e.g., consent, contract, legitimate interests with balancing test). Article 9 restricts special category (sensitive) data unless explicit consent or narrow exception applies.
CCPA/CPRA provides California residents opt-out rights for sale/sharing of personal information and limits on sensitive data use, plus disclosure requirements for significant automated decisions.
GDPR Article 35 requires Data Protection Impact Assessments for high-risk processing (e.g., large-scale profiling, significant-effect automated decisions, large-scale sensitive data).
GDPR Article 22 grants rights regarding solely automated decisions with legal or significant effects, including human intervention, contestation, and explanation of logic.
GDPR data minimization (Article 5(1)(c)) limits data to what is necessary, enforced via audits, retention policies, and erasure rights (Article 17).
In Canada, PIPEDA governs commercial personal information handling, requiring meaningful consent, collection limitation, and safeguards. OPC guidance addresses AI automated decision-making, stressing transparency and accountability .
These rules shape AI data practices in the EU, USA, and Canada, with extraterritorial influence.
Digital platform markets show high concentration, with dominant firms holding large user bases and data volumes. This reduces incentives to compete on privacy strength, as users face high switching costs and few privacy-focused alternatives. Firms gain private benefits from data collection (better model performance, targeted advertising revenue) while externalizing social costs (privacy erosion, discrimination, behavioural chilling effects), resulting in over-collection beyond the social optimum.
Stronger privacy measures reduce signal quality and predictive accuracy.
Adversarial attacks exploit model outputs to breach privacy.
Adding calibrated noise to achieve meaningful privacy guarantees frequently degrades model utility to unacceptable levels for tasks involving rare events or small subpopulations.
Federated learning distributes training across client devices, but the decentralized nature introduces attack vectors not present in centralized training.
Notice-and-choice frameworks rely on users actively reading and understanding privacy policies, but real-world behaviour shows systematic failure of this model.
Techniques intended to remove direct identifiers prove insufficient against linkage attacks using auxiliary information available in public or commercial datasets.
Systems trained on historical enforcement data inherit and amplify existing biases present in arrest, stop, and conviction records.
AI systems depend on large-scale data collection that includes personal, sensitive, behavioural, and inferred information. Re-identification risks undermine anonymization efforts, while everyday applications erode privacy through pervasive tracking. Surveillance and predictive tools concentrate power and amplify bias. Technical protections (differential privacy, federated learning) and regulatory frameworks (consent, minimization) offer partial mitigation, but significant trade-offs remain between utility and privacy. Addressing these tensions requires coordinated advances in technology, policy, and institutional design.