
Introduction
Existing privacy protections are not sufficient to curtail big tech automated decision-making. Automated decisions are increasingly widespread and can have harmful impacts.
Artificial Intelligence (AI) relies on vast amounts of data. Data’s social or relational properties can reveal information about individuals that wasn’t directly provided. This reduces the meaningful control individuals have over their data.
This article explores the tension between data production practices and privacy protection in the AI age.
Automated decision-making directly impacts our lives
Companies increasingly deploy AI systems to make automated decisions about millions of customers and workers.
Consider how digital ride-hailing platforms dispatch great numbers of ride matches at a time and tailor dynamic pricing using AI. Often, these decisions are presented as a choice for the users, but in a significantly constrained sense. Drivers, in particular, may be harmed by automated decisions about their tasks and remuneration, leaving them with little control over their work. At a sufficiently large scale, collective harm can occur when people with similar conditions are affected by the same decisions.
In advanced economies, the harms of automated decisions by AI systems were reported in healthcare insurance, unemployment benefits, and more. AI systems’ underperformance in their operational contexts may be a risk, but there are other problems associated with automated decisions that go beyond accidental harm.
How is automated decision-making done?
In commercial applications, automated decision-making is based on actionable insights derived from predictive analytics. AI systems enable analytics to be automated and scaled up. AI systems can combine vast amounts of data from various sources, process data and make decisions autonomously for millions of users (or cases) at a time.
For example, in recommender systems, machine learning (ML) models predict users’ preferred products, movies, or music by learning from a dataset of other users with similar browsing or purchasing histories. The model outputs a recommended list as an actionable insight.
ML models rely on an immense volume of data. The more data an ML model is trained on, the better the accuracy of the output. As ML models’ performance depends on the size and quality of data, companies are clamouring to expand the scale of data production.
This has led to the mushrooming of the data production industry dedicated to the collection, processing, storage, and circulation of data. Not only are individuals now subjected to collection of identifiable personal information, but also to an expanding surveillance of their behaviour, turning all aspects of life into data. This is known as dataification.
Legal scholars have pointed out the incompatibility of privacy with data production in the AI age, owing to the fact that data is social or relational in nature. The advancement in statistical tools and AI has changed the ways in which data are processed and used. To understand this incompatibility, we must learn how value is derived from data relationality.
The value of social data
Data is social
In 2018, Cambridge Analytica harvested user data through their app “thisisyourdigitallife” to develop predictive psychological profiles used to target users with similar profiles for political advertisements on Facebook. In this case, most Facebook users did not disclose their data to Cambridge Analytica but accurate prediction had exposed them to ad-targeting.
What the incident has demonstrated is a problem of privacy. The ability of Cambridge Analytica’s algorithm to make predictions about one group based on information collected elsewhere suggested that information reveals relationships between people.
Consider a financial services platform that uses an ML model trained on user data such as browsing histories, socio-economic class, and financial product preferences. Suppose Alice shares only her browsing history with this platform. The model infers sensitive information about her, such as socio-economic class and financial interests, from her browsing data. Suppose the platform uses this inferred information to target her for advertisements of financial products. In that case, Alice is affected by the data of others, independent of her choice in disclosing the target information.
Salome Viljoen (2021) called this the “relationality” of data. Relationality refers to the phenomena where information about others has the potential to reveal information about us when processed or aggregated.
Data production is motivated by the social nature of data
Individual datum is not useful in itself; it is only by relating one datum to another that meaningful links are derived to inform valuable insights. According to Viljoen (2021), in the digital economy, data isn’t collected solely because of what it reveals about us as individuals. Rather, data is valuable primarily because of how it can be aggregated and processed to reveal things (and inform actions) about groups of people. Dataification, in other words, is a social process, not a personal one.
Companies and organisations now voraciously collect data to produce predictive analytics about users. More data give better approximations about groups and relationships between the features linked to users.
Machine learning (ML) aims to “automatically detect meaningful patterns in training data” to make predictions about new data. This ability to gain insights and automate decisions is crucial for deriving value from data. The goal is to develop a prediction rule that approximates the relationship between pieces of information, such as correlations between input features and target variables.
In a way, models construct identities at an aggregated level, sometimes called “profiles.” For example, “women earning below median wage” is an input variable or a profile that groups individuals based on similar characteristics. The prediction rule approximates the relationship between profiles and a target variable, such as the likelihood of women earning below the median wage in taking loans. This is a target function or “pattern”. ML seeks to predict the target variable in the new data based on the patterns modelled in the training data.
A subset of users’ data is selected as training data to train a predictive model. These are often data of users who disclose some target information like gender, or earnings. A prediction rule is modelled between the target information and some readily available auxiliary information like browsing history, clicks, and latency. The prediction rule modelled from this pool of data is then used to infer new data from the rest of the users, even if they haven’t explicitly disclosed the target information. This prediction is produced as an actionable insight to either make automated decisions for users, such as ad targeting, or aid in decision-making.
Current privacy protections are insufficient
The dominant regulatory approach to information flow is a combination of transparency and choice, also known as notice-and-consent or informed consent. The approach “requires that individuals be notified and grant their permission before information about them is collected and used.” The approach also stresses the role of the individual as data subjects and their autonomy in information disclosures. Hence, regulatory efforts often emphasise the protection of personal information or personally identifiable information.
Data relationality undermines privacy protection based on informed consent. Data protection laws protect information at an individual level, whereas AI sidestepped the need for an individual’s informed consent to learn information about that individual.
The ability of AI to produce highly accurate predictions about us based on aggregated information of others erodes privacy. Thus, there are constraints to the extent of meaningful control one has over their data.
Privacy disclosures (or privacy notices) that inform how users’ data are collected and used are now widely implemented across the web. According to this view, the data subject’s privacy is protected so long as people have legitimate control over the permissions they give to disclose their personal information.
In reality, most digital platforms implement opt-in contracts on a “take-it-or-leave-it” basis for their services. These opt-in contracts leave users with little deciding power, as big digital platforms accrue users by undercutting competition from alternative platforms.
Helen Nissenbaum posited the impracticality of informed consent in the Internet age. Modern Big Data analytics draw and combine data from various sources. Companies also trade data among each other, making it hard for users to assess the trade-offs for giving away their information. The ability of AI to infer private information about us from public auxiliary information such as cookies, clickstreams, latencies, IP addresses, and so on makes drawing boundaries between private and public information a futile exercise and individual privacy calculus infinitely tricky.
Conclusion
In the age of information flow, where data collection, processing, and use are everywhere, data governance is crucial. The crux of data governance is about managing the tension in “balancing data openness and control.” Because data brings about essential benefits in the public interest, improved access to and broader sharing of data are crucial to expanding the reach of benefits that raise living standards. Conversely, data misuse and unjust outcomes can arise from loose data flow.
Protection of privacy has been one of the critical principles for data handling to strengthen trust in information systems. However, the current regimes of privacy protection rely on individualist notions of information control. This may not be sufficient to safeguard society from harms derived from an economy driven by social predictions based on shared data.