It is true that artificial intelligence (AI) will come to influence almost every aspect of our lives. In the scramble to realize the potential economic and societal benefits promised by AI, the ready availability of massive, complex, and assumed-to-be generalizable datasets with which to train and test new algorithms is vital. The interaction of governments with their citizens throughout their lives generates huge volumes of diverse information, and these continuously expanding repositories of data are now seen as a public good, providing the raw material for AI industries.
In passing the National Artificial Intelligence Initiative Act of 2020 (NAIIA), the United States has adopted a path similar to that of the European Union, as defined within the European Commission's Coordinated Plan on Artificial Intelligence 2021 Review. Under the provisions of the NAIIA, the National Artificial Intelligence Research Resource Task Force (NAIRRTF) has been constituted to make recommendations to Congress on, among other things, the capabilities necessary to create shared computing infrastructure for use by AI researchers and potential solutions in respect to "barriers to the dissemination and use of high-quality government data sets."
The goal of trustworthy AI systems combined with shared computing resources that will presumably form an extension of existing government cloud infrastructure added to the envisioned influence of AI systems in the real and virtual world creates a potent cocktail of issues for the NAIRRTF to address. For evidence of the difficulties associated with the use of public data for — apparently — altruistic motives, the NAIRRTF needs to look no further than a protracted debate in the United Kingdom surrounding the proposed use of medical data for research purposes.
As part of the General Practice Data for Planning and Research (GPDPR) initiative — not to be confused with the pertinent General Data Protection Regulation (GDPR) — NHS Digital intends to collect medical data from patient interactions with general practitioners to form a national data resource in the UK in support of research activity. In response to widespread concerns about patient privacy, NHS Digital has attempted to provide public reassurance. The provision of medical data to Google DeepMind by The Royal Free London NHS Foundation Trust in 2016, however, illustrates the pitfalls associated with individual consent when sharing data for research. The NAIRRTF must recognize how the failures documented by the Royal Free case, and perceived misuse of private data in AI development such as the current controversy around facial image scraping by Clearview AI, serve to undermine public trust in the AI industry.
The NAIRRTF must also tackle challenges created by the scope of the NAIIA legislation, which aims to coordinate knowledge transfer from AI research across civilian agencies, the Department of Defense, and the intelligence community. While patients can explicitly opt out of the GPDPR system, the dual issues of consent and auditability represent one of the primary barriers to success for the National Artificial Intelligence Research Resource (NAIRR). If I retain ownership of the data generated by my interaction with the government, how can I exert control over its use? How can I ensure enforcement of my consent, so that my data may be used by AI researchers working in one domain but not by researchers working in another?
For example, how can I verify that my data is used to develop new clinical AI solutions and AI tools for drug discovery, which I endorse, while preventing the use of my data for research in other areas, such as military AI applications, to which I may be opposed? This problem becomes acute where a specific AI technology, such as image classification, can be applied in both domains. Only by effective implementation of policy constraints at the point of use can data resources used in AI development be audited based on consent, providing transparency for the data owner, algorithm developer, and regulatory bodies.
Visibility of data use is also important because the value in modern AI applications resides in their ability to process large, disparate, multidimensional data sets to provide new insights or execute a specified decision task with real-world consequences. As the breadth and depth of data increase, there becomes a real risk that individuals may become identifiable, even where precautionary anonymization has taken place. Consequently, the NAIRRTF needs to determine how the privacy of individuals will be protected where the combination of different datasets leads to integration of indirect identifiers of discrete individuals or communities, and where the function of an AI system is directly, or indirectly, biased with respect to different subjects.
As the full implication of the SUNBURST cyber incident emerges, the NAIRRTF needs to look to an architectural design for the NAIRR that puts individual privacy and consent at its center. Demonstrable enforcement of policies and permissions at the point that data is used — using a segmented infrastructure capable of safeguarding the use of data by only authorized users and applications — is fundamental to the success of the NAIRR mission. The NAIRRTF must include new technical innovations, which support both the NAIRR objectives and the adoption of a zero-trust architecture within government, in its recommendations to Congress on shared computing resources.
To ensure public confidence in the AI systems enabled by the NAIRR, the same public contributing the massive volumes of data required for their development must have confidence that their data is being used in a safe and ethical manner. While addressing the problem of making government data available to the AI research community, the NAIRRTF must not lose sight of concerns at the individual level and must confront the issues of trust that continue to beleaguer AI research and implementation. Only by establishing the foundations for public trust in the NAIRR will the ideal of trustworthy AI be realized.