Good data for AI: what does it mean and how can we get it?

Context first
Good data for AI: what does it mean and how can we get it?
Why good data matters for AI models
High-quality data for AI
- What can be done to collect higher quality open data and to mitigate the risks with already existing open data that might exhibit problems?
Best Practices for Collecting and Curating AI Data
The role of data in AI success
How Link Digital supports AI with data management

Context first

The last couple of years have seen a tremendous increase in the tempo of debate around the benefits, problems and risks associated with Artificial Intelligence (AI). But what is not emphasised enough is that machine learning models only learn patterns and make predictions based on the data they are trained on.

In other words, AI is only as good as the data that is fed into it.

This is particularly the case for generative AI, the form of AI technology receiving perhaps the most public and media attention. Generative AI is an umbrella term for machine learning algorithms, such as ChatGPT, which generate artificial digital content. ‘Rather than merely processing existing data, generative AI entails the creation of models that can generate new content, such as images, videos, music, or text. Deep learning is a type of machine learning that employs neural networks to understand patterns and generate new data.’ ¹

Link Digital understands that quality data is essential to creating what is increasingly being referred to as ‘ethical AI’ or AI run on ethical principles. That is an AI that is safer, more reliable, and more equitable. The following article offers a very brief explanation of what is meant by good data in the context of AI, and some recent protocols and guidelines that have been developed to help find or create it. This is the kind of topic that is also going to be discussed in a new series of public forums Link Digital has started hosting on the last Thursday of every month, Australian EST, the details of which are included at the end.

Good data for AI: what does it mean and how can we get it?

As AI is much discussed but arguably little understood, some basic definitions are useful to start with. The Arts Law Centre of Australia argues that while there is no one accepted definition of AI, it can be understood as ‘computer systems that are able to perform tasks normally requiring human intelligence. This includes visual perception, speech recognition, decision-making, translation between languages and generation of creative outputs.’² The Canadian Government’s 2023 Directive on Automated Decision Making defines AI as ‘Information technology that performs tasks that would ordinarily require biological brainpower to accomplish, such as making sense of spoken language, learning behaviours or solving problems.’ The United Nations Education, Scientific and Cultural Organisation (UNESCO) adds the key observation that AI can be divided into several interrelated disciplines, including natural language processing, knowledge representation, automated reasoning, machine learning and robotics. ³

What we now understand as AI has actually been around as early as 1950 when the English mathematician Alan Turing published Computer Machinery and Intelligence, which proposed a machine intelligence test he termed ‘The Imitation Game’.⁴ The first rules to govern artificial intelligence systems, arguably the precursor to what we now refer to as ethical AI, the ‘Three Laws of Robotics’, appeared the same year in American science fiction writer Isaac Asimov’s anthology, I, Robot. The term ‘Artificial Intelligence’ was coined six years later by an American computer scientist called John McCarthy. ⁵

Several factors have led us to where we are now, both in terms of AI’s growing abilities and the at times almost fevered debate around its transformative possibilities and dangers. Foremost is the massive expansion of computing power since the early 2000s, which has seen the gradual digitisation of nearly all key aspects of life and the ability of technology to help generate and capture data. These trends were accelerated by the COVID-19 pandemic, which resulted in increased data sharing within and between governments, and between governments and researchers to assist with research into finding effective vaccines and collaborate on measures such as contact tracing, much of which was also fed into AI applications.

Why good data matters for AI models

As a 2023 UNESCO report, Open Data for AI: what now? puts it, the current AI boom is largely based ‘on massive advancements in machine (and particularly) deep learning, which necessitates large data sets.’⁶ While some of these AI applications are run on closed or privately owned data, for the most part, they operate using open data, that is data ‘anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).’⁷

Open data has major benefits for AI. Aside from the fact that it can accelerate the development of AI by making more data available on which to train it, open data can make it easier for policymakers, researchers and the public to compare the effectiveness of different AI models, which promotes greater transparency.

But it is not enough for that data to be open and shareable, it also needs to be quality data.

Major concerns around AI include its implications for privacy, security, reliability, accuracy, fairness and inclusion. In addition to posing longer-term ethical questions, these issues have very real and immediate implications. As the Canadian Government’s Directive on Automated Decision Making puts it: ‘An AI tool could, for example, decide whether someone is eligible for a service, determine the level of benefits someone is entitled to, or process survey data to inform policy direction.’ Central to this concern is AI’s capacity to exhibit bias. This might occur in content generated by ChatGPT or it might be a service decision influenced by an AI algorithm, either of which might be discriminatory, non-representative or based on stereotypes, including racial identity and gender. ⁸

The key source of this bias is the datasets from the open Internet which are being used to train AI and which contain biases that AI technology simply perpetuates. Deloitte notes this bias can take several forms: ‘Induction bias’, which occurs ‘when an algorithm is trained on a dataset which provides limited interaction with varying demographics. For example, facial recognition systems that are trained primarily on the faces of white men are significantly more likely to misidentify the faces of women or minorities’; ‘latent bias,’ algorithms trained on historical data may stereotype; and ‘selection bias’, which is when a certain group is overrepresented in a dataset and another underrepresented.⁹

In addition to the implications for individuals or groups of people, inaccurate, out-of-date, incomplete data or biased data can have a significant impact on organisations and businesses. It can lead to incorrect assumptions and decisions and undermine trust in those providing digital services.

As the World Wide Web Foundation’s Open Data Barometer noted, a lot of open data, including government data, is typically incomplete and can be of low quality.¹⁰ This can relate to problems in the way the data was collected, or the fact that it is not standardised or is mislabelled and/or from unknown sources.

High-quality data for AI

The call for quality data is one of the key principles of the movement around what is referred to as the Third Wave of Open Data. This emphasises that in addition to making data easy to find, access and understand, it must be of high quality. It is also in line with the FAIR principles, which seek to ensure open data can be accessed efficiently and that there is greater interoperability between data-sharing portals. First enunciated in a 2016 article in the open-access journal, Scientific Data, the FAIR principles are Findability, Accessibility, Interoperability, and Reuse of digital data.¹¹ The FAIR acronym has also been interpreted as ‘Federated, AI–Ready’ to illustrate the importance that the data can be used by AI systems.¹² FAIR in this context means that the data is Findable, Accessible, Interoperable and thus Reusable by both humans and machines.

What can be done to collect higher quality open data and to mitigate the risks with already existing open data that might exhibit problems?

In recognition that most AI systems are only as good as the data they are trained on, many organisations have developed laws and suggested protocols to help make data of higher quality. For example, the European Parliament is about to adopt new regulations that stipulate that foundation AI model providers, such as Google and ChatGPT owner OpenAI, must describe data sources, including potentially copyrighted data, that have been used to train their AI models.

The UNESCO Open Data for AI: what now? report notes the vital importance of effective governance frameworks to ensure data collected is of sufficient quality. This includes ensuring that the data is not outdated, is comprehensive and collected from credible sources. It should have been collected ‘with consent only and not in privacy-invasive ways. Data about people should be disaggregated where relevant, ideally by income, sex, age, race, ethnicity, migratory status, disability and geographic location.’¹³ ‘Efforts are required to convert the relevant data in a machine-readable format, which involves curating of the collected data, i.e., cleaning and labelling.’¹⁴

Best Practices for Collecting and Curating AI Data

The European Parliament’s law, UNESCO, Canadian Government’s Directive on Automated Decision Making, and Implementing Australia’s AI Ethics Principles: A selection of Responsible AI practices and resources, a 2023 report by Australia’s Commonwealth Scientific and Industrial Research Organisation (CSIRO) and the not for profit Gradient Institute, particularly focus on measures to reduce potential bias migration in open data.

Common themes include the need for fairness, privacy and protection of data, transparency and explainability. The CSIRO/Gradient Institute principles stress that there ‘should be transparency and responsible disclosure so people can understand when they are being significantly impacted by AI and can find out when an AI system is engaging with them.’¹⁵ They state that ‘when an AI system significantly impacts a person, community, group or environment, there should be a timely process to allow people to challenge the use or outcomes of the AI system.’¹⁶ The principles also stress the importance of participation, stating that enlisting the public, where appropriate and possible, to help correct bias, deliberately engaging with marginalised communities, and hiring staff from diverse backgrounds, disciplines and demographics.¹⁷

The role of data in AI success

The Canadian Government’s Directive on Automated Decision-Making suggests a wide range of measures to minimise bias. These include reviewing content developed using generative AI for biases or stereotypical associations, formulating prompts to generate content that provides holistic perspectives and minimizes biases, and striving to understand the data that was used to train the tool in question, where it came from and how it was selected and prepared. ‘Open datasets also carry with them the imprint of how they were created. These datasets contain critical information reflecting a valuable historical record of transactions. But if those historical records are incomplete or reflect historical biases, they might train future AI models to recreate those biases.’ The Canadian Directive is backed up by an explicit strategy of using data to foster equity, fairness and inclusion. Statistics Canada’s Disaggregated Data Action Plan, created under the aegis of the 2023-2026 Data Strategy for the Federal Public Service, includes the ambitious aim to break data down on specific population groups, with a current focus on Indigenous Peoples, women and people with disability.

How Link Digital supports AI with data management

As an organisation deeply committed to effectively leveraging open data to solve real-world problems, Link Digital understands the necessity of creating ethical AI. One important way we can begin to have a meaningful impact in this area is through engaging in discussions about the importance of open data and how we can make it work for greater transparency, equity, and democratic accountability.

This is one of the many topics that were discussed in a series of forums by Link Digital for the past months. You can watch the recording of Data Governance for AI that we hosted with Open North. John Griffin shares Open North’s view on responsible AI usage in governance, reduce the focus on AI’s uniqueness, prioritise digital maturity, and address issues with solutions.

You can also watch the other topics, such as:

Open Data 2023 Privacy Enhancements (Queensland Government)

Insights on using Jira with the ERM Explorer solution to manage Energy and Resource Markets operations

Government Data: A Key to Canada’s Prosperous Future (with Jamie Boyd, National Digital Government Leader at Deloitte Canada)

These forums will connect you with like-minded experts who are passionate discussing about the importance of open data.

Want to stay updated on the latest developments in the field? Receive weekly updates from Link Digital.

[1] United Nations Educational Scientific and Cultural Organisation, Open data for AI: What now? (UNESCO, Paris, 2023), 39, https://www.unesco.org/en/articles/open-data-ai-what-now

[2] Arts Law Centre, “Artificial Intelligence (AI) and copyright,” accessed September 19, 2023, https://www.artslaw.com.au/information-sheet/artificial-intelligence-ai-and-copyright/

[3] UNESCO, Open data for AI, 39.

[4] Council of Europe, “History of Artificial Intelligence,” accessed September 19, 2023, https://www.coe.int/en/web/artificial-intelligence/history-of-ai

[5] Ibid.

[6] UNESCO, Open data for AI, 39.

[7] “The Open Definition,” Open Knowledge Foundation, accessed September 18, 2023, https://opendefinition.org/

[8] For examples see, “4 shocking AI biases,” Prolific, accessed September 18, 2023, https://www.prolific.co/blog/shocking-ai-bias; and Terence Shin, “Real-life Examples of Discriminating Artificial Intelligence,” Towards Data Science,’ June 5, 2020, https://towardsdatascience.com/real-life-examples-of-discriminating-artificial-intelligence-cae395a90070

[9] Austin, Tasha, Kara Busath, Allie Diehl, and Pankaj Kamleshkumar Kishani, “Trustworthy open data for trustworthy AI: Opportunities and risks of using open data for AI,” Deloitte Insights, December 10, 2021. https://www2.deloitte.com/xe/en/insights/industry/public-sector/open-data-ai-explainable-trustworthy.html

[10] World Wide Web Foundation, Open Data Barometer, Global Report, 4^th edition, 14-15, https://webfoundation.org/research/open-data-barometer-fourth-edition/

[11] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, and Niklas Blomberg, et al, “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific Data 3, 160018 (2016): 1-9, https://doi.org/10.1038/sdata.2016.18

[12] UNESCO, Open data for AI, 15.

[13] Ibid,17.

[14] Ibid, 18.

[15] Alistair Reid, Simon O’Callaghan, and Yaya Lu, Implementing Australia’s AI Ethics Principles: A selection of Responsible AI practices and resources, (Gradient Institute and CSIRO, 2023), 28.

[16] Ibid, 38.

[17] Ibid, 15.

What does good data for AI mean and how can we get it?

Context first

Good data for AI: what does it mean and how can we get it?

Why good data matters for AI models

High-quality data for AI

What can be done to collect higher quality open data and to mitigate the risks with already existing open data that might exhibit problems?

Best Practices for Collecting and Curating AI Data

The role of data in AI success

How Link Digital supports AI with data management

Recent Projects

Canadian Watershed Information Network (CanWIN)

The Pacific Data Hub: creating a sustainable solution to store and access Pacific-related data

A case study on data governance: how a government geoscience regulation agency leveraged CKAN

Sharing and Enabling Environment Data (SEED) Portal

We've encountered an error