What is the future of open data and CKAN – the Comprehensive Knowledge Archive Network – given the rapid changes taking place in the open data ecosystem, in particular the rise of AI?

This question was the focus of a seminar organised by the Open Knowledge Foundation (OKF) to celebrate the recent Open Data Day 2026 and the 20th anniversary of CKAN, which included presentations from Link Digital Executive Director Steven De Costa and Link Digital Full Stack Developer/CKAN expert Oleksandr Cherniavsky.

The open data movement and CKAN, a major component of the open source technology that supports it, face significant challenges and major opportunities. Core to both is making open data infrastructure more functional and relevant in the face of advances in machine learning, which is changing the terms of effective data management.

The following article will look at some of the challenges and opportunities discussed in the OKF seminar. It will also touch on a new Link Digital capability presented during the event, that aims to addressing the gap between the large amount of data held on CKAN open data portals and the struggle many portal users have in locating the data they want and understanding it.

Different stages of the open data movement

For over two decades the open data movement – in which Link Digital has been an active participant – has been defined by the relatively straightforward philosophy and purpose of making data freely available to everyone to use and republic, without restrictions.

The ‘First Wave’ of this movement was defined by volume: how many datasets can be released? The ‘Second Wave’ focused on the FAIR principles (Findable, Accessible, Interoperable, and Reusable).

Central to the ‘Third Wave’, which came into play late in the 2010s, was the proposition that the open data ecosystem has moved beyond a focus on simplifying data and how it impacts our society and economy; to how data can better be systematised and how governments, private corporations and civil society organisations can better know and understand the data they have and share it to provide improved products and services, and a more informed citizenry.

We are currently in the Fourth Wave, characterised by The Gov Lab as the next frontier, ‘where open data is more conversational and AI ready, data quality and provenance are centre stage, a whole range of new use cases from open data are feasible and there are new avenues of data collaboration.’ 

Adapting infrastructure for generative AI

The rise of generative AI and large language models (LLMs) means that making data open, while still crucial, is now no longer enough.

Data now needs to be specifically structured to make it machine readable to AI agents that now crawl portals to LLMs and specialised analytical tools. Modern open data repositories are moving away from static files to architecture that supports vector search and real-time streaming, allowing AI agents to query data directly rather than downloading massive batches. And data provenance is increasingly vital, allowing creators to tag their work with ‘AI-usage’ permissions.

Threats facing the open data movement

Within this emerging framework, are threats and opportunities for the open data movement.

The threats were discussed at length during the OKF event. These include the ability of AI to amplify bias, and the possibility of distortions and hallucinations, in part due to the internet becoming flooded with AI-generated content, which means models train on their own outputs until they become distorted and lose reality-grounding.

Specifically related to the open data movement, there is also the fact that a large amount of open data is not AI ready. And if a portal’s data isn’t machine-actionable it risks becoming almost invisible. AI agents may skip your portal entirely. Or, worse, they may misinterpret key facts because the underlying data lacks the necessary structure and trust which, in turn, can have misrepresenting or damaging an organisation’s authority.

The opportunities in the AI era

One of the most significant opportunities comes from the fact that after years of scraping the public internet, AI developers have nearly exhausted the supply of high-quality, human-generated text. This has created an interesting tension: while more data is being produced than ever before, the openly accessible portion of high-quality training data is shrinking as platforms move behind paywalls to protect their assets from AI scrapers. This high quality verified open data from government and scientific institutions can thus be a ground truth, of sorts, that anchors that prevent AI models from drifting into hallucination.

It is a positive development for the open data movement but taking advantage of it is not easy. As Ruth Del Campo, General Director for Data at the Spanish Ministry for Digital Transformation and the Civil Service, put it in one of the panels in the OKF seminar: “How do we provide curated, trusted, open data resources for LLMs, in a way that creates efficiencies but does not exacerbate inequalities.”

What does this mean for data portals?

The current environment throws up many questions, but to focus on just one, what are the implications for one of the most common CKAN uses, open data portals?

For the last two decades, the goals of data portal management have been relatively simple: ensuring efficient, reliable storage, and findability and shareability. Organisations built digital portals to act as repositories, which sometimes essentially functioned as passive digital filing cabinets where information was placed for safekeeping.

AI is fundamentally shifting the goal posts, and the traditional open data portal, a static repository of CSV files, is no longer fit for purpose. There are many ways that portals will have to evolve to meet this, but a key one is having the most impactful data.

For AI agents, this means making the data machine readable and findable. For the individual CKAN portal users, it means that a traditional search matching keywords in titles, descriptions, and other metadata fields, is not enough. It may work well when users know the exact phrasing used by publishers. But real-world users often search differently. Enabling users to undertake semantic search that is contextually informed is vital.

Introducing Link Digital’s Ask AI capabilities

Improved data discoverability is always going to be an ongoing challenge for open data portals. Ask AI, a new CKAN extension by Link Digital, seeks to deal with one aspect of it, that many users still struggle to locate the right dataset when they don’t know the exact terminology. They manually inspect metadata. They download files to understand the structure. In practice, access exists, but usability often lags far behind. Ask AI seeks to introduce an AI-powered layer on top of CKAN to address this gap between data discoverability and interpretation.

Ask AI search layer

It uses semantic similarity search. Instead of matching words, it matches meaning. Behind the scenes, dataset and resource metadata – along with supported document content – are converted into vector embeddings and stored in PostgreSQL using an extension called pgvector. When a user performs a search, the system retrieves results based on semantic proximity rather than simple text overlap.

The effect is broader recall without sacrificing relevance, discovery across inconsistent terminology, and reduced dependence on perfect metadata alignment. For organisations managing large portals, this materially improves dataset discoverability without requiring publishers to change workflows. It also means that not only is the data on CKAN-powered portals more findable and accessible, but users do not need to be a data scientist to understand their search results.

Ask AI’s security configuration

Importantly, in terms of data security, Ask AI can configure this in such a way that sensitive data remains within a secure infrastructure and proprietary information is never used to train public models like ChatGPT or Gemini. A data manager can define exactly which datasets the AI can access and who is authorised to query them.

CKAN has made a vital contribution to making data open and shareable. Key to the next part of its journey will be how it can use AI to make the data on CKAN portals more accessible and understandable.

“Knowledge is getting closer. It used to be that we had to go to a library, and you had to look stuff up, then you had to go to stack overflow and ask the right questions and search through a whole lot of stuff,” De Costa said during his presentation at the OKF seminar. “A lot of that searching and discovery is chaperoned into our hands quite easily.”

What we need now is the ability to get a better understanding of the data we are asking for and what we get. “The future of data in an AI world is largely how we come to understand knowledge in our systems, in our civic systems, in our governance systems, in our commercial systems.”

There are many aspects to this, including being much more mindful about how we create our data systems, improved data governance, data pipelines and the training of more appropriate LLMs, to name a few. But we are confident Ask AI can play a role in this effort.

You can read more about Ask AI in this short article on our website.

Or sign up for a demo, and we can talk about your project and how Ask AI can improve your users’ interactions with data. Get in touch to discuss how we can help.