Link Digital and open data

A lot has been written, including on this site, about the challenges posed by the rapid rise of AI for the open data ecosystem. But at Link Digital we don’t just want to talk about this. We are doing something about it.

We have two decades of experience building data management solutions for a wide range of clients globally, using the Comprehensive Knowledge Archive Network (CKAN) and other open source tools. Our core mission remains empowering our clients to make meaningful change with their data. But now we are using our extensive experience to think differently about how to make open data infrastructure more effective in the face of advances in machine learning. 

This means developing open source tools that not only make data management more AI-ready but also ensure it’s done in a secure and ethical way. Here is a brief overview of three ways that Link Digital is currently helping to do this.

Ask AI: a new extension for CKAN-based data portals

You will have heard the term ‘agentic’ now being used a lot in relation to how data portals are evolving. In short, agentic portals incorporate AI agents to understand, manipulate, and act upon the data on a user’s behalf. While much of this functionality is still in development, one area that Link Digital has worked on is the incorporation of natural language or semantic search interaction into our open data portals and catalogues. 

In a traditional portal, users rely on key words to find a dataset. By using semantic understanding, portals don’t just use filenames; they understand the context of the data. This relates to what data specialists Dr. Stefaan Verhulst and Adam Zable described in a recent article as “a broader paradox at the heart of contemporary data governance: data may be open, yet it remains functionally inaccessible for many intended users.” In other words, many organisations now sit on such large data collections that it becomes too hard to use by non-technical or expert staff, let alone interested members of the public.

Lowering the barrier to the effective use of data portals by allowing users to get expert level insights without needing to be data scientists, lies behind  Ask AI, a new CKAN extension by Link Digital. Rather than struggling to locate the right dataset when they are not clear about the exact terminology they are searching for – and having to manually inspect metadata and download files they may not need or want to understand the structure – Ask AI uses semantic similarity to help users search. Instead of matching words, it matches meaning. Behind the scenes, dataset and resource metadata – along with supported document content – are converted into vector embeddings and stored in PostgreSQL using an extension called pgvector. When a user performs a search, the system retrieves results based on semantic proximity rather than simple text overlap. 

For organisations managing large portals this is a game changer in terms of improving dataset discoverability without requiring publishers to change data workflows. And to ensure accuracy, every AI generated answer provides a direct link to the source dataset. It makes the data on CKAN-powered portals more findable and, thus, more accessible, and users do not need to be data scientists to understand their search results.

Importantly, while the Ask AI feature enables users to interact with a data catalogue or portal more like they’re interacting with ChatGPT, it does so without sacrificing security. This is because Ask AI can be configured in such a way that sensitive data remains within a secure infrastructure and proprietary information is never used to train public models like ChatGPT or Gemini. It enables a data manager to define exactly which datasets the AI can access and who is authorised to query them. This means the AI cannot guess or make up facts as it is tethered to the trusted datasets in your portal. 

You can find out more about Link Digital’s Ask AI feature here.

Machine readable multilingual metadata

The MLCommons is an open engineering consortium dedicated to improving AI systems by developing standardised benchmarks and large-scale open datasets. Earlier this year they released Croissant 1.1, a new high level metadata format for machine learning datasets designed to make open datasets on CKAN easier to discover, use, and integrate with machine learning tools and platforms.

Before Croissant 1.1, AI models often struggled to understand the datasets hosted on CKAN because the descriptive metadata was meant for humans, not machine learning pipelines. The work of MLCommons means that any dataset can now be automatically translated into a format for AI tools to understand instantly without manual cleanup.

Link Digital has been an active participant in the Croissant 1.1 working group and the efforts of one of our designers were key to the implementation of multilingual metadata support. By enabling datasets to hold labels and descriptions in multiple languages at the same time, this update ensures that global data portals can speak the same language – making AI research more inclusive and accessible globally. 

This is a small but vital contribution to meeting the much larger challenge caused by the fact that most of the world’s most powerful AI has been built using data labelled in English and, thereby. While Large Language AI models (or LLMs) are getting better at providing multilingual tools, this is no substitute for having the actual version of the language instead of having a tool that will do the translation for you. A major step forward in democratising the technology.

The Objective Observer Initiative: An ethical framework for AI

The Objective Observer Initiative (OOI) is a new capability being developed by Link Digital. 

While tools like Croissant and Ask AI can make a vital contribution to the ‘how’ of machine readability, by improving facets such as discoverability and metadata, OOI is focused on ethical issues involved in developing more trustworthy models. 

In the past, open data simply meant the file was available for download. Link Digitals posits that in an AI world, making data open, while vital, is not enough because AI can’t easily distinguish between high-quality facts and ‘hallucinated’ or biased data. In addition to sharing data, Link Digital believes that what is needed is ‘verifiable honesty.’  

The OOI does this by focusing on the ‘signals’ in data. In a CKAN context, this means adding metadata that doesn’t just describe the columns but declares the intent of the data creator. When AI reads a dataset, it can use this additional data to verify whether that data is honest, unbiased, and fit for purpose. It can also check it to see if a dataset’s observed outcomes match its stated intent of the individual or organisation that uploaded the dataset

This differs markedly from the main measurement of satisfaction currently in place in relation to engagement with digital platforms, clicks, and efficiency as the only signal for consent. The current situation relies on feeding data into opaque machine learning tools and passively receiving the results that emerge on the other side. Changing this has the potential to turn CKAN from a simple repository into a more fundamental source of truth for responsible AI.

While the OOI is still in development it is already influencing Link Digital’s work with data portals. As Link Digital’s Executive Director put it in a recent interview: “There is an opportunity to co-construct datasets that provide optimistic signals. Such datasets would naturally be made discoverable on portals such as those we develop and maintain for our clients using CKAN, Drupal and other open technologies. “

You can read more about the OOI on our site here.

Greater discoverability, multilingual metadata for machine learning datasets, and a more transparent and ethical model of AI. Link Digital is not just talking about how the open data ecosystem can meet the challenges of AI, we are helping it to do so.