What is a data dictionary and what are its benefits?

This question was the subject of a recent presentation by Ian Ward, one of our Senior Solutions Architects, as part of the Canadian Polar Data Workshop VI, co-hosted by the Waterloo Climate Institute and the Polar Data Catalogue Canadian Cryospheric Information Network at the University of Waterloo

I spoke to Ian about his talk and how data dictionaries can help users better understand their data.

Defining a data dictionary

Ian defines a data dictionary as a file that is typically published alongside a dataset to provide users with the necessary information to interpret the data that you’re giving them – like a centralised reference guide that defines the structure, meaning, and rules of a specific dataset or database.

“It can be anything from a CSV file that lists the individual columns and what they’re supposed to contain, or it could be a fully described Croissant file that includes provenance information and instructions on how to collect individual image files and make them correspond to tables in a database. The most important things are just titles and descriptions of what the data [is] that you’re publishing.”

This is very similar to other forms of metadata, which  is often used as the starting point of a data dictionary. A data dictionary “can go down to the individual columns or you can also have information about the controlled lists and things like that, but it really varies based on who the intended audience is. 

So, if you’re working in reproducible science, it’s going to be important that you have information about every single step that was involved in the production of that data. That would be an important part of your data dictionary. But for many users, just having clear titles and descriptions of the fields is still much better than looking at a plain CSV file.”

Who can benefit from a data dictionary?

In terms of who can benefit from a data dictionary, Ian maintains it is anybody whose going to be working with your data. “Because it means that if they are building systems that they can rely on the format that’s going to be published [and] they can see whether that format changes over time. Perhaps data that was published last year is going to be different than this year, and updating the data dictionary makes it clear that when they’re doing that integration that they need to account for that.”

“But also, the people that are publishing the data can benefit as well. It forces you to think as a data publisher about exactly how you’re going to format things. And you might potentially think while you’re describing them, well, you know, maybe as a consumer, you know, this way of writing it might be better than another way.”

One use case example given by Ian during his talk related to the Canadian Watershed Information Network website – the sea ice levels on the Hudson Bay [a large saltwater body located in northeastern Canada]. “That’s one of their most important data sets and I went through a bunch of possible different ways of formatting a data dictionary for that type of dataset. And when you’re producing a dataset like that, it might be obvious things like the units that you’re using or what a code means for a particular location where a sample has been captured. So, you can say exactly what units you’re using, how a measurement is taken, and what the different codes that you see there actually correspond to in the real world.”

Ian believes data dictionaries also have a role to play in helping to enable transparent future ready governments. “Everything is driven by the data that we’re publishing and being able to understand that is key to being able to turn that [data] into the outcomes that you’d like to see. Anybody that’s publishing and using data, which is almost everybody, can benefit from a clearer understanding of what’s being published and what’s being consumed.”

Governments are increasingly turning to AI to improve services, anticipate citizen needs, and make faster, smarter decisions. But AI doesn’t work without high-quality data.

“One of the formats that I was discussing was the Croissant format that was specifically designed for machine learning data because that data is often extremely large and formatted very differently than your more typical data set that you might see, maybe produced with Excel.” A data dictionary “allows you to tie together information like directories of files and tables in a way that you can then consume it with a system that would be used to train a language model.”

“But on the other side, of course, these models are built using the data that we publish. So, clearly having a clear understanding of exactly what’s being published makes for much better training so that you can build the types of systems that you’re trying to do whether that’s with AI or with any kind of automated system.”

Want to talk about data dictionaries? 

Contact us here and one of Link Digital experts will be in touch with you.