Table of contents
- The question being asked by data managers…
- CEOS and CanWIN
- A lot of data, but is it doing anything?
- Open source a gamechanger in research environments
- Information sharing data made sense
- Using a portal be used to tell better stories with data
- A change to combine multiple datasets
- CKAN will play a key role in these plans
- Want to create a data portal? Don’t reinvent the wheel
When we talk about data portals, we usually focus on their role in collecting and cataloguing data, improving its quality and making it more discoverable and shareable. But even with the very best data portal, it can still just be data sitting on a site.
The question being asked by data managers…
A question being asked more by data managers is, how do you move the data on your portal into the hands of different audiences? Not just expert groups like researchers and policy analysts, but local communities who can use the insights generated by the data to act on the issues of most concern to them. Claire Herbert, Head of Digital Strategy at the Canadian Watershed Information Network (CanWIN), an open data portal managed by University of Manitoba’s Centre for Earth Observation Science (CEOS), is one data manager for whom this concern is front and centre.
CEOS and CanWIN
CEOS is a leading multidisciplinary and collaborative research centre focused on understanding how the earth will respond to climate change. It conducts fieldwork globally, with the Arctic freshwater marine system a particular focus due to the acute impacts of climate change on it. CanWIN is the data portal that collects and manages the large array of climate and hydrological data generated by this research, including detailed spatial data, and shares it with researchers.
But the portal also works with other stakeholders, including Canadian indigenous rights holders. ‘What they really want to know is, how is climate change affecting my bay which my community depends on for multiple uses,’ says Herbert. ‘To be able to tell that story, as a researcher, I need access to multiple types of data, for example, weather station data, atmospheric and water quality data, etc. And then I can put all those datasets together and actually make a story about it. I can say I have analysed all these different bits of data, and this is the result. From this ten-year study I can see that water quality has changed over time in the area, you’re getting more algae blooms, and the water temperature has increased by two degrees. And we know from global climate change patterns that this is typical of climate change.’
Herbert believes an important task for CanWIN ‘is trying to get out science facts in a way that is understandable to the public and tells a story about a relevant topic.’
A lot of data, but is it doing anything?
The original CanWIN database was developed by Canada’s federal government to host freshwater data collected by various departments and agencies, often going back decades, but which was completely unorganised and largely inaccessible. ‘Biological data. Physical data. Geological data. A lot of that information was originally hard copies, before we had computers. A lot of it wasn’t digitalised, even after we had computers. And what was digitised was in multiple government departments around the country.’
The problem came to a head for Herbert, a freshwater biologist by training,
while she was working for the Department of Fisheries and Oceans. The agency had been collecting freshwater data going back decades, but the data was inaccessible to the public, and many other freshwater datasets were stored in different databases around the country.
Parallel to this, another government department responsible for the sharing of water related data, Environment and Climate Change Canada, realized they had a similar problem, and started to think about different data management solutions but faced major difficulties. They were using a proprietary database system mandated by the government. It was expensive and the database’s structure had a very specific schema that made it inflexible and complex to use, particularly for its users, who were scientists not database experts. Herbert even took a diploma in database administration, to try and figure out how to better manage data. ‘So, long story short, then the government entered into an agreement with the University of Manitoba for us to host that freshwater database and I was hired to manage it.’
Open source a gamechanger in research environments
It was while she was trying to understand all the different types of freshwater data that existed and how they were collected, that Herbert realised how much a structured proprietorial database solution was unsuited to an environment in which researchers are always adding variables to collect information in a way that makes sense. She started looking at open source technology.
CanWIN was originally created as a relational database using the open-source technology, Visual Studio. Ultimately, however, this was not able to capture and share the complexity of the multi-disciplinary data produced and stored on the portal. In 2020, Link Digital was engaged to completely overhaul the portal using CKAN (the Comprehensive Knowledge Archive Network) as its backbone, along with Drupal for content management.
CKAN’s modular structure made it flexible and easy to customise with extensions. ‘Extensions are really important to what we do,’ Herbert emphasises. CKAN also aligned with the funding structure of higher education, which is mainly grant-based, and thus largely rules out having the budget to fund an expensive, ongoing software licence with a proprietary vendor.
Information sharing data made sense
‘We designed CanWIN specifically to focus on freshwater and Arctic data. We did not try and do everything. And we chose CKAN ultimately because it gave us the flexibility to create a space where we felt we could share the information in a way that made sense.’ As part of this, Herbert adds, ‘we work in a lot of developing countries, so I didn’t want to build something that couldn’t be used by any collaborators because it was too costly.’
The other gamechanger was CKAN’s ability to help CEOS develop consistent, standardised metadata schemas. This enables researchers to understand the type of data, how it was collected and what instruments were used to collect it. They were also able to create a common data dictionary that researchers working across the globe were able to use to compare data, especially when they wanted to look at the effects of things like climate change on a system.
CKAN also offered the ability to create a user-friendly backend. While CanWIN staff still do a lot of the actual uploading of data, ‘we can give researchers access to be able to upload the data themselves. We can have required fields, not required fields, controlled vocabularies, drop-down menus. And the fact that we have it wrapped in Drupal, that makes a huge difference and makes it more user friendly.’
CanWIN currently hosts 124 datasets, each of which contains between three and seven resources, which are often their own datasets. But an additional 34,000 datasets are searchable from other data portals CanWIN federates: Canada’s national open data portal and the Columbia Water Basin Hub, run by Living Lakes Canada, a charitable water stewardship non-government organisation working with community groups to protect freshwater.

Using a portal be used to tell better stories with data
CanWIN’s initial mission was to improve the collection and sharing of freshwater data. With the upgrade, it moved to help researchers better tackle the complexity of the multi-disciplinary data they were collecting, including showing the relationships between the data and the larger projects they were related to, and the instruments used to collect it.
The next stage of CanWIN’s evolution includes plans to increase the number of hubs it pulls data from. But the central concern, according to Herbert, is getting the data out to communities. Anyone can already access the data on the portal easily. Likewise, CanWIN can generate metrics that allow its team to track where datasets were used in public policy documents. But Herbert believes this approach is not enough and a bigger effort needs to be made to simplify access to the data and make it more understandable for a non-expert audience.
A change to combine multiple datasets
One change will seek to improve CanWIN’s main landing page to better highlight research reports featuring data the portal hosts. Another idea is specially produced plain language knowledge products combining multiple datasets from the portal to tell a more comprehensive story. ‘Elevated summaries that contain what you might want to tell a politician in two minutes that’s important to them. And those knowledge products can then be used to inform policy about topics like climate change.’
‘Knowledge products mean there are multiple actions you can do from them. They can be used to inform policy, but they can also be used to take to a community for information. They can also be used to teach in a school. They are a method to get science facts out to the public, especially in an era where it’s so easy to spread misinformation.’ And where they want to, people can also connect with the researcher who collected that data and ask questions or collaborate with them.
CKAN will play a key role in these plans
‘CKAN allows us to not only host the data that we collect, but to standardize it into common language terms that everyone can understand. So, I don’t have to be a weather station specialist to know what barometric pressure means, because that description will be in there.’
Want to create a data portal? Don’t reinvent the wheel
Herbert has advice for those starting their own journey managing a data portal.
‘Based on my experience, I would say don’t reinvent the wheel. Start simple. What’s the most basic key component you need to have and then figure out who does them. Because there’s going to be people who already do them. Then figure out which tool is going to work for you. There’s almost nothing that people have to create themselves.’
‘This is a problem that I still see in all the aspects of data management. People just want to always run in and do their own thing. And then you get isolated repositories that are absolutely no use in the long term or can’t be supported, because they are expensive or difficult to maintain.’
Linked to this, it is important to have an idea of your short and long term goals for your portal. ‘If you understand your goals, you can make a budget and try and figure out how you can actually sustain that portal in the long term and, if you can’t, look for alternatives. There are better ways, like collaborating with an existing repository, that may server you better with the money you have.’