The release of Croissant 1.1 (in coordination with MLCommons) represents an important milestone in making datasets more accessible and interoperable worldwide. At the core of this update is a major feature by Link Digital’s senior solutions architect, Ian Ward, who contributed to the implementation of multilingual metadata support.
Croissant is a new high-level metadata format for machine learning datasets designed to make datasets easier to discover, use, share, and integrate with machine learning tools and platforms. Ward’s contribution aligns Croissant’s metadata structure with the standards used in CKAN, specifically the ckanext-fluent and ckanext-scheming approaches that he has long championed. By enabling datasets to hold labels and descriptions in multiple languages simultaneously, this update ensures that global data portals can ‘speak the same language,’ making AI research more inclusive and accessible worldwide.

Why it matters for the global community
In the past, much of the world’s most powerful AI was built using data labeled primarily in English, creating a ‘digital language gap.’ By making it easy for datasets to be described in any language, Ian and the Link Digital team are helping to democratize AI. Our work is helping researchers in non-English-speaking regions to find, understand, and use high-quality data more effectively, ensuring that the benefits of artificial intelligence are shared by everyone, regardless of what language they speak.
Ongoing work
Beyond this milestone, Ian has been a key voice in shaping the future of the ML data standard. His ongoing work and discussions regarding composite foreign key support (which allows the linking of multiple fields together to define a unique relationship between different tables or files within a dataset) and data validation (a capability through its Python library – mlcroissant– and visual editor, ensuring the metadata and the underlying data adhere to defined structures).Both pieces of work are already paving the way for the next release, further solidifying Croissant as the premier format for high-quality, reproducible machine learning datasets.
You can read more about Croissant here or visit its GitHub repository.