
Federating data: How APPN gets it together
Providing seamless, standardised access to phenomics data across all the APPN Nodes has meant developing new strategies for data management and storage. Innovative use of a persistent research organisation identifier (ROR) supports catalogue and search functionality, but consistency from units of measurement to metadata terms is the real key.
Shortly before attending the eighth International Plant Phenotyping Symposium (IPPS8) in Lincoln Nebraska, APPN Data Architect Dr Rakesh David was invited to address students and faculty at the University of Kentucky (UK) Department of Horticulture on federating data in a collaborative ecosystem.
Federating data involves addressing the challenge of managing and organising data across multiple systems and sites. It enables APPN’s Nodes to connect and collaborate without having to move all data into a single shared location.
Finding ways to federate data that has been collected across multiple locations, host organisations, projects and partnerships is critical to the success of our multi-nodal network. With APPN establishing one of the first truly national plant phenotyping infrastructure networks in the world, the University of Kentucky (UK) were keen to learn how we are tackling the issue.
Consistent metadata standards is the key
Federating the data collected at our Nodes while applying consistent protocols and the FAIR data principles is central to APPN’s data management vision. In fact, consistency is one of the key ways we will deliver additional value that would not be possible without a network.
Working against this are the sheer volume of data that our high throughput high resolution digital phenotyping technologies generate, the differences between the data repository services used by each of our Node hosts and the need to respect the intellectual property rights associated with the research projects we support.
As Dr David explained at UK, maintaining consistency in the fine detail underpins the accessibility and interoperability of data at scale.
“APPN is working hard to agree on standardised terminologies to document units of measurement and so on from the outset,” he says, “so that different datasets can be used together easily and accurately. Standardised terminology also ensures that data assets aren’t missed in later searches.”
Even with standardisation of terminology at each Node, APPN needs a way to federate data across our hosts’ scattered data repositories.
Using persistent identifiers to power dataset searches
To solve this challenge, the APPN data team is adopting a simple approach that is easy to implement in the metadata offered by each repository. It makes use of persistent identifiers (PIDs).
APPN has registered a persistent organisation identifier with the global Research Organisation Registry (ROR) to represent our whole network: https://ror.org/02zj7b759. Wherever this identifier is used, it serves as an unambiguous reference to APPN. Embedding it in the metadata for a dataset signifies that the dataset is associated with APPN’s operations, and the set of all datasets with this identifier forms the national data collection generated using APPN’s facilities.
In order to catalogue this collection and including new datasets as the Nodes produce them, APPN will leverage investments by another NCRIS partner, the Australian Research Data Commons (ARDC). ARDC maintains Research Data Australia (RDA), a general-purpose catalogue for collecting metadata on research datasets from most Australian universities and research infrastructures.
APPN and ARDC have co-invested in a new Federating APPN Data Collections project that will enhance RDA’s handling of persistent identifiers such as APPN’s ROR and make it possible to find all datasets that reference them. Using this service, APPN will offer a dedicated APPN theme page on RDA and have access to the full and evolving collection of APPN datasets to build its own rich discovery portal.
Breaking down data silos for open access
By combining consistent metadata with high-level indexing, APPN is breaking down the silos that can keep data assets within the institution where they were generated.
Applying consistent metadata and PIDs means APPN Nodes and researchers will ensure that any phenotyping datasets can be easily found by others, regardless of where they are stored, and that APPN will be able to interpret and combine measurements from different datasets to offer richer data services to its users.
These strategies will do more than simply link APPN’s decentralised infrastructure. They will also have a significant impact on our users’ ability to advance crop science efficiently – by making it easy for researchers to identify existing knowledge and avoid research repetition, and by enabling the ‘big picture’ knowledge and inferences of meta-analysis involving multiple projects.
“The audience at the University of Kentucky were very interested in how APPN has approached this traditionally difficult topic,” Dr David says.
“Federating data is essential for building more effective and efficient integrated research infrastructures, such as APPN.
“Globally, people see us as something of a pioneer and there is a lot of interest in how we bring the elements of our network together, with data federation being essential to our success – and the success of future projects seeking to build on our output.”
6 February 2025