Welcome to the Future of Science podcast. This episode is about as meta as it gets. We'll cover FAIR data, metadata, frustration about metadata, and metadata about metadata. We'll explain how FAIR data, which means findable, accessible, interoperable, and reusable data, enables machine actionability. We'll explore which sci-fi future is possible if the internet were composed of FAIR digital objects, and why we need data stewards to get there. We'll also see how the principles got their cute little acronym name and how that likely helped with the quick adoption, even in high-level bureaucratic institutions like the G20 and World Economic Forum. We'll learn about the incredible amount of money going to waste in the public and private sectors due to not having implemented FAIR data principles yet. Oh, and we'll also talk about how the banking industry tried to buy the internet. Yep, you heard that right. So, let's jump right in.
In my opinion, the FAIR principles are guidelines for machine actionability. They provide a range of behaviors that are necessary for automated findability, accessibility, interoperability, and reusability. They are also known as the FAIR guiding principles, and I believe that, six years later, they have proven to be an excellent guide. They really help you determine where you stand in terms of machine actionability and identify the gaps that need to be filled to take you further. So, they do an excellent job as a guide. The question is, why do we need them?
There is so much data now that it exceeds human comprehension. It is beyond the ability of a human to manage it all, and as a consequence, a lot of really important data that is created can simply be lost in plain sight. Even if you have access to some of this data and can make it interoperable with other data to derive some important results, it just doesn't scale anymore. You can’t go in and process that data.
The story behind the acronym began at a workshop held in January 2014 at the Lorenz Center, which focused on managing complex data. The vision was to create an infrastructure that supports data, using the metaphor of an airport. The name of the Lorenz workshop was called "Jointly Designing a Data Fair Port", mimicking the experience of a passenger at an airport, where many services and vendors are available to support their journey from point A to point B and make it more comfortable.
But the idea was to do something similar for data. We could we work out a system whereby vendors could come in and offer services, like in a food court, restaurants, or shops. There would be a place where customers could access services that would make their data experience easier, their data journeys easier. At that point, FAIRport was just a play on words for airport. The acronym had not yet existed. After the workshop, the people who had attended were thinking about creating a set of principles to make data more machine-actionable. So they started creating these principles. Now, we know them as the 15 one-liners in the FAIR principles. The story I've heard is that at one point, Barend Mons, the initiator of the workshop, had taken these 15 principles and simply rearranged them in the order of findable, accessible, interoperable, and reusable. The acronym was born. The 15 one-liners found an appropriate place in FAI and R, and then this magical acronym was created.
The power of the acronym is that it sounds very egalitarian. Who doesn't want to be fair? Yet, at the same time, the FAIR principles really have nothing to do with any kind of moralistic approach. They are technical principles that need to be in place before the machine can actually do FAIR on its own.
Machine actionability is really the idea of efficiency. If things become too inefficient, then activities become impossible altogether. Not just slow, impossible. And at the moment we're facing diminishing returns with the volume and complexity of data. Data is viewed to be absolutely essential from a theoretical point of view, but really from a practical point of view, people need to save money. They need to process information if they want to advance their own work. And that's just not going to happen if the machine cannot assist the human in the right functions.
But making this data machine actionable requires additional labor on the front end for researchers. So it's a question of how we can motivate researchers or data producers to put in the effort required to make data fair. There are several answers here, some for the short term and some for the long term. In the short term, you can appeal to the moralistic or egalitarian view that if you make this effort, your data will probably be of higher quality, will be reused more often when appropriate, and you might even increase reproducibility. These are all great ideas, and we all love to think about better science, but we are all aware of the demands on our time, especially researchers. It's a great view of the future, but not necessarily something that we can easily implement day to day. Then there's another set of possibilities: more and more funding agencies are requiring fair research. There's the carrot and the stick approach, and they are bringing a big stick with regards to FAIR. This will get the attention of researchers, especially academic researchers who get publicly funded research grants. They are starting to feel the heat. But funders should also realize that they need to mandate and put a budget behind that mandate. I think that's an encouraging development. There are a number of cases where funding agencies are realizing and becoming more understanding of the role they play in the FAIRification of data. There's a very serious commitment that funders, and by the way, publishers, will need to make to create an infrastructure for this.
There's the surprising thing is that FAIR was taken up immediately after after the fair principles were published in 2016. So you saw, you know, online on web pages and in white papers, you saw commitments to FAIR everywhere. There's been a period of kind of languishing or trying to come to terms of what this really means. And so, what I think the commitment really is about, but the very concrete things we can point to our, for example, at the Dutch ZonMV organization, which in the COVID pandemic, and the courses of in their national COVID program. So first there was a commitment that research products research assets should be made FAIR. So that was a big statement. And then there was some assistance that was provided through the GoFAIR Foundation and also HealthRI in actually supporting about 130 research projects in that program to make their data FAIR. And so there was a series of metadata from machine workshops. There's a deployment of a portal environment that could display the metadata that was created. So ZonMV really put the money where their mouth was and they funded the FAIRification. And we engaged the researchers and the data stewards in these 130 projects. It was a pretty huge effort. But to say now, there's a number of programs coming up where this process will be refined and extended in dementia research and cystic fibrosis.
At the same time, we have now also pretty in depth discussions with the national institutes of health (NIH) in the United States. And I think we're getting close to deploying a series of metadata for machine workshops and FAIR implementation profile (FIP) workshops for some projects and institutes in the NIH.
The NIH is a particularly interesting target organization for FAIR. Because the NIH is a very large organization composed of 23 research institutes. They're all on one campus. They're all located as one big project and yet the data interoperability is like anywhere else. It's not existing.
So for NIH level FAIRification, we imagine that you identify cells or genes or proteins in one institute and those identifiers and those vocabularies could probably just as well be used by the other 22 institutes.
There is a report out there, which was also sponsored by the European Commission, where they conducted a cost-benefit analysis of having FAIR data versus not having FAIR data. They estimated that, for the European Union alone, there is a waste of over 10 billion euros a year due to not having FAIR data. This includes
And this report does not even take into account the effects of slowing down research or having less replicable research or innovation as a result of that. So, all of these more intangible indirect effects are not even considered. It really seems to be a significant issue where there is at least a theoretical cost of not having FAIR data. However, there is no one who is directly paying for these costs. So, it is a public goods problem in a way. There is an obvious cost to not having these public goods, but who will actually provide them?
In 2016, the FAIR principles were published in their modern form. Almost immediately, the European Open Science Cloud High-Level Expert Group published their first report and FAIR was all over it. It was recognized early on that FAIR should be fundamental even to the European Open Science Cloud. It's also important to say that in our work from 2016 on, it was really the private industry that had the most compelling use case or commitment to FAIR data. We had worked with some larger pharmaceutical companies who, even very early on in 2017, would start discussions saying that they didn't need to see the justification for FAIR because they understood its importance.
They were losing approximately 2.3 billion a year in their own internal data management. Their estimate was staggering. So, I think that the 10 billion cited by PwC is extremely conservative. It was in the industry alone where, if we could help FAIRify their data internally, it would already bring them huge savings. So, the answer that you may be prompting here, Philip, is properly seen, and I'm going to quote my colleague Rob Holt in this, "FAIR is not a cost. It's an investment." If we can win back some of the inefficiency that is now endemic, then it should be the case that whatever we invest in our FAIRification should pay back dividends going forward.
In order to achieve FAIRness, there are certain things that data producers need to do. It requires a commitment that becomes part of your daily routine. However, fairness cannot be achieved unless there is an infrastructure in place to support it. Researchers cannot be expected to develop new identification or resolution services, access protocols, or authorization protocols. Instead, researchers need to develop good metadata descriptions about what they are doing, and that data needs to be made machine-actionable. Once the data has been "FAIRified" and made machine-actionable, it needs to be orchestrated in some way, which is an infrastructure problem. In the past, we have built infrastructures such as electricity grids and highway networks as public goods. Similarly, we need a data infrastructure, and the FAIR principles are guiding the innovation process towards the point where there will be mature enough approaches to warrant large investments and implementation. This will likely be done under appropriate governance, such as open and freely available, similar to how the internet came into being which was made by a series of partnerships with industry, government, and individuals.
And similarly to the internet, I get the impression that there was absolutely nothing inevitable about the internet we have now. The first reason why we have the internet now wasn't written in destiny. The internet we have now could, in fact, be much better than it is. There were ways to make the internet much more powerful and trustworthy.
There were also ways that the internet could have been stalled in a much more rudimentary state. Bob Khan, the guy who co-invented TCP/IP, once said that there was a moment when the banking industry wanted to buy TCP/IP or somehow own it. If that had happened, the internet today would be just ATM machines. That would have been the extent of the vision. Taking that lesson to now, we think there's an emerging data network that's trying to be born. There is a moment now where small fluctuations can really matter a lot. We could end up with a fantastic data infrastructure, or we could end up with a mediocre one, or we could end up with none at all. There's nothing inevitable about it.
We cannot underestimate the value of acronyms. If the acronym had turned out to be something like a scam, no one would follow the scam principles. But if it's FAIR, then people can get behind that.
Even with this acronym however, there is a distinction between researchers and data producers who would feel an extra burden of FAIR and the high-level organizations who bought into this immediately.
This is a bit of a contradiction because these large, authoritative bodies tend to move very slowly on purpose because their actions have a significant impact. But at the same time, it is these very high-level groups that are most aware of the waste and inefficiencies in data production. I can imagine a high-level bureaucrat who's feeling the heat because the 10 billion they invested in some program produced a lot of data that nobody can find anymore. You can't be responsible for taking that risk. FAIR was attractive very early on to these high-level groups because it spoke to the mitigation of risk, and that's what rings the bell at those higher levels. Large bureaucracies are generally directed by their risk assessments, so they're always trying to mitigate risk. They're not trying to innovate or push boundaries at the front.
What this meant was that the same year the principles were published, the G20 had FAIR written in their communique. The G7 had the same from Italy when they met the following year. It's been mentioned at the World Economic Forum in different ways. You see it at funding organizations, EOS, and the European Commission. There was an immediate attraction to very high-level organizations.
But that's at the high level. Then, the people who have to implement it, at the lower levels in the lab or in the research departments, see it more like a burden. They need help and assistance with it.
FAIR was born in Europe and for whatever reason, very strong commitments were made early on in Europe. There was fortuitous timing with the announcement of the European Open Science Cloud. And this meant that early on there was a lot of money to follow on EOSC, which allowed the signal to be amplified here in Europe. While there is an awareness of FAIR in the US, as well as in Canada and Brazil, it just hasn't received the same level of exacting implementation pressures that we've had here in Europe.
My point of view is that of an American. However, I think that in the United States, there is often a belief that the US is the leader and innovator, and that new things happen here. Therefore, if a good idea comes from Europe, it may take longer to gain traction in the US. FAIR has been recognized in the US from the beginning, but it has gone through some hype cycles. At times, people would say that it was only hype or that they had been doing it for 20 years already. As a result, people tended to find reasons not to take FAIR seriously or to go very deep with it. However, I am happy to say that this is now changing. Even in the last year, we have seen a renewed and much more serious and committed interest in FAIR to the point where we are now looking to implement it. The way history tends to work is that the US is often slow to get going, but when it does, it can be very serious.
The nice thing about the FAIR principles, again, as guiding principles, is that you can use them to prioritize your approach. You can assess where you are in your data practices and how well they adhere to the FAIR principles. Then, you can easily spot low-hanging fruit. There are simple things you can do to boost your FAIRness, and you can create a roadmap for things that might be at a project level or that you would have to invest in later. One thing about the FAIR principles that is quite useful is that they elevate the idea of metadata to be a first-class citizen with the data itself.
The way science has developed over the past 200 years, metadata has been considered the responsibility of the publisher. Science or Nature Magazine records the text, tables, and figures, and provides citation information so people can access and cite the work. For a long time, people thought that was the extent of metadata. It wasn't my job; it was the publisher's job. I thought about it for two minutes when I submitted my paper for publication. If data is going to exist digitally on a network, it needs to be more findable and accessible. If I want to reuse it, I need to know more about where it came from, the processes that produced it, the parameters that were in place, and dates and times. Metadata, licensing, and who is allowed to use it under what conditions were only vaguely or non-rigorously kept by research scientists. FAIR is asking us to reevaluate that. The metadata should persist even if the data is no longer available. The FAIR principles are asking that metadata becomes a permanent record or a publication in and of themselves. Data sets could be slated for deletion or non-maintenance, go missing for accidental or purposeful reasons, and metadata should be there to say, "Hey, the data is missing.”
In addition, what FAIR has taught me over the years is that the line between metadata and data is actually quite blurry, if not non-existent. Even to the point where I think, "Oh, if I want to FAIRify an Excel table, the scientist probably creates columns of numbers and then puts a column header on top." Those column headers, which most people would say are part of the data set, are actually telling you something about those numbers.
And a critical point is that when we talk about FAIR metadata, it's not only important to have metadata, but it's also important to have metadata that can be understood by machines. This is achieved through the use of controlled vocabularies or terminologies embedded in schemas, which allow machines to interpret the properties of the metadata in a way that makes sense to humans. Instead of arbitrary column headers, there are headers that mean something for machines and humans alike. Ideally, this establishes a coherent relationship between machines and humans, where machines can handle high-volume routine tasks and humans can focus on more creative and theoretical work. However, this is only possible if semantic information is embedded in the data and metadata that describe those datasets.
I think of the FAIR principles as an interface. These are guidelines for building an interface between the human brain and the cyber computational world. It's really the bridge between the human world and the machine world. And this has a very serious consequence for implementation in that we will need.
Infrastructures will always grow, and there will always be some innovation at the infrastructure level. But FAIR will always require the active participation of researchers. As long as we have computers and people, we will need human beings to codify their knowledge and practice at the metadata level in their datasets and descriptions so that the machine can better assist them.
Now currently, codifying this human knowledge in a machine representation is a bit tedious, to say the least. It involves clarifying terminologies, building ontologies with technical specifications, making sure those are available, coming to terms with the metadata schema, and the different forms of metadata schemas that we would like to interoperate. So there's a lot of work that needs to go on to scale, automate, or semi-automate this process to make make it easier for just the ordinary practicing botanist to begin to record FAIR metadata without a lot of hassle.
For example, artificial intelligence could help humans generate the metadata they need to make their scientific data machine-actionable. In a lot of machine learning, you might set some kind of objective function to optimize the machine's performance for a specific task or domain. Machines could be asked to create metadata optimized for interoperability with a specific domain, for example, which would otherwise require different teams of people to implement. It's possible that machines could perform these sophisticated tasks more efficiently.
There are different terminologies related to data management, such as data management plan, data manager, data stewardship plan, and data steward. If there is a distinction between these terms, data stewardship takes a broader view of the long-term viability of the data.
A data steward is a person with technical skills in data representation, data science, or data analysis, but their focus is on the data itself, not on IT. They need to have knowledge of the domain-level content and practices. The ideal data steward is someone who is trained as a researcher in a specific domain, such as chemistry or biology, and then gains technical skills to help develop experiments, methods, and decisions about how to capture and render data machine-actionable. Data modeling and semantics require someone who is involved with the content of the subject itself, either from a methodological or subject area point of view. Data stewards need to be good scientists and understand how to set up an observation or experiment to yield meaningful results. They also need to know the domain content, such as genes, proteins, and diseases. Data stewards should not be viewed as mere technicians in the lab, but as professional roles equivalent to the principal investigator. Research teams need both a data steward and a principal investigator to work together to create research assets that are maximally FAIR. Yeah.
The professionalization of this field has already begun. I am seeing more and more job offers for data stewards coming across email or LinkedIn. You can even see what people are asking for in these roles. I think people are being hired at higher pay grades as well, indicating an increase in demand. Research departments are starting to carve out permanent positions for data stewards in their budgets and institutions. All indications point to this emerging as a real professional occupation. Recently, GoFAIR partnered with iOS Press to launch Fair Connect, which we think of as a journal for practicing data stewards. It's an attempt to give data stewards a professional platform to communicate and make their work more citeable and FAIRified. They can build up portfolios of their data stewardship work over time and create a professional profile of their accomplishments.
A FAIR Implementation Profile (FIP) is actually a simple idea.
When you look at these 15 one-liners, the FAIR principles (F1, F2, F3, etc.), you'll notice that these principles request a certain computational behavior that would enable FAIR. To determine the level of FAIRness achieved, we can ask a specific question: What technology did you use to implement F1, F2, F3? You should be able to answer this question rigorously.
The FAIR implementation profile is simply a list of technologies. It is about documenting the technological choices you have made.
In addition, we have discovered that a community is fundamental to FAIR. You can build a list of technologies that might implement FAIR, but who decides on that list and on whose behalf does this list apply? A complete fair implementation profile includes a FAIR implementation community and a list of what we call FAIR enabling resources (FER) that support each of the fair principles. That is what it is abstractly.
More concretely. The idea would be that if a community, big or small, temporary or permanent, comes together and creates a list of FAIR implementation choices, they can make a data management plan or a data stewardship plan that anyone can follow, with their technology choices built in. This way, even a data steward who has never heard of the FIP can implement it in a consistent manner. The choices can be disseminated to the community, and a FIP from one community can be compared to that of another community. This way, you can see how different communities have chosen to implement FAIR and spot overlaps, gaps, or inconsistencies that might hinder interoperability or interaction.
And then, that's a starting point for revising your FIP. Maybe what you've decided upon initially could be optimized later on. The FIP, if done widely for a large number of communities, would really paint a technology landscape of FAIR. This is how people are currently implementing FAIR, and it would give you a starting point for an informed, logical step towards convergence on a more robust implementation strategy.
We think of the FIP as a piece of metadata, and we've gone to some effort to build a tool that allows people to build a FAIR implementation profile that is itself machine-actionable.
To give you a concrete example again from the work we're doing with the ZonMV, a funding agency here in the Netherlands.
The idea is to launch a new program in dementia research. Before that program is even announced, there could be an attempt to build a FAIR implementation profile by bringing in the right people. This would include technology experts to consider the real technology components, and domain experts in dementia to figure out the vocabulary they want to use and the appropriate metadata schema for the kinds of things they want to record.
By going through this exercise, you end up with a FAIR implementation profile that can be announced when you say, "Hey, here's a new call on dementia research." People would apply and understand that if they receive a grant from this program, they would have to follow a DMP template or a data stewardship plan template, and produce fair research outputs at the end. By following this template, which is informed by the FIP, they can borrow the agreed-upon vocabularies, metadata schema, and FAIR data points. This way, their work will migrate directly into fair without having to spend a lot of nervous energy trying to make these decisions themselves. And everybody in that research program, there are 100 projects, would be implementing fair in a consistent way according to what the experts had said early on. For most researchers, they don't have to worry about the FIP unless they want to. They might leave that burden to someone else, a trusted authority, who will then say, "Okay, now in your data creation efforts, you will be using these metadata forms and these vocabularies.”
On a scale of 1 to 100, I would rate our progress at 30. In the past few weeks, since the FAIR focus week we had in Light and EASC's big symposium in Prague, I have seen a lot of brilliant progress being made. Although there is still a need for coordination or some kind of convergence to happen, I think we have identified the critical technology gaps. There is clarity on what needs to be done, even though some of these projects might still be quite substantial. We still have a long way to go, but I am optimistic that we can potentially close the gap from 25% to 100% exponentially.
However, we need to avoid using too many "ifs" and focus on making progress. We could wander around for the next 100 years, trying to achieve perfection but never really getting there. However, I believe that in the next decade, we could get very close to that asymptote.
George Straun gave a keynote speech at the FAIR focus week. He was the guy who handed off what was then called the NSF net to AT&T and Verizon to become the Internet. And he represented the vision for FAIR, which is the idea that there will be a moment when it will look like there's just one computer out there running one giant data set. This means that users won't have to move between different platforms, data sets, and vocabularies, and find mappings between them to do the data munging. Everything will be automated. Just like how Google presents the entire internet as a directory, FAIR would be the next step where even operations are automated to such a degree that what you see as a user is essentially one computer system interoperable with one data set. You can search that data set as you see fit.
This is I use a Mac, and I would love the moment when I can have a directory of files, of any file types. I would also love to be able to add bookmarks from the internet into that directory, and treat web pages in the same way as files and directories. For example, I would like to pull services into my directory and make it easy for me to organize things so I can access them quickly and operate with them. However, on my Mac, the directory is designed only for files and not for other digital object entities. If we had FAIR digital objects, I could manipulate these objects as if they were real objects and use them the way I want to use them.
Erik: And then there are more sophisticated visions of this. Earlier this week, we talked about data visiting. The idea is that instead of saying, "Hey, I would like to reanalyze some data. Could you copy it and send it to me, please?" we would use data visiting. With data visiting, I would send a query to a data station, the query would happen over there, and then I would get the result. What's nice about this approach is that even very sensitive data can be analyzed under the appropriate conditions without having to be copied and sent all over the place where you never know what's going to happen with the data. There are other advantages to this approach as well, and we can imagine using the FAIR principles behind it. There are lots of nice innovations going on at different levels around data visiting.
By training, I am an evolutionary biologist and I am interested in some theoretical issues that I find very intriguing. I have worked with many systems to test theories using molecules in the laboratory, such as RNA and protein molecules. After receiving a PhD in bioinformatics and an evolutionary approach, I did a postdoc at MIT in the laboratory and learned how to do biochemistry in the lab.
Over time, my interest became more data-intensive, and it wasn't just about experimenting with a particular RNA sequence. I began to imagine all possible RNA sequences and how to approach them. We can synthesize a lot in the laboratory using certain techniques, and I can keep track of it on a computer. However, I realized that there are frustrating blocks and obstacles to doing data science.
We were working with some ideas around nanopublications early, and then we had a meeting on FAIR in 2014. Two years later, there was a paper on the FAIR principles. In 2016, a project funded by Health Holland called FAIRdict was launched, and we tried to build some reference implementations for FAIR. There was a broad spectrum of community engagement, and we worked to jumpstart this.
By 2018, the GoFAIR initiative was launched by German, French, and Dutch ministries. GoFAIR was meant to be a temporary kickstart on FAIR implementation, primarily for the purposes of the European Open Science Club, but it had a global view. I would have to say that GoFAIR has been the best job I've ever had. Working with the community to develop things like the FAIR and the wild or the metadata for machine workshop approach has been incredibly productive. We get a lot of traction, and there's a lot of interest. In a way, GoFAIR has been like a five-year sabbatical outside the research domain, just in this infrastructure domain. As it turns out, I will be re-entering research in the light and academic community in the next couple of months and will bring some of the bag of tricks that we have from FAIR and try to do some more of this molecular evolutionary biology.
First question, geology or biology.
One and the same. They're one and the same. Different buildings, different people. Biological systems are isolated entities only in an artificial sense. If you see an organism or a cell, it's only an entity because it's easy for us to study it that way. All life forms are deeply interconnected with the geology of a planet, the atmosphere, and the oceans. This is where FAIR can really shine because we need to interoperate between biological, geological, atmospheric, and oceanic domains for a good handle on current environmental predicaments.
What about… Mobile or desktop
Desktop. I think that there could be an age-related factor at play. I prefer to do my work at my desk, but when I stand up and walk away, I don’t want to do work anymore.
And do you ask permission or do you beg for forgiveness?
I'm constantly begging for forgiveness. There are so many opportunities, you know, that I call "low-hanging fruit" - things that can be acted upon without requiring a lot of funding or a mandate. I probably give in to that temptation too often, and then we have to do damage control afterwards.
What about Feeling or thinking?
Erik: Both. In the theme of geology and biology, I have a tendency to think that these dichotomies are somewhat artificial. They serve a certain purpose at a certain time, but we shouldn't really separate them at a conceptual level.
Please fill in the blank. Science is...
Humanities. sixth. sense.
You know, if you run a scientific project, you formulate hypotheses, make observations, control experiments, collect data, cancel out noise, and account for problems. You might end up with a conclusion, which to me is just another observation - a way of seeing the world that goes beyond what our eyes or ears can do. Science is like an additional sense organ that allows us to see patterns in the world. It's a collective effort, and we must work together to make it happen.
If you could change just one thing in the universe, what would it be?
I would change all the zeros to ones and all the ones to zeros, creating a mirror image. I have had conversations with people before about this, and it may seem philosophical or even mystical, but somehow everything fits together in a way. There are many horrible things we can see in the world and things we would like to change, but somehow it all fits as one big picture. If I had to change anything, it would be just coloring the bits different colors or something.
However, as a mission in FAIR, what I could say is that it would be nice to see data stewards and the data stewardship role being elevated, as it is something that can happen quite effectively across the world and even in developing countries. I think there is a tremendous opportunity to create a real profession of data stewards and see people with wonderful opportunities in that field.