Towards a FAIR Research Identifier

If you scroll back to anything published by Elsevier or other publishers just 20 years ago, almost 100% of the URL references in the articles are affected by link rot or content drift (Klein et al., 2014). This means we either can't access this content, or it is significantly changed from its original, cited version.

As a researcher, this is frustrating. But for the scientific record, it's devastating. Sources that scientists base their research on may be revised, wholly withdrawn, or arbitrarily lost.

Now scale this system up. We are increasingly publishing research with code, data, Twitter threads, talks, protocols, and all sorts of other associated information. How would the current system manage linking to the thousands or hundreds of thousands of data files related to future publications in fluid dynamics, astrophysics, or genetics research? We'd never be able to link to these files in a way that would guarantee accessibility.

This problem is not new; it's part of how the current version of the Internet is designed. Publishers have been discussing it for decades, leading to the DOI system. However, the DOI system does not solve the fundamental problem. It still uses URLs which only point to the location of a file. However, moving the file still implies that the link breaks. The DOI system primarily pays for the cost of creating and manually updating a lookup table that links DOIs with the most current URLs. This system is labor-intensive, bureaucratic, expensive, and doesn't work well. Even with all the maintenance, around 51.7% of DOI requests don't reach the correct target (Klein and Balakireva, 2020).

In our podcast episode with Juan Benet, he discussed the need for identifiers that don't need monitoring to remain active and that trace when content has changed. IPFS identifies and locates files by using content identifiers. A cryptographic hash is generated based on the content of the file. This hash identifies each file uniquely and enables multiple parties to store a copy of the same file using the same address - an essential requirement to protect content against data silos, promoting institutional sovereignty.

Even beyond being able to always find a research object, changing even the tiniest bit of the content (e.g., a comma, a pixel) yields an entirely different hash. This allows these content identifiers to act as a 'trace' on research updates. This means that when you access a research object that uses such a content identifier, you can see any changes that happened, such as added data, updated code dependencies, or edits in the research paper. This functionality flips our current research paradigm inside out: Instead of having a single, largely inaccessible 'version of record,’ we have a persistent and traceable record of every version of evolving research objects.

Research should adhere to the FAIR principles: Findable, Accessible, Interoperable, and Reproducible. To achieve this, we need a way to access and utilize all the research that has been and will be conducted. DOIs lack persistence and the ability to track changes to the scientific record. However, content identifiers offer a promising solution. This solution is already being introduced to the scientific community by dPID.org. dPID stands for decentralized persistent identifier, and this protocol builds upon IPFS’s content identifiers to enhance the existing DOI system. With this new protocol, any research object, such as code or data files, can be identified and versioned, and the links will remain functional even if the location changes. We are excited to further explore and support technical innovations like the ones presented by Juan through IPFS, protocols like dPID, and their potential to contribute to a more FAIR science.