Pushing for more open citations

tl;dr let's get more citations in open databases

Citations are the fundament of scholarly communication, but until the I4OC came along, publishers made only 1% of these available in a machine-readable fashion. This initiative has already cranked up commitments to make citations for 54% of publications available! These are/will be made available through CrossRef and OpenCitations. This is a publisher-side decision and some big and small publishers have not committed for various reasons.

What about the other 46% of the citations in scholarly publications?

This project idea focuses on whether we can independently make those available, without direct commitments from the publishers (considering that's what I4OC is doing). In turn, this may nudge non-committal publishers to commit to I4OC?

A list of scholarly publications can be readily retrieved with the CrossRef API and cross-checked with the OpenCitations API to identify whether it's one of the 46% that aren't available.

Subsequently, for those 46% of papers without citations, we'd need to check whether the publication is accessible. This can be done also with the CrossRef API by inspecting the license field (e.g., CC-BY 3.0), but then we'd miss a lot considering that metadata is badly deposited and there are articles that are free to read but not with permissive licenses. The Unpaywall API or data dump could help out with that.

For those publications that are accessible, the raw references could be extracted. This would require building some robust ways to identify these body sections across publishers, but I think that would be okay. With some testing that should work out. There will also be plentiful of publications that are not available in HTML, which I'd skip for now (PDFs are something for another day...).

Parsing and disambiguating these raw references is going to be the tricky part (see also this blog). First off, the format differs per webpage format. Second, I already saw that direct links to DOIs are not always included in the content but through hidden calls. Finally, they'd have to be reformatted to CrossRef metadata standards just to ease ingestion into already existing pipelines.

This would require resources to operate. It could just provide the information different from CrossRef/OpenCitations (which aren't the same?), but then it would require additional pipeline adjustments. By creating a CrossRef API clone and adding information, pipelines using that API could simply adjust the base URL. That might also be a stepping stone towards improving the metadata coverage on other aspects? (inspect some of the issues here)

Enough to explore and discuss? :)

Risks

  • There could be legal risks? I haven't searched for resources on this, but I seem to remember that some publishers claim it's within their copyright. That wouldn't work for CC licensed articles, but does it work for those freely available but restrictively licensed? It's not creative work after all, but more legal exploration would be needed.

Project name ideas

  • Guerilla citations
  • Extended Citations
  • librecite

Examples

  • DOI: 10.1016/j.jscs.2018.08.007
  • DOI: 10.1001/jama.2017.21168
  • OpenCitations API: https://opencitations.net/index/croci/api/v1/references/{doi}
  • CrossRef API: https://api.crossref.org/works/{doi}
  • Unpaywall API: http://api.unpaywall.org/{doi}?email={email}

Tangents

  • How about book data? It seems like the citations of those are quite relevant, but even way less available.
  • Note: OpenCitations REST API may return [] if nothing is deposited and [{}] if there is nothing to deposit? Haven't found a use case yet

This seemed interesting to use for extracting citations. I’ve asked about its performance, because that of course is important.