5.5 Content accessibility

5.5.1 Suggested training topics

  1. Content provisioning

    1. XML: JATS, ONIX, TEI

    2. PDF

    3. HTML: website features and SEO

    4. Citations in metadata and content data

    5. Research data

    6. Version control

  2. Content harvesting and indexing

    1. OAI-PMH protocol and infrastructure

    2. Introduction to indexing services and their requirements

    3. Website optimization for indexing

  3. Content depositing and export

    1. Repositories and journal hosting services

    2. Depositing protocols: SWORD

    3. Export format types: CSV (DC), JSON, XML (JATS, ONIX)

  4. Content long-term archiving

    1. Archiving services: CLOCKSS, Portico, PKP Preservation Network, PubMed Central, national / institutional

5.5.2 Notes on the training topics

This block covers the extensive domain of content accessibility and is divided into four sections, addressing content provisioning in human and machine-readable formats, content harvesting and indexing, content depositing, export and long-term archiving. The topics in this block partially mirror those defined for metadata (§5.2 “Metadata I”), but add some important new features relevant for the content side of the journals’ output.

The “Report on challenges and help measures faced by Open Access journals and platforms” (D3.2 (Laakso et al. 2024)) underlines the importance of content availability in machine-readable formats, which support text and data mining. The training topics on XML provisioning thus cover common standards such as JATS, ONIX and TEI. PDF is a thematic area of its own including alongside the layout features (title, headers, footers, marginal and endnotes, margins, borders, etc.) the optimization of PDF documents for findability online. Google Scholar, for example, provides the following recommendations for PDF articles on the web:

  1. The full text of the paper should be in a PDF file that ends with ".pdf".

  2. The title of the paper should appear in large font size on top of the first page.

  3. The authors of the paper should be listed right below the title on a separate line.

  4. There should be a bibliography section titled, e.g., "References" or "Bibliography" at the .

The importance of journals’ website features cannot be underestimated, both from the perspective of article landing pages (always in HTML and generated by the publishing system) and full text articles in HTML (in many cases not generated by the system). The topic on HTML provisioning therefore covers website features such as the selection of a suitable domain name, matching keywords, unique URLs for article landing pages and related research data. It also considers the optimisation of website features for search engines, such as placing each article and each abstract in a separate HTML file, meta-tags configuration, the structuring of robots.txt file, . The alignment with the Counter Code of Practice, the presence of alerting services, sharing on social networks, post-publication evaluation and commenting, support for multimedia and open peer review are additional elements enhancing the visibility of the resources on the web.

The next topic in the thematic area of “Content provisioning” deals with the correct structuring of articles’ references in XML, HTML and PDF. The citations should follow the Open Citations standard to be correctly deposited to Crossref. In order to be properly indexed the references section should have a standard heading (e.g. “References” or “Bibliography”) in HTML and PDF output. Content production often includes accompanying research data, this is why this topic is also included in this block. It closes with version control obtaining more importance with the spread of pre-prints and overlay journals.

The section on content harvesting and indexing starts with explaining the technical set-up of the OAI-PMH infrastructure and the core features of this protocol. It proceeds to the introduction of indexing services, such as BASE, CORE, OpenAire, Google Scholar, etc. and their varying requirements. As these aggregators usually use repository registries (OpenDOAR, Re3Data) to obtain information on the data sources they intend to harvest, the procedure of registration in both registrars as well as directly in indexing service is explained.

The section on content depositing and hosting services comprises introductions to repository hosting software such as Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com) or DSpace (dspace.org). The SWORD protocol that enables the remote depositing of resources into the repositories is also covered there. Another topic in this section is content export in CSV, XML or JSON structured according to DC, JATS, ONIX or any other common standard enabling data mining. The final training section is devoted to long-time archiving and preservation and demonstrates such services as CLOCKSS, Portico, PKP Preservation Network, PubMed Central, and potentially any national or institutional service. Archiving functions as a backup in case platforms, where publishers store their books and journals, cease to exist, or publishers themselves go out of business.

5.5.3 Modules build-up: content accessibility i - provisioning, harvesting, depositing and archiving

Table 14: Modules for for the training block “Content accessibility”

5.5.4 Training materials

  1. Existing training materials.

Table 15: Existing materials for the training block “Content accessibility I: provisioning, harvesting, depositing and archiving”

  1. Training materials planned to be produced by CRAFT-OA

Table 16: Potential materials for the training block “Content accessibility I: provisioning, harvesting, depositing and archiving”

Last updated