ISAW Papers 20.1 (2021)

Introducing the Semantic Web and Linked Open Data

Sarah E. Bond, University of Iowa; Paul Dilley, University of Iowa; and Ryan Horne, University of California Los Angeles

In: Sarah E. Bond, Paul Dilley, and Ryan Horne, eds. 2021. Linked Open Data for the Ancient Mediterranean: Structures, Practices, Prospects. ISAW Papers 20.

URI: http://hdl.handle.net/2333.1/c2fqzh3d

Abstract: This introduction is a series of introductory case studies written by principal investigators of, or contributors to, various digital projects related to the ancient Mediterranean world which use linked open data. Some of the chapters are based on presentations given at the “Linking the Big Ancient Mediterranean” conference at the University of Iowa in summer 2016. As a group, the contributions cover a wide variety of geographic regions and forms of evidence, and discuss data structures as they relate to a number of sub-disciplines within the study of the ancient world.

Library of Congress Subjects: Semantic Web--Congresses; Linked data--Congresses.

In 2001, Tim Berners-Lee, James Hendler, and Ora Lassila published their vision for the future of the Web in the magazine Scientific American. “The Semantic Web,” outlined the necessity for a decentralized extension to the Web which would encourage a standardized order and purpose-driven future for the internet. They prophesied that, “[a] new form of Web content that is meaningful to computers will unleash a revolution of new possibilities.”¹ What followed was an impassioned plea for a set of standards that created a meaningful, i.e. semantic, future predicated on open exchange within the Web. Berners-Lee is himself credited with inventing the World Wide Web in 1989 and co-founded the Open Data Institute (ODI) in 2012. He is also the current director of the World Wide Web Consortium (W3C). For the past 30 years, Berners-Lee and many others have worked tirelessly to keep the Web open and accessible to all, but it is only in the last two decades that developers have begun to prepare more actively for a Web future refocused on a system of annotation, citation, and preservation which is dependent upon the adoption of sets of standards collectively called Linked Open Data (LOD). Linked Open Data has also become an important subfield within the Digital Humanities, not only in connection with adjacent disciplines such as cultural heritage management and informatics, but touching more generally on aspects of data-driven projects and research strategies.⬈#p1

Essential Concepts and Terminology for Linked Open Data²

One of the foundational ideas underpinning LOD is the notion that different datasets can be connected together through the web in an open, freely accessible manner by encouraging individuals, organizations, and institutions not only to make their data openly available, but conversely, to then cite extant but stable open data on the Web already. Berners-Lee outlined the following basic principles which guide LOD development:⬈#p2

Use URIs as names for things
Use HTTP URIs so that people can look up those names
When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
Include links to other URIs. so that they can discover more things³

For data to conform to LOD standards, the data must additionally be released under an open licensing agreement, allowing it to be used, expanded, or built upon by others.⁴In practical terms, this means that projects seeking to interface with the LOD world need to disambiguate and identify the entities in their data sets, provide some means for those individual entities to be accessible through the web, and link those entities to other data providers. ⬈#p3

Although it is highly recommended that they do so, it is not strictly necessary for a project to offer its own data in a LOD format in order to use existing resources. What is required is that a digital project establishes some link between its own data and a LOD resource. For many projects dealing with ancient geography, for example, this takes the form of linking a piece of text, artifact, or list to an appropriate Pleiades project ID, which has a stable URI. For instance, if a study is done on ancient letters that mention Rome, each individual letter would be represented in a spread sheet or database, with a field denoting the places mentioned within it, and another with appropriate Pleiades IDs (in this case https://pleiades.stoa.org/places/423025). After constructing such a database or spreadsheet, it is then a simple matter to use the API of the data provider to get any relevant information.⬈#p4

However, LOD is far more useful when data and connections are contributed back to the community writ large. One of the primary methods for linking data is through the Resource Description Framework standard, or RDF. This technology is a method for representing the relationship of different web based resources in a standard format that can be read an understood by machines. In order for RDF to work, it uses Uniform Resource Identifiers (URIs) to uniquely identify and disambiguate different web resources. The HyperText Transfer Protocol is where most internet users encounter URIs. For example, the website for the Pleiades project is https://pleiades.stoa.org; The entire address is a URI, with the https as a scheme and the address pleiades.stoa.org as an authority. What this means in practical terms is that https://pleiades.stoa.org/ identifies a unique site that should resolve in an expected manner.⬈#p5

The digital ecosystem is not a static environment. Websites can—and often do—change, become defunct, or disappear altogether. However, the ideal URI points to a web resource that will not conceptually change. The URI for the city of Rome in the Pleiades Project, https://pleiades.stoa.org/places/423025, should always point to some information about Rome; the design and display elements of the site, and perhaps some of the underlying data itself may be altered, but https://pleiades.stoa.org/places/423025 should always point to some information about Rome, and not, say, a new method for creating rocket engines or a recipe for a sour imperial stout. The term RDF describes the relationship between these ideally stable URIs through a triple, or a deceleration taking the form of <subject><predicate><value>. A project may wish to state that it has data on an artifact found in Rome, and the resulting RDF triple can be expressed by using https://pleiades.stoa.org/places/423025 as the subject, found in as the predicate, and the id of the artifact itself in a web-accessible database as the value. After constructing a file of these triples, in a recognized format, a project can submit its data to an aggregator which makes it accessible to the larger LOD world by establishing a recognized means to search and present data in the LOD cloud relevant to the study domain. The details of this process, including refining data for alignments with LOD resources, and publishing the results, are discussed below.⬈#p6

From Linked Open Data to Linked Ancient World Data

In the spring of 2012 and 2013, the fields of Classics, Ancient History, and Classical Archaeology took a definitive step towards the adoption of LOD standards with the convening of two Linked Ancient World Data Institutes (LAWDI) funded by the Office of Digital Humanities of the National Endowment for the Humanities.⁵ Two three-day workshops, one at the Institute for the Study of the Ancient World at New York University (2012) and the other at Drew University (2013) brought together a number of digital humanists focused on applying Linked Open Data standards to the digital study of the ancient Mediterranean and the Near East.⁶ These workshops focused on the need for stable URIs and parsable, open data provided in formats which could easily facilitate machine-based reuse of this data. Models for such LAWD standards included the Pelagios Commons consortium and the Pleiades Project. Participants began to discuss, debate, and decide upon more standardized vocabularies within the online environment. If anything, it was apparent to the many philologists and language-based academics in attendance that establishing an open, stable language spoken between projects on the Web would be able to enrich, connect, and stabilize the data and bibliographies on sites nested on servers across the globe. In recent years, linked data related to the ancient world has continued to develop in multiple areas, including large-scale linked data projects such as the Linking Latin project (Lila), based at Università Cattolica del Sacro Cuore, Milan, which aims to build a knowledge base of linked NLP resources for Latin (https://lila-erc.eu); and biblissima, a data aggregator which seeks to link collections of IIIF-compliant manuscript images (http://beta.biblissima.fr/).⬈#p7

In summer 2016, Sarah E. Bond, Paul Dilley, and Ryan Horne decided to convene a similar conference to the LAWDI of 2012 and 2013 at the University of Iowa in Iowa City, IA, in order to address how LOD had developed in the intermediate years and how their new digital project, called Big Ancient Mediterranean (BAM) might best adopt LAWD standards and connect to the current network of digital projects. Funded by the University of Iowa’s Obermann Center for Advanced Studies and hosted within the University of Iowa Library’s Digital Scholarship & Publishing Studio, “Linking the Big Ancient Mediterranean” brought together an international group of scholars from the fields of Classics, Archaeology, Digital Humanities, Museum Studies, Art History, Library and Information Science, and Computer Science for a two-day conference. Instead of simply publishing the paper presentations in a closed access edited collection, presenters instead decided to construct a handbook that would aid future digital projects focused on the ancient and late ancient Mediterranean and Near East. There was a consensus that we wished to publish this volume in a manner that made it open access and online in the spirit of LOD. What follows is then the use of case-studies written by the principal investigators or contributors to various digital projects who have themselves adopted LOD structures and experienced the myriad hopes, frustrations, setbacks, and accomplishments that come with beginning or revising a digital humanities project.⬈#p8

In the second chapter of this volume, ancient historian and ancient geographer Ryan Horne discusses not the “what” but rather the “how” of LOD with particular attention to the issue of space. He addresses the basics of how principal investigators and contributors to digital projects can first assess the nature of their data before determining its relationship(s) with people, places, objects, and concepts. He remarks on possible metadata considerations and the necessity for unique IDs for each datapoint, as well as the need for careful recording of replicable research methodologies and how to reconcile with other providers before publishing it under an open license of your choosing. In the third chapter, archaeologist and Co-PI of PeriodO, Adam Rabinowitz moves us from considerations of space to the challenges of conceptualizing and quantifying time. Rabinowitz addresses how to incorporate time within datasets that will be modeled as LOD, beginning with the basic modeling and expression of Gregorian calendar dates and then transitioning to the modeling and expression of periods and other temporal expressions. He also differentiates the variant modeling approaches which will be needed in order to address continuous time-stamped data versus data stamped with relative periods. Ultimately, he argues that “structured representation of periods in datasets related to the past will help us to understand our own scholarly disciplines.”⬈#p9

In addition to the dynamics of space and time, a fourth chapter in the volume engages a broad category of the ancient world. Gabriel Bodard describes linked open data for people and names, as developed by the SNAP:DRGN project. He offers a rich overview of the varieties of data about people available from ancient sources, and the “factoid” model for prosopographical databases. Distinct from, but related to, the practices of prosopographies, personal names lexica, and catalogues, the SNAP cookbook for person data in RDF. Each person has a required URI, as well as type (e.g. mortal or immortal), citation, and publisher information; name is optional but recommended (it might not always be known), as are attestations; data such as associated place, date, occupation, relationship, and other identifiers, are to be recorded when available.⬈#p10

The representation of materiality is another challenge for LOD modeling. In her chapter, digital humanities librarian and 3D modeling specialist Hannah Scates Kettler reflects on the challenges of research in the third dimension. After providing a history of LOD and 3D modeling, Scates Kettler addresses the research, publication, and dissemination of 3D research. The issues of ethics surrounding the modeling of cultural heritage are of central concern, as is the creation of principles that outline and enforce these ethics. Scates Kettler underscores the need for interdisciplinary work within the humanities in order to convey information, but also returns to a key theme of this handbook: the value of adopting LOD as a guiding principle: “Linked Open Data has the potential to not only aid in the discovery of 3D research, but also aid in the creation of new relationships that provide context to 3D data, provide ways of viewing and interaction (should appropriate metadata be captured) and provide an extension on a much too short research lifecycle for 3D.” As Scates Kettler and the rest of the contributors within this handbook demonstrate, LOD can break the silos of traditional academic content by creating democratized networks of information. But ethical and open 3D modelling means its citizens must still adhere to certain rules and best practices. ⬈#p11

Two other chapters discuss LOD as it relates to material objects. Andrew Meadows and Ethan Gruber, in a revised version of their 2014 article in ISAW Papers, describe the current landscape of LOD relating to numismatics. After giving an overview of coins as historical sources, archaeological objects, and the publication of online numismatic databases, they describe the challenges of describing these complex objects, which are simultaneously physical, textual and geographical. They describe the ontology that meets these descriptive challenges, developed in 2014 and now in use by nomisma.org, including how it can be used to connect ideal coin types to individual specimens of coins, and to coin hoards. Pietro Liuzzo describes how a linked open data model employing TEI and RDF triples is used to adopt current models of codex structure and stratigraphy as described in La Syntaxe du Codex. Essai de codicologie structural (Andrist, Canart, and Maniaci 2013) to a digital environment, and in particular for Beta maṣāḥǝft: Manuscripts of Ethiopia and Eritrea, a resource for the rich heritage of manuscripts in Ethiopic (Ge’ez). Liuzzo describes how the physical structure of the manuscript is encoded in TEI, which is then used to produce an RDF triplet, which can then be visualized and verified as part of the research workflow. Liuzzo’s analysis suggests how ad-hoc rather than comprehensive methodologies can be of fundamental research value, and indeed can lead to further collaboration.⬈#p12

Several chapters in the volume address LOD as it relates to literature. Thomas Koentges introduces the CITE architecture and CTS identifiers, in conversation with developers and practitioners Christopher Blackwell, Gregory Crane, Neel Smith, and James Tauber. The CITE architecture, which developed out of the Homer Multitext Project, is “a highly precise framework to reference research data in textual humanities in a machine-actionable way.” It formalizes the connection between different versions, whether manuscript, print, or electronic, of a given work at the level of the text. CITE architecture is used in conjunction with Canonical Text Services (CTS) - Uniform Resource Names (URN), which point to a particular section of the text, which can be viewed in various ways (including Perseus’s new Scaife viewer). Alison Babeu and Paul Dilley explore a related topic, linked data for Authors and Works, outlining the development of authoritative digital catalogues of ancient works, with related metadata, including time, place of composition, and genre. Building upon print resources such as the TLG, as well as digitized editions, the Perseus Catalog offers stable urns for ancient authors and works; while the Virtual Internet Authority File (VIAF), builds on a consortium of international libraries operated by the OCLC, to present the names of authors and their works in multiple languages.⬈#p13

The use of LOD to amplify and preserve cultural heritage is an issue within a number of the projects in the LOD handbook, including those focused on specific regions and/or traditions. In his chapter, historian of early Christianity and Co-PI of Syriaca.org David A. Michelson examines the modelling, encoding, and publishing of historical data surrounding a language by using the example of Syriac, which was “a dialect of Aramaic which was used widely in the Middle East and Asia during the first millennium of the current era.” Michelson discusses how Syriaca.org applied the principles of LOD to aggregate, relate, and then publish the historical information pulled from Syriac sources; investigating how the project reconciled the graph data structure of LOD with the encoding guidelines provided by a consortium known as the Text Encoding Initiative (TEI), which defines a long-established eXtensible Markup Language (XML) format for marking up and encoding texts. Michelson uses Syriaca.org as a way of showing that a digital project engaged using LOD can create a paradigm for other projects looking to digitize, organize, and preserve texts connected to other endangered languages. The themes of preservation and stability are seen in the chapter by papyrologist and digital humanist Ryan Baumann, who uses Papyri.info for his case study. Papyri.info allows for the searching and edits of thousands of papyrological documents, mostly from Egypt, but also exemplifies the need for social and institutional commitments to establishing and providing stable identifiers for data. Baumann again stresses the need for project managers to establish relationships between texts, objects, and their representations. He also points to the need for projects to create a data preservation plan with their institution to ensure that the URIs linked to are secure now and well into the future. ⬈#p14

Finally, several contributions engage with large archaeological datasets. In “Origins and Antidotes of Omission: Southeastern European Archaeology, Linked Open Data, and the Possibilities for Archaeological Integration,” Anne Hunnell Chen and Jamie Folsom discuss their project for creating a LOD ecosystem for data relating to the archaeology of southeastern Europe, a relatively understudied region, focusing on the importance of including local partners in the integration of data. Sebastian Heath provides an example of LOD in action, through his dataset of 260 Roman Amphitheatres, which is available in an open license Github repository. He walks readers through the .geojson file, with its structured data on amphitheater size, location, etc.; and demonstrates how one can use SPRQL to query JSON-SD files with IPython notebooks, for example to select amphitheatres belonging to a particular chronological period. This procedure is an excellent way for both exploring one’s own research data in a linked environment, which can also be made available for others to query.⬈#p15

Conclusion

The chapters in this volume reflect many different places, times, sources, and approaches. We hope that it will provide readers with a sense of the possibilities for participating in the linked data environment, and an introduction to some of the major resources and techniques for doing so. While this volume cannot address every potential question or concern now or in the future, it can provide resources for accessing and understanding the terminologies, methodologies, and goals of the LOD movement in the study of the ancient and late ancient Mediterranean and Near East. ⬈#p16

Notes

¹ Tim Berners-Lee, James Hendler, and Ora Lassila, "The Semantic Web," Scientific American (May 2001): 29-37. See the subsequent assessment in Christian Bizer, Tom Heath, and Tim Berners-Lee, “Linked Data - The Story So Far,” International Journal on Semantic Web and Information Systems 5.3 (2009): 1-22. ⬈#footnote-1

² For a more extensive practical introduction in a Digital Humanities context, see Jonathan Blaney, “Introduction to the Principles of Linked Open Data” (2017; modified 2020): https://programminghistorian.org/en/lessons/intro-to-linked-data (Accessed August 10, 2020).⬈#footnote-2

³ Tim Berners-Lee, “Linked Data” (July 2006; modified June 2009): http://www.w3.org/DesignIssues/LinkedData.html (Accessed August 10, 2020).⬈#footnote-3

⁴ See the Journal of Open Humanities Data for a useful list of open licenses: https://openhumanitiesdata.metajnl.com/about/#q5 (Accessed August 10, 2020).⬈#footnote-4

⁵ “Linked Ancient World Data Institute,” The Digital Classicist Wiki: https://wiki.digitalclassicist.org/Linked_Ancient_World_Data_Institute. (Accessed August 10, 2020).⬈#footnote-5

⁶ See Thomas Elliott, Sebastian Heath, and John Muccigrosso, “Current Practice in Linked Open Data for the Ancient World,” ISAW Papers 7 (2014): http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/ (Accessed August 10, 2020), and the rest of the papers in that volume. For a recent overview of linked data relating to the ancient world, see Hugh Cayless, “Sustaining Linked Ancient World Data,” in Monica Berti, ed., Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution (Berlin: DeGruyter, 2019): 35-50. https://www.degruyter.com/viewbooktoc/product/502894.⬈#footnote-6

Essential Concepts and Terminology for Linked Open Data2

From Linked Open Data to Linked Ancient World Data

Conclusion

Notes

Essential Concepts and Terminology for Linked Open Data²