This article is available at the URI http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/blackwell-smith/ as part of the NYU Library's Ancient World Digital Library in partnership with the Institute for the Study of the Ancient World (ISAW). More information about ISAW Papers is available on the ISAW website.

Except where noted, ©2014 Christopher W. Blackwell and D. Neel Smith; distributed under the terms of the Creative Commons Attribution License
Creative Commons License

This article can be downloaded as a single file

ISAW Papers 7.5 (2014)

The Homer Multitext and RDF-Based Integration

Christopher W. Blackwell and D. Neel Smith

The Project

The Homer Multitext (HMT) is an international collaboration aimed at recovering and documenting the history of Greek epic poetry based on primary source documents, particularly the fragmentary papyri from late Antiquity and the annotated Byzantine codices of the Homeric Iliad. The data generated by the project consists of image files, transcriptions and translations of Greek and Latin texts in poetry and prose, commentary texts, and relationships among these. The digital library architecture that the project has developed since 2001 to manage this work is called CITE for Collections, Indices, Texts and Extensions.

Through participation in the LAWDI workshop at New York University in the summer of 2012, the HMT’s editors recognized the significant potential of RDF triples not only as a means of linking between projects, but of capturing and integrating the project’s data for internal use. The simplicity of RDF, combined with its flexibility, freely available tools, and widespread support, moved us to integrate RDF into the heart of our workflow and architecture.

Separation of Concerns

Integrating a large, diverse, and evolving body of data, under development by a widely distributed group of scholars at many stages of their career, from first-year students of Greek to tenured Professors at major universities, requires rigorous attention to separation of concerns. We have tried to separate scholarly activities cleanly, and associate with each activity an archival data format most appropriate for it.

For example, it is most convenient to edit a transcription of a Greek texts as a valid TEI-XML document. Subsequent analysis of that document, however, is made considerably more difficult by the arbitrarily deep hierarchical structure of any non-trivial XML text; for many kinds of analysis, processing a flattened, tabular format is preferable. For serving, querying, and sharing the archival material, an RDF triplestore is most efficient and broadly useful. For presentation of data delivered to web browsers for human or machine consumption, a combination of XML, XSLT, and CSS is most convenient. Since 2012, much of the development on the CITE architecture has focused on a test-driven environment for a publication cycle of edit, test, integrate, compile, serve, and format.

Task Format Tools Validation
Editing
Texts TEI-XML Oxygen, vel sim. RelaxNG Schema
Collections Plain text, comma- or tab-delimited Google Fusion Tables, Git Online Editor, &c. csv/tsv parsing libraries
Building & Integrating
Testing Validity of Texts flattened, tabular data Gradle build-system Custom scripts, Perseus Morphological Service
Integrating Texts, Data, and Images RDF triples Gradle build-system Custom scripts
Publishing
Serving Data RDF triples Fuseki SPARQL endpoint, vel sim.
Discovery, Query, Retrieval RDF triples > XML CITE Servlet
End-user display Citations > XML Fragments > HTML CITEKit

URN Citation

Every object in the HMT can be cited by a URN—CTS URNs for texts, CITE URNs for objects and images.1

A Model of “Text”

The most complex data in the HMT are the texts, and for this separation of concerns to work we have to be working with a conceptual model of “text” that allows us to move from hierarchical XML to tabular data to RDF and back to hiearchical XML without loss.

For our purposes, we define a “text” as an “ordered hierarchy of citation objects”, following the OHCO2 defined by Neel Smith and Gabriel Weaver.2 By prioritizing units of citation over any other hierarchy, we can guarantee the most important scholarly activity: citation and retrieval. Any other content elements or orthogonal views of a text can be accomodated by this model, to a greater or lesser degree of granularity, depending on editorial decisions about citation.

OHCO2 is implemented in the CITE architecture through the Canonical Text Services protocol, which defines a structure for a catalog, a small number of valid requests for discovery and retrieval, and the format of responses to those requests.

Collections and Images

CITE defines a “collection” as a group of data objects sharing a defined set of fields. Each object has a URN, and named fields defined in a catalog. Collections may be ordered or unordered; in an ordered collection each object has one field that defines its place in a sequence with an integer value.

Sub-references to URNs

CITE and CTS URNs define texts, collections, or images. Because scholarship demands citation to specific parts of objects—passages of text, regions-of-interest on objects, particular fields of a data-object—all CITE URNs may include a sub-reference, providing arbitary granularity, specific to the object defined by the URN.

Type URN Points to… Sub-reference?
CTS Text urn:cts:greekLit:tlg0012.tlg001.msA Homer, Iliad, edition of Manuscript A none
CTS Text urn:cts:greekLit:tlg0012.tlg001.msA:1.1 Homer, Iliad, edition of Manuscript A, Book 1, Line 1 none
CTS Text urn:cts:greekLit:tlg0012.tlg001.msA:1.1@μῆνιν Homer, Iliad, edition of Manuscript A, Book 1, Line 1 the string “μῆνιν”
CITE Image urn:cite:hmt:vaimg.VA052RN–0053 hmt namespace, vaimg collection, image VA052RN-0053 none
CITE Image urn:cite:hmt:vaimg.VA052RN–0053@0.1381,0.4192,0.3954,0.0368 hmt namespace, vaimg collection, image VA052RN-0053 a rectangular region-of-interest
CITE Object urn:cite:hmt:venAsign.10 hmt namespace, venAsign collection, item 10 none
CITE Object urn:cite:hmt:venAsign.10@GreekName hmt namespace, venAsign collection, item 10 the contents of the field GreekName for this item

Doing the Work at Build-Time

Validation and Testing

We use Gradle to build our .ttl file, compiling XML texts, and collections and indices saved as .csv or .tsv files. Our build-scripts perform validation on XML files as well as other domain-specific tests. HMT-MOM (for “Mandatory Ongoing Maintanance”) includes scripts that enforce a specified canon of legitimate Unicode characters for Greek texts, ensure the referential integrity of URN values in indices and CITE Collection data structures, and support further manual review by providing visualizations of the state of completion for each folio our collaborators edit. HMT-MOM also does linguistic checking, matching each word-token against the Morpheus morphological parser; words that fail to match must be identified as non-standard forms actually present on a manuscript, non-lexical strings (numbers, case-endings, etc.), or new Greek vocabulary to be entered (ultimately) into a new lexicon of the language.

Inferencing

The URN-syntax of all CITE citations captures hierarchical relationships between group/work/edition/citation (for texts) or namespace/collection/object (for data objects and images). The citations show us that urn:cts:greekLit:tlg0012.tlg001.msA:1.11 (Homer, Iliad, MS. A edition, Book 1, line 11), belongs to urn:cts:greekLit:tlg0012.tlg001 (Homer, Iliad), and so forth.

Earlier versions of our CITE services—perl, Cocoon, eXist, AppEngine—were processor-intensive, sorting out hiearchical relationships a N-levels of depth on the fly. Some of the solutions to implementing a generic architecture for complex data were clever.

In the current implementation of CITE/CTS services, we take advantage of RDF to avoid cleverness at all costs. The Gradle build citemgr that processes our XML texts, tab-delimited and comma-delimited collections and indices, and their catalogues, and makes explicit at build-time every relationship necessary to capture the model of a text or collection-object.

A complete list of the RDF verbs used to describe the Homer Multitext data is available through this query:

SPARQL Endpoint http://beta.hpcc.uh.edu:3030/ds/
Query select distinct ?v where { ?s ?v ?o . }

Building the CITE Services, then, is an exercise in constructing sufficient SPARQL queries to retrieve triples, based on their URN subjects.

We have found this to accelerate our speed of developing and exposing HMT data, using a CITE service written as a Java Servlet, with a Fuseki triple-store. The build-process that constructs a .ttl file of 443,000 lines, containing all of our HMT data (texts, objects, images), currently takes one minute, 13 seconds, on a three-year-old Mac Pro.

With this system, we have a very clean separation of concerns between archival data, served data, the network service, and end-user applications, with each standing entirely on their own. The archival texts and data are complete, of course, and end-users can retrieve them by resolving citations through the CITE service, and our RDF storage captures and makes explicit all intrinsic relationships among objects in our digital library. This strikes us a a clean, open, and forward-looking approach.

The cost is the time spent building the .ttl file (which is inconsiderabe), and inefficiency in the middle layers, between the CITE Service and the SPARQL endpoint. It takes 10 SPARQL queries to retrieve and re-assemble one citation-node of a text, in response to a CTS GetPassagePlus query.

This approach may not scale. In our ongoing collaboration with the Department of Informatik at Leipzig University, which is working toward implementing a CTS library containing tens of thousands of books, we are finding that the computing cost of making an averge of 10 SPARQL queries for each requested citation-node, when the SPARQL server is hosting millions of statements, might require a more efficient, more clever, solution that works directly with URNs at query time. Alternatively, a client that connects to a SPARQL end point using web sockets may solve what is primarily an I/O bottleneck. In either case, the value of having cleanly separated concerns—data, integration, storage, services, applications—will be even more apparent.

Sourcecode and Data

All data for the Homer Multitext is freely available.

Package URL
Direct download of archived images http://amphoreus.hpcc.uh.edu
Nexus Repository for versioned artifacts http://beta.hpcc.uh.edu/nexus/index.html
HMT-XML: working repository for project data https://github.com/neelsmith/hmtarchive
CITE-Manager: integrate, test, and build RDF from project data https://github.com/neelsmith/citemgr
CITE-Servlet: CITE/CTS Services implemented as a Java servlet, querying a SPARQL endpoint https://github.com/neelsmith/citeservlet
CITEKit: resolve CITE/CTS URNs to their objects via AJAX in HTML https://bitbucket.org/Eumaeus/citekit

Notes

1 Blackwell, C., and D.N. Smith. “A Gentle Introduction to CTS & CITE URNs.” Homer Multitext Project Documentation (November 2012). http://www.homermultitext.org/hmt-doc/guides/urn-gentle-intro.html.

2 Smith, D. Neel, and Gabriel Weaver. “Applying Domain Knowledge from Structured Citation Formats to Text and Data Mining: Examples Using the CITE Architecture.” Text Mining Services (2009): 129.