ISAW Papers 7.5 (2014)

The Homer Multitext and RDF-Based Integration

Christopher W. Blackwell and D. Neel Smith

The Project

The Homer Multitext (HMT) is an international collaboration aimed at recovering and documenting the history of Greek epic poetry based on primary source documents, particularly the fragmentary papyri from late Antiquity and the annotated Byzantine codices of the Homeric Iliad. The data generated by the project consists of image files, transcriptions and translations of Greek and Latin texts in poetry and prose, commentary texts, and relationships among these. The digital library architecture that the project has developed since 2001 to manage this work is called CITE for Collections, Indices, Texts and Extensions.⬈#p1

Through participation in the LAWDI workshop at New York University in the summer of 2012, the HMT’s editors recognized the significant potential of RDF triples not only as a means of linking between projects, but of capturing and integrating the project’s data for internal use. The simplicity of RDF, combined with its flexibility, freely available tools, and widespread support, moved us to integrate RDF into the heart of our workflow and architecture. ⬈#p2

Separation of Concerns

Integrating a large, diverse, and evolving body of data, under development by a widely distributed group of scholars at many stages of their career, from first-year students of Greek to tenured Professors at major universities, requires rigorous attention to separation of concerns. We have tried to separate scholarly activities cleanly, and associate with each activity an archival data format most appropriate for it. ⬈#p3

For example, it is most convenient to edit a transcription of a Greek texts as a valid TEI-XML document. Subsequent analysis of that document, however, is made considerably more difficult by the arbitrarily deep hierarchical structure of any non-trivial XML text; for many kinds of analysis, processing a flattened, tabular format is preferable. For serving, querying, and sharing the archival material, an RDF triplestore is most efficient and broadly useful. For presentation of data delivered to web browsers for human or machine consumption, a combination of XML, XSLT, and CSS is most convenient. Since 2012, much of the development on the CITE architecture has focused on a test-driven environment for a publication cycle of edit, test, integrate, compile, serve, and format.⬈#p4

Task	Format	Tools	Validation
Editing
Texts	TEI-XML	Oxygen, vel sim.	RelaxNG Schema
Collections	Plain text, comma- or tab-delimited	Google Fusion Tables, Git Online Editor, &c.	csv/tsv parsing libraries
Building & Integrating
Testing Validity of Texts	flattened, tabular data	Gradle build-system	Custom scripts, Perseus Morphological Service
Integrating Texts, Data, and Images	RDF triples	Gradle build-system	Custom scripts
Publishing
Serving Data	RDF triples	Fuseki SPARQL endpoint, vel sim.
Discovery, Query, Retrieval	RDF triples > XML	CITE Servlet
End-user display	Citations > XML Fragments > HTML	CITEKit

URN Citation

Every object in the HMT can be cited by a URN—CTS URNs for texts, CITE URNs for objects and images.¹⬈#p5

A Model of “Text”

The most complex data in the HMT are the texts, and for this separation of concerns to work we have to be working with a conceptual model of “text” that allows us to move from hierarchical XML to tabular data to RDF and back to hiearchical XML without loss. ⬈#p6

For our purposes, we define a “text” as an “ordered hierarchy of citation objects”, following the OHCO² defined by Neel Smith and Gabriel Weaver.² By prioritizing units of citation over any other hierarchy, we can guarantee the most important scholarly activity: citation and retrieval. Any other content elements or orthogonal views of a text can be accomodated by this model, to a greater or lesser degree of granularity, depending on editorial decisions about citation.⬈#p7

OHCO² is implemented in the CITE architecture through the Canonical Text Services protocol, which defines a structure for a catalog, a small number of valid requests for discovery and retrieval, and the format of responses to those requests. ⬈#p8

Collections and Images

CITE defines a “collection” as a group of data objects sharing a defined set of fields. Each object has a URN, and named fields defined in a catalog. Collections may be ordered or unordered; in an ordered collection each object has one field that defines its place in a sequence with an integer value. ⬈#p9

Sub-references to URNs

CITE and CTS URNs define texts, collections, or images. Because scholarship demands citation to specific parts of objects—passages of text, regions-of-interest on objects, particular fields of a data-object—all CITE URNs may include a sub-reference, providing arbitary granularity, specific to the object defined by the URN. ⬈#p10

Type	URN	Points to…	Sub-reference?
CTS Text	urn:cts:greekLit:tlg0012.tlg001.msA	Homer, Iliad, edition of Manuscript A	none
CTS Text	urn:cts:greekLit:tlg0012.tlg001.msA:1.1	Homer, Iliad, edition of Manuscript A, Book 1, Line 1	none
CTS Text	urn:cts:greekLit:tlg0012.tlg001.msA:1.1@μῆνιν	Homer, Iliad, edition of Manuscript A, Book 1, Line 1	the string “μῆνιν”
CITE Image	urn:cite:hmt:vaimg.VA052RN–0053	`hmt` namespace, `vaimg` collection, image `VA052RN-0053`	none
CITE Image	urn:cite:hmt:vaimg.VA052RN–0053@0.1381,0.4192,0.3954,0.0368	`hmt` namespace, `vaimg` collection, image `VA052RN-0053`	a rectangular region-of-interest
CITE Object	urn:cite:hmt:venAsign.10	`hmt` namespace, `venAsign` collection, item 10	none
CITE Object	urn:cite:hmt:venAsign.10@GreekName	`hmt` namespace, `venAsign` collection, item 10	the contents of the field `GreekName` for this item

Doing the Work at Build-Time

Validation and Testing

We use Gradle to build our .ttl file, compiling XML texts, and collections and indices saved as .csv or .tsv files. Our build-scripts perform validation on XML files as well as other domain-specific tests. HMT-MOM (for “Mandatory Ongoing Maintanance”) includes scripts that enforce a specified canon of legitimate Unicode characters for Greek texts, ensure the referential integrity of URN values in indices and CITE Collection data structures, and support further manual review by providing visualizations of the state of completion for each folio our collaborators edit. HMT-MOM also does linguistic checking, matching each word-token against the Morpheus morphological parser; words that fail to match must be identified as non-standard forms actually present on a manuscript, non-lexical strings (numbers, case-endings, etc.), or new Greek vocabulary to be entered (ultimately) into a new lexicon of the language.⬈#p11

Inferencing

The URN-syntax of all CITE citations captures hierarchical relationships between group/work/edition/citation (for texts) or namespace/collection/object (for data objects and images). The citations show us that urn:cts:greekLit:tlg0012.tlg001.msA:1.11 (Homer, Iliad, MS. A edition, Book 1, line 11), belongs to urn:cts:greekLit:tlg0012.tlg001 (Homer, Iliad), and so forth. ⬈#p12

Earlier versions of our CITE services—perl, Cocoon, eXist, AppEngine—were processor-intensive, sorting out hiearchical relationships a N-levels of depth on the fly. Some of the solutions to implementing a generic architecture for complex data were clever.⬈#p13

In the current implementation of CITE/CTS services, we take advantage of RDF to avoid cleverness at all costs. The Gradle build citemgr that processes our XML texts, tab-delimited and comma-delimited collections and indices, and their catalogues, and makes explicit at build-time every relationship necessary to capture the model of a text or collection-object.⬈#p14

A complete list of the RDF verbs used to describe the Homer Multitext data is available through this query:⬈#p15


SPARQL Endpoint	`http://beta.hpcc.uh.edu:3030/ds/`
Query	`select distinct ?v where { ?s ?v ?o . }`

Building the CITE Services, then, is an exercise in constructing sufficient SPARQL queries to retrieve triples, based on their URN subjects.⬈#p16

We have found this to accelerate our speed of developing and exposing HMT data, using a CITE service written as a Java Servlet, with a Fuseki triple-store. The build-process that constructs a .ttl file of 443,000 lines, containing all of our HMT data (texts, objects, images), currently takes one minute, 13 seconds, on a three-year-old Mac Pro.⬈#p17

With this system, we have a very clean separation of concerns between archival data, served data, the network service, and end-user applications, with each standing entirely on their own. The archival texts and data are complete, of course, and end-users can retrieve them by resolving citations through the CITE service, and our RDF storage captures and makes explicit all intrinsic relationships among objects in our digital library. This strikes us a a clean, open, and forward-looking approach.⬈#p18

The cost is the time spent building the .ttl file (which is inconsiderabe), and inefficiency in the middle layers, between the CITE Service and the SPARQL endpoint. It takes 10 SPARQL queries to retrieve and re-assemble one citation-node of a text, in response to a CTS GetPassagePlus query.⬈#p19

This approach may not scale. In our ongoing collaboration with the Department of Informatik at Leipzig University, which is working toward implementing a CTS library containing tens of thousands of books, we are finding that the computing cost of making an averge of 10 SPARQL queries for each requested citation-node, when the SPARQL server is hosting millions of statements, might require a more efficient, more clever, solution that works directly with URNs at query time. Alternatively, a client that connects to a SPARQL end point using web sockets may solve what is primarily an I/O bottleneck. In either case, the value of having cleanly separated concerns—data, integration, storage, services, applications—will be even more apparent.⬈#p20

Sourcecode and Data

All data for the Homer Multitext is freely available.⬈#p21

Package	URL
Direct download of archived images	http://amphoreus.hpcc.uh.edu
Nexus Repository for versioned artifacts	http://beta.hpcc.uh.edu/nexus/index.html
HMT-XML: working repository for project data	https://github.com/neelsmith/hmtarchive
CITE-Manager: integrate, test, and build RDF from project data	https://github.com/neelsmith/citemgr
CITE-Servlet: CITE/CTS Services implemented as a Java servlet, querying a SPARQL endpoint	https://github.com/neelsmith/citeservlet
CITEKit: resolve CITE/CTS URNs to their objects via AJAX in HTML	https://bitbucket.org/Eumaeus/citekit

Notes

¹ Blackwell, C., and D.N. Smith. “A Gentle Introduction to CTS & CITE URNs.” Homer Multitext Project Documentation (November 2012). http://www.homermultitext.org/hmt-doc/guides/urn-gentle-intro.html.⬈#footnote-1

² Smith, D. Neel, and Gabriel Weaver. “Applying Domain Knowledge from Structured Citation Formats to Text and Data Mining: Examples Using the CITE Architecture.” Text Mining Services (2009): 129.⬈#footnote-2