This article is available at the URI http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/blackwell-smith/ as part of the NYU Library's Ancient World Digital Library in partnership with the Institute for the Study of the Ancient World (ISAW). More information about ISAW Papers is available on the ISAW website.
Except where noted, ©2014 Christopher W. Blackwell and D. Neel Smith; distributed under the terms of the Creative Commons Attribution License
This article can be downloaded as a single file
ISAW Papers 7.5 (2014)
The Homer Multitext and RDF-Based Integration
Christopher W. Blackwell and D. Neel Smith
The Project
The Homer Multitext (HMT) is an international collaboration aimed at recovering and documenting the history of Greek epic poetry based on primary source documents, particularly the fragmentary papyri from late Antiquity and the annotated Byzantine codices of the Homeric Iliad. The data generated by the project consists of image files, transcriptions and translations of Greek and Latin texts in poetry and prose, commentary texts, and relationships among these. The digital library architecture that the project has developed since 2001 to manage this work is called CITE for Collections, Indices, Texts and Extensions.
Through participation in the LAWDI workshop at New York University in the summer of 2012, the HMT’s editors recognized the significant potential of RDF triples not only as a means of linking between projects, but of capturing and integrating the project’s data for internal use. The simplicity of RDF, combined with its flexibility, freely available tools, and widespread support, moved us to integrate RDF into the heart of our workflow and architecture.
Separation of Concerns
Integrating a large, diverse, and evolving body of data, under development by a widely distributed group of scholars at many stages of their career, from first-year students of Greek to tenured Professors at major universities, requires rigorous attention to separation of concerns. We have tried to separate scholarly activities cleanly, and associate with each activity an archival data format most appropriate for it.
For example, it is most convenient to edit a transcription of a Greek texts as a valid TEI-XML document. Subsequent analysis of that document, however, is made considerably more difficult by the arbitrarily deep hierarchical structure of any non-trivial XML text; for many kinds of analysis, processing a flattened, tabular format is preferable. For serving, querying, and sharing the archival material, an RDF triplestore is most efficient and broadly useful. For presentation of data delivered to web browsers for human or machine consumption, a combination of XML, XSLT, and CSS is most convenient. Since 2012, much of the development on the CITE architecture has focused on a test-driven environment for a publication cycle of edit, test, integrate, compile, serve, and format.
Task | Format | Tools | Validation |
---|---|---|---|
Editing | |||
Texts | TEI-XML | Oxygen, vel sim. | RelaxNG Schema |
Collections | Plain text, comma- or tab-delimited | Google Fusion Tables, Git Online Editor, &c. | csv/tsv parsing libraries |
Building & Integrating | |||
Testing Validity of Texts | flattened, tabular data | Gradle build-system | Custom scripts, Perseus Morphological Service |
Integrating Texts, Data, and Images | RDF triples | Gradle build-system | Custom scripts |
Publishing | |||
Serving Data | RDF triples | Fuseki SPARQL endpoint, vel sim. | |
Discovery, Query, Retrieval | RDF triples > XML | CITE Servlet | |
End-user display | Citations > XML Fragments > HTML | CITEKit |
URN Citation
Every object in the HMT can be cited by a URN—CTS URNs for texts, CITE URNs for objects and images.1
A Model of “Text”
The most complex data in the HMT are the texts, and for this separation of concerns to work we have to be working with a conceptual model of “text” that allows us to move from hierarchical XML to tabular data to RDF and back to hiearchical XML without loss.
For our purposes, we define a “text” as an “ordered hierarchy of citation objects”, following the OHCO2 defined by Neel Smith and Gabriel Weaver.2 By prioritizing units of citation over any other hierarchy, we can guarantee the most important scholarly activity: citation and retrieval. Any other content elements or orthogonal views of a text can be accomodated by this model, to a greater or lesser degree of granularity, depending on editorial decisions about citation.
OHCO2 is implemented in the CITE architecture through the Canonical Text Services protocol, which defines a structure for a catalog, a small number of valid requests for discovery and retrieval, and the format of responses to those requests.
Collections and Images
CITE defines a “collection” as a group of data objects sharing a defined set of fields. Each object has a URN, and named fields defined in a catalog. Collections may be ordered or unordered; in an ordered collection each object has one field that defines its place in a sequence with an integer value.
Sub-references to URNs
CITE and CTS URNs define texts, collections, or images. Because scholarship demands citation to specific parts of objects—passages of text, regions-of-interest on objects, particular fields of a data-object—all CITE URNs may include a sub-reference, providing arbitary granularity, specific to the object defined by the URN.
Type | URN | Points to… | Sub-reference? |
---|---|---|---|
CTS Text | urn:cts:greekLit:tlg0012.tlg001.msA | Homer, Iliad, edition of Manuscript A | none |
CTS Text | urn:cts:greekLit:tlg0012.tlg001.msA:1.1 | Homer, Iliad, edition of Manuscript A, Book 1, Line 1 | none |
CTS Text | urn:cts:greekLit:tlg0012.tlg001.msA:1.1@μῆνιν | Homer, Iliad, edition of Manuscript A, Book 1, Line 1 | the string “μῆνιν” |
CITE Image | urn:cite:hmt:vaimg.VA052RN–0053 | hmt namespace, vaimg collection, image VA052RN-0053 |
none |
CITE Image | urn:cite:hmt:vaimg.VA052RN–0053@0.1381,0.4192,0.3954,0.0368 | hmt namespace, vaimg collection, image VA052RN-0053 |
a rectangular region-of-interest |
CITE Object | urn:cite:hmt:venAsign.10 | hmt namespace, venAsign collection, item 10 |
none |
CITE Object | urn:cite:hmt:venAsign.10@GreekName | hmt namespace, venAsign collection, item 10 |
the contents of the field GreekName for this item |
Doing the Work at Build-Time
Validation and Testing
We use Gradle to build our .ttl
file, compiling XML texts, and collections and indices saved as .csv
or .tsv
files. Our build-scripts perform validation on XML files as well as other domain-specific tests. HMT-MOM
(for “Mandatory Ongoing Maintanance”) includes scripts that enforce a specified canon of legitimate Unicode characters for Greek texts, ensure the referential integrity of URN values in indices and CITE Collection data structures, and support further manual review by providing visualizations of the state of completion for each folio our collaborators edit. HMT-MOM
also does linguistic checking, matching each word-token against the Morpheus morphological parser; words that fail to match must be identified as non-standard forms actually present on a manuscript, non-lexical strings (numbers, case-endings, etc.), or new Greek vocabulary to be entered (ultimately) into a new lexicon of the language.
Inferencing
The URN-syntax of all CITE citations captures hierarchical relationships between group/work/edition/citation (for texts) or namespace/collection/object (for data objects and images). The citations show us that urn:cts:greekLit:tlg0012.tlg001.msA:1.11
(Homer, Iliad, MS. A edition, Book 1, line 11), belongs to urn:cts:greekLit:tlg0012.tlg001
(Homer, Iliad), and so forth.
Earlier versions of our CITE services—perl, Cocoon, eXist, AppEngine—were processor-intensive, sorting out hiearchical relationships a N-levels of depth on the fly. Some of the solutions to implementing a generic architecture for complex data were clever.
In the current implementation of CITE/CTS services, we take advantage of RDF to avoid cleverness at all costs. The Gradle build citemgr
that processes our XML texts, tab-delimited and comma-delimited collections and indices, and their catalogues, and makes explicit at build-time every relationship necessary to capture the model of a text or collection-object.
A complete list of the RDF verbs used to describe the Homer Multitext data is available through this query:
SPARQL Endpoint | http://beta.hpcc.uh.edu:3030/ds/ |
Query | select distinct ?v where { ?s ?v ?o . } |
Building the CITE Services, then, is an exercise in constructing sufficient SPARQL queries to retrieve triples, based on their URN subjects.
We have found this to accelerate our speed of developing and exposing HMT data, using a CITE service written as a Java Servlet, with a Fuseki triple-store. The build-process that constructs a .ttl
file of 443,000 lines, containing all of our HMT data (texts, objects, images), currently takes one minute, 13 seconds, on a three-year-old Mac Pro.
With this system, we have a very clean separation of concerns between archival data, served data, the network service, and end-user applications, with each standing entirely on their own. The archival texts and data are complete, of course, and end-users can retrieve them by resolving citations through the CITE service, and our RDF storage captures and makes explicit all intrinsic relationships among objects in our digital library. This strikes us a a clean, open, and forward-looking approach.
The cost is the time spent building the .ttl
file (which is inconsiderabe), and inefficiency in the middle layers, between the CITE Service and the SPARQL endpoint. It takes 10 SPARQL queries to retrieve and re-assemble one citation-node of a text, in response to a CTS GetPassagePlus
query.
This approach may not scale. In our ongoing collaboration with the Department of Informatik at Leipzig University, which is working toward implementing a CTS library containing tens of thousands of books, we are finding that the computing cost of making an averge of 10 SPARQL queries for each requested citation-node, when the SPARQL server is hosting millions of statements, might require a more efficient, more clever, solution that works directly with URNs at query time. Alternatively, a client that connects to a SPARQL end point using web sockets may solve what is primarily an I/O bottleneck. In either case, the value of having cleanly separated concerns—data, integration, storage, services, applications—will be even more apparent.
Sourcecode and Data
All data for the Homer Multitext is freely available.
Package | URL |
---|---|
Direct download of archived images | http://amphoreus.hpcc.uh.edu |
Nexus Repository for versioned artifacts | http://beta.hpcc.uh.edu/nexus/index.html |
HMT-XML: working repository for project data | https://github.com/neelsmith/hmtarchive |
CITE-Manager: integrate, test, and build RDF from project data | https://github.com/neelsmith/citemgr |
CITE-Servlet: CITE/CTS Services implemented as a Java servlet, querying a SPARQL endpoint | https://github.com/neelsmith/citeservlet |
CITEKit: resolve CITE/CTS URNs to their objects via AJAX in HTML | https://bitbucket.org/Eumaeus/citekit |
Notes
1 Blackwell, C., and D.N. Smith. “A Gentle Introduction to CTS & CITE URNs.” Homer Multitext Project Documentation (November 2012). http://www.homermultitext.org/hmt-doc/guides/urn-gentle-intro.html.
2 Smith, D. Neel, and Gabriel Weaver. “Applying Domain Knowledge from Structured Citation Formats to Text and Data Mining: Examples Using the CITE Architecture.” Text Mining Services (2009): 129.