This article is available at the URI http://dlib.nyu.edu/awdl/isaw/isaw-papers/20-8/ as part of the NYU Library's Ancient World Digital Library in partnership with the Institute for the Study of the Ancient World (ISAW). More information about ISAW Papers is available on the ISAW website.

©2021 Thomas Koentges; text and images distributed under the terms of the Creative Commons Attribution 4.0 International (CC-BY) license.
Creative Commons License

This article can be downloaded as a single file

ISAW Papers 20.8 (2021)

The CITE Architecture: Q & A Regarding CTS and CITE

Thomas Koentges, Christopher Blackwell, Gregory Crane, Neel Smith, James Tauber

In: Sarah E. Bond, Paul Dilley, and Ryan Horne, eds. 2021. Linked Open Data for the Ancient Mediterranean: Structures, Practices, Prospects. ISAW Papers 20.

URI: http://hdl.handle.net/2333.1/fttdz9dt

Abstract: In this conversation we tried to capture different voices in the communities using and developing the CTS/CITE framework. By doing so, this contribution offers a beginner-friendly introduction that answers three questions: what, why, and how? Although there are traditional articles available for CTS/CITE, the authors (a group of users, developers, and project leaders working with CTS) realised that most of our insights were generated through conversations. Important progress, such as CEX (a simple text format now common to exchange whole CTS/CITE libraries), was only possible through constant communication. This chapter invites the reader into this conversation and also acts as a demonstration of how one can generate citable text using CTS identifiers and textual nodes.

Library of Congress Subjects: Classical literature--Bibliography--Methodology; Machine-readable bibliographic data.

[urn:cts:scholarship:lod.discussion:intro.1]

The CITE architecture is a highly precise framework to reference research data in textual humanities in a machine-actionable way. Traditional reference systems rely on a good deal of human interpretation. For instance, traditionally Plato is often cited according to Stephanus pages and Stephanus paragraphs; that is, identifiers inspired by the edition of the Corpus Platonicum published by Henri Estienne in 1578. Yet, if one looks at different modern editions of Plato’s text, precisely where a certain paragraph or page starts can differ from edition to edition, because Stephanus’ original paragraphs often break the text mid-sentence and sometimes even mid-word. As a result, comparing the text of two different editions requires some adjustments and interpretation by a human reader. Taking another example, a reader has to be aware that Petronii Arbitri Affranii Satyrici lib., Petronii Arbitri Satyri fragmenta ex libro quinto decimo et sexto decimo, Petronii Arbitri Satiricon, Petronii Arbitri Satyricon, Petronii Arbitri Satirici fragmenta, and excerta petronii satiriei, point to the same work.

[urn:cts:scholarship:lod.discussion:intro.2]

Because the CITE architecture is highly precise, it is used in many historical text digitization efforts and research projects: most significantly, in the Homer Multitext Project (the project for which the CITE architecture was originally created), the Perseus Digital Library and Scaife Digital Library Viewer at Tufts University, and the Open Greek and Latin Project (OGL), which, between them, have created the largest openly-licensed, structured machine-actionable corpus of classical Latin and Greek literature. It is also used by Das Deutsche Textarchiv, the neo-Latin Croatiae auctores Latini bibliotheca electronica (CroALa), the Arabic Kitab project, the Persian Digital Library, and the Sanskrit Digital Critical Edition of the Nyāyabhāṣya project.

[urn:cts:scholarship:lod.discussion:intro.3]

To explain the fundamental concepts and applications of the framework, I have gathered a group of experts who work with the CITE architecture on a daily basis. Although I have had many real-world discussions with each participant, what follows is an entirely virtual discussion, written in a collaborative editing environment. I, Thomas Koentges, am an Assistant Professor in Digital Humanities at the University of Leipzig, Germany, and a Fellow for Historical Language Processing and Data Analysis at Harvard’s Center for Hellenic Studies. I am an associate of the Perseus Digital Library and have worked with Christopher Blackwell and Neel Smith on CITE microservices. My interlocutors, in alphabetical order, are: Christopher Blackwell, who is the Louis G. Forgione University Professor of Classics at Furman University and one of the two project architects of the Homer Multitext Project (HMT), the requirements of which were the original impetus for developing CITE; Gregory Crane, who is the Editor-in-Chief of the Perseus Digital Library, Professor in Classics at Tufts University, and has been working in the field that some call Digital Classics; Neel Smith, who is Professor of Classics at the College of the Holy Cross, faculty advisor to the Holy Cross Manuscripts, Inscriptions, and Documents Club, the other project architect of the HMT, and a founder of the CITE Architecture project; and last but not least, James Tauber, who is the CEO and founder of Eldarion, Inc., works as both a professional software developer and researcher in digital philology, is involved with the Perseus Digital Library and OGL, and led the development of the new Scaife Digital Library Viewer, which makes extensive use of CTS.

[urn:cts:scholarship:lod.discussion:intro.4]

In addition to explaining the CITE architecture, this chapter is itself an example of a CTS-compliant text. Each logical passage of this text has a CTS URN. URN is short for Uniform Resource Name and describes an identifier that follows a URN scheme. These resource names are intended to serve as unique, persistent location-independent resource identifiers. CTS is short for Canonical Text Services and CTS URNs are the notation used in the CITE architecture to identify citable textual data. For example, the identifier for this introductory section prefacing the discussion is urn:cts:scholarship:lod.discussion:intro, and if we drill even deeper, then this paragraph is urn:cts:scholarship:lod.discussion:intro.4. If you were to send this second URN to a CTS text collection containing that passage, you would retrieve the textual data: that is, this exact paragraph.

[urn:cts:scholarship:lod.discussion:intro.5]

An important feature of this way of referring to textual data is that it is medium independent; that is, the CTS URN refers to the textual information itself, rather than how it is displayed, regardless of whether it is split by medium-dependent borders (such as the pagination of a physical book or word-count based pagination of a visualization). For instance, imagine that you want to talk about this introduction with a myopic colleague in a seminar. Both of you print this text with a slightly different font size. In your version of the text, this paragraph is on page 1, while your colleague’s version has the paragraph on page 2. Because of the CTS URN, both of you can pinpoint this part of the discussion using urn:cts:scholarship:lod.discussion:intro.5, without any confusion about which page number to reference.

[urn:cts:scholarship:lod.discussion:intro.6]

While this is, of course, an overly simple example, given that the introduction is obviously at the beginning, imagine that you want to discuss a passage in the middle of a long text or that this discussion will be reprinted with entirely different page numbers. You probably see where I am coming from, but still you might say “but traditional print publishing has ways of dealing with the latter” and I would answer “yes, that is true. But also imagine all the possible different forms in which this article could be published.” Whether that section of text is published on a blog, in a book, or read aloud, the textual information of that passage is always referenceable with the same URN. This is worth emphasizing: A CTS URN can cite a passage of text in a printed edition as easily as a passage in an electronic text. The way textual information is managed with CTS URNs moves beyond the page, beyond the book, beyond the webpage: even beyond publishing or visualization forms we haven’t yet imagined. The CTS URN points to the textual information itself. How you publish, display, voice, or visualize it is up to you.

[urn:cts:scholarship:lod.discussion:intro.7]

Just as you can precisely reference textual data with CTS URNs, you can also reference non-textual data or metadata with CITE2 URNs. A CITE2 URN can identify a physical folio of a manuscript in Venice as easily as it can cite a data record in digital form. The principle is always the same: you have an identifier and you have information that is referenced by this identifier. Scholarship is created by connecting and analyzing information. By using the CITE architecture, you make your scholarship machine actionable and machine understandable.

[urn:cts:scholarship:lod.discussion:intro.8]

But I am getting ahead of myself and have probably already used a lot of jargon that has not been properly explained. Thus, in order to unpack this complex topic and show what CTS/CITE is, I will turn to a group of experts to explain what it can and can’t do, and most importantly, how historical language researchers can use it to their advantage.

[urn:cts:scholarship:lod.discussion:question.1]

Thomas Koentges: Chris, Greg, Neel, James, thank you for joining me and for agreeing to this format. Given how important Plato is for our field, I hope the dialogue format is not too unorthodox. Firstly, to go back to the beginning, Chris and Neel, when and why did you create the CITE architecture?

[urn:cts:scholarship:lod.discussion:answer.1]

Chris Blackwell & Neel Smith: Casey Dué, Mary Ebbott, and Gregory Nagy conceived a project to produce a 21st-century edition of the Homeric Iliad, an edition based on their understanding of the nature of that text as a product of a thousand-year evolution of a tradition of composition-in-performance. This edition would be a “multitext” rather than a critical edition; the goal would not be to analyze the surviving witnesses to the text and, when they differ on a line or passage, throw away all but a single, ostensibly “authentic” version. Rather, they see that in the transmission of the Iliad the various differing readings preserved in manuscripts and ancient commentaries reflect a fundamental aspect of the Iliad: that variation is an essential part of this poetry. When we were brought on to this project, we knew we would need to be able to talk precisely and imprecisely about many versions of the “same” text. Precisely, as in “Book 2, line 4 of the Iliad as preserved on the Venetus A manuscript”; imprecisely, as in “Book 10 of the Iliad, in all of its manifestations through its transmission history.” We also knew that this would be a long task, which motivated us to try to separate concerns as much as possible, on the assumption that generations of technologies would likely pass into obsolescence before we had a scholarly product. Canonical citation has worked as a linking mechanism for centuries, and CITE is just a machine-actionable update to that ancient technology.

[urn:cts:scholarship:lod.discussion:question.2]

TK: So it was invented by philologists for philologists and as a result it has already been implemented in several edition projects. Who, then, can use the CITE architecture?

[urn:cts:scholarship:lod.discussion:answer.2a]

CB & NS: Anyone with canonically citable texts or data. So what does that mean? We can start with texts as an example. If you have a text, and you can assign identifiers to “work-group,” “work,” and “version,” and you can assign a unique identifier to each of its parts (which may or may not be in a hierarchy), and the contents of those parts is not going to change, then you have a canonically citable text. This seems simple to us, but we have been surprised at how many electronic texts are put online without attention to their long-term status as objects of study. Scholars with unique texts, or texts with complicated histories, seem unwilling simply to assign an arbitrary identifier to a text-group (especially), or work, or version. There seems to be a perceived need for an identifier to be descriptive, and (with that mistake made), for the descriptive identifier to have the burden of saying everything about the text. This is an unhelpful attitude. A scholar with a newly edited papyrus, for example, can just define IDs in an independent namespace and proceed: urn:cts:papyrins:02034.34.1:1 is a perfectly good URN. CTS URNs, after all, are generally going to be handled by machines.

Another stumbling block seems to be the necessity that a canonically citable text not change. Digital scholars love doing dynamic things. While a text-under-constant-editing (or a data-collection) may be a valuable locus of scholarly activity, it is not yet a published scholarly work. Publish a text citable by CTS URN, and if you edit it later, publish a new version. This is one of the important benefits of CTS, which allows many versions to be aligned.

[urn:cts:scholarship:lod.discussion:answer.2b]

James Tauber: In terms of usage, I think it’s worth considering not only the publishers of texts (and, as Chris points out, other citable data) but those consuming such texts or wanting to say something about those texts or passages within them. You don’t need to be publishing new works to benefit from CTS. On the consumption side, one application of CTS is in building reading environments like the Scaife Digital Library Viewer. Canonical references can be used for retrieval of texts from a repository to read, as bookmarks, for reading lists, and so on. Another is in text-processing tools that ingest machine-actionable texts. Being able to retrieve texts and text passages via a uniform identifier can help improve the reproducibility of text-processing tasks because you can record the CTS URNs of the exact passages of a particular version you retrieved and others can repeat the same. Finally, CTS URNs become a powerful identifier for those wanting to say something about a text, such as annotators or commentary writers. One can conceive of CTS-based citations servers which can be queried for known annotations or commentaries on a particular passage. In this way the publishers of texts, the readers of those texts, and the commentators on those texts can work without even knowing the others exist but things interoperate through common use of CTS URNs.

[urn:cts:scholarship:lod.discussion:question.3]

TK: Thanks for mentioning the Scaife Digital Library Viewer (https://scaife.perseus.org), James, which is the latest environment in which people can read the CTS-compatible text of OGL, including the Perseus Digital Library, which used the CITE architecture from early on. Greg, when and why did you move to CTS?

[urn:cts:scholarship:lod.discussion:answer.3]

Gregory Crane: It took me a long time to understand the significance of CTS. We have for many years been able to call up a particular chunk of a particular version of a particular text with a particular URL; that is, get me this precise span of words in the first Murray edition of Aeschylus’ Agamemnon. It took me a while before I really understood the difference between our functionality and a generalized API. We started to shift our data to CTS in 2012.

[urn:cts:scholarship:lod.discussion:question.4]

TK: So it is used by large-scale projects, but can also be used by any individual researcher or programmer handling canonical text. However, when introducing CTS, I often encounter the argument: “Well, that might work for your text, but our text is special.” What do you think of this argument and do you know of any “special cases” that could not be expressed through the CITE architecture framework?

[urn:cts:scholarship:lod.discussion:answer.4]

CB & NS: There may well be some extraordinary texts that are incapable of being cited by a CTS URN. However, I have never actually seen one or heard one convincingly described. Were such a text to exist, it is hard to imagine how to pursue scholarly research on a text that cannot be cited. I have heard this claim many times, yet it always arises from one of three problems. Either the speaker actually has no text or does not understand the text, or the speaker is not content to let a citation consist of arbitrary identifiers but insists on overloading the CTS URN in an effort to turn it into a library catalog entry, or the speaker is trying to create an impossibly deep citation hierarchy (e.g., urn:cts:sometexts:group.work.ed:1.3.45.2.6.78.note.123.4). I would note that the texts people actually have cared about over the centuries have one-, two-, or at most three-level citation hierarchies. At Holy Cross, we surveyed every classical Greek text we could identify, and couldn’t find any that required more than three levels in their passage hierarchy.

[urn:cts:scholarship:lod.discussion:question.5]

TK: Yet, because some researchers still assume that their text cannot be expressed in CTS (whether or not that is a correct assumption), there is a push for another standard called DTS. What do you know about it and how does it differ from CTS?

[urn:cts:scholarship:lod.discussion:answer.5]

JT: One non-technical difference between CTS and DTS is the way in which the specifications are being developed. There are two common approaches to writing standards: you either extract the details from a working system that’s already proven or you get a bunch of stakeholders and agree on how things should interoperate. Each has advantages and disadvantages. CTS takes the former approach, DTS the latter. On the technical side, one major difference is in the style of the protocol. The original CTS protocol is reminiscent of the kind developed in the late 1990s around XML-RPC and SOAP. DTS is being built with modern approaches to Web API design and relies more on existing specifications like Hydra and JSON-LD.

[urn:cts:scholarship:lod.discussion:question.6]

TK: It’s definitely a development one should keep in mind, although as Chris pointed out, CTS could be applicable to lots of use cases. Chris, for people who might assume that their text does not fit the requirements for a successful implementation of CTS, I would like to return to your comment that the text content of the CTS nodes cannot be changed. In my experience that is often misunderstood. Taking your example, I know that urn:cts:papyrologyns:02034.34.1:1 contains specific unchangeable text, but one might argue that the same text in this node could be represented differently to emphasize different features. How does CTS deal with this?

[urn:cts:scholarship:lod.discussion:answer.6]

CB: My grandfather was a Baptist minister and had many Bibles. He had at least two that were the King James translation. One of these was fancy, with small print, and had the words of Jesus printed in red. Another was a “large print” edition. It is easy to see that “John 3:16” in each of these contained the same text. It is equally easy to see that if my 98-year-old grandfather wanted to read that citable passage, it would really matter which of the two versions he picked up. The two versions differed, not in the text, but in the markup that presented that text. CTS is good for making distinctions like this while preserving scholarly identity. URNs to my grandfather’s two Bibles would share textgroup (“New Testament”), work (“John”), and version (“KJV”) components; they would also share a citation (“3.16”). But we would call each of these an analytical exemplar, a further level of the bibliographic hierarchy after version. For example:

  • urn:cts:bibles:nt.john.kjv.rubricated:3.16
  • urn:cts:bibles:nt.john.kjv.large:3.16

The editors of the HMT have produced an edition of the Greek text of the Venetus A manuscript of the Iliad. The archival edition is a collection of TEI-XML documents. But we publish many exemplars of that edition, sub-versions derived from the edition according to defined analytical principles. One exemplar might be a plain-text version with all abbreviations expanded. Another exemplar might be normalized according to modern orthographic practices for Ancient Greek. All of these are “the text of the Venetus A manuscript,” and yet all of them are distinct, citable data. There may be other scholarly frameworks that also give this level of control over scholarly identity, but CTS and CITE have served our project very well.

[urn:cts:scholarship:lod.discussion:question.7]

TK: One has to point out though that everything is very text-focused. Chris and Neel, why did you make text objects first-class citizens in the CITE architecture?

[urn:cts:scholarship:lod.discussion:answer.7]

CB & NS: We are philologists, and we started with the Iliad. But more specifically, Neel gave a presentation (ca. 2002) at the Center for Hellenic Studies, “Toward a Text Server,” in which he articulated the basic requirements of machine-actionable canonical citation, and we went from there. In the early years we thought of these as a controlled set of parameters sent to a service; URNs came later.

[urn:cts:scholarship:lod.discussion:question.8]

TK: Speaking of which, I already mentioned above (in urn:cts:scholarship:lod.discussion:intro.7) that the CITE architecture differentiates between two types of URNs. What are they?

[urn:cts:scholarship:lod.discussion:answer.8]

CB & NS: CTS URNs are for identifying passages of text, where the “text” is an ordered hierarchy of citation objects in a bibliographic hierarchy of group, work, version, exemplar. CITE2 URNs are for citing discrete objects in a collection of objects sharing a similar structure—essentially, “everything else.”

[urn:cts:scholarship:lod.discussion:question.9]

TK: So it all started with text and began just a few years after the invention of XML, the format preferred by the Text Encoding Initiative to encode textual editions. Speaking of TEI XML, I know that some people think that the CITE architecture is directly connected to XML, which is not quite correct. Could you clarify how the CITE architecture is related to (TEI) XML?

[urn:cts:scholarship:lod.discussion:answer.9a]

NS: The CITE architecture is related to XML in much the same way that an inventory of food items in my kitchen is related to a container, such as my spice rack. Sometimes, there will be a mapping between the two (oregano is on the top left of the container/spice rack), but in other cases there will be no relation (coffee beans are in a different cabinet; milk is in the refrigerator).

[urn:cts:scholarship:lod.discussion:answer.9b]

CB: CITE is a collection of protocols based on defined data models. Since you ask about TEI XML, and therefore textual things: CTS is based on a model of “text” that is an ordered hierarchy of citation objects. TEI XML is a markup vocabulary, not a data model. It is possible to create a text in TEI XML that is an ordered hierarchy of citation objects and thus valid for CTS, but it is also possible to create TEI XML documents that are not.

I downloaded a TEI edition of Herodotus that had been online for years, and when I tried to process it for use in CTS I discovered that its citation-values were not unique; that is, there were two passages identified as “6.32.” So this was not valid as a CTS text, despite being perfectly valid TEI.

For another example, take any large TEI XML edition of a text (Herodotus will serve), open it in a text editor, and scroll to some arbitrary point in the middle. Now, quickly, what passage are you looking at? Probably, you have some identifier on the paragraph or division that you see, but you would need to scroll around to figure out any higher-level citation values. In contrast, in any CTS environment, you will never see a passage of text without knowing its precise citation. It is of course quite possible to attach full citation-values to each citable passage in a TEI XML text. This would make it much, much easier to transform for CTS, and I wish more editors did so.

[urn:cts:scholarship:lod.discussion:answer.9c]

JT: In many ways I view a CTS system as consisting of the same tripartite set of specification as the Web itself: an addressing scheme, a resource retrieval protocol, and a resource content format. TEI XML can definitely serve as the resource content format for a CTS system. As Chris points out, though, being a valid CTS text in TEI XML is a stricter condition than just being a valid TEI XML document. Crucial to making a TEI XML document compatible with CTS is a references declaration that says how to map a passage reference like 2.35 to the structure of the XML. Furthermore, as Chris also highlights, such references must be unique. This is not a burdensome requirement, however, and passage identifier uniqueness is worth testing on TEI XML documents whether you’re using CTS or not.

[urn:cts:scholarship:lod.discussion:question.10]

TK: Yes, I think it is very important to stress that CTS is not tied to XML and for a while now we have been searching for simpler formats that can be more easily adapted by edition projects and traditional scholars. Following workshop discussions at DH2016 in Krakow, Neel devised the flat tabular format 82XF and, from there, CEX was developed, a format that kept the simple approach but was extended to include all kinds of linked data. Without going into lengthy data format discussions, what do you see as the advantages of using CEX alongside or instead of XML? I know that the Kitab project team uses CEX while building their corpus because it is a flat format that is easier to maintain. Did such thinking influence you when concepting CEX?

[urn:cts:scholarship:lod.discussion:answer.10a]

CB & NS: With a flat, tabular format, aggregation, disaggregation, and transformation become incredibly easy, since we all have access to a body of Unix utilities, dating back to the 1960s, aimed precisely as these problems. Also, with CEX, we can aggregate any type of CTS or CITE data in a single file, as we do for HMT data-releases. I suppose that might be possible with some elaborate XML schema, but I would not want to try to work with it.

[urn:cts:scholarship:lod.discussion:answer.10b]

JT: I’ve been working with flat, tabular formats that are easily processed by Unix utilities for 25 years but was also involved in the creation of the XML specification from the very beginning so am comfortable working with both approaches and they each have their strengths and weaknesses. There is no doubt that there are many orders of magnitude more texts in XML than in CEX, and I have not myself used CEX, but I can certainly appreciate the benefits of CEX-like formats both for certain applications as well as certain styles in which people want to work.

[urn:cts:scholarship:lod.discussion:question.11]

TK: Is it correct that CEX not only makes the textual library exchangeable for scholars, it also makes all linked data they have created shareable between research projects?

[urn:cts:scholarship:lod.discussion:answer.11]

CB & NS: Yes. I can hand you my whole text-library and 17 databases in a single human-readable text file. Better, I can hand you portions of my text-library and slices of my databases in a single text file. And you can validate and discover the contents of that text file computationally.

[urn:cts:scholarship:lod.discussion:question.12]

TK: Let’s talk a bit more about CEX. How is it structured precisely?

[urn:cts:scholarship:lod.discussion:answer.12]

CB & NS: It is probably easiest to point to the CEX specification (available at https://cite-architecture.github.io/citedx/CEX-spec-3.0.1), but briefly, a CEX file is a plain-text file. Its contents are divided into blocks. Each block is introduced by an identifying line beginning with “#!”. The valid block labels are:

  • #!cexversion
  • #!citelibrary
  • #!ctsdata
  • #!ctscatalog
  • #!citecollections
  • #!citeproperties
  • #!citedata
  • #!imagedata
  • #!relations
  • #!datamodels

The specified formatting of the contents of a block differs according to type. The contents of a #!ctsdata block, for example, consists of two fields separated by a delimiter; the default is “#” but this is configurable. The first field is the CTS URN for a passage of text, the second field is the contents of that passage. The order of records in a #!ctsdata block is significant. That sequence gives the “ordered” in “ordered hierarchy of citation objects”; the URNs themselves provide the “hierarchy.”

[urn:cts:scholarship:lod.discussion:question.13]

TK: I would like to follow-up on the way CEX structures image data, which might make this the most technical question of this chapter. But given that images that one might want to use could be stored in different collections in different locations on the web, and given the push by cultural heritage organizations to share collections via the International Image Interoperability Framework (IIIF), I think that this is becoming an increasingly important issue, and so I will ask this technical question anyway. I could see that you specify base URLs in the #imagedata block. I assume that you structured it so that those base URLs, when combined with the CITE2 URN, retrieve the correct image. How do you deal with a collection that stems from multiple sources where you cannot control with which ID you retrieve an image? For example, the first folio of my manuscript is digitized by the Vatican and I have to use a IIIF string, while I can use a IIIF manifest of a different cultural heritage collection for the following five folia, and I have local Deep Zoom images for the rest. Is it possible to encode that?

[urn:cts:scholarship:lod.discussion:answer.13]

CB & NS: Binary image data is complicated because, naturally, there has to be a connection between the CITE collection (which is portable) and some real-world server. We can’t deliver binary image data in the CEX file itself. At present, there are two solutions.

First, make two collections, one for the Vatican image, and one for the others.

Second, make two versions of the same collection: urn:cite2:myimages.vatican:1 and urn:cite2:myimages.localdz:2, urn:cite2:myimages.localdz:3, etc. By requesting objects at the collection-level (without specifying a version ID), you get all your images. But you can associate images with particular hosts by using ID-level URNs when documenting the CiteBinaryImage data model.

[urn:cts:scholarship:lod.discussion:question.14]

TK: Thanks, I think that is a sensible way forward. So there are plenty of advantages to implementing the CITE architecture: anyone working with canonical text can use it and it offers enough room for different exemplars or complex analyses of the text. It is also not tied to any data format, but can be expressed in flat formats that are easily understandable not only by machines, but also by humans. Beyond this, one of the biggest selling points, in my opinion, is that with implementation, one has direct access to a multitude of open-source libraries and tools. Which tools do you use and which stand out for you?

[urn:cts:scholarship:lod.discussion:answer.14a]

CB & NS: Now that we are by default serializing CITE data in the CEX format, much of my editorial work uses basic UNIX utilities and applications: cat, tab, nl, vim (Neel uses emacs, but we remain on speaking terms). After many years of developing code libraries in Groovy, we largely abandoned those about 18 months ago in favor of Scala. We have libraries, each with large and growing numbers of unit tests, for working with URNs, corpora of texts, collections of objects, images, JSON serializations, CEX serializations, and arbitrary relations among URNs. Scala seems very well suited to this kind of work, not least because, thanks to ScalaJS, we can use the same libraries to write command-line applications, server-side applications, and end-user web applications.

[urn:cts:scholarship:lod.discussion:answer.14b]

JT: In building the Scaife Digital Library Viewer, we made extensive use of the CapiTainS suite for the Python programming language. The Nautilus server allowed us to serve up text passages via the CTS protocol and CapiTainS also provided a client library for Python code to request passages from Nautilus or any other CTS-protocol-compatible server.

[urn:cts:scholarship:lod.discussion:answer.14c]

TK: If I may (and since I am the only Gopher in the group) I would like to refer, here, to the programming language Go, which I use in my tools and in which one of the CITE microservices is written, and for which I have written a library dealing with CTS and CITE URNs. In summary, it’s good to see that there is a growing list of tools and software libraries out there that users can employ to interact with and produce CTS data.

[urn:cts:scholarship:lod.discussion:question.15]

TK: Thank you for all the information. I have one final question: From a developer’s perspective, what are the strengths and shortfalls of the CITE architecture?

[urn:cts:scholarship:lod.discussion:answer.15]

CB: I am of course biased, since the evolution of the CITE architecture has to a great extent followed my own experiences and perceived needs as a developer. I find our current Scala libraries a pleasure to use, not least because of the pretty large, and growing, body of tests that accompany each one. I think the OHCO2 library, for CTS texts, offers a very, very rich body of methods; it is rare these days for me to find a manipulation of CTS data that is more than a one-liner away in Scala. Working with CITE collection data is necessarily more verbose. Necessarily, because CITE collections, by definition, each contain an arbitrary number of typed properties, to work with them generically, you have to discover those properties and their types before addressing objects and their particular property values.

I am particularly excited at the potential of discoverable data models, which we are currently exploiting in published code for a Binary Image Model (connecting CITE collections to binary image data on real servers delivered according to different protocols), the Documented Scholarly Edition (DSE) model (a graph of text-bearing artifact, image, and transcription), and the Commentary Model (one citable object or text commenting on another). Data Models allow further compositions or elaborations on generically citable data. The model I am beginning to work on now is a “Typed String Property” model. CITE collections can have properties of specified types, but the types are limited to: boolean, number, CTS URN, CITE2 URN, or string. Each property of a collection, and each property of each object in a collection, is citable by a CITE2 URN. By declaring a Typed String Property model, we can, for example, list specific collection-properties that are not merely of type “string” but of type “Markdown” or “GeoJSON.” Any app or service is free to ignore that data model, and the properties will come through as strings. But an app or service that is aware of these data models can discover these “extended string properties” and handle them accordingly. We get much greater functionality that degrades gracefully. We can add specified string-types as we see fit, without altering the CITE protocol or breaking any existing implementations.

[urn:cts:scholarship:lod.discussion:reading]

TK: Thank you so much for answering all my questions so patiently. In case readers still have questions, we have put together a short list of further reading:

 

Blackwell, C., T. Koentges, and N. Smith (2018). “CITE Exchange Format (CEX): Simple, Plain-Text Interchange of Heterogeneous Datasets.” In Digital Humanities 2018: Conference Abstracts, 541–43 (Mexico City: Universidad Nacional Autónoma de México).

Dué, C., M. Ebbott, C. Blackwell, N. Smith, D. Frame, L. Muellner, and G. Nagy (2016). “The Homer Multitext Project.” Available at: https://www.homermultitext.org/index.html.

Koentges, T., (2018, June 1). ThomasK81/gocite: Kite Surfer (Version 1.0.1). Zenodo. Available at: http://doi.org/10.5281/zenodo.1257467.

Kuczera, A. (2016). “Digital Editions beyond {XML}—Graph-based Digital Editions.” In Proceedings of the 3rd HistoInformatics Workshop on Computational History, 37–46. Available at: http://ceur-ws.org/Vol-1632/paper_5.pdf.

Smith, D. N. (2009). “Citation in Classical Studies.” Digital Humanities Quarterly 3, no. 1: 121–37.