This article is available at the URI http://dlib.nyu.edu/awdl/isaw/isaw-papers/20-13/ as part of the NYU Library's Ancient World Digital Library in partnership with the Institute for the Study of the Ancient World (ISAW). More information about ISAW Papers is available on the ISAW website.
©2021 Sebastian Heath; text and images distributed under the terms of the Creative Commons Attribution 4.0 International (CC-BY) license.
This article can be downloaded as a single file
ISAW Papers 20.13 (2021)
Applied Use of JSON, GeoJSON, JSON-LD, SPARQL, and IPython Notebooks for Representing and Interacting with Small Datasets
Sebastian Heath, Institute for the Study of the Ancient World, New York University
URI: http://hdl.handle.net/2333.1/t1g1k70v
In: Sarah E. Bond, Paul Dilley, and Ryan Horne, eds. 2021. Linked Open Data for the Ancient Mediterranean: Structures, Practices, Prospects. ISAW Papers 20.
Abstract: This paper describes the role of standards-based and open source file formats and tools in representing and interacting with small datasets. The example used is a database of Roman amphitheaters that is based on the GeoJSON variant of JSON, both of which formats are briefly defined and explained by example. It is stressed that the code sharing site GitHub can map the spatial information in GeoJSON files by default. Next, a series of iPython notebooks - all of which can be run interactively or downloaded for further developemnt - show the implementation of a lightweight interface for exploring amphitheater seating capacity. In conclusion, the paper emphasizes that using existing tools can make it easier to maintain focus on the intellectual content of a dataset.
Library of Congress Subjects: Amphitheaters--Italy--Statistics; Linked data--Use studies.
Introduction
The main goal of this paper is to show that a selection of the standards, methods, and tools that fall under the rubric of Linked Open Data (LOD) can be the basis for creating flexible representations, as well as interactive presentations, of small datasets. As will become clear, by 'representation' I mean the specific instance of a file that conforms to particular standards and is therefore reusable in multiple computational contexts. By 'presentation' I mean the transformation of that file into more human-readable results, such as visualizations or interactive web-pages. The specific use-case is a dataset providing brief information about Roman amphitheaters. There are approximately 260 of these structures, which occur throughout Roman territory, even if unevenly distributed.1 All were built between the early first century BCE and the early fourth century CE, though most Roman amphitheaters are first or second century CE in date. These aspects of the data - a relatively small number of entities that show spatial and chronological variability within the set - make for an interesting test-case of the use of LOD methods and tools. They also allow the discussion here to be published in conjunction with all the associated data and with brief scripts that many readers will be able to run themselves, either after downloading or in a cloud-based environment.2 There are links to the latter in the text that follows.
The discussion that follows will move from an overview of the specifics of an LOD-informed representation of the phenomenon of Roman amphitheaters, then to querying that data using the SPARQL query language, and finally to a limited implementation of a graphical and interactive user interface. My intent is that this interface is useful as a repeatable and reusable example of working code. LOD influences all parts of what follows, though more general tools will come into play, including the Python programming language and interactive IPython notebooks. These additions mean that there is no attempt to be "pure" or "strict" LOD. Discussion of actual practice will always be to the fore, and that practice will also suggest a path for using data in historical investigation, even if that is not the primary focus here. Although it is a gross generalization to say that computers only work with 1's and 0's and humans work with ideas, working to bridge the gap between those two perspectives remains a topic of discussion within the wider field of "Digital Humanities."3 By the end of this paper, a set of tools and data will have been assembled that offer an additional starting point in this ongoing effort.
There are other introductory topics to address early on. Firstly, "Roman amphitheaters" here usually means fully-enclosed, quite large, at least partially stone, oval, public structures used primarily for the staging of gladiatorial combats, fights involving animals, and public executions.4 These activities made them an important setting for social and political interaction in the Imperial period.5 Amphitheaters are distinct from theaters, which are generally half-round and primarily used for dramatic events. Even the succinctly stated criteria used here highlight that there are borderline cases, including the so-called Gallo-Roman amphitheaters that have seating only partially enclosing an oval arena. Those are included in the dataset, though it would be easy to exclude them from any analyses that would be improved by doing so. There is also dynamism in the number of amphitheaters in use at any one time. The form, or at least permanent stone versions of it, likely originated in southern Italy in the early first century BCE.6 Initial spread was slow, and then from the mid-first century CE to the mid-second century many were built. As new amphitheaters appeared, older ones went out of use. A compelling pairing of growth and loss is the destruction of the amphitheater at Pompeii in 79 CE, an event that buried 20,000 seats in ash, and the opening shortly thereafter of the Flavian Amphitheater in Rome, the so-called Colosseum, which is in use by 80 CE. With 50,000 seats, the Flavian Amphitheater was comparatively huge. Many Roman amphitheaters fell within the range of 10,000 to 25,000 seats. An interface for exploring amphitheater capacities that is built using open data and open tools appears towards the end of this paper (Fig. 14).
Another topic to consider is this paper's audience. I do not mean what follows as a ground-up introduction to using JSON, JSON-LD, GeoJSON, SPARQL, and iPython notebooks to publish data about the Roman Empire. I do offer brief definitions of those terms, but readers with no experience in the Linked Open Data digital ecosystem might not be satisfied with this discussion as an entry point to the topic. Nonetheless, I will stress throughout that representing data using well-documented file formats and then manipulating that data with open-source tools allows the focus to be on the intellectual content of a dataset and on how it can be queried and the results displayed. I will show "out of the box" functionality inherent in file formats, with mapping being the most visually compelling example. The combined application of all the third-party tools that I will use is tantamount to a test of whether or not I have usefully represented the phenomenon of Roman amphitheaters. To the extent readers think the answer is "yes," this paper is one more contribution that keeps the dialog between standards-compliance and the needs of individual research efforts at the center of discussion of the role of digital tools in Humanistic research.7
Representing the Data: JSON and GeoJSON
As of this writing, the dataset under discussion here is available in a GitHub repository published under a Public Domain dedication, meaning that it meets the expectations implied by the 'O' in LOD.8 While the current author is the main contributor, and is certainly responsible for any shortcomings and incompleteness, the commit history shows that early data collection was a shared effort. Versions of this repository are also published via Zenodo.org, which means there is a DOI for the collected resource.9
The main data appears in the file 'roman-amphitheaters.geojson'. By the end of this section, it will be clear that this file contains both structured data about each amphitheater - such as dimensions, an indication of chronology, and capacity - and spatial data in the form of a point giving the center - accurate to meters when possible - of the arena. After exploring a few specifics of this representation, I will show that the data can be queried using the SPARQL query language that works with simple statements known as 'triples'. But before that, a direct look at the serialization - that is the sequence of characters that allow both humans and computers to recognize the information content of a file - will be useful.
Some unpacking of file extensions and names of formats is necessary,. The '.geojson' extension means the information in 'roman-amphitheaters.geojson' is represented using the JSON format as a starting point, with additional compliance to the GeoJSON standard for recording spatial data. For its part, JSON records information as "key-value pairs".10 An example of four key-value pairs adapted from the Roman amphitheater data is:
{
"id": "romeFlavianAmphitheater",
"title": "Flavian Amphitheater at Rome",
"chronogroup": "flavian",
"pleiades": "https://pleiades.stoa.org/places/423025"
}
As an isolated snippet of JSON, the above is quite readable, which is one advantage of the format. To the left of each ":
" is a 'key
', and to the right is the associated 'value
'; these are surrounded by curly brackets, with the implication being that the key-value pairs describe a single entity. The information in Fig. 1 can be rephrased as "There is an amphitheater with the unique ID 'romeFlavianAmphitheater
'; it has the more human-readable title 'Flavian Amphitheater at Rome
'; it was built in the Flavian period; and - by the way - it's useful to associate this record with the Pleiades URI 'https://pleiades.stoa.org/places/423025
'." At the end of that long sentence I am being somewhat wordy, particularly in comparison to the JSON itself. That is because, like many databases, this specific serialization obscures the nature of the connection being made between a vocabulary and the values indicated. In this case, there is a reference to Pleiades, which describes itself as a "gazetteer of past places."11 Visiting the web address in the JSON snippet displays a page that has the title "Roma" and a further description reading "The capital of the Roman Republic and Empire." As used above, then, the link to Pleiades is imprecise. It is not suggesting a narrow equivalence as it is clear that the scope of the Pleiades identifier is far broader than the individual record in the amphitheater dataset. This use is instead an invocation of a well-recognized general identifier within a specific, even idiosyncratic, dataset. This is good Linked Open Data practice, and as will be seen below, one that comes with a good return on effort when this data is made available on the internet in an interactive setting.
Pleiades, however, does have an identifier for the Flavian Amphitheater itself (https://pleiades.stoa.org/places/285857974) and it will be useful to include that in the amphitheater data. This is easy to do, as shown by the following expanded JSON snippet that adds a key for 'pleiadesspecific
' (Fig. 2):
{
"id": "romeFlavianAmphitheater",
"title": "Flavian Amphitheater at Rome",
"chronogroup": "flavian",
"pleiades": "https://pleiades.stoa.org/places/423025" ,
"pleiadesspecific": "https://pleiades.stoa.org/places/285857974"
}
This snippet still remains readable. But it also allows me to introduce an important aspect of using JSON to represent structured data: when information is not known: there is no need to have a blank field. This can be seen by browsing the roman-amphitheaters.geojson file itself; many entries do not have a 'pleiadesspecific
' key, either because there is no relevant identifier in Pleiades or because it has not yet been entered. Further looking inside that file will find a number of 'fields' that are not present for every entry. These range from expected fields that are sometimes not available for poorly preserved structures, such as maximum length (see 'exteriormajor
'), to more specialized aspects of amphitheater studies such as the presence or absence of below-ground tunnels in the arena (look for the key 'hypogeum
').
Direct inspection of the data on Github will certainly reveal that the snippets appearing above are very simplified. The file itself has more structure. This is in part because, as noted, it conforms to the GeoJSON variant of JSON, which here supports directly recording the approximate centerpoints of amphitheaters. A still simplified snippet that indicates how these points appear in the data appears in Fig. 3:
{
"type": "Feature",
"id": "romeFlavianAmphitheater",
"properties": {
"title": "Flavian Amphitheater at Rome",
"chronogroup": "flavian",
"pleiades": "https://pleiades.stoa.org/places/423025" ,
"pleiadesspecific": "https://pleiades.stoa.org/places/285857974"
},
"geometry": {
"type": "Point",
"coordinates":[
12.492269,
41.890169,
22
]
}
}
GeoJSON is a formally published Internet Engineering Task Force (IETF) proposal, giving it the effective status as a useable standard.12 Although GeoJSON does impose requirements on how information is represented, it remains quite readable. The above snippet builds on the brief information about the Flavian Amphitheater already introduced, but places all but the ID in a 'properties
' block. There is also a 'geometries
' block, which in this case defines a point in three dimensional space at longitude 12.492269, latitude 41.890169, and elevation of 22 meters. Again, this specific representation - one that establishes the identity of the Flavian Amphitheater at Rome, gives very brief descriptive informations, and indicates the central point of the structure - has this precise form because it is valid GeoJSON. This conformance to a standard means that readers can copy-and-paste the text into a tool that renders GeoJSON as a map. At the time of this writing, the sites geojsonlint.org and geojson.io work well. Fig. 4 shows the GeoJSON snippet rendered by GeoJSONLint.com.