ISAW Papers 7.19 (2014)

Berkeley Prosopography Services

Laurie Pearce and Patrick Schmitz

Berkeley Prosopography Services (BPS, berkeleyprosopography.org) is a complete package, an interactive tool-kit for analyzing and visualizing prosopographical datasets, available to researchers working in diverse disciplines and operating on data that derive from a variety of text sources and formats. BPS developed as a collaboration between University of California Berkeley researchers in Near Eastern Studies eager for digital tools to facilitate prosopographical research, and a central Research IT team working to develop digital resources that served actual research needs.⬈#p1

BPS innovates by providing a complete package of software tools to perform association and computation tasks for name disambiguation, by adding a new model for curation and collaboration, and by connecting Social Network Analysis (SNA) tools and visualizations. At the heart of the BPS productivity and visualization tools, and the workspace support for exploration and collaboration, is an assertion model that is predicated on hueristics conventially (and manually) implemented by researchers working with onomastic and prosopographical data.⬈#p2

BPS tools include 1) functionality to import TEI documents and convert to our data model, 2) a disambiguation engine to associate names to persons based upon configurable heuristic rules, 3) an assertion model that supports flexible researcher curation and tracks provenance, 4) social network analysis and 5) graph visualization tools to analyze and understand social relations, and 6) a workspace model supporting exploratory research and collaboration. The assertion model poses a challenge to the assignment of unique identifiers and application of Linked Open Data (LOD).⬈#p3

The processing steps reflect the BPS architecture, which is divided into three major areas (diagram available at: http://berkeleyprosopography.org/docs/BPSarchitecture#FigureC):⬈#p4

1. In Text Preprocessing, a corpus is converted from some native format to TEI. The development corpus for BPS is a group of ~500 Akkadian cuneiform legal documents from Hellenistic Uruk, a corpus of the project Hellenistic Babylonia: Texts, Images and Names (HBTIN, oracc.org/hbtin). That project is a component of the Oracc consortium (On-line Richly Annotated Cuneiform Corpora, oracc.org), represented at LAWDI 2012 by Steve Tinney. HBTIN adheres to the shared standards and best practices of the Oracc community and the Cuneiform Digital Library Initiative (CDLI, cdli.ucla.edu).⬈#p5

The TEI markup (in the case of the HBTIN documents, a Unicode representation of transliterated Akkadian) includes elements denoting the individual documents, activities within each document, and persons that have roles in those activities. This markup may be generated by hand or by some semi-automated processes to recognize names, filiation, roles and activities (in any case, most of this happens external to the BPS system). Oracc generates the TEI for HBTIN texts used in BPS. Planned work includes the addition of services to support a broader range of corpora formats as input (e.g., direct from an existing database), and to support simple NLP plug-ins to enrich TEI (e.g., with role markup, based upon patterns).⬈#p6

2. In Disambiguation and Social Network Analysis, TEI is ingested and parsed by corpus services, and a native data model is built internally. The workspace services share this model, and leverage authentication and authorization components to support login and access controls on corpus and workspace resources. The disambiguation engine incorporates configurable rules that may be generic or corpus-specific, and associates the name citations in each document with actual persons depicted in the texts. It includes support for assertions that researchers make to confirm or reject the possibilities suggested by the engine. Finally, GraphML is passed to the SNA services to compute significant features of the social networks.⬈#p7

3. The Presentation, Visualization, and Reporting area presents results from various core model and analysis components, including the declared data model in each corpus (names, activities, etc.), assertions that the researcher has made or imported from others, family tree visualizations, as well as interactive network graphs for exploration and understanding.⬈#p8

The assertions model underlies several areas of BPS functionality, but is described in the primary context of making assertions about disambiguation.⬈#p9

A primary task in prosopography is to determine which real-world person corresponds to a given name instance. All name instances in a corpus, both within a single document (intra-document) and in documents across the corpus (inter-document), provide evidence for disambiguation. The algorithmic model is based upon the heuristics that researchers have long used, and so is familiar to BPS users. To begin, a unique person is posited for each name instance in each document. Then, the model attempts to collapse persons into one another, so that the persons posited for name instances that refer to a given real-world person are collapsed into a single person in the model as well. It does this according to user-configured rules that operate on various features (properties) of each original person. Filiation (declaration of parents and ancestors) is a primary feature used by the model. Additional features include the activity in which each associated name instance is cited, the roles that the citation had in the activity, the date of the respective activities, etc.⬈#p10

The rules of the model operate on these features and then can have one of three functions:⬈#p11

Shift rules shift weight from one person to another
Boost rules magnify the effect of applied shift rules
Discount rules reduce the effect of applied shift rules

A rule that produces a conclusive match between two person/name instances may shift 100% of the weight from one to the other. A rule that is only likely but not certain, may shift less weight. Name-matching rules are generally modeled as shift rules. Rules that provide additional evidence for a match are modeled as a boost, and tend to leverage features like location or activity. Rules that provide evidence of counter-indication are modeled as a discount; examples include date rules that consider the typical life-span and span of activity, along with the dates of respective activities (if two activities are 30 years apart, there is less likelihood that two person/name instances refer to the same real-world person, and so even if a name matches, a discount reduces the effect of the collapse).⬈#p12

Rules may apply only to person/names within a document (intra-document rules), or to persons across the corpus (inter-document rules). Many rules can operate in either model, but function slightly differently in the two contexts. The end result of applying the rules is a set of probabilities for each name-instance, for the set of real-word persons to which that name instance may correspond (low weight probabilities can be filtered out to simply results). Each researcher can configure their confidence in each rule that is configured for their corpus, and thereby individually control how the heuristic proceeds. Changing these values can also allow researchers to explore what-if scenarios.⬈#p13

Once the disambiguation model has run and produced weighted probabilities for each name-instance in a document, a researcher can review these in the UI, and then decide to confirm or discard the results of the model. These assertions are modeled as an action that the model can take to override the computed results, and so operate something like a boost or discount (in fact, the user can optionally set a confidence on each assertion, so it may be more of a hint than a conclusion on their part).⬈#p14

Within the BPS model broadly, assertions have (one or more) anchors in corpus document(s) that specify the resources (in the most general sense) upon which the assertion operates, an action that must be realizable in the model, and provenance (the researcher ID and date when it was created). The common assertions described above allow researchers to specify that a given name-citation is or is not the same person as another name-citation. However other types of assertions are also possible. Users may assert the date of a document for which metadata was missing, or where damage precludes conclusive dating in the original corpus. Users implicitly assert their confidence in each rule when they use the rules configuration interface.⬈#p15

Since the assertions are abstract (although tied to a corpus or application model), they can be serialized and published from their workspace. Other researchers working with the same corpus can accept selected assertions from a peer and apply them in their own workspace, and reject or ignore those with which they do not agree. As the each pass of the disambiguation process effectively reflects the computation of an algorithm defined by each user, the results (that is, the individuals disambiguated from multiple name-instances) may differ.⬈#p16

Challenge for implementing LOD in the BPS model:

In the realm of prosopographical research, LOD should find immediate reception and application. In traditional prosopographical research, some attributes within any given domain (e.g., toponyms, names of rulers, and generally agreed-upon terms) may readily be assigned unique identifiers. In traditional prosopographical research, the promotion of results of a single-authority disambiguation model may equally prompt the assignment of unique identifiers, in spite of a range of uncertainty that may surround a disambiguation. BPS differs from all other prosopography tools and projects in formalizing and integrating the probabilistic heuristics prosopographers naturally apply in their research, and in providing a workspace environment in which individual or collaborating researchers may approach a single corpus with different assertions. Each modification may result in variant disambiguations, which the researcher explores and may accept or reject. In view of the mutability of results that the probabilistic tools may generate, BPS is faced with a challenge with respect to assigning unique identifiers to disambiguated individuals. Recognizing the value of LOD, BPS researchers continue to investigate the application of unique identifiers to the results the tools generate.⬈#p17