Transcription Encoding Workflow

We use a multi-stage workflow for the transcriptions in order to allow for ease of collaboration during the transcription stage. All of the transcriptions are in the Google Drive folder; depending on the individual transcriber's preference, this transcription may have been automatically generated from Transkribus or has been transcribed by hand.

Once the transcription has been finished and proofed according to the guidelines in the Google Drive, the next step is to encode the transcription in TEI. We have an automated process for creating a semi-valid TEI document from the Google Doc transcription. Since Google Doc is a presentation format, the transformation from the Google Doc to TEI will be necessarily imperfect and will require significant manual editing in order to ensure that the structure, semantics, and presentation of the text are encoded properly.

Step 0: Update your repository

Before encoding your transcription, make sure you have your GitHub and oXygen environments properly set up (see Getting Started) and that you have update your local version of the repository (see Workflow).

Step 1: Check Google Drive Transcription

The automated conversion process makes a number of assumptions about the format of the Transcription Roster and the Google Doc; any divergence from these assumptions will quite likely cause the transformation to fail, so it is best to ensure the following before proceeding to run the conversion:

The transcription roster MUST contain the IDs of the assigned transcriber, the actual transcriber, and the proofer where the ID is usually the person's initials and a 1 (i.e. JT1). If there are multiple people for a given role, separate with a slash (i.e. JT1/LD1).

Ensure that these cells DO NOT contain any spaces

The document ID in the first cell MUST match the name of the document in Google Drive, which MUST match the name of the document's ID (i.e. v01.0001.01).

Transcriptions MUST NOT contain any Track Changes (insert, remove, format, etc). Ensure that all track changes are resolved before converting (NB: comments and notes are OK as these are converted to XML comments (!-- comment --)

Once you have confirmed the above, add your ID to the ENCODER column. This is important as we want to avoid multiple people encoding an item.

Step 2: Convert the Transcription

Open the Google Doc transcription and copy the document's URL

Open oXygen (and confirm you are in the lim-data project--it should say

lim-data.xpr

in the corner

Scroll to the bottom of the project pane, find the file named

convertTranscription.html

, and double-click it to open it

Then press the red "Play" button (a white circle with a red triangle)

The first is your encoder ID. Enter your id (i.e. LD1) and press OK

The second is the Google Doc URL. Paste in the Google Doc's URL and press OK

A pane should appear at the bottom of your oXygen application with a bunch of messages that scroll past as the conversion process proceeds. This will take about a minute or so.

Once the conversion is finished (and if it did not encounter any errors), then the document should automatically open in a new tab in oXygen and the information at the bottom should say "BUILD SUCCESSFUL"

If all was successful, you can close the results pane:

If the conversion fails, then a secondary tab will appear at the bottom and will say something dramatic and unhelpful like "FATAL ERROR."

If you see this, then there are a few steps you can take to try and debug. Often, these errors are caused by some inconsistency in the Transcription Roster, so double check that everything is formatted as described above.

If the Roster looks OK, then check the "ANT [convertGoogleDoc]" tab. There will be a lot of superfluous information in that window, but try scrolling up a few lines and seeing if you can find a line that begins with the words "ERROR." These are customized error messages for this project, which attempt to explain what caused the error. For instance, in this example, the document had already been transcribed:

If you can't find a line like that or if it's not clear how to resolve the issue, then the best course of action is to get in touch with the LiM team using the RAs email list, which includes the DHIL developer. It may be the case that another person has encountered the same issue and knows how to fix it or there is something broken with the conversion script. Please include the output of the ANT [convertGoogleDoc] tab at the bottom of your message (it will be a lot of text) since this will help with debugging the issue.

Step 3: Encoding the Transcription

The automatically generated transcription will almost certainly be riddled with validation errors. These are expected and wanted: the conversion tries to make some educated guesses about formatting, etc, but will usually produce generic or intentionally invalid TEI in order to flag places where human intervention is required.

The overall goal of this first pass of encoding is to make sure the document is valid and accurately represents the source material. While each document will have different requirements in terms of encoding, there are some common errors that appear within documents.

TODO: Add table of errors with solutions