Annotations Organization Proof-of-Concept

How to organize, backup, and move files between different stages of annotations

Overheard speech has different copies of the same eaf files (annotations) during different stages of the annotation pipeline. As such, this is a proof-of-concept of how to create, store, and copy these different eaf files, to ensure that no intermediate data is lost, as well as a chain of command on who is responsible for each step. This is a proof-of-concept, as such, let me know if any naming of organizational aspects could be changed.

Types of annotation files

The path to the overheard speech folder in blab_share will be refer to as ovs_path. There are four main folders with regards to annotations: annotations, annotations-in-progress, annotations-to-be-superchecked, and csv.

  • annotations : This folder contains the master copy of every annotation. These annotations will be named {id}_{age} and saved in a folder named OvS_{id}_{age}. If the annotation has not been worked on, this will be a blank eaf file. Otherwise, all changes that have been superchecked will be added to this file.

  • annotations-in-progress : This folder contains copies of the eaf files from annotations that are currently worked on by an RA. These annotations will be named {id}_{age}_{initials} and saved in a folder named OvS_{id}_{age}_{initials} . These are copies of the master copies, created by the RA when they begin working on the file.

  • annotations-to-be-superchecked : This folder contains copies of the eaf files from annotations-in-progress that have been annotated by the RA and are ready to be superchecked. These annotations will be named {id}_{age}_{RA's initials}_{superchecker's initials} and saved in a folder named OvS_{id}_{age}_{RA's initials} . These are copies of the in-progress copies, created by the superchecker.

  • cvs : This folder contains csv files of compiled annotations from all the master copies. These csv files can be generated by a blabpy script, which will use all the master copies in annotations and combined them into one big csv.

GitHub

Annotation Pipeline

Details of the exact script/code/command for doing certain steps are (for now) theoretical, I will simply refer to a nebulous "copy script" that can make copy of an eaf file, rename it and move it to the correct folder, with warning messages on overriding/deleting any important folders or files.

  • When a new audio file is added to the OvS project, an eaf file will be generated by the lab tech by randomly sampling 15 2-minute segments. This will also generate a selected region.csv that details the onset-offset of these segments. This is a master copy to be saved in annotations.

  • When an RA is assigned a file, they will run the copy script to make a copy of the eaf file in annotations-in-progress with their initials. This will be the only copy they have to interact with. The intial copy should be pushed to GitHub.

  • At the end of each day, the person responsible for OvS will pushed any in progress annotations, to ensure that the day's work can be recovered. Ideally, the RA should be responsible for backing up their work. However, write access to the GitHub should be only done by staffs. As such, let the RA know about other forms of backing up works (eaf auto backup, blab_share backup) and that regardless, their end of day work will be pushed to GitHub.

  • When the RA is finished, they will run the copy script to make a copy of their annotations to annotations-to-be-superchecked. This ensure that their original working copy is still available for reference, while the superchecker can work on a separate copy.

  • Superchecking! This can be done with newly created validation script. If there are any minor fixes, the superchecker can make the changes in the superchecker copy.

  • This is a step that can be further workshopped. Theoretically, once the eaf file has been superchecked, the changes should be merged into the master copy. Since there are no parallel annotations happening, this can be done by deleting the old master copy and replace it with a new version. This is how I'm proposing to do it.

    • Run the copy script, which will delete the original master copy, make a copy of the superchecked file and rename it to the same as the master copy, and move it to annotations, making this the new master copy. This way, git can track the changes in the eaf file (since it has the same name as the master copy).

    • Alternatively, this might be a good step to do manually, just to make sure no files are deleted by accident.

  • Once everything is validated, clean up unnecessary copies by deleting the supercheck and in-progress folders of the respective file.

The main con of this pipeline is that there are a lot of copies of files, and since everything is pushed to GitHub, it will bog down the repository. However, the advantage is that everything is backed up and retrievable, and you can view changes in the eaf files through Git. An alternative option is to make a separate branch for each RA's annotations, which can then be deleted. I think this might introduce some unnecessary complexity to the pipeline, but might be a good tradeoff.

Last updated