Annotations Pipeline

This page is a work in progress.

The VIHI corpus is being actively added to, with multiple parallel annotations while needing to be accessed for research and analyses. In addition, different versions of the corpus with differing levels of completeness live in different places. This pipeline ensures that new data are checked and processed appropriately, while current versions of the data can be accessed. For each step, there is usually a separate gitbook page that details the exact process.

Where the data lives

There are two GitHub repositories of interest: VIHI_LENA and vihi_annotations. In addition, BLAB_SHARE contains the rest of VIHI corpus and derivatives

  • VIHI_LENA contains (currently) final versions of all the annotations. New annotations (post superchecked) are merged into these files.

  • vihi_annotations contains aggregated csv files of all current annotations, along with scripts to propagate new annotations from VIHI_LENA into these csv files. This repo needs to be cloned to your local ~/BLAB_DATA (instructions) to be accessible by blabr::get_vihi_annotations()

  • See here for a detailed breakdown of the data in BLAB_SHARE and how it is organized. Of relevance to the pipeline, BLAB_SHARE contains new annotations in progress, as well as a clone of VIHI_LENA at BLAB_SHARE/VIHI/SubjectFiles/LENA/annotations.

The Pipeline

Each step of the pipeline has three parameters: who's doing the step, where the the data is stored, and a link to an instruction page

Last updated