Annotations in csv format

All complete annotations are collected into two large csv files with all tiers and intervals (annotations.csv and intervals.csv ). The annotations are exported using a blabpy function, saved to a repo as these csv files, whence they can be loaded with a blabr function. See the vihi_annotations repoarrow-up-right for details.

blabpy

  • blabpy.pipeline.extract_aclew_data extracts annotations from a file or recursively from a folder and returns two dataframes: annotations and intervals. This function can be used to extract any eaf files, and is the most raw version of the data.

    • Intervals table has one row per coding interval. Tables can be merged using the eaf_filename and code_num columns. These are extracted from the code, code_num, sampling_type, onset/offset and context tiers.

    • Annotations table has one row per participant-level annotation, all extra annotations (vcm, xds, etc.) are in their own columns. A missing child-tier segment is represented as NA, an empty one - as an empty string. Any annotations outside an interval (code_num) is assigned a code_num of -1

  • A version of this function in blabpy.vihi.pipeline has been adapted to combine this data with additional information extracted from selected_regions.csv files (specifically the rank of each code_num)

vihi_annotations

  • This repo contains the current version of the large VIHI csv files.

  • It contains an update.sh script which clone the most recent version of VIHI_LENA and calls the blabpy function above to collect all the intervals and update the current csv. The lab technician should do this everytime a new file has gone through the annotation/supercheck/merge pipeline and is ready to part of the final csv.

  • As noted above, these csv is the most "raw" version of the annotations. annotations.csv has every tiers along with every annotations and codes in the eaf files (with the exception of one or two entire tiers that can be excluded in the update.sh script), while intervals.csv will have only the intervals that were explicitly coded in the eaf files. In addition, no checks are performed on these once they have been through the superchecking steps. Thus, it is not recommended to use these csv directly.

  • The repo needs to be cloned to your local ~/BLAB_DATA (instructions) to be accessible by blabr::get_vihi_annotations()

circle-info

An older version of this page notes that the current version of update.sh uses the "Dev versions (0.0.0.9xxx)". I have no idea what this means but will look into it.

blabr

  • There is blabr::get_vihi_annotations() that loads the csv from the repo mentioned above.

  • The vihi_annotations repo needs to be cloned to your local ~/BLAB_DATA (instructions) to be accessible by blabr::get_vihi_annotations()

  • This function is the *preferred* way to work with the full VIHI corpus since:

    • It performs various checks to make sure that there are no errors among annotations according to ACLEW standards

    • It ensures the correct data type of every column (including empty/NA values)

    • It processes the data to add derivative column that are useful but not included in the original, such as the is_top_5_hivol, and the first 90 minute interval (interval 0) that is not explicitly coded in the eaf files.

    • It has options for loading

      • Older version of the csv

      • Only the annotations, only the intervals, or the annotations with interval data merged to it

      • Only the random samples, only VI and TD matches, or the entire corpus

      • Include annotations with/without PI

    • See details of all that herearrow-up-right

Last updated