Seedlings Project Goals

Description of Seedlings and how they (barely) work

General notes:

  • pay attention to the branches that things are on, it is a mess (sorry again)

  • BergelsonLab can have private repos but not SeedlingsBabylab, so no PI (private information) on Seedlings, and make sure that the repo in BergelsonLab is private before putting PI things on it

Listened time

Goal of the project

Get the amount of listened time in each cha file so that we can figure out if all the files have been consistently annotated.

Retrieving the current listened time

Use the script in /Volumes/Fas-Phyc-PEB-Lab/Duke/Seedlings/Scripts_and_Apps/Github/seedlings/annot_distr called recap_regions_listen_time_compute.py (usage here -- README has not been updated since surplus was added, a lot of scripts that were there before I (GB) got here, sorry about the mess)

Checking the output

The basic processing pipeline is:

  • running the script (lab tech)

  • checking the output (lab manager)

    • if errors in output (pointed out by lab manager), modify script and re-run (lab tech)

    • if incoherences in cha files, fix (lab manager) an re-run script (lab tech)

Reliability

See here

Samples were taken from lena recordings from different corpora and annotated by the lab that recorded it and another lab. The idea is to measure how much they agree with each other.

Annotation format

We use the pyannote library to store the annotations. From there, we can use the built-in pyannote metrics to compute what we need.

Detection of speech

The first thing coders should agree on is when there is speech and when there is not. pyannote detection measures see code for detection accuracy and kappa computation

For detection, we computed:

  • %true pos, %false pos, %true neg, %false neg

  • precision, recall, f1

  • accuracy (pe), kappa

Identification of speaker

Each segment of speech is labeled with a speaker by each coder. The second thing coders should agree on is the identity of this speaker -- at least, for now, whether it is a female, a male, the target child or some other child. See code for how speakers are concatenated into categories (CHI-target child, FA-female adult, MA-male adult, OC-other child)

For identification, we computed:

  • identification error rate with the raw as referent, with the rely as referent

  • %match, %false alarm, %miss

  • accuracy, kappa

More detailed annotation

CHI tiers

VCM, LEX, MWU annotations per category: basic reliability, v similar to speaker id reliability; requires those annotations to be read in in the first place, which they are not right now

Other tiers

XDS annotations: multi-category multi-coder reliability, ask Middy for what, Stephan for how (?)

Tests

Build basic eaf files for which reliability values are known and on which to run the metrics and check that values are correct

TODO

  • read XDS, VCM, LEX and MWU annotations in pyannote format

  • change spkr id metrics so that they compute the same values but for VCM, LEX and MWU

  • implement reliability for XDS -- ask Stephan

  • create test files+CI

Babble corpus

See here

HV

Four repos, HV creation and HV Elika and HV Sarah and templates

Vital records

Birth records

See here

Death records

See here for issue filed about how to handle death records

Lab passwords

See

blabr

See here

R library containing functions useful to the lab. Missing functions recently identified were: renaming participants in a coherent way (leading 0 and so on), merge w/ CDI (Charlotte working on that)

Last updated