Seedlings Project Goals
Description of Seedlings and how they (barely) work
General notes:
pay attention to the branches that things are on, it is a mess (sorry again)
BergelsonLab can have private repos but not SeedlingsBabylab, so no PI (private information) on Seedlings, and make sure that the repo in BergelsonLab is private before putting PI things on it
Listened time
Goal of the project
Get the amount of listened time in each cha file so that we can figure out if all the files have been consistently annotated.
Retrieving the current listened time
Use the script in /Volumes/Fas-Phyc-PEB-Lab/Duke/Seedlings/Scripts_and_Apps/Github/seedlings/annot_distr called recap_regions_listen_time_compute.py
(usage here -- README has not been updated since surplus was added, a lot of scripts that were there before I (GB) got here, sorry about the mess)
Checking the output
The basic processing pipeline is:
running the script (lab tech)
checking the output (lab manager)
if errors in output (pointed out by lab manager), modify script and re-run (lab tech)
if incoherences in cha files, fix (lab manager) an re-run script (lab tech)
Reliability
See here
Samples were taken from lena recordings from different corpora and annotated by the lab that recorded it and another lab. The idea is to measure how much they agree with each other.
Annotation format
We use the pyannote
library to store the annotations. From there, we can use the built-in pyannote metrics to compute what we need.
Detection of speech
The first thing coders should agree on is when there is speech and when there is not. pyannote detection measures see code for detection accuracy and kappa computation
For detection, we computed:
%true pos, %false pos, %true neg, %false neg
precision, recall, f1
accuracy (pe), kappa
Identification of speaker
Each segment of speech is labeled with a speaker by each coder. The second thing coders should agree on is the identity of this speaker -- at least, for now, whether it is a female, a male, the target child or some other child. See code for how speakers are concatenated into categories (CHI-target child, FA-female adult, MA-male adult, OC-other child)
For identification, we computed:
identification error rate with the raw as referent, with the rely as referent
%match, %false alarm, %miss
accuracy, kappa
More detailed annotation
CHI tiers
VCM, LEX, MWU annotations per category: basic reliability, v similar to speaker id reliability; requires those annotations to be read in in the first place, which they are not right now
Other tiers
XDS annotations: multi-category multi-coder reliability, ask Middy for what, Stephan for how (?)
Tests
Build basic eaf files for which reliability values are known and on which to run the metrics and check that values are correct
TODO
read XDS, VCM, LEX and MWU annotations in pyannote format
change spkr id metrics so that they compute the same values but for VCM, LEX and MWU
implement reliability for XDS -- ask Stephan
create test files+CI
Babble corpus
See here
HV
Four repos, HV creation and HV Elika and HV Sarah and templates
Vital records
Birth records
See here
Death records
See here for issue filed about how to handle death records
Lab passwords
See
blabr
See here
R library containing functions useful to the lab. Missing functions recently identified were: renaming participants in a coherent way (leading 0 and so on), merge w/ CDI (Charlotte working on that)
Last updated