Audio Annotation Checks
Last updated
Last updated
Audio files need to be checked. This is a second round of coding--the file has been first-pass annotated, but now we need to catch the mistakes the original coder made, plus add any imageable, concrete nouns they missed.
Audio files are broken into Subregions. These are one hour chunks. They start and end with comments that look like this:
Subregions are numbered 1 - 5 chronologically in the file. Subregions are RANKED based on an algorithm that tells us how much talking happens in the hour.
Annotation Notes has helpful information about what should/should not be coded, as well as formatting instructions, etc.
6, 7
Full file should be coded
8, 9, 10, 11, 12, 13
Top 4 Subregions (lowest ranked not coded)
14, 15, 16, 17
Top 3 Subregions (2 lowest ranked not coded)
Make a working file for your assigned file in Fas-Phyc-PEB-Lab/Seedlings/Working_Files
. Working files should be named [SubNum]_[Month] and should have subfolders for audio and video. Copy the following files from Subject Files into the audio subfolder:
From Subject_Files/[XX]/[XX_XX]/Home_Visit/Coding/Audio_Annotation
:
[XX_XX]_Audio_Coding_Issues.docx
[XX_XX]_sparse_code.cha
From Subject_Files/[XX]/[XX_XX]/Home_Visit/Processing/Audio_Files
:
[XX_XX].wav
You may save directly over the version in Working Files. If anything goes wrong, tell lab staff, who can address the problem. As long as you're copying files from Subject Files, we'll always have a backup. Please be careful in your work, though!
Open the Audio Coding Issues word document and read through the information that's available. It's useful to know which subregions the original coder used (should be top ranked 3 or 4 subregions), whether they marked any skips, and why.
Open the .cha file and search (Ctrl+F) for "subregion" which should bring you to Subregion 1. Check the rank to see if it should be coded (see chart above, referring to the age in months of subject), or if it is lowest-ranked, check to see if it was used as a make up region.
If it is a coded region, use Ctrl+G to find the "ends at ..." comment. Make note of the ending timestamp (in Audio Coding Issues, make a new section called "Checker Notes") so that you know when it's coming up as you're listening to the file. ***to find timestamps: go to Mode-> Expand bullets (shortcut: Esc-a) and timestamps will appear at the end of each segmented tier.
Now return to the beginning of the subregion (Ctrl+F again for "subregion 1"). From there, find the first coded word in the subregion (Ctrl+F for " &=").
Skip back several lines, or to the beginning of the conversation. (TIP: You can see where there is speech in the file by looking at the automated "Conversational Turns" in the .cha file.)
Your job is to listen through all coded words, and the surrounding regions that contain speech. You should NOT listen through the entire subregions, since there are large chunks of silence that someone already listened through. When there are conversations, listen through the entire section. Once you reach a stretch of silence, Ctrl+F for " &=" to skip to the next coded word. Start listening several lines before the word, and continue listening and checking words.
For each word, check each code to see:
if the word should be coded
whether the sentence frame, object presence, and speaker are correct
if the speaker codes are correctly attributed (check Speaker_Codes document in the subject's Subject File)
if the coder missed any words that were said to the child or in the direct vicinity of the child
Repeat for Subregions 2-5. Also search (Ctrl+F) "extra" to see if any extra time is coded in the file. Listen through any extra time until the "end extra time" comment. Check to see that any skips in the file are made up and check to see if the make up time corresponds to skip time using the Fancy Skip Calculator (Fas-Phyc-PEB-Lab/Seedlings/Fancy Skip Calculator.xlsx).
Codes that are already present in the file are formatted exactly as follows:
[word] &=x_x_XXX_annotationID
e.g. boat &=d_y_MOT_0x679fb2
When you add codes that were missing (i.e. you're checking a file and the original coder missed "ball" in the sentence, "Where's your ball?" said by mom) you should format them exactly as follows:
[word] &=x_x_XXX
e.g. ball &=q_n_MOT
Newly coded words won't have annotation IDs assigned to them yet. You'll run a script to insert them after you finish coding the file.
Make sure there is no underscore after speaker code (for newly added words!)
Add object words and codes before the word count (usually before "0.", but if word count for that region is greater than 0, before the &=, ex. &=5_04).
For child-produced utterances, code them as the intended word. The utterance will also go to the intended word at basic level. For example, if the child tries to say an object word (ball) and does not pronounce it correctly (ex. "ba"), you should write the following:
ball &=n_y_CHI
You should also add the corresponding PHO tier, to be filled in later by a phonetic transcriber. To add the PHO tier, press ENTER to make a new line, type %pho:[TAB]
After the TAB, write three number signs with no spaces in between
Add the file to this list, writing the Subject Number and Month for File Name, Audio, and "No" for whether it is transcribed
Important: if there is ALREADY a PHO tier, and you're adding a repetition of a word, or a new word, but corresponding to the same line as another child production, go to the END of the first transcription, put a SPACE, and add "###"
If this is what already exists:
But you think the child says "pancakes" TWO times, you should do this:
Same if you put TWO or more new CHI's on the same line--put as many "###"'s as there are words, each separated from the next by one SPACE
Parents often refer to an object by a nickname (ex. "baba" referring to a bottle). If the parent refers to a "baba" in the file, and then the child subsequently uses the same form, then you can use the child form as a word. Making this decision might entail opening the Basic Level in Subject Files to see if adults are coded using the child form of the word and if the child form is the most common form.
Press Esc+L to run CLAN check. CLAN will bring you to any errors in the formatting of the file. At the bottom of the page in a black highlighted box, it will explain the error to you. Common errors and solutions can be found here.
Go to Windows -> Commands. Change your working directory to the directory where your file lives by clicking the Working button and navigating to your file. Type this command in the text box, replacing the file name with the name of the file you're working on:
If the command runs properly, you’ll either get an error message (with a line number for where the error appears) or you’ll get an output of all of the words coded in your file. Check the list for any typos or outputs that look weird (for example, you shouldn’t see any codes like “d_n_FAT” in the list--if you do, your formatting is wrong for that word). Keep the CLAN output window open so that you can compare the number of words detected to your .csv output (the next step!). Save the file with any changes you make.
After checking the file for format errors, you can run the following command in the command box to check for spelling errors: mor +xl 08_09_sparse_code.cha [changing the file name appropriately]
If you get an error that tells you CLAN can't locate the "mor lib," go to File -> Get MOR grammar -> English.
The output will be a new CLAN file with any entries that do not appear in the dictionary. Most will be compound words (with pluses, which don't appear in any dictionary), but some will be actual misspellings and typos. After you fix the entries in your .cha, you can delete this output from your Working File.
Step 4: add Annotation IDs
There is currently no script for this. See Audio Add Annotation IDsfor details.
See https://github.com/BergelsonLab/gitbook/blob/main/data-pipeline/basic-levels/update-sparse_code.csvs