Audio Reliability Checks

We recode 10% of the data from each file in order to determine inter-coder reliability for utterance type and object presence.

Caution! Lines become REVERSED during reliability checks: for utterances that contain TWO OR MORE coded nouns, the coded nouns do not appear in the order that they are uttered.

Please especially take careful note when there are two instances of the SAME WORD within the same utterance. They can be distinguished only by their Annotation ID‌

Background: review Audio Annotation Checks if necessary

1) Generate Reliability files [Lab Coordinator]

Do not do this until AFTER you have already sent the recodes back to SubjectFiles!

Relevant repo: https://github.com/SeedlingsBabylab/reliability

Go to /Volumes/Fas-Phyc-PEB-Lab/Seedlings/Working_Files/

Create a directory for the month you're running reliability on, called "reliability_[month]". Within that directory, make an audio and video subfolder. In audio, make the following subfolders:

full_files
orig_10_percent
reliability_checks
spreadsheets
debug
- compare_csvs (folder within debug)

Open get_cha.py in Atom (this is in "collect" scripts). The start_dir should be set to Subject Files. The out_dir should be set to the full_files folder that you just created. The subj should be set to "" and the month should be set to whichever month you want to gather (ex.: "11"). Run the script from the terminal:

python get_cha.py

Now we need to extract the 10% of annotations and fill them into new cha files. To do this, we run the batch_sample.py script. It takes one argument: the path to the full_files folder. This script will output files with 10% of annotations replaced with "word &=X_X_MOT" into the "orig_10_percent" and "reliability_checks" folders.

2) Conduct Reliability Recodes [RA]

a) Setup

Navigate to your assigned .cha in the following directory: /Volumes/Fas-Phyc-PEB-Lab/Seedlings/Working_Files/reliability_[month]/audio/reliability_checks
Copy the .cha into its corresponding Processing/Audio_Files folder (in Subject_Files): /Volumes/Fas-Phyc-PEB-Lab/Seedlings/Subject_Files/.../Home_Visit/Processing/Audio_Files
- NOTE: Please DO NOT leave the file in this folder when not in use! If you do not finish an audio reliability file by the end of your shift, move the .cha back into the reliability_checks folder and replace the older version, and leave a note for yourself in Asana to finish the file.
Open the file in CLAN. Go to Mode -> turn OFF Chat mode (shortcut: Esc-m)
Also in Mode -> click 'Expand bullets' (shortcut: Esc-a) and timestamps will show up at the end of each segment tier.
Go to Edit -> CLAN Options -> Make sure Auto-Wrap in TEXT mode and Auto-Wrap in CLAN Output are both unchecked (not on!).

b) Check codes

Search (Mac shortcut: ⌘ + F, PC shortcut: Ctrl + F) for "X_X". The codes that you need to fill in have the form "word &=X_X_[SPEAKER]_[annotid]" where X is utterance type and object presence for each coded word.

Place your cursor a few lines above the word in order to hear a bit of context. Listen to the utterance and replace the X in each code with what you think are the appropriate codes.

To move to the next "X_X" use ⌘ + G or Ctrl + G. Continue until you fill in all of the codes.

If the word should not have been coded, make the utterance type and object presence "o". Don't delete any codes from the blank_rel_10.cha file.

If you notice that the speaker code is wrong, or if there are missing words in the section of the file you're listening to, don't change it in the recode file, but do leave a note in this google doc with the timestamp and description of the change you need to make or the missing word you need to add:

https://docs.google.com/document/d/1eKncqrDu5OXwDb559--ILXAdgZSzAMqUedcuV65hFvw/edit?usp=sharing

*Since this doc is stored on the cloud, DO NOT write any identifiable info (e.g. names) on this.

When you're finished, move the .cha back into the reliability_checks directory, replacing the older version.

Mark your task complete on Asana.

3) Generate the spreadsheet [Lab Coordinator]

After all the blank annotations in the reliability_checks folder have been recoded, we need to compare those annotations with the original ones. Run the batch_compare.py script. It takes one argument: the path to the folder that contains the full_files, reliability_checks, etc. folders. It will generate a csv of the mismatches between the recodes and originals, which it will output to the spreadsheets folder.

If the script crashes, compare the two csv files that were sent to the debug folder. Find where there is a mismatch and fix accordingly in the orig or recode .cha.

Place a copy of the reliability spreadsheet in the folder: Volumes/Fas-Phyc-PEB-Lab/Seedlings/Compiled_Data/reliability_sheets_FINAL. This one will remain clean and will not be used for RA's to do consensus meetings.

On the copy of the spreadsheet that will remain inside the month folder, use the following directions to make the consensus spreadsheet:

Use the EXACT text function to compare utt_type and obj_pres columns from orig and recode. The EXACT formula should look like this: =EXACT(C2:C1779, E2:E1779). Then use CONCATENATE to compare TRUE/FALSE columns. The CONCATENATE formula should look like this: =CONCATENATE (I2:I1779,"_",J2:J1779). Filter out TRUE_TRUE to find mismatched codes.

Assign RAs audio consensus and wordmerge tasks--for audios, changes are made directly in the .cha file in subject files

4) Conduct consensus [RA]

The consensus spreadsheet points out any differences in the file between the original answers and your re-coded answers.

a) Set up the consensus spreadsheet

Open the consensus spreadsheet for the month you are doing reliability for Fas-Phyc-PEB-Lab/Seedlings/Working_Files/ reliability_[month]/audio/ spreadsheets
Select+all the cells. Click the Filter icon in the upper right-hand corner.
On the True_False column, select the drop-down arrow. Un-select TRUE_TRUE (this removes the matching codes; now only mismatches remain).
On the File column, select the drop-down arrow and select only the name of the reliability file you want to check.

b) Set up the sparse_code.cha

Find the sparse_code.cha for the file in Subject Files Fas-Phyc-PEB-Lab/Seedlings/Subject_Files/[SubjectNo]/[SubjectNo_month]/Home_Visit/Coding/Audio_Annotation
Copy-paste the sparse_code.cha into its HomeVisit/Processing/Audio_Filesfolder
1. Do not drag and drop!
Open the .cha file. Press Ctrl+a on your keyboard to view timestamps in the file.

c) Do the consensus

If you have a consensus buddy...

Grab a consensus buddy (any research assistant who is currently present)
Grab a headphones splitter. Plug both sets of headphones in.
Look at the spreadsheet. For each mismatched code...
1. On the spreadsheet, copy the onset timestamp.
2. Go to the .cha file. Press Ctrl+f. Paste the timestamp into the search box. Press Enter.
3. Listen from a little before the utterance until a little after (for context)
4. What do you think the code should be? Discuss with your partner. Reference the spreadsheet, Annotation Notes, CWI, or other documentation as necessary. Come to a conclusion.
5. If you disagreed with the original code, make any necessary changes in the sparse_code.cha directly in Subject Files (NOT in the reliability spreadsheet or anywhere else!).
6. Repeat for every mismatched code from the spreadsheet.
7. Clan check + Save
8. Move the sparse_code.cha back to its HomeVisit/Coding/Audio_Annotation folder
  1. No need to run the add_annotid, parseclan, wordmerge, etc. scripts at this time -- that is in the post-consesnsus step

If you don't have a consensus buddy (i.e. nobody else is present at the moment)...

Look at the spreadsheet. For each mismatched code...
1. On the spreadsheet, copy the onset timestamp.
2. Go to the .cha file. Press Ctrl+f. Paste the timestamp into the search box. Press Enter.
3. Listen from a little before the utterance until a little after (for context)
4. Without being influenced by the original code (as indicated in the .cha) or by the recode (as indicated in the spreadsheet), what do you think the code should be?
  1. If you agree with the original code: leave as is
  2. If you agree with the recode: change it to the recode
  3. If you don't agree with either:
    Consult documentation such as Annotation Notes and CWI
    Then, if you still don't agree with either the original OR recode, make a note in the Reliability Issues doc saying that you need to consult with someone about it.
5. Make any necessary changes in the sparse_code.cha directly in Subject Files (NOT in the reliability spreadsheet or anywhere else!).
6. Clan check + Save
7. Move the .cha file back to Coding/Audio_Annotation
  1. Replace the old version of the .cha file in Coding/Audio_Annotation
  2. Delete the one that you were working with in Processing/Audio_Files.
  3. No need to run the add_annotid, parseclan, wordmerge, etc. scripts at this time -- that will happen in the post-consesnsus step
Repeat for every mismatched code from the spreadsheet.

d) Post-consensus

For each of your assigned files:

Reliability issues doc
1. Copy-paste the .cha from Coding/Audio_Annotation into Processing/Audio_Files
2. Open the Reliability Check Coding Issues document. This is the doc that you used to write notes to yourself when you were doing the recodes. https://docs.google.com/document/d/1eKncqrDu5OXwDb559--ILXAdgZSzAMqUedcuV65hFvw/edit?usp=sharing
3. Check if the doc contains any notes about your file.
4. Implement those fixes directly in the .cha in Subject Files and then highlight the comment on the doc once it is taken care of.
Run the scripts and check the basic levels
1. Move the .cha file back to Coding/Audio_Annotation
  1. Replace the old version of the .cha file in Coding/Audio_Annotation
  2. Delete the one that you were working with in Processing/Audio_Files.
2. Ask Zhenya to add annotation IDs. See here.
3. Update sparse_code.csv's

PreviousSkips, Silences, Make Up Regions & Extra Time NextCLAN Check - Troubleshooting

Last updated 9 months ago