Production Checks

MAKE SURE TO READ THE SECTION ON ANNOTATION CONVENTIONS AT THE BOTTOM

N.B. See Child Productions for information on how we typically annotate child productions in SEEDLingS files.

Production Checks

The two sets of scripts that you'll need for this process are those in cellrel and datavyu_scripts

MAKE SURE TO READ THE SECTION ON ANNOTATION CONVENTIONS AT THE BOTTOM

video

Place the original set of opf files (gathered from Subject_Files) into a folder called original_opfs
Pull out a new column with just CHI cells.
- batch_getchild.rb (in datavyu_scripts)
- requires setting input/output directories in the script.
  - $input - original_opfs folder
  - $output - full_with_chi_col folder
- These are the files that will be fully checked for CHI/pho annotations. The getchild script will place empty %pho cells after each CHI cell in that new column, and new cell field indicating its original ordinal number from the column it was extracted from.
- outputs will go in the full_with_chi_col folder
After those files have been fully CHI/pho checked, we need to pull out 10% of each file's CHI/pho's and create new opf files with the annotations blanked out.
- with the batch_recode_pho.rb script (in datavyu_scripts)
- requires setting input/output directories in the script:
  - $input_dir - the folder with the outputs from step #1 (full_with_chi_col)
  - $output_dir - where it will output blank versions of the files (the reliability_checks folder)
  - $original_out - where it will output non-blank versions of the selected 10% (the orig_10_percent folder)
The re-coders go through and annotate the files (which they didn't originally code in step #1) in the reliability_checks folder.
We batch csv export the pairs of orig_10_percent and reliability_check files with the batch_basic_level.rb script (in datavyu_scripts). This is run inside Datavyu, and needs to have input/output variables set in the script itself before running.
- $input_dir - folder with the contents of orig_10_percent and reliability_checks
- $output_dir folder where the csv files will be dumped
These pairs of csv files (generated in the previous step) will be merged into a single spreadsheet containing the original and recoded annotations side by side for all the files. This is the spreadsheet that will be used to calculate reliability scores. The aggregated spreadsheet is generated with the phochi_batch_bl.py script (in cellrel). It takes 1 argument when run from the command line: the path to $output_dir from step #4
The orig_10_percent opf files will be combined with the reliability check files with the combine_recode_chi.rb script (in datavyu_scripts), to produce combined consensus opf files (ending in "_consensus_relia.opf", and output into the converge_out folder). These files have 2 columns: one with the original cells and one with the recoded cells. Cells that that disagree across original and recode will have "MISMATCH" as a code in the recode column. The original and reliability coders should sit down and reach a consensus about what the appropriate annotations should be, and make that final assignment in the "recode" column in that file.
Now we merge the consensus versions of the codes back into the original files with the merge_reliability_chi.rb script (in datavyu_scripts). You need to set input/output variables for this script too.
- $origin_in - the original_opfs folder
- $recode_in - the converge_out folder
- $output - final_out folder
Those files that are now in final_out are the final versions of the opf files and can be re-basicleveled/wordmerged, and sent back to Subject_Files

audio

In the Production_checks/CLANfiles/XX_month folder, there should be 4 folders:

full_files
reliability_checks
orig_10_percent
spreadsheets

THESE FOLDERS MUST EXIST OR ELSE THE SCRIPTS WILL NOT WORK

Fill the full_files folder with the original .cha files (and their corresponding audios)
- do the initial annotations on these
Generate the 10% sampled files
- this is done with the batch_sample_chi.py script (from reliability repo)
- it will read all the .cha files in the full_files folder and sample 10% of the CHI annotations, outputing the blank/orig sampled cha's into reliability_checks(for the blanks) and orig_10_percent (for the originals)
After annotating the blank reliability 10% cha's in reliability_checks, run the batch_compare_chi.py script.
- this will generate the comparison spreadsheet.
- the script takes 1 argument - the path to the folder that contains the full_files, reliability_checks, etc....folders.
- it will output the spreadsheet into the spreadsheets folder

annotation conventions

audio

values in %pho lines must be separated by a single space, and contain only 1 phonetic transcription per orthographic CHI annotation preceding them. These space-separated annotations should be ordered identically to the CHI's they refer to, both within a single %pho line, and across %pho lines, so that if you were to pull out all the %pho lines, and put each individual annotation in a long list (in the order they exist in/across %pho lines), that ordering would be identical to the ordering of the _CHI annotations if you were to do that same list aggregation to those.
- the # of CHI's and # of pho's should be identical. if they're not, scripts will crash (if you're lucky) or return garbage output that won't be caught until way later.

video

Each %pho should be a point cell (identical onset and offset) that matches up with the offset of the corresponding CHI cell.

PreviousChild Productions NextFor phonetic CHI transcribers

Last updated 9 months ago