# Production Checks

N.B. See [Child Productions](https://gitbook.bergelsonlab.com/data-pipeline/child-productions) for information on how we typically annotate child productions in SEEDLingS files.

## Production Checks

The two sets of scripts that you'll need for this process are those in [cellrel](https://github.com/SeedlingsBabylab/cellrel) and [datavyu\_scripts](https://github.com/SeedlingsBabylab/datavyu_scripts)

**MAKE SURE TO READ THE SECTION ON ANNOTATION CONVENTIONS AT THE BOTTOM**

### video

1. Place the original set of opf files (gathered from Subject\_Files) into a folder called **original\_opfs**
2. Pull out a new column with just CHI cells.
   * **batch\_getchild.rb** (in datavyu\_scripts)
   * requires setting input/output directories in the script.
     * *$input* - **original\_opfs** folder
     * *$output* - **full\_with\_chi\_col** folder
   * These are the files that will be fully checked for CHI/pho annotations. The getchild script will place empty %pho cells after each CHI cell in that new column, and new cell field indicating its original ordinal number from the column it was extracted from.
   * outputs will go in the **full\_with\_chi\_col** folder
3. After those files have been fully CHI/pho checked, we need to pull out 10% of each file's CHI/pho's and create new opf files with the annotations blanked out.
   * with the **batch\_recode\_pho.rb** script (in datavyu\_scripts)
   * requires setting input/output directories in the script:
     * *$input\_dir* - the folder with the outputs from step #1 (**full\_with\_chi\_col**)
     * *$output\_dir* - where it will output blank versions of the files (the **reliability\_checks** folder)
     * *$original\_out* - where it will output non-blank versions of the selected 10% (the **orig\_10\_percent** folder)
4. The re-coders go through and annotate the files (which they didn't originally code in step #1) in the **reliability\_checks** folder.
5. We batch csv export the pairs of **orig\_10\_percent** and **reliability\_check** files with the **batch\_basic\_level.rb** script (in datavyu\_scripts). This is run inside Datavyu, and needs to have input/output variables set in the script itself before running.
   * $*input\_dir* - folder with the contents of **orig\_10\_percent** and **reliability\_checks**
   * $*output\_dir* folder where the csv files will be dumped
6. These pairs of csv files (generated in the previous step) will be merged into a single spreadsheet containing the original and recoded annotations side by side for all the files. This is the spreadsheet that will be used to calculate reliability scores. The aggregated spreadsheet is generated with the **phochi\_batch\_bl.py** script (in cellrel). It takes 1 argument when run from the command line: the path to **$output\_dir** from step #4
7. The **orig\_10\_percent** opf files will be combined with the **reliability check** files with the **combine\_recode\_chi.rb** script (in datavyu\_scripts), to produce combined consensus opf files (ending in "\_consensus\_relia.opf", and output into the **converge\_out** folder). These files have 2 columns: one with the original cells and one with the recoded cells. Cells that that disagree across original and recode will have "MISMATCH" as a code in the recode column. The original and reliability coders should sit down and reach a consensus about what the appropriate annotations should be, and make that final assignment in the "recode" column in that file.
8. Now we merge the consensus versions of the codes back into the original files with the **merge\_reliability\_chi.rb** script (in datavyu\_scripts). You need to set input/output variables for this script too.
   * *$origin\_in* - the **original\_opfs** folder
   * *$recode\_in* - the **converge\_out** folder
   * *$output* - **final\_out** folder
9. Those files that are now in **final\_out** are the final versions of the opf files and can be re-basicleveled/wordmerged, and sent back to Subject\_Files

### audio

In the `Production_checks/CLANfiles/XX_month` folder, there should be 4 folders:

* **full\_files**
* **reliability\_checks**
* **orig\_10\_percent**
* **spreadsheets**

**THESE FOLDERS MUST EXIST OR ELSE THE SCRIPTS WILL NOT WORK**

1. Fill the **full\_files** folder with the original .cha files (and their corresponding audios)
   * do the initial annotations on these
2. Generate the 10% sampled files
   * this is done with the `batch_sample_chi.py` script (from [reliability repo](https://github.com/SeedlingsBabylab/reliability))
   * it will read all the .cha files in the **full\_files** folder and sample 10% of the CHI annotations, outputing the blank/orig sampled cha's into **reliability\_checks**(for the blanks) and **orig\_10\_percent** (for the originals)
3. After annotating the blank reliability 10% cha's in reliability\_checks, run the `batch_compare_chi.py` script.
   * this will generate the comparison spreadsheet.
   * the script takes 1 argument - the path to the folder that contains the full\_files, reliability\_checks, etc....folders.
   * it will output the spreadsheet into the spreadsheets folder

### annotation conventions

**audio**

* values in %pho lines must be separated by a single space, and contain only 1 phonetic transcription per orthographic CHI annotation preceding them. These space-separated annotations should be ordered identically to the CHI's they refer to, both within a single %pho line, and across %pho lines, so that if you were to pull out all the %pho lines, and put each individual annotation in a long list (in the order they exist in/across %pho lines), that ordering would be identical to the ordering of the \_CHI annotations if you were to do that same list aggregation to those.
  * the # of CHI's and # of pho's should be identical. if they're not, scripts will crash (if you're lucky) or return garbage output that won't be caught until way later.

#### video

* Each %pho should be a point cell (identical onset and offset) that matches up with the offset of the corresponding CHI cell.
