Future Directions/Project Ideas

This is a page with preliminary ideas on the general project status as well as future ideas/starting points for how we can make use of OvS work

Project Status

  • Naturalistic study/corpus annotation: please refer to the coding_scheme spreadsheet for the most up-to-date, current annotation progress

    • As of December 2025: All 22 participants have at least 2 early infancy files completely annotated (combination of 6;0~ 9;0, 10;0~13;0, and 14;0 ~ 17;0). All participants (n = 12) who have a corresponding follow-up file (SF5) are fully annotated, including the CHI tier.

  • Experimental study:

    • As of December 2025: still preliminary, with some progress with stimuli (sample scripts and videos created) and questionnaires (Overheard Speech Questionnaire)

Note: checking in with Jasenia about the specific progress with experimental study

Preliminary Future Direction Ideas

The below points summarize preliminary project ideas that can be potentially used as a senior thesis/Bath intern project based on the current project status (as of December 2025)

Naturalistic study/corpus annotation

  1. XDS/CDS Reliability Coding for all 3 early infancy + follow-up file (n = 69 files) a

  2. Resampling & Validation:

The current sampling method is outlined here.

  • Hi-volubility (hi-vol) annotations (n=22)

    • Annotate 5 two-minute hi-vol clips to examine whether the nature of children's language input (e.g. number of speakers are present, how many minutes of adult vs. child speech are heard, how much speech does each talker direct to the target child/another child/adult?) and the 4 language measures are consistent with the current data collected for the full 30 minutes (15 random clips x 2 minutes)

    • Notes: Some files originate from VIHI which includes these pre-existing hi-vol transcriptions

  • 3 different early infancy timepoints (n=22)

    • Selecting 3 different pre-determined early infancy timepoints than the original 3 timepoints (refer to the "table_sampling_approach" tab in the coding_scheme spreadsheet for the original 3 timepoints selected for each subject)

    • Calculate 4 language measures to validate with the current data/originally selected 3 early infancy timepoints

    Note (delete when confirming with Jasenia): the technicalities behind the specific months selected from the 3 time points (i.e. How did you specifically select what month for each of the 3 early infancy categories e.g. why month 16 instead of month 14 for subject #8?)

  • Using TalkBank or other pre-existing corpus, extract 15 two-minute clips to examine whether the 4 language measures (MLU, unique tokens, syntactic complexity, and decontextualized language usage) is consistent

  • Re-calculate the 4 language measures (specifically for MLU and unique tokens) for speaker each unique 2 minute clip for all 15 clips instead of broadly calculating these measures across the total 30 minute period (15 random clips x 2 minutes)

  • Run VTC to (1) compute the total number of minutes each speaker is talking during the whole 16 hours, (2) calculate the average speaking time in one clip for each speaker, and (3) validate with current data

Experimental study

  1. Stimuli creation:

    1. Videos & Audio starting point: samples stored on NSF Grant SharePoint (Michika has access)

    2. Psychopy:

      1. Add finalized videos and audios

      2. Insert red box for final answer selection

    3. Stimuli integration with eyetracker (if in-person) or habit2 (online)

  2. Admin:

    1. Modify IRB protocol #2337 (Sound and Meaning) with finalized stimulus, appropriate forms (e.g. consent form), study procedures, etc.

      1. figure out whether format will be in-person or online + admend protocol as necessary

Last updated