Future Directions/Project Ideas
This is a page with preliminary ideas on the general project status as well as future ideas/starting points for how we can make use of OvS work
Project Status
Naturalistic study/corpus annotation: please refer to the coding_scheme spreadsheet for the most up-to-date, current annotation progress
As of December 2025: All 22 participants have at least 2 early infancy files completely annotated (combination of 6;0~ 9;0, 10;0~13;0, and 14;0 ~ 17;0). All participants (n = 12) who have a corresponding follow-up file (SF5) are fully annotated, including the CHI tier.
Experimental study:
As of December 2025: still preliminary, with some progress with stimuli (sample scripts and videos created) and questionnaires (Overheard Speech Questionnaire)
Preliminary Future Direction Ideas
The below points summarize preliminary project ideas that can be potentially used as a senior thesis/Bath intern project based on the current project status (as of December 2025)
Naturalistic study/corpus annotation
XDS/CDS Reliability Coding for all 3 early infancy + follow-up file (n = 69 files) a
Resampling & Validation:
The current sampling method is outlined here.
Hi-volubility (hi-vol) annotations (n=22)
Annotate 5 two-minute hi-vol clips to examine whether the nature of children's language input (e.g. number of speakers are present, how many minutes of adult vs. child speech are heard, how much speech does each talker direct to the target child/another child/adult?) and the 4 language measures are consistent with the current data collected for the full 30 minutes (15 random clips x 2 minutes)
Notes: Some files originate from VIHI which includes these pre-existing hi-vol transcriptions
3 different early infancy timepoints (n=22)
Selecting 3 different pre-determined early infancy timepoints than the original 3 timepoints (refer to the "table_sampling_approach" tab in the coding_scheme spreadsheet for the original 3 timepoints selected for each subject)
Calculate 4 language measures to validate with the current data/originally selected 3 early infancy timepoints
Using TalkBank or other pre-existing corpus, extract 15 two-minute clips to examine whether the 4 language measures (MLU, unique tokens, syntactic complexity, and decontextualized language usage) is consistent
Re-calculate the 4 language measures (specifically for MLU and unique tokens) for speaker each unique 2 minute clip for all 15 clips instead of broadly calculating these measures across the total 30 minute period (15 random clips x 2 minutes)
Run VTC to (1) compute the total number of minutes each speaker is talking during the whole 16 hours, (2) calculate the average speaking time in one clip for each speaker, and (3) validate with current data
Experimental study
Stimuli creation:
Videos & Audio starting point: samples stored on NSF Grant SharePoint (Michika has access)
Psychopy:
Add finalized videos and audios
Insert red box for final answer selection
Stimuli integration with eyetracker (if in-person) or habit2 (online)
Admin:
Modify IRB protocol #2337 (Sound and Meaning) with finalized stimulus, appropriate forms (e.g. consent form), study procedures, etc.
figure out whether format will be in-person or online + admend protocol as necessary
Last updated