Naturalistic study

This page describes the sampling, processing, and data management procedures for the SEEDLingS and Overheard Speech (OvS) projects, covering both Study A (6–17 mo.) and Study B (follow-up at 54 mo.)

Introduction

The naturalistic study of overheard speech consist of two goals:

1) To characterize the speech children hear and overhear in their daily lives.

2) To test the relationship between children's speech input and their language development.

There are two projects within this study. Study A addresses on Goal 1 while Study B addresses Goal 2. For more information, read Objective 1 in the Project Descriptionarrow-up-right.

Sampling approach

Early Infancy Files (Study A)

The naturalistic overheard speech studies used longitudinal data from SEEDLinGS corpus. This corpus consist of monthly daylong recordings (16 hrs each) of 44 infants in their homes, collected from age 6 to 17 months.

From the 44 infants, we selected 19 infants who had at least one sibling in their household. This increased the likelihood of sampling speech to both the target child and other other children in the household.

To sample the age distribution of SEEDLinGS as uniformly as possible, we selected three recordings per child, corresponding to the following age windows: 6-9 months, 10-13 months, and 14-17 months. To see which age recordings were selected for each SEEDLingS child, refer the 'table_sampling_approach' tab in coding_orders.xlsx.arrow-up-right

Some infants had fully transcribed and annotated recordings from previous lab studies. To reduce annotation time, we incorporate these existing files when possible. For these files (all files in the overheard speech project), we added additional tiers as needed (e.g., the cds tier, which specifies which child was being addressed when the xds tier indicates child-directed speech (C).

To see which files already existed, refer to the 'coding_order' tab in coding_order.xlsxarrow-up-right.

circle-exclamation

Follow-Up/SF5 Files (Study B)

Study B uses a subset of infants ( 12 infants) from Study A. These 12 infants contributed daylong recording at 54 months for a SEEDLinGsS follow-up study. These children also completed the Clinical Evaluation of Language Fundamentals (CELF) test, which is a comprehensive assessment that measures general language abilities in children ages 3-6 years and includes a Core Language score (CLS), which evaluates children’s lexical, syntactic, and morphological abilities.

circle-exclamation
circle-info

Note: Although 12 infants contributed daylong recordings at 54 months, 16 infants in the overheard-speech group completed the CELF assessment. Therefore, when testing the relationship between early language input (at 6–17 months) and later language outcomes (at 54 months), the sample size can be increased to 16. However, when examining the relationship between input at 54 months and language outcomes at the same age, the sample size will remain 12.

How the files were generated

For existing files (e.g., VIHI or ACLEW), I manually copied the .eaf file into the corresponding folder, placed it in its respective OvS subfolder within the eaf directory, and renamed it according to the established naming system for .eaf files and subfolders (see the OvS repository arrow-up-rightfor details on the naming conventions of eaf files and its content).

circle-exclamation

For new files, I provided Zhenya Kalenkovich (former BLAB IT) with a list of recordings that required .eaf files (for complete list, see list_seedlings_files_ovs.xlsxarrow-up-right found in Aim 1 folder in NSF SBE Fellowship SharePoint folder. The list included the SEEDLingS subject number, Age Folder, and duration (in minutes) of the audio .wav file. Zheyna then generated an .eaf for each home recording using the blabpy function (see the Generating .eaf Files page for more details). For each new files, fifteen, 2-minute segments were randomly selected. Each file is located within its corresponding subfolder in the eaf folder ((see the OvS repository arrow-up-rightfor details on the naming conventions of eaf files and its content).

circle-exclamation

Data Types and Locations

  1. CELF

    • all_vocab.csv: This comma-delimited file contains several language assessment scores of SEEDLingS infants, including their standardized score on the CELF test. The csv is located in this repository.

    • raw scores: This password-protected csv file contains the raw scores of each SEEDLingS infant who completed the CELF test at follow-up

    • CELF test material and scoring booklets: the test material and scoring booklet is stored on the shelfs in RA room in William J. Hall (WJH).

circle-info

Note: Both all_vocab.csv and raw_scores.csv include SEEDLingS infants who are not part of the OvS sample, since inclusion in the Overheard Speech study required infants to have a sibling.

  1. daylong recordings at 6 to 17 months and at 54 months - each SEEDLingS infant at a given age month (eg 6 month) will have a corresponding daylong recording at the month.

    • wav files: this is the audio file extracted from LENA. We attach this file to the eaf file in order to hear what speech is happening in the child's environment.

    • eaf file: this is a file to annotate and transcribe recordings using ELAN.

  2. Demographics Data and Household Size - this is a password-protected csv file that contains information about the household size for each child and the age of the child's siblings.

  3. Speaker Codes - list all the speaker found in the 6 to 17 month daylong recordings. Theer is one document per infant. This data is useful to prior to annotating the data to get a sense of who might appear in the recording.

    • for more information about speaker coding system, see Speaker Codes page.

  4. Cast of Characters - another list of individuals in the 6 to 17 month daylong recording. There is one document per infant. This data is sometimes used alongside the Speaker Codes document prior to annotating the files to get a sense of who might appear in the recording.

  5. Coding issues: This is a document written by a coder (typically an RA) as they are transcribing and annotating a file. This document contains information about any issues they encountered during transcription, such as difficulty in deciphering who is speaking within a 2 minute clip.

Last updated