SEEDLingS - Nouns
Work with the SEEDLings - Nouns dataset
Last updated
Work with the SEEDLings - Nouns dataset
Last updated
SEEDLingS corpus data collection included monthly day-long audio (~16 h) and hour-long video at-home recordings. There were 44 children recorded every month from 6 to 17 months of age. That is 44 x 12 = 528 audio and the same number of video recordings made. One video and one audio recording didn't make it into the corpus, so there are 527 audio and 527 video files in total.
SEEDLingS - Nouns refers to
the annotation effort by the Bergelson Lab to annotate nouns in these files,
the resulting dataset in the form of a set of CSV tables.
This page describes working with the dataset with only a brief description of how it came to be. The following resources contain additional information:
SEEDLingS Corpus Companion contains details about both the dataset and the annotation process.
A README in the repository containing the dataset has a different - shorter version of what's in this page but also has a few extra details.
The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process - all in text-only CSV files. See What's inside?for details about the files' contents.
The dataset doesn't contain audio or video files. Only nouns have been annotated on the corpus-wide level, the files haven't been transcribed. An exception to both these statements is annotations made as part of the ACLEW project. For each recording in the sample of 44, 30 two-minute-long segments were sampled and transcribed. The transcriptions and the audio recordings are available through HomeBank.
Each audio and video file was at least partially annotated for concrete imageable nouns. Here is how much time was annotated in different files:
Video files were annotated in full.
Audio files:
Files from months 6 and 7 were annotated in full.
Files from months 8-13 have 4 hours annotated.
Files from months 14-17 have 3 hours annotated.
The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process.
The above is mostly true, but due to coder errors, etc. the amount of annotated time in the file is not always perfectly aligned with the intended amount of time. Therefore, we created a system to identify annotations as either part of the intended original coding, or not. Overage coding is classified as "surplus" and can be identified with the is_surplus
variable. If the amount of annotated time is important to your work, you may want to remove any annotations that are surplus coding--the default for SEEDLingS Nouns is to include all annotations. Additionally, if you are interested in comparing *equal* amounts of time across files, you should use the is_top_3_hours
variable.
If you need to work with the dataset in R, see here.
The dataset is hosted online as a seedlings-nouns public repository.
That repository contains only public versions of the dataset. The full edit history and in-progress changes are stored in the seedlings-nouns_private private repository. Use the private repository only if necessary.
In both cases, check Use SEEDLingS-Nouns in the scriptsbefore using the repository files directly.
seedlings-nouns.csv
- Contains all the annotated tokens.
recordings.csv
- Contains information about each of audio/video recording sessions.
sub-recordings.csv
- If the audio recording was paused one or more times, we consider all the recorded audio as one recording consisting of several sub-recordings. If the recording wasn't paused at any time, sub-recording is the same as recording. This file also contains local time when each sub-recording started and ended.
regions.csv
- The audio recordings for months 08 and above weren't annotated in full. This file contains information about the regions that were annotated. See below for information on the types of annotated regions.
*.codebook.csv
- For each table listed above, there is an associated codebook listing and describing the columns of the table.
README.md - a readme, duh.
The main seedlings-nouns.csv
table contains all noun annotations in the SEEDLingS repository plus a few extra column. Here is what it looks like:
Here are partial descriptions of the columns:
Full descriptions and extra info are in the seedlings-nouns.codebook.csv
codebook. Here is a glimpse
of it:
The codebook doesn't have its own codebook, so hopefully, the columns are self-explanatory.
See Filesfor a brief description of each. Each of them has a codebook structured in the same as the seedlings-nouns.csv's codebook.
Here are the head
s of all three tables.