SEEDLingS - Nouns
Work with the SEEDLings - Nouns dataset
SEEDLingS corpus data collection included monthly day-long audio (~16 h) and hour-long video at-home recordings. There were 44 children recorded every month from 6 to 17 months of age. That is 44 x 12 = 528 audio and the same number of video recordings made. One video and one audio recording didn't make it into the corpus, so there are 527 audio and 527 video files in total.
SEEDLingS - Nouns refers to
the annotation effort by the Bergelson Lab to annotate nouns in these files,
the resulting dataset in the form of a set of CSV tables.
This page describes working with the dataset with only a brief description of how it came to be. The following resources contain additional information:
SEEDLingS Corpus Companion Book contains details about both the dataset and the annotation process.
A README in the repository containing the dataset has a different - shorter version of what's in this page but also has a few extra details.
What is and what isn't in the dataset?
The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process - all in text-only CSV files. See What's inside?for details about the files' contents.
The dataset doesn't contain audio or video files. Only nouns have been annotated on the corpus-wide level, the files haven't been transcribed. An exception to both these statements is annotations made as part of the ACLEW project. For each recording in the sample of 44, 30 two-minute-long segments were sampled and transcribed. The transcriptions and the audio recordings are available through HomeBank.
What was annotated?
Each audio and video file was at least partially annotated for concrete imageable nouns. Here is how much time was annotated in different files:
Video files were annotated in full.
Audio files:
Files from months 6 and 7 were annotated in full.
Files from months 8-13 have 4 hours annotated.
Files from months 14-17 have 3 hours annotated.
The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process.
Special Note about Annotated Time
The above is mostly true, but due to coder errors, etc. the amount of annotated time in the file is not always perfectly aligned with the intended amount of time. Therefore, we created a system to identify annotations as either part of the intended original coding, or not. Overage coding is classified as "surplus" and can be identified with the is_surplus
variable. If the amount of annotated time is important to your work, you may want to remove any annotations that are surplus coding--the default for SEEDLingS Nouns is to include all annotations. Additionally, if you are interested in comparing *equal* amounts of time across files, you should use the is_top_3_hours
variable.
Where to find it?
The dataset is hosted online as a seedlings-nouns public repository.
That repository contains only public versions of the dataset. The full edit history and in-progress changes are stored in the seedlings-nouns_private private repository. Use the private repository only if necessary.
In both cases, check Use SEEDLingS-Nouns in the scriptsbefore using the repository files directly.
What's inside?
Files
seedlings-nouns.csv
- Contains all the annotated tokens.recordings.csv
- Contains information about each of audio/video recording sessions.sub-recordings.csv
- If the audio recording was paused one or more times, we consider all the recorded audio as one recording consisting of several sub-recordings. If the recording wasn't paused at any time, sub-recording is the same as recording. This file also contains local time when each sub-recording started and ended.regions.csv
- The audio recordings for months 08 and above weren't annotated in full. This file contains information about the regions that were annotated. See below for information on the types of annotated regions.*.codebook.csv
- For each table listed above, there is an associated codebook listing and describing the columns of the table.README.md - a readme, duh.
seedlings-nouns.csv
The main seedlings-nouns.csv
table contains all noun annotations in the SEEDLingS repository plus a few extra column. Here is what it looks like:
> glimpse(blabr::get_seedlings_nouns(version = 'v1.0.0'))
Rows: 358,305
Columns: 22
$ recording_id <chr> "Audio_01_06", "Audio_01_06", "Audio_01_06", "Audio_…
$ audio_video <fct> audio, audio, audio, audio, audio, audio, audio, aud…
$ subject_month <chr> "01_06", "01_06", "01_06", "01_06", "01_06", "01_06"…
$ child <fct> 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, …
$ month <fct> 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, …
$ onset <int> 30810, 38980, 294500, 295510, 296310, 300620, 304170…
$ offset <int> 31830, 40190, 295510, 296310, 298440, 301670, 305170…
$ annotid <chr> "0x08406f", "0x6578a9", "0xdc5508", "0x96c739", "0xa…
$ ordinal <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ speaker <fct> MOT, MOT, MOT, SIS, MOT, MOT, MOT, MOT, MOT, MOT, MO…
$ object <chr> "snaps", "coffee", "shirt", "shirt", "shirt", "shirt…
$ basic_level <chr> "snap", "coffee", "shirt", "shirt", "shirt", "shirt"…
$ global_basic_level <chr> "snap", "coffee", "shirt", "shirt", "shirt", "shirt"…
$ transcription <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ utterance_type <fct> q, q, n, n, n, n, d, n, n, n, n, d, d, d, d, q, d, d…
$ object_present <fct> n, n, y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n…
$ is_subregion <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_top_3_hours <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_top_4_hours <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_surplus <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ position <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1,…
$ subregion_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4, 4, 4,…
Here are partial descriptions of the columns:
get_seedlings_nouns(version = 'v1.0.0', get_codebook = TRUE) %>%
select(column, data_type, description) %>%
print(n = 9999)
# A tibble: 22 x 3
column data_type description
<chr> <fct> <chr>
1 recording_id string "Recording ID. Unique recording ID - a combinat…
2 audio_video categorical "Media type: audio or video. Indicates whether …
3 subject_month string "Subject and month ID. Uniquely identifies a pa…
4 child categorical "Child ID. Unique infant identifier. The SEEDLi…
5 month categorical "Month. Age in months on the nearest \"month bi…
6 onset integer "Onset. Onset of the utterance (audios) or noun…
7 offset integer "Offset. Offset of the utterance (audio) or nou…
8 annotid string "Token ID. A randomly generated unique identifi…
9 ordinal integer "Token number. The order that the coded nouns o…
10 speaker categorical "Speaker code. A three-letter code indicating t…
11 object string "Coded noun. A concrete, imageable English noun…
12 basic_level string "Basic Level. Variant of the coded noun that wa…
13 global_basic_level string "Global Basic Level. Each noun's lemma. Decided…
14 transcription string "Transcription. Phonetic transcription of the n…
15 utterance_type categorical "Utterance type. Type of the utterance the noun…
16 object_present categorical "Object presence. Was the object present, i.e.,…
17 is_subregion boolean "Does this interval belong to a subregion?"
18 is_top_3_hours boolean "Is top three hours. Indicates whether the noun…
19 is_top_4_hours boolean "Is top four hours. Indicates whether the noun …
20 is_surplus boolean "Is surplus. Indicates that the noun is from a …
21 position integer "(if token in subregion only) Chronological pos…
22 subregion_rank integer "(if token in subregion only) Rank of the subre…r
Full descriptions and extra info are in the seedlings-nouns.codebook.csv
codebook. Here is a glimpse
of it:
Rows: 22
Columns: 6
$ column <chr> "recording_id", "audio_video", "subject_month", "child…
$ data_type <fct> "string", "categorical", "string", "categorical", "cat…
$ values <chr> "1054 unique values", "audio, video", "528 unique valu…
$ description <chr> "Recording ID. Unique recording ID - a combination of …
$ additional_info <chr> NA, NA, "https://app.gitbook.com/o/-LD2B3y79nAYcWKjWKT…
$ additional_info_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
The codebook doesn't have its own codebook, so hopefully, the columns are self-explanatory.
regions.csv, recordings.csv, sub-recordings.csv
See Filesfor a brief description of each. Each of them has a codebook structured in the same as the seedlings-nouns.csv's codebook.
Here are the head
s of all three tables.
> blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'regions') %>% print(n = 6, width = Inf)
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./regions.csv
# A tibble: 2,898 × 9
recording_id start end is_subregion is_top_3_hours is_top_4_hours
<chr> <int> <int> <lgl> <lgl> <lgl>
1 Audio_01_06 0 600000 FALSE FALSE FALSE
2 Audio_01_06 600000 4200000 TRUE FALSE TRUE
3 Audio_01_06 4200000 14400000 FALSE FALSE FALSE
4 Audio_01_06 14400000 18000000 TRUE TRUE TRUE
5 Audio_01_06 18000000 23400000 FALSE FALSE FALSE
6 Audio_01_06 23400000 27000000 TRUE TRUE TRUE
is_surplus position subregion_rank
<lgl> <int> <int>
1 TRUE NA NA
2 FALSE 1 4
3 TRUE NA NA
4 FALSE 2 2
5 TRUE NA NA
6 FALSE 3 3
# ℹ 2,892 more rows
# ℹ Use `print(n = ...)` to see more rows
> blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'recordings') %>%
+ head()
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./recordings.csv
# A tibble: 6 × 3
recording_id total_recorded_time_ms total_listened_time_ms
<chr> <int> <int>
1 Audio_01_06 57599000 36497890
2 Audio_01_07 57599000 35324640
3 Audio_01_08 57599000 14700240
4 Audio_01_09 57599000 14400000
5 Audio_01_10 57599000 14400000
6 Audio_01_11 57599000 14400000
blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'sub-recordings') %>%
+ head()
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./sub-recordings.csv
# A tibble: 6 × 4
recording_id start end start_position_ms
<chr> <dttm> <dttm> <int>
1 Audio_01_06 1920-01-01 08:26:45 1920-01-02 00:26:44 0
2 Audio_01_07 1920-01-01 09:03:54 1920-01-02 01:03:53 0
3 Audio_01_08 1920-01-01 09:31:57 1920-01-02 01:31:56 0
4 Audio_01_09 1920-01-01 06:09:34 1920-01-01 22:09:33 0
5 Audio_01_10 1920-01-01 06:27:24 1920-01-01 22:27:23 0
6 Audio_01_11 1920-01-01 06:36:10 1920-01-01 22:36:09 0
Last updated