SEEDLingS - Nouns

Work with the SEEDLings - Nouns dataset

SEEDLingS corpus data collection included monthly day-long audio (~16 h) and hour-long video at-home recordings. There were 44 children recorded every month from 6 to 17 months of age. That is 44 x 12 = 528 audio and the same number of video recordings made. One video and one audio recording didn't make it into the corpus, so there are 527 audio and 527 video files in total.

SEEDLingS - Nouns refers to

  • the annotation effort by the Bergelson Lab to annotate nouns in these files,

  • the resulting dataset in the form of a set of CSV tables.

This page describes working with the dataset with only a brief description of how it came to be. The following resources contain additional information:

  • SEEDLingS Corpus Companion Book contains details about both the dataset and the annotation process.

  • A README in the repository containing the dataset has a different - shorter version of what's in this page but also has a few extra details.

What is and what isn't in the dataset?

The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process - all in text-only CSV files. See What's inside?for details about the files' contents.

The dataset doesn't contain audio or video files. Only nouns have been annotated on the corpus-wide level, the files haven't been transcribed. An exception to both these statements is annotations made as part of the ACLEW project. For each recording in the sample of 44, 30 two-minute-long segments were sampled and transcribed. The transcriptions and the audio recordings are available through HomeBank.

What was annotated?

Each audio and video file was at least partially annotated for concrete imageable nouns. Here is how much time was annotated in different files:

  • Video files were annotated in full.

  • Audio files:

    • Files from months 6 and 7 were annotated in full.

    • Files from months 8-13 have 4 hours annotated.

    • Files from months 14-17 have 3 hours annotated.

The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process.

Special Note about Annotated Time

The above is mostly true, but due to coder errors, etc. the amount of annotated time in the file is not always perfectly aligned with the intended amount of time. Therefore, we created a system to identify annotations as either part of the intended original coding, or not. Overage coding is classified as "surplus" and can be identified with the is_surplus variable. If the amount of annotated time is important to your work, you may want to remove any annotations that are surplus coding--the default for SEEDLingS Nouns is to include all annotations. Additionally, if you are interested in comparing *equal* amounts of time across files, you should use the is_top_3_hours variable.

Where to find it?

If you need to work with the dataset in R, see here.

The dataset is hosted online as a seedlings-nouns public repository.

That repository contains only public versions of the dataset. The full edit history and in-progress changes are stored in the seedlings-nouns_private private repository. Use the private repository only if necessary.

In both cases, check Use SEEDLingS-Nouns in the scriptsbefore using the repository files directly.

What's inside?

Files

  • seedlings-nouns.csv - Contains all the annotated tokens.

  • recordings.csv - Contains information about each of audio/video recording sessions.

  • sub-recordings.csv - If the audio recording was paused one or more times, we consider all the recorded audio as one recording consisting of several sub-recordings. If the recording wasn't paused at any time, sub-recording is the same as recording. This file also contains local time when each sub-recording started and ended.

  • regions.csv - The audio recordings for months 08 and above weren't annotated in full. This file contains information about the regions that were annotated. See below for information on the types of annotated regions.

  • *.codebook.csv - For each table listed above, there is an associated codebook listing and describing the columns of the table.

  • README.md - a readme, duh.

seedlings-nouns.csv

The main seedlings-nouns.csv table contains all noun annotations in the SEEDLingS repository plus a few extra column. Here is what it looks like:

> glimpse(blabr::get_seedlings_nouns(version = 'v1.0.0'))
Rows: 358,305
Columns: 22
$ recording_id       <chr> "Audio_01_06", "Audio_01_06", "Audio_01_06", "Audio_…
$ audio_video        <fct> audio, audio, audio, audio, audio, audio, audio, aud…
$ subject_month      <chr> "01_06", "01_06", "01_06", "01_06", "01_06", "01_06"…
$ child              <fct> 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, …
$ month              <fct> 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, …
$ onset              <int> 30810, 38980, 294500, 295510, 296310, 300620, 304170…
$ offset             <int> 31830, 40190, 295510, 296310, 298440, 301670, 305170…
$ annotid            <chr> "0x08406f", "0x6578a9", "0xdc5508", "0x96c739", "0xa…
$ ordinal            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ speaker            <fct> MOT, MOT, MOT, SIS, MOT, MOT, MOT, MOT, MOT, MOT, MO…
$ object             <chr> "snaps", "coffee", "shirt", "shirt", "shirt", "shirt…
$ basic_level        <chr> "snap", "coffee", "shirt", "shirt", "shirt", "shirt"…
$ global_basic_level <chr> "snap", "coffee", "shirt", "shirt", "shirt", "shirt"…
$ transcription      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ utterance_type     <fct> q, q, n, n, n, n, d, n, n, n, n, d, d, d, d, q, d, d…
$ object_present     <fct> n, n, y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n…
$ is_subregion       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_top_3_hours     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_top_4_hours     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_surplus         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ position           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1,…
$ subregion_rank     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4, 4, 4,…

Here are partial descriptions of the columns:

get_seedlings_nouns(version = 'v1.0.0', get_codebook = TRUE) %>%
  select(column, data_type, description) %>%
  print(n = 9999)
# A tibble: 22 x 3
   column             data_type   description                                     
   <chr>              <fct>       <chr>                                           
 1 recording_id       string      "Recording ID. Unique recording ID - a combinat…
 2 audio_video        categorical "Media type: audio or video. Indicates whether …
 3 subject_month      string      "Subject and month ID. Uniquely identifies a pa…
 4 child              categorical "Child ID. Unique infant identifier. The SEEDLi…
 5 month              categorical "Month. Age in months on the nearest \"month bi…
 6 onset              integer     "Onset. Onset of the utterance (audios) or noun…
 7 offset             integer     "Offset. Offset of the utterance (audio) or nou…
 8 annotid            string      "Token ID. A randomly generated unique identifi…
 9 ordinal            integer     "Token number. The order that the coded nouns o…
10 speaker            categorical "Speaker code. A three-letter code indicating t…
11 object             string      "Coded noun. A concrete, imageable English noun…
12 basic_level        string      "Basic Level. Variant of the coded noun that wa…
13 global_basic_level string      "Global Basic Level. Each noun's lemma. Decided…
14 transcription      string      "Transcription. Phonetic transcription of the n…
15 utterance_type     categorical "Utterance type. Type of the utterance the noun…
16 object_present     categorical "Object presence. Was the object present, i.e.,…
17 is_subregion       boolean     "Does this interval belong to a subregion?"     
18 is_top_3_hours     boolean     "Is top three hours. Indicates whether the noun…
19 is_top_4_hours     boolean     "Is top four hours. Indicates whether the noun …
20 is_surplus         boolean     "Is surplus. Indicates that the noun is from a …
21 position           integer     "(if token in subregion only) Chronological pos…
22 subregion_rank     integer     "(if token in subregion only) Rank of the subre…r

Full descriptions and extra info are in the seedlings-nouns.codebook.csv codebook. Here is a glimpse of it:

Rows: 22
Columns: 6
$ column            <chr> "recording_id", "audio_video", "subject_month", "child…
$ data_type         <fct> "string", "categorical", "string", "categorical", "cat…
$ values            <chr> "1054 unique values", "audio, video", "528 unique valu…
$ description       <chr> "Recording ID. Unique recording ID - a combination of …
$ additional_info   <chr> NA, NA, "https://app.gitbook.com/o/-LD2B3y79nAYcWKjWKT…
$ additional_info_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

The codebook doesn't have its own codebook, so hopefully, the columns are self-explanatory.

regions.csv, recordings.csv, sub-recordings.csv

See Filesfor a brief description of each. Each of them has a codebook structured in the same as the seedlings-nouns.csv's codebook.

Here are the heads of all three tables.

> blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'regions') %>% print(n = 6, width = Inf)
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./regions.csv
# A tibble: 2,898 × 9
  recording_id    start      end is_subregion is_top_3_hours is_top_4_hours
  <chr>           <int>    <int> <lgl>        <lgl>          <lgl>
1 Audio_01_06         0   600000 FALSE        FALSE          FALSE
2 Audio_01_06    600000  4200000 TRUE         FALSE          TRUE
3 Audio_01_06   4200000 14400000 FALSE        FALSE          FALSE
4 Audio_01_06  14400000 18000000 TRUE         TRUE           TRUE
5 Audio_01_06  18000000 23400000 FALSE        FALSE          FALSE
6 Audio_01_06  23400000 27000000 TRUE         TRUE           TRUE
  is_surplus position subregion_rank
  <lgl>         <int>          <int>
1 TRUE             NA             NA
2 FALSE             1              4
3 TRUE             NA             NA
4 FALSE             2              2
5 TRUE             NA             NA
6 FALSE             3              3
# ℹ 2,892 more rows
# ℹ Use `print(n = ...)` to see more rows
> blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'recordings') %>%
+ head()
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./recordings.csv
# A tibble: 6 × 3
  recording_id total_recorded_time_ms total_listened_time_ms
  <chr>                         <int>                  <int>
1 Audio_01_06                57599000               36497890
2 Audio_01_07                57599000               35324640
3 Audio_01_08                57599000               14700240
4 Audio_01_09                57599000               14400000
5 Audio_01_10                57599000               14400000
6 Audio_01_11                57599000               14400000
blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'sub-recordings') %>%
+ head()
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./sub-recordings.csv
# A tibble: 6 × 4
  recording_id start               end                 start_position_ms
  <chr>        <dttm>              <dttm>                          <int>
1 Audio_01_06  1920-01-01 08:26:45 1920-01-02 00:26:44                 0
2 Audio_01_07  1920-01-01 09:03:54 1920-01-02 01:03:53                 0
3 Audio_01_08  1920-01-01 09:31:57 1920-01-02 01:31:56                 0
4 Audio_01_09  1920-01-01 06:09:34 1920-01-01 22:09:33                 0
5 Audio_01_10  1920-01-01 06:27:24 1920-01-01 22:27:23                 0
6 Audio_01_11  1920-01-01 06:36:10 1920-01-01 22:36:09                 0

Last updated