# SEEDLingS - Nouns

SEEDLingS corpus data collection included monthly day-long audio (\~16 h) and hour-long video at-home recordings. There were 44 children recorded every month from 6 to 17 months of age. That is 44 x 12 = 528 audio and the same number of video recordings made. One video and one audio recording didn't make it into the corpus, so there are 527 audio and 527 video files in total.&#x20;

SEEDLingS - Nouns refers to

* the annotation effort by the Bergelson Lab to annotate nouns in these files,
* the resulting dataset in the form of a set of CSV tables.

This page describes working with the dataset with only a brief description of how it came to be. The following resources contain additional information:

* SEEDLingS Corpus Companion [Book](https://seedlings-nouns.bergelsonlab.com/) contains details about both the dataset and the annotation process.
* A [README](https://github.com/BergelsonLab/seedlings-nouns/) in the repository containing the dataset has a different - shorter version of what's in this page but also has a few extra details.

## What is and what isn't in the dataset?

The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process - all in text-only CSV files. See [#whats-inside](#whats-inside "mention")for details about the files' contents.

The dataset doesn't contain audio or video files. Only nouns have been annotated on the corpus-wide level, the files haven't been transcribed. An exception to both these statements is annotations made as part of the [ACLEW](https://sites.google.com/view/aclewdid/home) project. For each recording in the sample of 44, 30 two-minute-long segments were sampled and transcribed. The transcriptions and the audio recordings are available through [HomeBank](https://homebank.talkbank.org/).

## What was annotated?

Each audio and video file was at least partially annotated for concrete imageable nouns. Here is how much time was annotated in different files:

* Video files were annotated in full.
* Audio files:
  * Files from months 6 and 7 were annotated in full.
  * Files from months 8-13 have 4 hours annotated.
  * Files from months 14-17 have 3 hours annotated.

The SEEEDLingS - Nouns dataset contains all the noun annotations together with additional information about the recordings and the annotation process.

#### Special Note about Annotated Time

The above is mostly true, but due to coder errors, etc. the amount of annotated time in the file is not always perfectly aligned with the intended amount of time. Therefore, we created a system to identify annotations as either part of the intended original coding, or not. Overage coding is classified as "surplus" and can be identified with the `is_surplus` variable. If the amount of annotated time is important to your work, you may want to remove any annotations that are surplus coding--the default for SEEDLingS Nouns is to include **all** annotations. Additionally, if you are interested in comparing **\*equal\*** amounts of time across files, you should use the `is_top_3_hours` variable.

## Where to find it?

{% hint style="info" %}
If you need to work with the dataset in R, see [here](/data-pipeline/use-seedlings-nouns-in-the-scripts.md).
{% endhint %}

The dataset is hosted online as a [seedlings-nouns](https://github.com/BergelsonLab/seedlings-nouns) public repository.

That repository contains only public versions of the dataset. The full edit history and in-progress changes are stored in the [seedlings-nouns\_private](https://github.com/BergelsonLab/seedlings-nouns_private) private repository. Use the private repository only if necessary.

In both cases, check [Use SEEDLingS-Nouns in the scripts](/data-pipeline/use-seedlings-nouns-in-the-scripts.md)before using the repository files directly.

## What's inside?

### Files

* `seedlings-nouns.csv` - Contains all the annotated tokens.
* `recordings.csv` - Contains information about each of audio/video recording sessions.
* `sub-recordings.csv` - If the audio recording was paused one or more times, we consider all the recorded audio as one *recording* consisting of several *sub-recordings*. If the recording wasn't paused at any time, *sub-recording* is the same as *recording*. This file also contains local time when each sub-recording started and ended.
* `regions.csv` - The audio recordings for months 08 and above weren't annotated in full. This file contains information about the regions that were annotated. See [below](https://github.com/BergelsonLab/seedlings-nouns_private/blob/main/public/README.md#audio-recording-regions) for information on the types of annotated regions.
* `*.codebook.csv` - For each table listed above, there is an associated codebook listing and describing the columns of the table.
* README.md - a readme, duh.

### seedlings-nouns.csv

The main `seedlings-nouns.csv` table contains all noun annotations in the SEEDLingS repository plus a few extra column. Here is what it looks like:

<pre class="language-r"><code class="lang-r">> glimpse(blabr::get_seedlings_nouns(version = 'v1.0.0'))
<strong>Rows: 358,305
</strong>Columns: 22
$ recording_id       &#x3C;chr> "Audio_01_06", "Audio_01_06", "Audio_01_06", "Audio_…
$ audio_video        &#x3C;fct> audio, audio, audio, audio, audio, audio, audio, aud…
$ subject_month      &#x3C;chr> "01_06", "01_06", "01_06", "01_06", "01_06", "01_06"…
$ child              &#x3C;fct> 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, …
$ month              &#x3C;fct> 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, …
$ onset              &#x3C;int> 30810, 38980, 294500, 295510, 296310, 300620, 304170…
$ offset             &#x3C;int> 31830, 40190, 295510, 296310, 298440, 301670, 305170…
$ annotid            &#x3C;chr> "0x08406f", "0x6578a9", "0xdc5508", "0x96c739", "0xa…
$ ordinal            &#x3C;int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ speaker            &#x3C;fct> MOT, MOT, MOT, SIS, MOT, MOT, MOT, MOT, MOT, MOT, MO…
$ object             &#x3C;chr> "snaps", "coffee", "shirt", "shirt", "shirt", "shirt…
$ basic_level        &#x3C;chr> "snap", "coffee", "shirt", "shirt", "shirt", "shirt"…
$ global_basic_level &#x3C;chr> "snap", "coffee", "shirt", "shirt", "shirt", "shirt"…
$ transcription      &#x3C;chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ utterance_type     &#x3C;fct> q, q, n, n, n, n, d, n, n, n, n, d, d, d, d, q, d, d…
$ object_present     &#x3C;fct> n, n, y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n…
$ is_subregion       &#x3C;lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_top_3_hours     &#x3C;lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_top_4_hours     &#x3C;lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ is_surplus         &#x3C;lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ position           &#x3C;dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1,…
$ subregion_rank     &#x3C;dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4, 4, 4,…
</code></pre>

Here are partial descriptions of the columns:

```r
get_seedlings_nouns(version = 'v1.0.0', get_codebook = TRUE) %>%
  select(column, data_type, description) %>%
  print(n = 9999)
# A tibble: 22 x 3
   column             data_type   description                                     
   <chr>              <fct>       <chr>                                           
 1 recording_id       string      "Recording ID. Unique recording ID - a combinat…
 2 audio_video        categorical "Media type: audio or video. Indicates whether …
 3 subject_month      string      "Subject and month ID. Uniquely identifies a pa…
 4 child              categorical "Child ID. Unique infant identifier. The SEEDLi…
 5 month              categorical "Month. Age in months on the nearest \"month bi…
 6 onset              integer     "Onset. Onset of the utterance (audios) or noun…
 7 offset             integer     "Offset. Offset of the utterance (audio) or nou…
 8 annotid            string      "Token ID. A randomly generated unique identifi…
 9 ordinal            integer     "Token number. The order that the coded nouns o…
10 speaker            categorical "Speaker code. A three-letter code indicating t…
11 object             string      "Coded noun. A concrete, imageable English noun…
12 basic_level        string      "Basic Level. Variant of the coded noun that wa…
13 global_basic_level string      "Global Basic Level. Each noun's lemma. Decided…
14 transcription      string      "Transcription. Phonetic transcription of the n…
15 utterance_type     categorical "Utterance type. Type of the utterance the noun…
16 object_present     categorical "Object presence. Was the object present, i.e.,…
17 is_subregion       boolean     "Does this interval belong to a subregion?"     
18 is_top_3_hours     boolean     "Is top three hours. Indicates whether the noun…
19 is_top_4_hours     boolean     "Is top four hours. Indicates whether the noun …
20 is_surplus         boolean     "Is surplus. Indicates that the noun is from a …
21 position           integer     "(if token in subregion only) Chronological pos…
22 subregion_rank     integer     "(if token in subregion only) Rank of the subre…r
```

Full descriptions and extra info are in the `seedlings-nouns.codebook.csv` codebook. Here is a `glimpse` of it:

```r
Rows: 22
Columns: 6
$ column            <chr> "recording_id", "audio_video", "subject_month", "child…
$ data_type         <fct> "string", "categorical", "string", "categorical", "cat…
$ values            <chr> "1054 unique values", "audio, video", "528 unique valu…
$ description       <chr> "Recording ID. Unique recording ID - a combination of …
$ additional_info   <chr> NA, NA, "https://app.gitbook.com/o/-LD2B3y79nAYcWKjWKT…
$ additional_info_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
```

The codebook doesn't have its own codebook, so hopefully, the columns are self-explanatory.

### regions.csv, recordings.csv, sub-recordings.csv

See [#files](#files "mention")for a brief description of each. Each of them has a codebook structured in the same as the seedlings-nouns.csv's codebook.

Here are the `head`s of all three tables.

```r
> blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'regions') %>% print(n = 6, width = Inf)
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./regions.csv
# A tibble: 2,898 × 9
  recording_id    start      end is_subregion is_top_3_hours is_top_4_hours
  <chr>           <int>    <int> <lgl>        <lgl>          <lgl>
1 Audio_01_06         0   600000 FALSE        FALSE          FALSE
2 Audio_01_06    600000  4200000 TRUE         FALSE          TRUE
3 Audio_01_06   4200000 14400000 FALSE        FALSE          FALSE
4 Audio_01_06  14400000 18000000 TRUE         TRUE           TRUE
5 Audio_01_06  18000000 23400000 FALSE        FALSE          FALSE
6 Audio_01_06  23400000 27000000 TRUE         TRUE           TRUE
  is_surplus position subregion_rank
  <lgl>         <int>          <int>
1 TRUE             NA             NA
2 FALSE             1              4
3 TRUE             NA             NA
4 FALSE             2              2
5 TRUE             NA             NA
6 FALSE             3              3
# ℹ 2,892 more rows
# ℹ Use `print(n = ...)` to see more rows
```

```r
> blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'recordings') %>%
+ head()
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./recordings.csv
# A tibble: 6 × 3
  recording_id total_recorded_time_ms total_listened_time_ms
  <chr>                         <int>                  <int>
1 Audio_01_06                57599000               36497890
2 Audio_01_07                57599000               35324640
3 Audio_01_08                57599000               14700240
4 Audio_01_09                57599000               14400000
5 Audio_01_10                57599000               14400000
6 Audio_01_11                57599000               14400000
```

```r
blabr::get_seedlings_nouns(version = 'v1.0.0', table = 'sub-recordings') %>%
+ head()
reading file: /Users/ek221/BLAB_DATA/seedlings-nouns/./sub-recordings.csv
# A tibble: 6 × 4
  recording_id start               end                 start_position_ms
  <chr>        <dttm>              <dttm>                          <int>
1 Audio_01_06  1920-01-01 08:26:45 1920-01-02 00:26:44                 0
2 Audio_01_07  1920-01-01 09:03:54 1920-01-02 01:03:53                 0
3 Audio_01_08  1920-01-01 09:31:57 1920-01-02 01:31:56                 0
4 Audio_01_09  1920-01-01 06:09:34 1920-01-01 22:09:33                 0
5 Audio_01_10  1920-01-01 06:27:24 1920-01-01 22:27:23                 0
6 Audio_01_11  1920-01-01 06:36:10 1920-01-01 22:36:09                 0
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://gitbook.bergelsonlab.com/data-pipeline.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
