Eyetracking data preparation

We are going to use the data preparation in the ht_seedlings repository as an example of how we want to do it. The information here is supposed to help with

Main parts of the code

The code snippets below are meant to represent the key steps in each notebook. They aren't necessarily actual working code.

01_wrangling.Rmd

Running this file in its entirety will produce several untracked .Rds files under data/eytracking/local/ that will be used in the subsequent notebooks and in the analysis:

  • recordings.Rds,

  • trials.Rds,

  • fixations.Rds,

  • messages.Rds.

Load and split data

fixations_report <- read_fixation_report(fix_rep_path, remove_practice = T)
message_report <- read_message_report(mes_rep_path, remove_practice = T)
fix_rep_data <- split_fixation_report(fixations_report
mes_rep_data <- split_message_report(message_report)
data_tables_raw <- merge_split_reports(fix_rep_data, mes_rep_data)

data_tables_raw is a list of five tables: "experiment", "recordings", "trials", "fixations", "messages" - each containing information at the corresponding level. I.e., a row in trials correspond to a single trial, a row in "messages" - a single "message".

Select, rename, clean, fix, validate

data_tables_selected <- data_tables_raw
...modify data_tables_selected...
data_tables_validated <- data_tables_selected
...modify data_tables_validated 
data_tables_transformed <- data_tables_validated
...modify data_tables_transformed...

It is fixing and validation that takes the most time and the most space in the notebook. Use summary, glimpse, skimr::skim to explore the data interactively. In the end, however,

Verification in "ht_seedlings" is done with the help of packages assertthat and assertr. See this post by Danielle Navarro with an overview of four ways to do assertions.

Save to data/eytracking/local/

save_tables_separately(data_tables_transformed, "data/eytracking/local/")

02_target-onsets.Rds

The aim of this notebook is to identify the target onset time relative to the trial start. For "ht_seedlings", it required a combination of the trial-level information in column rt and the time of the display_both_images message. The specific way to do this will change from study to study. If this is trivial for your study, feel free to incorporate the code into the first notebook. Do still save it to a separate file called "target_onstes.Rds" to have consistent code across the studies. That is, do not add "target_onset" to the "trials" table.

messages <- readRDS(file.path(local_data_dir, 'messages.Rds'))
trials <- readRDS(file.path(local_data_dir, 'trials.Rds'))
target_onsets <- ...code working on messages and trials ...
saveRDS(file.path(local_data_dir, "target_onsets.Rds"))

03_gaze-data

The aim of this notebook is to:

  • convert the fixations data into a an evenly sampled time series,

  • assign time windows of interest,

  • mark "low-data" trials with too little fixations on areas of interest,

  • mark "low-data" recordings with too many "low-data" trials.

The output is saved to data/eyetracking/clean/fixation_timeseries_taglowdata.Rds (untracked).

fixations <- readRDS(file.path(local_data_dir, 'fixations.Rds'))
target_onsets <- readRDS(file.path(local_data_dir, 'target_onsets.Rds'))

fixations_updated <- ...add target onsets to fixations...

fixation_timeseries <- fixations_to_timeseries(fixations_updated, t_step = 20)
fixation_timeseries_with_windows <- fixation_timeseries %>%
  filter(!is.na(target_onset)) %>%
  assign_time_windows(t_step = 20, t_start = 360)
  
fixation_timeseries_taglowdata <-
  fixation_timeseries_with_windows %>%
  mutate(is_good_timepoint = aoi %in% c('TARGET', 'DISTRACTOR')) %>%
  tag_low_data_trials(window_column = 'shortwin', 
                      t_start = 367)

The variable names are too long and will likely change in the future. I hope the steps are still clear though.

The window and low-data-trial information are at the wrong level of hierarchy in fixation_timeseries_taglowdata. Windows are a property of time (independent of which trial they came from and the fixation coordinates at any specific time point) and is_low_data_trial is a property of trials. We haven't decided yet but there is a high chance we are going to split it into window_timeseries and low_data_trials - same as we did with the reports.

Keeping things up-to-date

We want all the files to always be up to date with the changes in the code and the data. Eventually, we will set up a system that will take care of that for us programmatically. But for now we'll have to just re-knit all the notebooks fairly regularly. This can be very slow but we can use caching to avoid repeating steps that definitely haven't changed and we can skip some step that we mostly need when we run the code interactively.

Caching

Caching saves the output of a piece of code and reuses it the next time you run the code. A very important condition for this to work is for the cache to be invalidated and the code to be rerun when any of the inputs to the code change or the code itself changes.

If you use caching, make sure you've thought through cache invalidation.

knitr chunk caching

knitr has a built in way to cache code run in chunks - you just need to set chunk option cache to TRUE and add anothe chunk option that will change depending on the input. For R object inputs, you can use digest::digest(obj) that will change if the object does. For files, use tools::md5sum(path) that will change if the contents of the file change. The name and the modification time will be ignored.

```{r split-fixation-report}
#| cache = TRUE,
#| cache.extra = digest::digest(fixations_report)

fix_rep_data <- split_fixation_report(fixations_report)
```

#| cache = TRUE is just a way to set chunk options one line at a time which is useful for readability. You can put everything in the chunk header instead.

cache.extra could have been named anything else, the cache is invalidate when any of the chunk options are changed, cache.extra is just the name suggested by knitr's documentation.

For more information, see https://bookdown.org/yihui/rmarkdown-cookbook/cache.html

Skipping code/chunks when knitting

Both summary and skimr::skim

```{r load-preview-message-report}
#| cache = TRUE,
#| cache.extra = tools::md5sum(mes_rep_path)

message_report <- read_message_report(mes_rep_path, remove_practice = T)

if (interactive()) {summary(message_report)}
```
```{r preview_main, eval=interactive()}
summary(fixations_report)
glimpse(fixations_report)
skim(fixations_report)
colnames(fixations_report)
```

Managing files

Data files

Depending on the size, either put them in the git repo or on OSF. You can copy code from "ht_seedlings" that download the data from OSF. If you are committing them to the repo and they are going to change (e.g., the data collection is ongoing), make sure they are in a text format. For the DataViewer reports, if read_fixation_report and read_message_report work then the files are in text format and you don't need to do anything about.

Tables produced by your code

We don't want to commit too many files, especially until the data collection is finished and the data preparation notebooks are finalized. What we are currently doing in "ht_seedlings" is we are not committing any of the output files at all. They are all saved to "data/eyetracking/local" which is git-ignored.

Table metadata

We do want to track some information about the files if we aren't committing them though. Mostly so that changes to the files leave footprints in the git repo. Zhenya hasn't quite figured a good way to do that so in "ht_seedglings", we are currently:

  • When saving to the "local" folder, we are using save_rds_return_md5 from R/helpers.R (we might move it to blabr later) so that a hash of the table is saved in the output .md document. This way, there is at least some signal when the tables change.

  • For cases, when we do expect changes but they should be confined to a subset of columns, it is useful to have a record of hashes of data in individual columns. Here is one way to do it:

    fixation_timeseries %>%
      sapply(digest::digest) %>%
      enframe %>%
      arrange(name) %>%
      print(n = Inf)

It isn't necessary to do any of that when first working through the data prep. However, when you go back and change something, you do probably want to know whtether it resulted (or didn't) changes down the line.

Notebook outputs

Use the github_document type of output for the notebooks. This produces .md (Markdown) documents that are lightweight, quicker to produce; they can be read as-is or rendered on GitHub; when they change git diffs are meaningful. A side of effect of using github_document is that figures are saved as pngs.

  • Do commit .md and .pngs.

  • Do not commit any other files produced by knitting. E.g., no .log files. Do add them to .gitignore.

As we are committing pngs, we don't want them to change all the time. This might require setting a random seed in case your figure include random jitter or sampling-based confidence intervals.

Setting random seed once at the top of the document might not always work when the order of randomness-based commands changes. In those cases, use https://withr.r-lib.org/reference/with_seed.html. Also remind Zhenya to write a function that converts text to random seed so you can do smth like withr::with_seed(seed_from_text('plot-of-participant-attrition', {qplot(df$attrition})

General git guidelines

  • There should be no untracked files or modifed files after you make a commit. For example, you don't want to commit .Rmd without committing the pngs and the .md.

  • To make it easy to check for untracked and modified files, add anything you don't want to track to .gitignore. Then all the changes shown by git status (or your git gui of choice) will be the changes you've just produced.

  • Do not commit .Rds files unless you are sure they won't change. Definitely don't commit any files produced by caching - they are temporary by design.

Joins

One of the reasons we are now using split tables instead of one big one is to avoid having to use distinct() all the time which often results in both duplicated rows and missing data. The down side of that is that we have to use inner_join/left_join to combine the tables on the fly. This too can lead to both duplicated and missing data. Fortunately, dplyr now has a way to minimize those problems in the form of parameters unmatched and relationship. A work-in-progress version of code can temporarily omit them but in the end both of them and also the by parameter should be supplied.

All *_join function calls should supply by, unmatched, and relationship parameters.

Comparison with the older template

  • One long notebook was split into three. Split into as few or as many as necessary to make each notebook short enough to be comprehensible.

  • Instead of combining all the data into one large monotable, we now use several tables: recording, trials, messages, fixations, target_onsets.

  • For now, the output of 03_gaze-data.Rmd is still a monotable, though with fewer columns. This will probably change too.

  • Having multiple tables means we now need to do lots of joins.

  • We are now knitting to github_document to avoid dealing with Latex, knit quicker, and have meaningful git diffs.

Last updated