Data Science without the Data

Rhian Davies | @statsRhian

About Me 👋

  • Data Scientist at Jumping Rivers
  • RSS Statistical Ambassador
  • Bad at French (Je suis désolé 😳)

Cartoon of a woman holding out a book

About Jumping Rivers

  • Data science & machine learning
  • Training courses
  • Dashboard development and deployment
  • Infrastructure
  • Managed Posit services

Cartoon of three people working at computers

I’m going to tell you a story

The Client

  • Database of patients with a rare disease
  • Consulted us to perform the data analysis for a study
    • 200 statistical results (count, %, mean, sd, median, IQR)
    • Interrupted Time Series Analysis

A cartoon robot holding a testtube and wearing a lab coat

Stratifications

  • Country
  • Subtypes of the disease
  • Mobility
  • Drug
  • Year
  • Age of patient

For example

  • What is the average age of patients when they are diagnosed (by country and subtype)?

  • What percentage of patients are taking Drug A (by country, subtype and year)?

Simple, yes?

data |>
  group_by(country, subtype) |>
  summarise(mean = age_at_diagnosis)

The challenge 🙈

  • Write a detailed Statistical Analysis Plan without seeing any data
  • Start development with a small subset of the data
  • We can’t see the data for Germany ever

Time to chat 💬

  • Have you experienced scenarios have led you to having no data?

  • What problems did you encounter?

Our plan

The power of statistical summaries

  • For each dataset, calculate all the summaries we might need
  • Combine these summaries as we like
    • Mean: \(\frac{1}{N} \sum_{i=i}^{N} x_{i}\)
    • Standard deviation: \(\frac{1}{N - 1} \sqrt{\sum_{i=i}^{N} x^2_{i} - (\sum_{i=i}^{N} x_{i} )^2}\)

A small cartoon robot stood next to a huge pile of data

Develop an R package

  • Run it on the data we can see
  • Send it to Marcus
  • He sends us an .RDS
  • We can aggregate and plot as needed
devtools::install_local("describeDisease.tar.gz")
library("describeDisease")
run_analysis("path/to/german.xlsx")

Cartoon people holding wraped presents

Where to develop?

  • Data security is important
  • Client wanted controlled access and logs
  • Shared projects
  • Multiple sessions

The posit workbench logo

Data exploration

  • What values are unique per patient?

  • Which stratifications are viable?

  • Quarto document for data exploration and validation

The posit workbench logo

Data validation packages 📦

What happened?

Sure, we’ll send you dummy data

Oh no

  • Real data shuffled
  • It was an XLSX worksheet

Cartoon figure saying 'Oh no'

Sure, we’ll send you the schema

Database schema for a single indicator listing allowed entries

Oh no

  • Data didn’t match the specification
  • Data types not defined

Cartoon figure saying 'Oh no'

Sure we’ll send you validated data

Oh no

  • It wasn’t validated.
  • Patients with stop dates but no start dates
  • Patients with start & stop dates but with the drug name missing

Whose responsibility is it?

Cartoon figure saying 'Oh no'

Okay let’s run the analysis

Oh no

Hi Rhian, I have run the code, unfortunately I get the error you can see below.

Error in `purrr::map()`:■■■■■■■■■■■■■■■■■               53% | ETA: 11s
In index: 18.
Caused by error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
Column `time_axis` is not found.

Cartoon figure saying 'Oh no'

Generating results…

Oh no

> wb = openxlsx::createWorkbook("Results")
> openxlsx::addWorksheet(wb, "Analysis by country, subtype and drug")
Error in openxlsx::addWorksheet(wb, "Analysis by country, subtype and drug") : 
  sheetName 'Analysis by country, subtype and drug' too long! Max length is 31 characters.

Cartoon figure saying 'Oh no'

“Final” run

Sure, I’ll run it right away and let you know!

Oh no

Unfortunately, I get the error below. The same error also appears when I only use the data that you already have, which is strange because I suppose that you have already tested this script on that data.

    Error in `dplyr::left_join()`:
    ! `...` must be empty.
    ✖ Problematic argument:
    • relationship = "many-to-many"

Cartoon figure saying 'Oh no'

{dplyr} version

  • We specified {dplyr} v1.1.0
  • We needed to specify {dplyr} v1.1.1
  • {renv} or Docker would have avoided this

Diffify hex sticker a red package symbol next to a green package symbol

Voilà 🎉

Facetted ggplot graph showing points and standard deviation

Voilà 🎉

Facetted ggplot graph showing points and standard deviation

In hindsight

  • Push back earlier to evidence the data challenges
  • Set realistic expectations
  • Use a proper database
  • purrr::map2() with tidyr::nest() was a helpful workflow
  • Use a different git workflow
  • Use {renv} from the start

Questions?