Data Science without the Data

Rhian Davies | @statsRhian

About Me 👋

Data Scientist at Jumping Rivers
RSS Statistical Ambassador
Bad at French (Je suis désolé 😳)

Cartoon of a woman holding out a book

About Jumping Rivers

Data science & machine learning
Training courses
Dashboard development and deployment
Infrastructure
Managed Posit services

Cartoon of three people working at computers

I’m going to tell you a story

The Client

Database of patients with a rare disease
Consulted us to perform the data analysis for a study
- 200 statistical results (count, %, mean, sd, median, IQR)
- Interrupted Time Series Analysis

A cartoon robot holding a testtube and wearing a lab coat

Stratifications

Country
Subtypes of the disease
Mobility
Drug
Year
Age of patient

For example

What is the average age of patients when they are diagnosed (by country and subtype)?
What percentage of patients are taking Drug A (by country, subtype and year)?

Simple, yes?

data |>
  group_by(country, subtype) |>
  summarise(mean = age_at_diagnosis)

youtube.com/watch?v=2k7KAmgPO20

The challenge 🙈

Write a detailed Statistical Analysis Plan without seeing any data
Start development with a small subset of the data
We can’t see the data for Germany ever

Time to chat 💬

Have you experienced scenarios have led you to having no data?
What problems did you encounter?

Our plan

The power of statistical summaries

For each dataset, calculate all the summaries we might need
Combine these summaries as we like
- Mean: \(\frac{1}{N} \sum_{i=i}^{N} x_{i}\)
- Standard deviation: \(\frac{1}{N - 1} \sqrt{\sum_{i=i}^{N} x^2_{i} - (\sum_{i=i}^{N} x_{i} )^2}\)

A small cartoon robot stood next to a huge pile of data

Develop an R package

Run it on the data we can see
Send it to Marcus
He sends us an .RDS
We can aggregate and plot as needed

devtools::install_local("describeDisease.tar.gz")
library("describeDisease")
run_analysis("path/to/german.xlsx")

Cartoon people holding wraped presents

Where to develop?

Data security is important
Client wanted controlled access and logs
Shared projects
Multiple sessions

The posit workbench logo

Data exploration

What values are unique per patient?
Which stratifications are viable?
Quarto document for data exploration and validation

The posit workbench logo

Data validation packages 📦

What happened?

Sure, we’ll send you dummy data

Oh no

Real data shuffled
It was an XLSX worksheet

Cartoon figure saying 'Oh no'

Sure, we’ll send you the schema

Database schema for a single indicator listing allowed entries

Oh no

Data didn’t match the specification
Data types not defined

Cartoon figure saying 'Oh no'

Sure we’ll send you validated data

Oh no

It wasn’t validated.
Patients with stop dates but no start dates
Patients with start & stop dates but with the drug name missing

Whose responsibility is it?

Cartoon figure saying 'Oh no'

Okay let’s run the analysis

Oh no

Hi Rhian, I have run the code, unfortunately I get the error you can see below.

Error in `purrr::map()`:■■■■■■■■■■■■■■■■■               53% | ETA: 11s
In index: 18.
Caused by error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
Column `time_axis` is not found.

Cartoon figure saying 'Oh no'

Generating results…

Oh no

> wb = openxlsx::createWorkbook("Results")
> openxlsx::addWorksheet(wb, "Analysis by country, subtype and drug")
Error in openxlsx::addWorksheet(wb, "Analysis by country, subtype and drug") : 
  sheetName 'Analysis by country, subtype and drug' too long! Max length is 31 characters.

Cartoon figure saying 'Oh no'

“Final” run

Sure, I’ll run it right away and let you know!

Oh no

Unfortunately, I get the error below. The same error also appears when I only use the data that you already have, which is strange because I suppose that you have already tested this script on that data.

    Error in `dplyr::left_join()`:
    ! `...` must be empty.
    ✖ Problematic argument:
    • relationship = "many-to-many"

Cartoon figure saying 'Oh no'

{dplyr} version

We specified {dplyr} v1.1.0
We needed to specify {dplyr} v1.1.1
{renv} or Docker would have avoided this

Diffify hex sticker a red package symbol next to a green package symbol

Voilà 🎉

Facetted ggplot graph showing points and standard deviation

Voilà 🎉

In hindsight

Push back earlier to evidence the data challenges
Set realistic expectations
Use a proper database
purrr::map2() with tidyr::nest() was a helpful workflow
Use a different git workflow
Use {renv} from the start

Questions?

@statsRhian

StatsRhian

jumpingrivers.com

shiny-in-production.jumpingrivers.com