2a. Create patient-level snapshots

About snapshots

patientflow is organised around the following concepts:

Prediction time: A moment in the day at which predictions are to be made, for example 09:30.
Patient snapshot: A summary of data from the EHR capturing is known about a single patient at the prediction time. Each patient snapshot has a date and a prediction time associated with it.
Group snapshot: A set of patient snapshots. Each group snapshot has a date and a prediction time associated with it.
Prediction window: A period of hours that begins at the prediction time.

To use patientflow your data should be in snapshot form.

In this notebook I suggest how to you might prepare your data, starting from data on finished hospital visits. I start with fake data on Emergency Department visits, and demonstrate how to convert it into snapshots. There are two examples

A simple example of creating snapshots assuming you have one flat table of hospital visits
An example of creating snapshots from data structured as a relational database.

A note on creating your own shapshots

The snapshot creation shown here is designed to work with fake data generated below. You would need to create your own version of this process, to handle the data you have.

In practice, determining from data whether a patient was admitted after the ED visit, and when they were ready to be admitted, can be tricky. How do you account for the fact that the patient may wait in the ED for a bed, due to lack of available beds? Likewise, if you are trying to predict discharge at the end of a hospital visit, should that that be the time they were ready to leave, or the time they actually left? Discharge delays are common, due to waiting for medication or transport, or waiting for onward care provision to become available.

The outcome that you are aiming for will depend on your setting, and the information needs of the bed managers you are looking to support. You may have to infer when a patient was ready from available data. Suffice to say, think carefully about what it is you are trying to predict, and how you will identify that outcome in data.

Creating fake finished visits

I'll start by loading some fake data resembling the structure of EHR data on Emergency Department (ED) visits, using a function called create_fake_finished_visits. In my fake data, each visit has one row, with an arrival time at the ED, a discharge time from the ED, the patient's age and an outcome of whether they were admitted after the ED visit.

The is_admitted column is our label, indicating the outcome in this imaginary case.

# Reload functions every time
%load_ext autoreload 
%autoreload 2

from patientflow.generate import create_fake_finished_visits
visits_df, _, _ = create_fake_finished_visits('2023-01-01', '2023-04-01', 25)

print(f'There are {len(visits_df)} visits in the fake dataset, with arrivals between {visits_df.arrival_datetime.min().date()} and {visits_df.arrival_datetime.max().date()} inclusive.')
visits_df.head()

There are 2253 visits in the fake dataset, with arrivals between 2023-01-01 and 2023-03-31 inclusive.

	patient_id	visit_number	arrival_datetime	departure_datetime	age	is_admitted	specialty
0	354	1	2023-01-01 05:21:43	2023-01-01 12:35:43	31	1	medical
1	1281	7	2023-01-01 07:22:18	2023-01-01 22:46:18	31	0	None
2	113	15	2023-01-01 07:31:29	2023-01-01 16:12:29	41	0	None
3	114	3	2023-01-01 08:01:26	2023-01-01 10:34:26	33	0	None
4	1937	18	2023-01-01 08:45:38	2023-01-01 16:30:38	14	0	None

Example 1: Create snapshots from fake data - a simple example

My goal is to create snapshots of these visits. First, I define the times of day I will be issuing predictions at. Each time is expressed as a tuple of (hour, minute)

prediction_times = [(6, 0), (9, 30), (12, 0), (15, 30), (22, 0)] # each time is expressed as a tuple of (hour, minute)

Then using the code below I create an array of all the snapshot dates in some date range that my data covers.

from datetime import datetime, time, timedelta, date

# Create date range
snapshot_dates = []
start_date = date(2023, 1, 1)
end_date = date(2023, 4, 1)

# Iterate to create an array of dates
current_date = start_date
while current_date < end_date:
    snapshot_dates.append(current_date)
    current_date += timedelta(days=1)

print('First ten snapshot dates')
snapshot_dates[0:10]

First ten snapshot dates





[datetime.date(2023, 1, 1),
 datetime.date(2023, 1, 2),
 datetime.date(2023, 1, 3),
 datetime.date(2023, 1, 4),
 datetime.date(2023, 1, 5),
 datetime.date(2023, 1, 6),
 datetime.date(2023, 1, 7),
 datetime.date(2023, 1, 8),
 datetime.date(2023, 1, 9),
 datetime.date(2023, 1, 10)]

Next I iterate through the date array, using the arrival and departure times from the hospital visits table to identify any patients who were in the ED at each prediction time (eg 09:30 or 12.00) on each date.

import pandas as pd


# Create empty list to store results for each snapshot date
patient_shapshot_list = []

# For each combination of date and time
for date_val in snapshot_dates:
    for hour, minute in prediction_times:
        snapshot_datetime = datetime.combine(date_val, time(hour=hour, minute=minute))

        # Filter dataframe for this snapshot
        mask = (visits_df["arrival_datetime"] <= snapshot_datetime) & (
            visits_df["departure_datetime"] > snapshot_datetime
        )
        snapshot_df = visits_df[mask].copy()

        # Skip if no patients at this time
        if len(snapshot_df) == 0:
            continue

        # Add snapshot information columns
        snapshot_df["snapshot_date"] = date_val
        snapshot_df["prediction_time"] = [(hour, minute)] * len(snapshot_df)

        patient_shapshot_list.append(snapshot_df)

# Combine all results into single dataframe
snapshots_df = pd.concat(patient_shapshot_list, ignore_index=True)

# Name the index snapshot_id
snapshots_df.index.name = "snapshot_id"

Note that each record in the snapshots dataframe is indexed by a unique snapshot_id.

snapshots_df.head()

	patient_id	visit_number	arrival_datetime	departure_datetime	age	is_admitted	specialty	snapshot_date	prediction_time
snapshot_id
0	354	1	2023-01-01 05:21:43	2023-01-01 12:35:43	31	1	medical	2023-01-01	(6, 0)
1	354	1	2023-01-01 05:21:43	2023-01-01 12:35:43	31	1	medical	2023-01-01	(9, 30)
2	1281	7	2023-01-01 07:22:18	2023-01-01 22:46:18	31	0	None	2023-01-01	(9, 30)
3	113	15	2023-01-01 07:31:29	2023-01-01 16:12:29	41	0	None	2023-01-01	(9, 30)
4	114	3	2023-01-01 08:01:26	2023-01-01 10:34:26	33	0	None	2023-01-01	(9, 30)

Some patients are present at more than one of the prediction times, given them more than one entry in snapshots_df

# Count the number of snapshots per visit and show top five
snapshots_df.visit_number.value_counts().head()

visit_number
1940    7
375     7
1812    7
1733    7
1736    7
Name: count, dtype: int64

Below I show one example of a patient who was in the ED long enough to have multiple snapshots, captured at the various prediction times during their visit.

# Displaying the snapshots for a visit with multiple snapshots
example_visit_number = snapshots_df.visit_number.value_counts().index[0]
snapshots_df[snapshots_df.visit_number == example_visit_number]

	patient_id	visit_number	arrival_datetime	departure_datetime	age	is_admitted	specialty	snapshot_date	prediction_time
snapshot_id
2959	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-19	(15, 30)
2965	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-19	(22, 0)
2977	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-20	(6, 0)
2983	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-20	(9, 30)
2989	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-20	(12, 0)
2998	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-20	(15, 30)
3014	358	1940	2023-03-19 14:29:26	2023-03-21 03:27:26	79	0	None	2023-03-20	(22, 0)

Example 2: Creating fake finished visits from a relational database

Electronic Health Record systems and their data warehouses are often structured as relational databases, with information stored on multiple linked tables. Timestamps are used to capture how information about a patient accumulates as the ED visit progresses. Patients may visit various locations in the ED, such as triage, where their acuity is recorded, and different activities related to their care are carried out, like measurements of vital signs or lab tests.

The function below returns three fake dataframes, meant to resemble EHR data.

hospital visit dataframe - already seen above
observations dataframe - with a single measurement, a triage score, plus a timestamp for when that was recorded
lab orders dataframe - with five types of lab orders plus a timestamp for when these tests were requested

The function that creates the fake data returns one triage score for each visit, within 10 minutes of arrival

visits_df, observations_df, lab_orders_df = create_fake_finished_visits('2023-01-01', '2023-04-01', 25)

print(f'There are {len(observations_df)} triage scores in the observations_df dataframe, for {len(observations_df.visit_number.unique())} visits')
observations_df.head()

There are 2253 triage scores in the observations_df dataframe, for 2253 visits

	visit_number	observation_datetime	triage_score
0	1	2023-01-01 05:25:48.686712	2
1	7	2023-01-01 07:24:04.659833	2
2	15	2023-01-01 07:39:02.025157	4
3	3	2023-01-01 08:10:51.432211	3
4	18	2023-01-01 08:50:52.495502	4

The function that creates the fake data returns a random number of lab tests for each patient, for visits over 2 hours. Not all visits will have lab orders in this fake data.

print(f'There are {len(lab_orders_df)} lab orders in the dataset, for {len(lab_orders_df.visit_number.unique())} visits')
lab_orders_df.head()

There are 5754 lab orders in the dataset, for 2091 visits

	visit_number	order_datetime	lab_name
0	1	2023-01-01 05:51:39.377886	BMP
1	1	2023-01-01 05:58:40.347001	D-dimer
2	1	2023-01-01 06:36:24.534586	CBC
3	1	2023-01-01 06:49:29.836402	Urinalysis
4	7	2023-01-01 07:43:33.443262	Troponin

The create_fake_snapshots() function will pull information from the three fake tables, and prepare snapshots.

from datetime import date
start_date = date(2023, 1, 1)
end_date = date(2023, 4, 1)

from patientflow.generate import create_fake_snapshots

# Create snapshots
new_snapshots_df = create_fake_snapshots(df=visits_df, observations_df=observations_df, lab_orders_df=lab_orders_df, prediction_times=prediction_times, start_date=start_date, end_date=end_date)
new_snapshots_df.head()

	snapshot_date	prediction_time	patient_id	visit_number	is_admitted	age	latest_triage_score	num_bmp_orders	num_d-dimer_orders	num_cbc_orders	num_urinalysis_orders	num_troponin_orders
snapshot_id
0	2023-01-01	(6, 0)	354	1	1	31	2.0	1	1	0	0	0
1	2023-01-01	(9, 30)	354	1	1	31	2.0	1	1	1	1	0
2	2023-01-01	(9, 30)	1281	7	0	31	2.0	1	1	1	0	1
3	2023-01-01	(9, 30)	113	15	0	41	4.0	0	0	1	1	0
4	2023-01-01	(9, 30)	114	3	0	33	3.0	1	1	0	1	1

Returning to the example visit above, we can see that at 09:30 on 2023-01-10, the first snapshot for this patient, the triage score had not yet been recorded. This, and the lab orders, were placed between 09:30 and 12:00, so they appear first in the 12:00 snapshot.

new_snapshots_df[new_snapshots_df.visit_number==example_visit_number]

	snapshot_date	prediction_time	patient_id	visit_number	is_admitted	age	latest_triage_score	num_bmp_orders	num_d-dimer_orders	num_cbc_orders	num_urinalysis_orders	num_troponin_orders
snapshot_id
2959	2023-03-19	(15, 30)	358	1940	0	79	3.0	1	0	0	0	0
2965	2023-03-19	(22, 0)	358	1940	0	79	3.0	1	1	0	0	1
2977	2023-03-20	(6, 0)	358	1940	0	79	3.0	1	1	0	0	1
2983	2023-03-20	(9, 30)	358	1940	0	79	3.0	1	1	0	0	1
2989	2023-03-20	(12, 0)	358	1940	0	79	3.0	1	1	0	0	1
2998	2023-03-20	(15, 30)	358	1940	0	79	3.0	1	1	0	0	1
3014	2023-03-20	(22, 0)	358	1940	0	79	3.0	1	1	0	0	1

Summary

Here I have shown how to create patient snapshots from finished patient visits. Note that there is a discarding of some information, or summarisation involved. The lab orders have been reduced to counts, and only the latest triage score has been taken. In the same vein, you might just take the last recorded heart rate or oxygen saturation level, or the latest value of a lab result. A snapshot loses some of the richness of the full data in an EHR, but with the benefit that you get data that replicate unfinished visits.

Note that ED visit data can be patchy in ways that are meaningful. For example, a severely ill patient might have many heart rate values recorded and many lab orders, while a patient with a sprained ankle might have zero heart rate measurements or lab orders. For predicting probability of admission after ED, such variation in data completeness is revealing. By summarising to counts, snapshots allow us to capture that variation in data completeness without having to discard observations that have missing data.

In the next notebook I'll show how to make predictions using patient snapshots.