Skip to content

2a. Create patient-level snapshots

About snapshots

patientflow is organised around the following concepts:

  • Prediction time: A moment in the day at which predictions are to be made, for example 09:30.
  • Patient snapshot: A summary of data from the EHR capturing is known about a single patient at the prediction time. Each patient snapshot has a date and a prediction time associated with it.
  • Group snapshot: A set of patient snapshots. Each group snapshot has a date and a prediction time associated with it.
  • Prediction window: A period of hours that begins at the prediction time.

To use patientflow your data should be in snapshot form.

In this notebook I suggest how to you might prepare your data, starting from data on finished hospital visits. I start with fake data on Emergency Department visits, and demonstrate how to convert it into snapshots. There are two examples

  • A simple example of creating snapshots assuming you have one flat table of hospital visits
  • An example of creating snapshots from data structured as a relational database.

A note on creating your own shapshots

The snapshot creation shown here is designed to work with fake data generated below. You would need to create your own version of this process, to handle the data you have.

In practice, determining from data whether a patient was admitted after the ED visit, and when they were ready to be admitted, can be tricky. How do you account for the fact that the patient may wait in the ED for a bed, due to lack of available beds? Likewise, if you are trying to predict discharge at the end of a hospital visit, should that that be the time they were ready to leave, or the time they actually left? Discharge delays are common, due to waiting for medication or transport, or waiting for onward care provision to become available.

The outcome that you are aiming for will depend on your setting, and the information needs of the bed managers you are looking to support. You may have to infer when a patient was ready from available data. Suffice to say, think carefully about what it is you are trying to predict, and how you will identify that outcome in data.

Creating fake finished visits

I'll start by loading some fake data resembling the structure of EHR data on Emergency Department (ED) visits, using a function called create_fake_finished_visits. In my fake data, each visit has one row, with an arrival time at the ED, a discharge time from the ED, the patient's age and an outcome of whether they were admitted after the ED visit.

The is_admitted column is our label, indicating the outcome in this imaginary case.

# Reload functions every time
%load_ext autoreload 
%autoreload 2
from patientflow.generate import create_fake_finished_visits
visits_df, _, _ = create_fake_finished_visits('2023-01-01', '2023-04-01', 25)

print(f'There are {len(visits_df)} visits in the fake dataset, with arrivals between {visits_df.arrival_datetime.min().date()} and {visits_df.arrival_datetime.max().date()} inclusive.')
visits_df.head()
There are 2253 visits in the fake dataset, with arrivals between 2023-01-01 and 2023-03-31 inclusive.
patient_id visit_number arrival_datetime departure_datetime age is_admitted specialty
0 354 1 2023-01-01 05:21:43 2023-01-01 12:35:43 31 1 medical
1 1281 7 2023-01-01 07:22:18 2023-01-01 22:46:18 31 0 None
2 113 15 2023-01-01 07:31:29 2023-01-01 16:12:29 41 0 None
3 114 3 2023-01-01 08:01:26 2023-01-01 10:34:26 33 0 None
4 1937 18 2023-01-01 08:45:38 2023-01-01 16:30:38 14 0 None

Example 1: Create snapshots from fake data - a simple example

My goal is to create snapshots of these visits. First, I define the times of day I will be issuing predictions at. Each time is expressed as a tuple of (hour, minute)

prediction_times = [(6, 0), (9, 30), (12, 0), (15, 30), (22, 0)] # each time is expressed as a tuple of (hour, minute)

Then using the code below I create an array of all the snapshot dates in some date range that my data covers.

from datetime import datetime, time, timedelta, date

# Create date range
snapshot_dates = []
start_date = date(2023, 1, 1)
end_date = date(2023, 4, 1)

# Iterate to create an array of dates
current_date = start_date
while current_date < end_date:
    snapshot_dates.append(current_date)
    current_date += timedelta(days=1)

print('First ten snapshot dates')
snapshot_dates[0:10]
First ten snapshot dates





[datetime.date(2023, 1, 1),
 datetime.date(2023, 1, 2),
 datetime.date(2023, 1, 3),
 datetime.date(2023, 1, 4),
 datetime.date(2023, 1, 5),
 datetime.date(2023, 1, 6),
 datetime.date(2023, 1, 7),
 datetime.date(2023, 1, 8),
 datetime.date(2023, 1, 9),
 datetime.date(2023, 1, 10)]

Next I iterate through the date array, using the arrival and departure times from the hospital visits table to identify any patients who were in the ED at each prediction time (eg 09:30 or 12.00) on each date.

import pandas as pd


# Create empty list to store results for each snapshot date
patient_shapshot_list = []

# For each combination of date and time
for date_val in snapshot_dates:
    for hour, minute in prediction_times:
        snapshot_datetime = datetime.combine(date_val, time(hour=hour, minute=minute))

        # Filter dataframe for this snapshot
        mask = (visits_df["arrival_datetime"] <= snapshot_datetime) & (
            visits_df["departure_datetime"] > snapshot_datetime
        )
        snapshot_df = visits_df[mask].copy()

        # Skip if no patients at this time
        if len(snapshot_df) == 0:
            continue

        # Add snapshot information columns
        snapshot_df["snapshot_date"] = date_val
        snapshot_df["prediction_time"] = [(hour, minute)] * len(snapshot_df)

        patient_shapshot_list.append(snapshot_df)

# Combine all results into single dataframe
snapshots_df = pd.concat(patient_shapshot_list, ignore_index=True)

# Name the index snapshot_id
snapshots_df.index.name = "snapshot_id"

Note that each record in the snapshots dataframe is indexed by a unique snapshot_id.

snapshots_df.head()
patient_id visit_number arrival_datetime departure_datetime age is_admitted specialty snapshot_date prediction_time
snapshot_id
0 354 1 2023-01-01 05:21:43 2023-01-01 12:35:43 31 1 medical 2023-01-01 (6, 0)
1 354 1 2023-01-01 05:21:43 2023-01-01 12:35:43 31 1 medical 2023-01-01 (9, 30)
2 1281 7 2023-01-01 07:22:18 2023-01-01 22:46:18 31 0 None 2023-01-01 (9, 30)
3 113 15 2023-01-01 07:31:29 2023-01-01 16:12:29 41 0 None 2023-01-01 (9, 30)
4 114 3 2023-01-01 08:01:26 2023-01-01 10:34:26 33 0 None 2023-01-01 (9, 30)

Some patients are present at more than one of the prediction times, given them more than one entry in snapshots_df

# Count the number of snapshots per visit and show top five
snapshots_df.visit_number.value_counts().head()
visit_number
1940    7
375     7
1812    7
1733    7
1736    7
Name: count, dtype: int64

Below I show one example of a patient who was in the ED long enough to have multiple snapshots, captured at the various prediction times during their visit.

# Displaying the snapshots for a visit with multiple snapshots
example_visit_number = snapshots_df.visit_number.value_counts().index[0]
snapshots_df[snapshots_df.visit_number == example_visit_number]

patient_id visit_number arrival_datetime departure_datetime age is_admitted specialty snapshot_date prediction_time
snapshot_id
2959 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-19 (15, 30)
2965 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-19 (22, 0)
2977 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-20 (6, 0)
2983 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-20 (9, 30)
2989 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-20 (12, 0)
2998 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-20 (15, 30)
3014 358 1940 2023-03-19 14:29:26 2023-03-21 03:27:26 79 0 None 2023-03-20 (22, 0)

Example 2: Creating fake finished visits from a relational database

Electronic Health Record systems and their data warehouses are often structured as relational databases, with information stored on multiple linked tables. Timestamps are used to capture how information about a patient accumulates as the ED visit progresses. Patients may visit various locations in the ED, such as triage, where their acuity is recorded, and different activities related to their care are carried out, like measurements of vital signs or lab tests.

The function below returns three fake dataframes, meant to resemble EHR data.

  • hospital visit dataframe - already seen above
  • observations dataframe - with a single measurement, a triage score, plus a timestamp for when that was recorded
  • lab orders dataframe - with five types of lab orders plus a timestamp for when these tests were requested

The function that creates the fake data returns one triage score for each visit, within 10 minutes of arrival

visits_df, observations_df, lab_orders_df = create_fake_finished_visits('2023-01-01', '2023-04-01', 25)

print(f'There are {len(observations_df)} triage scores in the observations_df dataframe, for {len(observations_df.visit_number.unique())} visits')
observations_df.head()
There are 2253 triage scores in the observations_df dataframe, for 2253 visits
visit_number observation_datetime triage_score
0 1 2023-01-01 05:25:48.686712 2
1 7 2023-01-01 07:24:04.659833 2
2 15 2023-01-01 07:39:02.025157 4
3 3 2023-01-01 08:10:51.432211 3
4 18 2023-01-01 08:50:52.495502 4

The function that creates the fake data returns a random number of lab tests for each patient, for visits over 2 hours. Not all visits will have lab orders in this fake data.

print(f'There are {len(lab_orders_df)} lab orders in the dataset, for {len(lab_orders_df.visit_number.unique())} visits')
lab_orders_df.head()
There are 5754 lab orders in the dataset, for 2091 visits
visit_number order_datetime lab_name
0 1 2023-01-01 05:51:39.377886 BMP
1 1 2023-01-01 05:58:40.347001 D-dimer
2 1 2023-01-01 06:36:24.534586 CBC
3 1 2023-01-01 06:49:29.836402 Urinalysis
4 7 2023-01-01 07:43:33.443262 Troponin

The create_fake_snapshots() function will pull information from the three fake tables, and prepare snapshots.

from datetime import date
start_date = date(2023, 1, 1)
end_date = date(2023, 4, 1)

from patientflow.generate import create_fake_snapshots

# Create snapshots
new_snapshots_df = create_fake_snapshots(df=visits_df, observations_df=observations_df, lab_orders_df=lab_orders_df, prediction_times=prediction_times, start_date=start_date, end_date=end_date)
new_snapshots_df.head()
snapshot_date prediction_time patient_id visit_number is_admitted age latest_triage_score num_bmp_orders num_d-dimer_orders num_cbc_orders num_urinalysis_orders num_troponin_orders
snapshot_id
0 2023-01-01 (6, 0) 354 1 1 31 2.0 1 1 0 0 0
1 2023-01-01 (9, 30) 354 1 1 31 2.0 1 1 1 1 0
2 2023-01-01 (9, 30) 1281 7 0 31 2.0 1 1 1 0 1
3 2023-01-01 (9, 30) 113 15 0 41 4.0 0 0 1 1 0
4 2023-01-01 (9, 30) 114 3 0 33 3.0 1 1 0 1 1

Returning to the example visit above, we can see that at 09:30 on 2023-01-10, the first snapshot for this patient, the triage score had not yet been recorded. This, and the lab orders, were placed between 09:30 and 12:00, so they appear first in the 12:00 snapshot.

new_snapshots_df[new_snapshots_df.visit_number==example_visit_number]
snapshot_date prediction_time patient_id visit_number is_admitted age latest_triage_score num_bmp_orders num_d-dimer_orders num_cbc_orders num_urinalysis_orders num_troponin_orders
snapshot_id
2959 2023-03-19 (15, 30) 358 1940 0 79 3.0 1 0 0 0 0
2965 2023-03-19 (22, 0) 358 1940 0 79 3.0 1 1 0 0 1
2977 2023-03-20 (6, 0) 358 1940 0 79 3.0 1 1 0 0 1
2983 2023-03-20 (9, 30) 358 1940 0 79 3.0 1 1 0 0 1
2989 2023-03-20 (12, 0) 358 1940 0 79 3.0 1 1 0 0 1
2998 2023-03-20 (15, 30) 358 1940 0 79 3.0 1 1 0 0 1
3014 2023-03-20 (22, 0) 358 1940 0 79 3.0 1 1 0 0 1

Summary

Here I have shown how to create patient snapshots from finished patient visits. Note that there is a discarding of some information, or summarisation involved. The lab orders have been reduced to counts, and only the latest triage score has been taken. In the same vein, you might just take the last recorded heart rate or oxygen saturation level, or the latest value of a lab result. A snapshot loses some of the richness of the full data in an EHR, but with the benefit that you get data that replicate unfinished visits.

Note that ED visit data can be patchy in ways that are meaningful. For example, a severely ill patient might have many heart rate values recorded and many lab orders, while a patient with a sprained ankle might have zero heart rate measurements or lab orders. For predicting probability of admission after ED, such variation in data completeness is revealing. By summarising to counts, snapshots allow us to capture that variation in data completeness without having to discard observations that have missing data.

In the next notebook I'll show how to make predictions using patient snapshots.