2d. Explore the datasets provided
Two datasets have been provided with this repository.
- ed_visits.csv
- inpatient_arrivals.csv
These accompany my fully worked example of the modelling of emergency demand for beds. And they are also useful to illustrate what patient snapshots might be made up of.
This notebook does some data exploration by plotting charts of all relevant variables in each dataset.
The inpatient_arrivals
dataset contains arrival times of all patients who visited the UCLH Emergency Department (ED) and the Same Day Emergency Care (SDEC) unit, over the period of the data, and were later admitted. It includes their sex, child status (whether adult or child), and which specialty they were admitted to.
The ed_visits
database contains a set of snapshots of patients who visited the ED and SDEC over the period of the data, including both admitted and discharged patients. Each snapshot includes information known at the time of the snapshot, and excludes anything that was recorded later, except the variables that serve a 'labels' for model training. These are:
- is_admitted
- whether the visit ended in admission to a ward
- final_sequence
- the sequence of consultations the patient had during the visit
- specialty
- the specialty of the admission, if the patient was admitted
See the data dictionaries for detailed information about the variables in the data provided.
Learn more about the data
I recorded a webinar to demonstrate how we converted data from the UCLH Electronic Health Record in a form suitable for this modelling. If you click on the image below, the video will open at the point where I provide detail about the datasets
Set up the notebook environment
# Reload functions every time
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
from patientflow.load import set_project_root
project_root = set_project_root()
Inferred project root: /Users/zellaking/Repos/patientflow
Load parameters and set file paths
Parameters are set in config.json and (for UCLH implementation in config-uclh.yaml). You can change these for your own purposes. I'll talk more about the role of each parameter as it becomes relevant. Here we are loading the pre-defned training, validation and test set dates.
from patientflow.load import set_file_paths
# set file paths
data_file_path, media_file_path, model_file_path, config_path = set_file_paths(
project_root,
data_folder_name='data-public', # change this to data-synthetic if you don't have the public dataset
verbose=False
)
Load data
This notebook has been run using real data which you can download from Zenodo on request.
Alternatively you can use the synthetic dataset that was generated using a stratified sampling approach based on the distributions reported in the data dictionaries. Two relationships from the original data were preserved: the proportion of admitted versus non-admitted patients and the hourly arrival time patterns. All other variables were sampled independently using summary statistics stratified by admission status. This approach maintains relevant dependencies for admission outcomes while treating other variables as independent of each other.
If you don't have the public data, change the argument in the cell above from data_folder_name='data-public'
to data_folder_name='data-synthetic'
.
import pandas as pd
from patientflow.load import load_data
ed_visits = load_data(data_file_path,
file_name='ed_visits.csv',
index_column = 'snapshot_id',
sort_columns = ["visit_number", "snapshot_date", "prediction_time"],
eval_columns = ["prediction_time", "consultation_sequence", "final_sequence"])
inpatient_arrivals = load_data(data_file_path,
file_name='inpatient_arrivals.csv',
index_column = 'arrival_datetime',)
ed_visits.head()
snapshot_date | prediction_time | elapsed_los | sex | arrival_method | num_obs | num_obs_events | num_obs_types | num_lab_batteries_ordered | has_consultation | ... | visited_waiting | visited_unknown | latest_obs_respirations | latest_obs_temperature | latest_obs_news_score_result | latest_obs_objective_pain_score | visit_number | is_admitted | specialty | final_sequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
snapshot_id | |||||||||||||||||||||
98242 | 2031-01-14 | (22, 0) | 20740.0 | M | Amb no medic | 74 | 6 | 23 | 8 | True | ... | True | False | 19.0 | 96.8 | 1.0 | NaN | 000019a46d7c | True | surgical | ['surgical'] |
100119 | 2031-01-19 | (15, 30) | 3780.0 | F | Walk-in | 29 | 3 | 23 | 1 | False | ... | True | False | 16.0 | 98.4 | 1.0 | Mild | 00015db18883 | False | NaN | [] |
189750 | 2031-09-29 | (22, 0) | 10466.0 | F | Walk-in | 12 | 2 | 12 | 0 | False | ... | True | False | 32.0 | 97.9 | NaN | Nil | 0001fbabb70e | False | NaN | [] |
192732 | 2031-10-07 | (22, 0) | 13729.0 | F | Walk-in | 14 | 1 | 14 | 7 | True | ... | True | False | NaN | 97.7 | NaN | Mild | 00021c715ac7 | False | NaN | ['surgical'] |
119891 | 2031-03-12 | (6, 0) | 2504.0 | M | Walk-in | 9 | 1 | 9 | 0 | False | ... | True | False | NaN | NaN | NaN | Mild | 0002af190380 | False | NaN | [] |
5 rows × 68 columns
Explore ED visits dataset
Note that each snapshot has a date and a prediction time formatted separately, with the prediction time as a tuple of (hour, minute). All functions in patientflow
expect prediction times in this format. Each record in the snapshots dataframe is indexed by a unique snapshot_id.
ed_visits.head(10)
snapshot_date | prediction_time | elapsed_los | sex | arrival_method | num_obs | num_obs_events | num_obs_types | num_lab_batteries_ordered | has_consultation | ... | visited_waiting | visited_unknown | latest_obs_respirations | latest_obs_temperature | latest_obs_news_score_result | latest_obs_objective_pain_score | visit_number | is_admitted | specialty | final_sequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
snapshot_id | |||||||||||||||||||||
98242 | 2031-01-14 | (22, 0) | 20740.0 | M | Amb no medic | 74 | 6 | 23 | 8 | True | ... | True | False | 19.0 | 96.8 | 1.0 | NaN | 000019a46d7c | True | surgical | ['surgical'] |
100119 | 2031-01-19 | (15, 30) | 3780.0 | F | Walk-in | 29 | 3 | 23 | 1 | False | ... | True | False | 16.0 | 98.4 | 1.0 | Mild | 00015db18883 | False | NaN | [] |
189750 | 2031-09-29 | (22, 0) | 10466.0 | F | Walk-in | 12 | 2 | 12 | 0 | False | ... | True | False | 32.0 | 97.9 | NaN | Nil | 0001fbabb70e | False | NaN | [] |
192732 | 2031-10-07 | (22, 0) | 13729.0 | F | Walk-in | 14 | 1 | 14 | 7 | True | ... | True | False | NaN | 97.7 | NaN | Mild | 00021c715ac7 | False | NaN | ['surgical'] |
119891 | 2031-03-12 | (6, 0) | 2504.0 | M | Walk-in | 9 | 1 | 9 | 0 | False | ... | True | False | NaN | NaN | NaN | Mild | 0002af190380 | False | NaN | [] |
157035 | 2031-06-27 | (15, 30) | 1548.0 | F | Walk-in | 14 | 1 | 14 | 4 | False | ... | True | False | 18.0 | 97.7 | NaN | Moderate | 00033228d206 | False | NaN | [] |
209659 | 2031-11-30 | (6, 0) | 15020.0 | F | Walk-in | 33 | 3 | 24 | 9 | True | ... | True | False | 17.0 | 98.2 | 0.0 | Nil | 0003d8c503cf | False | NaN | ['obs_gyn'] |
105648 | 2031-02-03 | (12, 0) | 3502.0 | M | Walk-in | 9 | 1 | 9 | 0 | False | ... | True | False | NaN | NaN | NaN | Mild | 00043419ec6b | False | NaN | [] |
172620 | 2031-08-12 | (22, 0) | 7274.0 | F | Walk-in | 14 | 2 | 14 | 0 | False | ... | True | False | NaN | 97.5 | NaN | Mild | 0004c73468a6 | False | NaN | ['obs_gyn'] |
96560 | 2031-01-09 | (22, 0) | 19768.0 | F | Walk-in | 30 | 3 | 21 | 0 | False | ... | True | False | 17.0 | 98.2 | 0.0 | Moderate | 0004eacbae15 | False | NaN | [] |
10 rows × 68 columns
Grouping of columns in ED visits dataset
The ED visits dataset contains variables of different types.
For convenience, I use a function called get_dict_cols()
to organise the charts below into different sections.
from patientflow.load import get_dict_cols
dict_cols = get_dict_cols(ed_visits)
for key, value in dict_cols.items():
print(f"\nColumns in group called {key}:")
print(value)
Columns in group called not used in training:
['snapshot_id', 'snapshot_date', 'prediction_time', 'visit_number', 'training_validation_test', 'random_number']
Columns in group called arrival and demographic:
['elapsed_los', 'sex', 'age_group', 'age_on_arrival', 'arrival_method']
Columns in group called summary:
['num_obs', 'num_obs_events', 'num_obs_types', 'num_lab_batteries_ordered']
Columns in group called location:
['current_location_type', 'total_locations_visited', 'visited_majors', 'visited_otf', 'visited_paeds', 'visited_rat', 'visited_resus', 'visited_sdec', 'visited_sdec_waiting', 'visited_taf', 'visited_utc', 'visited_waiting', 'visited_unknown']
Columns in group called observations:
['num_obs_blood_pressure', 'num_obs_pulse', 'num_obs_air_or_oxygen', 'num_obs_level_of_consciousness', 'num_obs_news_score_result', 'num_obs_temperature', 'num_obs_manchester_triage_acuity', 'num_obs_objective_pain_score', 'num_obs_subjective_pain_score', 'num_obs_glasgow_coma_scale_best_motor_response', 'num_obs_oxygen_delivery_method', 'num_obs_oxygen_flow_rate', 'num_obs_pupil_reaction_right', 'num_obs_uclh_sskin_areas_observed', 'latest_obs_pulse', 'latest_obs_level_of_consciousness', 'latest_obs_manchester_triage_acuity', 'latest_obs_respirations', 'latest_obs_temperature', 'latest_obs_news_score_result', 'latest_obs_objective_pain_score']
Columns in group called lab orders and results:
['lab_orders_bc', 'lab_orders_crp', 'lab_orders_csnf', 'lab_orders_ddit', 'lab_orders_rflu', 'latest_lab_results_crea', 'latest_lab_results_hctu', 'latest_lab_results_k', 'latest_lab_results_lac', 'latest_lab_results_na', 'latest_lab_results_pco2', 'latest_lab_results_ph', 'latest_lab_results_wcc', 'latest_lab_results_htrt', 'latest_lab_results_alb', 'lab_orders_bon', 'lab_orders_ncov', 'lab_orders_xcov']
Columns in group called consults:
['has_consultation', 'consultation_sequence', 'final_sequence', 'specialty']
Columns in group called outcome:
['is_admitted']
Also for the plots, I convert the boolean columns to text values.
# Function to convert boolean columns to text values "true" or "false" - used for plotting format
def bool_to_text(df):
bool_cols = df.select_dtypes(include='bool').columns.drop('is_admitted')
for col in bool_cols:
df[col] = df[col].apply(lambda x: 'true' if x else 'false')
return df
# Apply the function
ed_visits = bool_to_text(ed_visits)
# temporarily add a is_admitted column to arrivals
inpatient_arrivals['is_admitted'] = True
inpatient_arrivals = bool_to_text(inpatient_arrivals)
As some variables are ordinal, I create a dictionary to record the ordering of the values.
ordinal_mappings = {
# age group
"age_group": [
"0-17",
"18-24",
"25-34",
"35-44",
"45-54",
"55-64",
"65-74",
"75-115",
],
# triage score
"latest_obs_manchester_triage_acuity": ["Blue", "Green", "Yellow", "Orange", "Red"],
# pain score
"latest_obs_objective_pain_score": [
r"Nil",
r"Mild",
r"Moderate",
r"Severe\E\Very Severe",
],
# level of consciousness
"latest_obs_level_of_consciousness": [
"A", #alert
"C", #confused
"V", #voice - responds to voice stimulus
"P", #pain - responds to pain stimulus
"U" #unconscious - no response to pain or voice stimulus
]
}
Arrival and demographic variables
Here I import a function called plot_data_distribution
to provide a convenient way of requesting each plot without multiple lines of code for each.
from patientflow.viz.data_distribution import plot_data_distribution
Elapsed Length of Stay
Both admitted and not admitted visits appear to have a long tail of visits lasting more than 24 hours. Any snapshots where the ED visit has lasted more than 72 hours are excluded.
ed_visits['elapsed_los_hrs'] = ed_visits['elapsed_los']/3600
plot_data_distribution(df=ed_visits, col_name='elapsed_los_hrs', grouping_var='is_admitted', grouping_var_name='whether patient admitted', plot_type='both',
title = 'Distribution of elapsed length of stay by whether patient admitted', truncate_outliers=False)
Below, I plot the snapshots where the elapsed visit duration is greater than 24 hours. We can see that the long tail of longer visits is more numerous for discharged than for admitted patients. This could be because patients leave the ED without being recorded as discharged on the system.
if ed_visits[ed_visits.elapsed_los_hrs >= 24].shape[0] > 0:
plot_data_distribution(ed_visits[ed_visits.elapsed_los_hrs >= 24], 'elapsed_los_hrs', 'is_admitted', 'whether patient admitted', plot_type='both',
title = 'Distribution of elapsed length of stay by whether patient admitted (where elapsed length of stay >= 24 hours)', truncate_outliers=False)
Sex, age group and arrival method
The charts below show distributions between admitted and not admitted patients for sex, age group and arrival method. More older people are admitted. Most walk-ins are discharged.
plot_data_distribution(ed_visits, 'sex', 'is_admitted', 'whether patient admitted', plot_type='hist')
if 'age_group' in ed_visits.columns:
plot_data_distribution(ed_visits, 'age_group', 'is_admitted', 'whether patient admitted', plot_type='hist', ordinal_order=ordinal_mappings['age_group'], rotate_x_labels = True)
else:
plot_data_distribution(ed_visits, 'age_on_arrival', 'is_admitted', 'whether patient admitted', plot_type='hist')
plot_data_distribution(ed_visits, 'arrival_method', 'is_admitted', 'whether patient admitted', plot_type='hist', rotate_x_labels = True)
Count variables
The counts variables record the following, up to the moment of the snapshot * the number of observations recorded * the number of events at which observations were recorded (if heart rate and respiratory rate have the same timestamp in the original data, this is one event) * the number of different types of observations (heart rate and respiratory would be two types) * the number of lab test batteries ordered
for col_name in dict_cols['summary']:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', is_discrete = False, truncate_outliers=True)
The plots above have been set to exlude outliers, for better readability. However, there are some extreme values of num_obs and num_obs_events. In such cases, you might consider removing the outliers, depending on what model is to be applied to the data
print(ed_visits.num_obs.max())
print(ed_visits.num_obs_events.max())
989
266
Location variables
The variable current_location_type
records the location of the patient at the time of the snapshot. Refer to the data dictionary for more information about what each location type means. Patients who visit the UTC (Urgent Treatment Centre) are more likely to be discharged than admitted. The UTC provides care for patients with minor injuries and illnesses.
plot_data_distribution(ed_visits, 'current_location_type', 'is_admitted', 'whether patient admitted', plot_type='hist', rotate_x_labels = True)
for col_name in dict_cols['location'][1:]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist')
Observations variables
The variables in the observations group record vital signs, triage scores, and also the number of times certain observations have been recorded, up to the moment of the snapshot.
dict_cols['observations']
['num_obs_blood_pressure',
'num_obs_pulse',
'num_obs_air_or_oxygen',
'num_obs_level_of_consciousness',
'num_obs_news_score_result',
'num_obs_temperature',
'num_obs_manchester_triage_acuity',
'num_obs_objective_pain_score',
'num_obs_subjective_pain_score',
'num_obs_glasgow_coma_scale_best_motor_response',
'num_obs_oxygen_delivery_method',
'num_obs_oxygen_flow_rate',
'num_obs_pupil_reaction_right',
'num_obs_uclh_sskin_areas_observed',
'latest_obs_pulse',
'latest_obs_level_of_consciousness',
'latest_obs_manchester_triage_acuity',
'latest_obs_respirations',
'latest_obs_temperature',
'latest_obs_news_score_result',
'latest_obs_objective_pain_score']
I first plot the variables that count the number of times something was recorded.
Count variables
for col_name in [item for item in dict_cols['observations'] if str(item).startswith('num')]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted',
plot_type='hist',
is_discrete = True,
truncate_outliers=True)
News Scores and Manchester Triage score values
News Scores are commonly used to track the acuity of a patient, and Manchester Triage scores are used at the door of the ED to prioritise patients
for col_name in [item for item in dict_cols['observations'] if ('manchester' in str(item) ) and str(item).startswith('latest')]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', rotate_x_labels = True, ordinal_order=ordinal_mappings['latest_obs_manchester_triage_acuity'])
plot_data_distribution(ed_visits, 'latest_obs_objective_pain_score', 'is_admitted', 'whether patient admitted',
plot_type='hist',
rotate_x_labels = True,
ordinal_order=ordinal_mappings['latest_obs_objective_pain_score'])
for col_name in [item for item in dict_cols['observations'] if 'news' in str(item) and str(item).startswith('latest')]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', is_discrete = True)
The ACVPU score is commonly used to track states of consciousness
for col_name in [item for item in dict_cols['observations'] if 'consciousness' in str(item) and str(item).startswith('latest')]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted',
plot_type='hist',
ordinal_order=ordinal_mappings['latest_obs_level_of_consciousness'])
Temporarily excluding the most common value of A from the ACVPU score, we can see the spread of other values
for col_name in [item for item in dict_cols['observations'] if 'consciousness' in str(item) and str(item).startswith('latest')]:
plot_data_distribution(ed_visits[~(ed_visits.latest_obs_level_of_consciousness == 'A')].copy(), col_name, 'is_admitted', 'whether patient admitted',
plot_type='hist',
ordinal_order=ordinal_mappings['latest_obs_level_of_consciousness'])
Vital signs values
I now plot the distributions of the vital signs values.
for col_name in [item for item in dict_cols['observations'] if str(item).startswith('latest') and ('pulse' in str(item) or 'resp' in str(item) or 'temp' in str(item))]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted',
plot_type='hist',
is_discrete = False,
truncate_outliers=True)
Lab variables
The lab variables include boolean values for whether a lab battery was ordered, and the results of certain lab test. The data include only a small a subset of the lab battery orders and test results that might be requested for a patient in the ED.
dict_cols['lab orders and results']
['lab_orders_bc',
'lab_orders_crp',
'lab_orders_csnf',
'lab_orders_ddit',
'lab_orders_rflu',
'latest_lab_results_crea',
'latest_lab_results_hctu',
'latest_lab_results_k',
'latest_lab_results_lac',
'latest_lab_results_na',
'latest_lab_results_pco2',
'latest_lab_results_ph',
'latest_lab_results_wcc',
'latest_lab_results_htrt',
'latest_lab_results_alb',
'lab_orders_bon',
'lab_orders_ncov',
'lab_orders_xcov']
Lab orders
It is notable in the charts below, which show whether a lab battery was ordered, that battery CRP (for markers of inflammation) is very commonly ordered for admitted patients; among the patients later admitted the majority have a CRP battery ordered whereas among the non-admitted patients only a minority have it. This difference between admitted and non-admitted (where the majority of admitted have something while the majority of discharged patients do not) only applies to this lab battery order. It will show up later as a strong predictor of admission.
for col_name in [item for item in dict_cols['lab orders and results'] if str(item).startswith('lab') ]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist')
Lab results
The plots below show the latest lab values.
for col_name in [item for item in dict_cols['lab orders and results'] if str(item).startswith('latest') ]:
plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', outlier_threshold=3, truncate_outliers=True)
Consults variables
The has_consultation
variable records whether a referral request was made to another service or specialty up to the point of the snapshot. The sequence of referrals up to that point is recorded in consultation_sequence
and the final sequence, at the end of the ED visit in final_sequence
. specialty
records the specialty that the patient was admitted under, if admitted.
The first plot shows that the number of admitted patients with consult requests at the time of the snapshots is about the same as those without. The group without consult requests will have their later in the visit, after the snapshot was recorded.
plot_data_distribution(ed_visits, 'has_consultation', 'is_admitted', 'whether patient admitted', plot_type='hist')
A very small number of non-admitted patients have a specialty of admission recorded. These are most likely patients referred from ED to SDEC, which we don't include in the admitted patients.
plot_data_distribution(ed_visits, 'specialty', 'is_admitted', 'whether patient admitted', plot_type='hist')
Explore inpatient arrivals dataset
The inpatient_arrivals dataset records all of the arrival dates and and times of patients who were later admitted to a ward. Other information is also recorded, such as sex and child status, as will as specialty of admission. This dataset will be used to predict the number of patients yet-to-arrive at the time of prediction.
inpatient_arrivals.head()
specialty | sex | is_child | is_admitted | |
---|---|---|---|---|
arrival_datetime | ||||
2031-02-26 11:28:00+00:00 | medical | M | false | True |
2031-01-14 16:51:00+00:00 | surgical | F | false | True |
2031-02-06 20:51:00+00:00 | surgical | M | false | True |
2031-01-10 13:43:00+00:00 | haem/onc | M | false | True |
2031-01-02 13:55:00+00:00 | haem/onc | F | false | True |
# temporarily add is_admitted column to arrivals dataset, to be able to use the plot_data_distribution function
inpatient_arrivals['is_admitted'] = True
plot_data_distribution(inpatient_arrivals.reset_index().copy(), 'specialty', 'is_admitted', 'whether patient admitted', plot_type='hist')
plot_data_distribution(inpatient_arrivals, 'is_child', 'is_admitted', 'whether patient admitted', plot_type='hist')
Summary
This notebook has shown how to load files that have been provided, and shows some plots of the variables included. This is an illustrative dataset, showing the type of variables that were used for the analysis at UCLH. Other sites will have different data.