2d. Explore the datasets provided

Two datasets have been provided with this repository. - ed_visits.csv - inpatient_arrivals.csv

These accompany my fully worked example of the modelling of emergency demand for beds. And they are also useful to illustrate what patient snapshots might be made up of.

This notebook does some data exploration by plotting charts of all relevant variables in each dataset.

The inpatient_arrivals dataset contains arrival times of all patients who visited the UCLH Emergency Department (ED) and the Same Day Emergency Care (SDEC) unit, over the period of the data, and were later admitted. It includes their sex, child status (whether adult or child), and which specialty they were admitted to.

The ed_visits database contains a set of snapshots of patients who visited the ED and SDEC over the period of the data, including both admitted and discharged patients. Each snapshot includes information known at the time of the snapshot, and excludes anything that was recorded later, except the variables that serve a 'labels' for model training. These are:
- is_admitted - whether the visit ended in admission to a ward - final_sequence - the sequence of consultations the patient had during the visit - specialty - the specialty of the admission, if the patient was admitted

See the data dictionaries for detailed information about the variables in the data provided.

Learn more about the data

I recorded a webinar to demonstrate how we converted data from the UCLH Electronic Health Record in a form suitable for this modelling. If you click on the image below, the video will open at the point where I provide detail about the datasets

Set up the notebook environment

# Reload functions every time
%load_ext autoreload 
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

from patientflow.load import set_project_root
project_root = set_project_root()

Inferred project root: /Users/zellaking/Repos/patientflow

Load parameters and set file paths

Parameters are set in config.json and (for UCLH implementation in config-uclh.yaml). You can change these for your own purposes. I'll talk more about the role of each parameter as it becomes relevant. Here we are loading the pre-defned training, validation and test set dates.

from patientflow.load import set_file_paths

# set file paths
data_file_path, media_file_path, model_file_path, config_path = set_file_paths(
        project_root, 
        data_folder_name='data-public',  # change this to data-synthetic if you don't have the public dataset
        verbose=False
        )

Load data

This notebook has been run using real data which you can download from Zenodo on request.

Alternatively you can use the synthetic dataset that was generated using a stratified sampling approach based on the distributions reported in the data dictionaries. Two relationships from the original data were preserved: the proportion of admitted versus non-admitted patients and the hourly arrival time patterns. All other variables were sampled independently using summary statistics stratified by admission status. This approach maintains relevant dependencies for admission outcomes while treating other variables as independent of each other.

If you don't have the public data, change the argument in the cell above from data_folder_name='data-public' to data_folder_name='data-synthetic'.

import pandas as pd
from patientflow.load import load_data

ed_visits = load_data(data_file_path, 
                    file_name='ed_visits.csv', 
                    index_column = 'snapshot_id',
                    sort_columns = ["visit_number", "snapshot_date", "prediction_time"], 
                    eval_columns = ["prediction_time", "consultation_sequence", "final_sequence"])

inpatient_arrivals = load_data(data_file_path, 
                    file_name='inpatient_arrivals.csv', 
                    index_column = 'arrival_datetime',)

ed_visits.head()

	snapshot_date	prediction_time	elapsed_los	sex	arrival_method	num_obs	num_obs_events	num_obs_types	num_lab_batteries_ordered	has_consultation	...	visited_waiting	visited_unknown	latest_obs_respirations	latest_obs_temperature	latest_obs_news_score_result	latest_obs_objective_pain_score	visit_number	is_admitted	specialty	final_sequence
snapshot_id
98242	2031-01-14	(22, 0)	20740.0	M	Amb no medic	74	6	23	8	True	...	True	False	19.0	96.8	1.0	NaN	000019a46d7c	True	surgical	['surgical']
100119	2031-01-19	(15, 30)	3780.0	F	Walk-in	29	3	23	1	False	...	True	False	16.0	98.4	1.0	Mild	00015db18883	False	NaN	[]
189750	2031-09-29	(22, 0)	10466.0	F	Walk-in	12	2	12	0	False	...	True	False	32.0	97.9	NaN	Nil	0001fbabb70e	False	NaN	[]
192732	2031-10-07	(22, 0)	13729.0	F	Walk-in	14	1	14	7	True	...	True	False	NaN	97.7	NaN	Mild	00021c715ac7	False	NaN	['surgical']
119891	2031-03-12	(6, 0)	2504.0	M	Walk-in	9	1	9	0	False	...	True	False	NaN	NaN	NaN	Mild	0002af190380	False	NaN	[]

5 rows × 68 columns

Explore ED visits dataset

Note that each snapshot has a date and a prediction time formatted separately, with the prediction time as a tuple of (hour, minute). All functions in patientflow expect prediction times in this format. Each record in the snapshots dataframe is indexed by a unique snapshot_id.

ed_visits.head(10)

	snapshot_date	prediction_time	elapsed_los	sex	arrival_method	num_obs	num_obs_events	num_obs_types	num_lab_batteries_ordered	has_consultation	...	visited_waiting	visited_unknown	latest_obs_respirations	latest_obs_temperature	latest_obs_news_score_result	latest_obs_objective_pain_score	visit_number	is_admitted	specialty	final_sequence
snapshot_id
98242	2031-01-14	(22, 0)	20740.0	M	Amb no medic	74	6	23	8	True	...	True	False	19.0	96.8	1.0	NaN	000019a46d7c	True	surgical	['surgical']
100119	2031-01-19	(15, 30)	3780.0	F	Walk-in	29	3	23	1	False	...	True	False	16.0	98.4	1.0	Mild	00015db18883	False	NaN	[]
189750	2031-09-29	(22, 0)	10466.0	F	Walk-in	12	2	12	0	False	...	True	False	32.0	97.9	NaN	Nil	0001fbabb70e	False	NaN	[]
192732	2031-10-07	(22, 0)	13729.0	F	Walk-in	14	1	14	7	True	...	True	False	NaN	97.7	NaN	Mild	00021c715ac7	False	NaN	['surgical']
119891	2031-03-12	(6, 0)	2504.0	M	Walk-in	9	1	9	0	False	...	True	False	NaN	NaN	NaN	Mild	0002af190380	False	NaN	[]
157035	2031-06-27	(15, 30)	1548.0	F	Walk-in	14	1	14	4	False	...	True	False	18.0	97.7	NaN	Moderate	00033228d206	False	NaN	[]
209659	2031-11-30	(6, 0)	15020.0	F	Walk-in	33	3	24	9	True	...	True	False	17.0	98.2	0.0	Nil	0003d8c503cf	False	NaN	['obs_gyn']
105648	2031-02-03	(12, 0)	3502.0	M	Walk-in	9	1	9	0	False	...	True	False	NaN	NaN	NaN	Mild	00043419ec6b	False	NaN	[]
172620	2031-08-12	(22, 0)	7274.0	F	Walk-in	14	2	14	0	False	...	True	False	NaN	97.5	NaN	Mild	0004c73468a6	False	NaN	['obs_gyn']
96560	2031-01-09	(22, 0)	19768.0	F	Walk-in	30	3	21	0	False	...	True	False	17.0	98.2	0.0	Moderate	0004eacbae15	False	NaN	[]

10 rows × 68 columns

Grouping of columns in ED visits dataset

The ED visits dataset contains variables of different types.

For convenience, I use a function called get_dict_cols() to organise the charts below into different sections.

from patientflow.load import get_dict_cols
dict_cols = get_dict_cols(ed_visits)

for key, value in dict_cols.items():
    print(f"\nColumns in group called {key}:")
    print(value)

Columns in group called not used in training:
['snapshot_id', 'snapshot_date', 'prediction_time', 'visit_number', 'training_validation_test', 'random_number']

Columns in group called arrival and demographic:
['elapsed_los', 'sex', 'age_group', 'age_on_arrival', 'arrival_method']

Columns in group called summary:
['num_obs', 'num_obs_events', 'num_obs_types', 'num_lab_batteries_ordered']

Columns in group called location:
['current_location_type', 'total_locations_visited', 'visited_majors', 'visited_otf', 'visited_paeds', 'visited_rat', 'visited_resus', 'visited_sdec', 'visited_sdec_waiting', 'visited_taf', 'visited_utc', 'visited_waiting', 'visited_unknown']

Columns in group called observations:
['num_obs_blood_pressure', 'num_obs_pulse', 'num_obs_air_or_oxygen', 'num_obs_level_of_consciousness', 'num_obs_news_score_result', 'num_obs_temperature', 'num_obs_manchester_triage_acuity', 'num_obs_objective_pain_score', 'num_obs_subjective_pain_score', 'num_obs_glasgow_coma_scale_best_motor_response', 'num_obs_oxygen_delivery_method', 'num_obs_oxygen_flow_rate', 'num_obs_pupil_reaction_right', 'num_obs_uclh_sskin_areas_observed', 'latest_obs_pulse', 'latest_obs_level_of_consciousness', 'latest_obs_manchester_triage_acuity', 'latest_obs_respirations', 'latest_obs_temperature', 'latest_obs_news_score_result', 'latest_obs_objective_pain_score']

Columns in group called lab orders and results:
['lab_orders_bc', 'lab_orders_crp', 'lab_orders_csnf', 'lab_orders_ddit', 'lab_orders_rflu', 'latest_lab_results_crea', 'latest_lab_results_hctu', 'latest_lab_results_k', 'latest_lab_results_lac', 'latest_lab_results_na', 'latest_lab_results_pco2', 'latest_lab_results_ph', 'latest_lab_results_wcc', 'latest_lab_results_htrt', 'latest_lab_results_alb', 'lab_orders_bon', 'lab_orders_ncov', 'lab_orders_xcov']

Columns in group called consults:
['has_consultation', 'consultation_sequence', 'final_sequence', 'specialty']

Columns in group called outcome:
['is_admitted']

Also for the plots, I convert the boolean columns to text values.

# Function to convert boolean columns to text values "true" or "false" - used for plotting format
def bool_to_text(df):
    bool_cols = df.select_dtypes(include='bool').columns.drop('is_admitted')
    for col in bool_cols:
        df[col] = df[col].apply(lambda x: 'true' if x else 'false')
    return df

# Apply the function
ed_visits = bool_to_text(ed_visits)

# temporarily add a is_admitted column to arrivals 
inpatient_arrivals['is_admitted'] = True
inpatient_arrivals = bool_to_text(inpatient_arrivals)

As some variables are ordinal, I create a dictionary to record the ordering of the values.

ordinal_mappings = {
    # age group
    "age_group": [
        "0-17",
        "18-24",
        "25-34",
        "35-44",
        "45-54",
        "55-64",
        "65-74",
        "75-115",
    ],
    # triage score
    "latest_obs_manchester_triage_acuity": ["Blue", "Green", "Yellow", "Orange", "Red"],
    # pain score
    "latest_obs_objective_pain_score": [
        r"Nil",
        r"Mild",
        r"Moderate",
        r"Severe\E\Very Severe",
    ],
    # level of consciousness
    "latest_obs_level_of_consciousness": [
        "A", #alert
        "C", #confused
        "V", #voice - responds to voice stimulus
        "P", #pain - responds to pain stimulus
        "U" #unconscious - no response to pain or voice stimulus
    ]
}

Arrival and demographic variables

Here I import a function called plot_data_distribution to provide a convenient way of requesting each plot without multiple lines of code for each.

from patientflow.viz.data_distribution import plot_data_distribution

Elapsed Length of Stay

Both admitted and not admitted visits appear to have a long tail of visits lasting more than 24 hours. Any snapshots where the ED visit has lasted more than 72 hours are excluded.

ed_visits['elapsed_los_hrs'] = ed_visits['elapsed_los']/3600
plot_data_distribution(df=ed_visits, col_name='elapsed_los_hrs', grouping_var='is_admitted', grouping_var_name='whether patient admitted', plot_type='both',
                        title = 'Distribution of elapsed length of stay by whether patient admitted', truncate_outliers=False)

png

Below, I plot the snapshots where the elapsed visit duration is greater than 24 hours. We can see that the long tail of longer visits is more numerous for discharged than for admitted patients. This could be because patients leave the ED without being recorded as discharged on the system.

if ed_visits[ed_visits.elapsed_los_hrs >= 24].shape[0] > 0:
    plot_data_distribution(ed_visits[ed_visits.elapsed_los_hrs >= 24], 'elapsed_los_hrs', 'is_admitted', 'whether patient admitted', plot_type='both',
                        title = 'Distribution of elapsed length of stay by whether patient admitted (where elapsed length of stay >= 24 hours)', truncate_outliers=False)

png

Sex, age group and arrival method

The charts below show distributions between admitted and not admitted patients for sex, age group and arrival method. More older people are admitted. Most walk-ins are discharged.

plot_data_distribution(ed_visits, 'sex', 'is_admitted', 'whether patient admitted', plot_type='hist')

png

if 'age_group' in ed_visits.columns:
    plot_data_distribution(ed_visits, 'age_group', 'is_admitted', 'whether patient admitted', plot_type='hist', ordinal_order=ordinal_mappings['age_group'], rotate_x_labels = True)
else:
    plot_data_distribution(ed_visits, 'age_on_arrival', 'is_admitted', 'whether patient admitted', plot_type='hist')

png

plot_data_distribution(ed_visits, 'arrival_method', 'is_admitted', 'whether patient admitted', plot_type='hist', rotate_x_labels = True)

png

Count variables

The counts variables record the following, up to the moment of the snapshot * the number of observations recorded * the number of events at which observations were recorded (if heart rate and respiratory rate have the same timestamp in the original data, this is one event) * the number of different types of observations (heart rate and respiratory would be two types) * the number of lab test batteries ordered

for col_name in dict_cols['summary']:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', is_discrete = False, truncate_outliers=True)

png

The plots above have been set to exlude outliers, for better readability. However, there are some extreme values of num_obs and num_obs_events. In such cases, you might consider removing the outliers, depending on what model is to be applied to the data

print(ed_visits.num_obs.max())
print(ed_visits.num_obs_events.max())

989
266

Location variables

The variable current_location_type records the location of the patient at the time of the snapshot. Refer to the data dictionary for more information about what each location type means. Patients who visit the UTC (Urgent Treatment Centre) are more likely to be discharged than admitted. The UTC provides care for patients with minor injuries and illnesses.

plot_data_distribution(ed_visits, 'current_location_type', 'is_admitted', 'whether patient admitted', plot_type='hist', rotate_x_labels = True)

png

for col_name in dict_cols['location'][1:]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist')

png

Observations variables

The variables in the observations group record vital signs, triage scores, and also the number of times certain observations have been recorded, up to the moment of the snapshot.

dict_cols['observations']

['num_obs_blood_pressure',
 'num_obs_pulse',
 'num_obs_air_or_oxygen',
 'num_obs_level_of_consciousness',
 'num_obs_news_score_result',
 'num_obs_temperature',
 'num_obs_manchester_triage_acuity',
 'num_obs_objective_pain_score',
 'num_obs_subjective_pain_score',
 'num_obs_glasgow_coma_scale_best_motor_response',
 'num_obs_oxygen_delivery_method',
 'num_obs_oxygen_flow_rate',
 'num_obs_pupil_reaction_right',
 'num_obs_uclh_sskin_areas_observed',
 'latest_obs_pulse',
 'latest_obs_level_of_consciousness',
 'latest_obs_manchester_triage_acuity',
 'latest_obs_respirations',
 'latest_obs_temperature',
 'latest_obs_news_score_result',
 'latest_obs_objective_pain_score']

I first plot the variables that count the number of times something was recorded.

Count variables

for col_name in [item for item in dict_cols['observations'] if str(item).startswith('num')]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', 
                            plot_type='hist', 
                            is_discrete = True, 
                            truncate_outliers=True)

png

News Scores and Manchester Triage score values

News Scores are commonly used to track the acuity of a patient, and Manchester Triage scores are used at the door of the ED to prioritise patients

for col_name in [item for item in dict_cols['observations'] if ('manchester' in str(item) ) and str(item).startswith('latest')]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', rotate_x_labels = True, ordinal_order=ordinal_mappings['latest_obs_manchester_triage_acuity'])

png

plot_data_distribution(ed_visits, 'latest_obs_objective_pain_score', 'is_admitted', 'whether patient admitted', 
                        plot_type='hist', 
                        rotate_x_labels = True, 
                        ordinal_order=ordinal_mappings['latest_obs_objective_pain_score'])

png


for col_name in [item for item in dict_cols['observations'] if 'news' in str(item) and str(item).startswith('latest')]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', is_discrete = True)

png

The ACVPU score is commonly used to track states of consciousness

for col_name in [item for item in dict_cols['observations'] if 'consciousness' in str(item) and str(item).startswith('latest')]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', 
                            plot_type='hist', 
                            ordinal_order=ordinal_mappings['latest_obs_level_of_consciousness'])

png

Temporarily excluding the most common value of A from the ACVPU score, we can see the spread of other values

for col_name in [item for item in dict_cols['observations'] if 'consciousness' in str(item) and str(item).startswith('latest')]:
    plot_data_distribution(ed_visits[~(ed_visits.latest_obs_level_of_consciousness == 'A')].copy(), col_name, 'is_admitted', 'whether patient admitted', 
                            plot_type='hist', 
                            ordinal_order=ordinal_mappings['latest_obs_level_of_consciousness'])

png

Vital signs values

I now plot the distributions of the vital signs values.

for col_name in [item for item in dict_cols['observations'] if str(item).startswith('latest') and ('pulse' in str(item) or 'resp' in str(item) or 'temp' in str(item))]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', 
                            plot_type='hist', 
                            is_discrete = False, 
                            truncate_outliers=True)

png

Lab variables

The lab variables include boolean values for whether a lab battery was ordered, and the results of certain lab test. The data include only a small a subset of the lab battery orders and test results that might be requested for a patient in the ED.

dict_cols['lab orders and results']

['lab_orders_bc',
 'lab_orders_crp',
 'lab_orders_csnf',
 'lab_orders_ddit',
 'lab_orders_rflu',
 'latest_lab_results_crea',
 'latest_lab_results_hctu',
 'latest_lab_results_k',
 'latest_lab_results_lac',
 'latest_lab_results_na',
 'latest_lab_results_pco2',
 'latest_lab_results_ph',
 'latest_lab_results_wcc',
 'latest_lab_results_htrt',
 'latest_lab_results_alb',
 'lab_orders_bon',
 'lab_orders_ncov',
 'lab_orders_xcov']

Lab orders

It is notable in the charts below, which show whether a lab battery was ordered, that battery CRP (for markers of inflammation) is very commonly ordered for admitted patients; among the patients later admitted the majority have a CRP battery ordered whereas among the non-admitted patients only a minority have it. This difference between admitted and non-admitted (where the majority of admitted have something while the majority of discharged patients do not) only applies to this lab battery order. It will show up later as a strong predictor of admission.

for col_name in [item for item in dict_cols['lab orders and results'] if str(item).startswith('lab') ]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist')

png

Lab results

The plots below show the latest lab values.

for col_name in [item for item in dict_cols['lab orders and results'] if str(item).startswith('latest') ]:
    plot_data_distribution(ed_visits, col_name, 'is_admitted', 'whether patient admitted', plot_type='hist', outlier_threshold=3, truncate_outliers=True)

png

Consults variables

The has_consultation variable records whether a referral request was made to another service or specialty up to the point of the snapshot. The sequence of referrals up to that point is recorded in consultation_sequence and the final sequence, at the end of the ED visit in final_sequence. specialty records the specialty that the patient was admitted under, if admitted.

The first plot shows that the number of admitted patients with consult requests at the time of the snapshots is about the same as those without. The group without consult requests will have their later in the visit, after the snapshot was recorded.

plot_data_distribution(ed_visits, 'has_consultation', 'is_admitted', 'whether patient admitted', plot_type='hist')

png

A very small number of non-admitted patients have a specialty of admission recorded. These are most likely patients referred from ED to SDEC, which we don't include in the admitted patients.

plot_data_distribution(ed_visits, 'specialty', 'is_admitted', 'whether patient admitted', plot_type='hist')

png

Explore inpatient arrivals dataset

The inpatient_arrivals dataset records all of the arrival dates and and times of patients who were later admitted to a ward. Other information is also recorded, such as sex and child status, as will as specialty of admission. This dataset will be used to predict the number of patients yet-to-arrive at the time of prediction.

inpatient_arrivals.head()

	specialty	sex	is_child	is_admitted
arrival_datetime
2031-02-26 11:28:00+00:00	medical	M	false	True
2031-01-14 16:51:00+00:00	surgical	F	false	True
2031-02-06 20:51:00+00:00	surgical	M	false	True
2031-01-10 13:43:00+00:00	haem/onc	M	false	True
2031-01-02 13:55:00+00:00	haem/onc	F	false	True

# temporarily add is_admitted column to arrivals dataset, to be able to use the plot_data_distribution function
inpatient_arrivals['is_admitted'] = True
plot_data_distribution(inpatient_arrivals.reset_index().copy(), 'specialty', 'is_admitted', 'whether patient admitted', plot_type='hist')

png

plot_data_distribution(inpatient_arrivals, 'is_child', 'is_admitted', 'whether patient admitted', plot_type='hist')

png

Summary

This notebook has shown how to load files that have been provided, and shows some plots of the variables included. This is an illustrative dataset, showing the type of variables that were used for the analysis at UCLH. Other sites will have different data.