Skip to content

API reference

PatientFlow: A package for predicting short-term hospital bed demand.

This package provides tools and models for analysing patient flow data and making predictions about emergency demand, elective demand, and hospital discharges.

aggregate

Aggregate Prediction From Patient-Level Probabilities

This submodule provides functions to aggregate patient-level predicted probabilities into a probability distribution. The module uses symbolic mathematics to generate and manipulate expressions, enabling the computation of aggregate probabilities based on individual patient-level predictions.

Functions:

Name Description
create_symbols : function

Generate a sequence of symbolic objects intended for use in mathematical expressions.

compute_core_expression : function

Compute a symbolic expression involving a basic mathematical operation with a symbol and a constant.

build_expression : function

Construct a cumulative product expression by combining individual symbolic expressions.

expression_subs : function

Substitute values into a symbolic expression based on a mapping from symbols to predictions.

return_coeff : function

Extract the coefficient of a specified power from an expanded symbolic expression.

model_input_to_pred_proba : function

Use a predictive model to convert model input data into predicted probabilities.

pred_proba_to_agg_predicted : function

Convert individual probability predictions into aggregate predicted probability distribution using optional weights.

get_prob_dist_for_prediction_moment : function

Calculate both predicted distributions and observed values for a given date using test data.

get_prob_dist : function

Calculate probability distributions for each snapshot date based on given model predictions.

get_prob_dist_without_patient_snapshots : function

Calculate probability distributions for each snapshot date using an EmpiricalSurvivalPredictor.

build_expression(syms, n)

Construct a cumulative product expression by combining individual symbolic expressions.

Parameters:

Name Type Description Default
syms iterable

Iterable containing symbols to use in the expressions.

required
n int

The number of terms to include in the cumulative product.

required

Returns:

Type Description
Expr

The cumulative product of the expressions.

Source code in src/patientflow/aggregate.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def build_expression(syms, n):
    """
    Construct a cumulative product expression by combining individual symbolic expressions.

    Parameters
    ----------
    syms : iterable
        Iterable containing symbols to use in the expressions.
    n : int
        The number of terms to include in the cumulative product.

    Returns
    -------
    Expr
        The cumulative product of the expressions.

    """
    s = sym.Symbol("s")
    expression = 1
    for i in range(n):
        expression *= compute_core_expression(syms[i], s)
    return expression

compute_core_expression(ri, s)

Compute a symbolic expression involving a basic mathematical operation with a symbol and a constant.

Parameters:

Name Type Description Default
ri float

The constant value to substitute into the expression.

required
s Symbol

The symbolic object used in the expression.

required

Returns:

Type Description
Expr

The symbolic expression after substitution.

Source code in src/patientflow/aggregate.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def compute_core_expression(ri, s):
    """
    Compute a symbolic expression involving a basic mathematical operation with a symbol and a constant.

    Parameters
    ----------
    ri : float
        The constant value to substitute into the expression.
    s : Symbol
        The symbolic object used in the expression.

    Returns
    -------
    Expr
        The symbolic expression after substitution.

    """
    r = sym.Symbol("r")
    core_expression = (1 - r) + r * s
    return core_expression.subs({r: ri})

create_symbols(n)

Generate a sequence of symbolic objects intended for use in mathematical expressions.

Parameters:

Name Type Description Default
n int

Number of symbols to create.

required

Returns:

Type Description
tuple

A tuple containing the generated symbolic objects.

Source code in src/patientflow/aggregate.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def create_symbols(n):
    """
    Generate a sequence of symbolic objects intended for use in mathematical expressions.

    Parameters
    ----------
    n : int
        Number of symbols to create.

    Returns
    -------
    tuple
        A tuple containing the generated symbolic objects.

    """
    return symbols(f"r0:{n}")

expression_subs(expression, n, predictions)

Substitute values into a symbolic expression based on a mapping from symbols to predictions.

Parameters:

Name Type Description Default
expression Expr

The symbolic expression to perform substitution on.

required
n int

Number of symbols and corresponding predictions.

required
predictions list

List of numerical predictions to substitute.

required

Returns:

Type Description
Expr

The expression after performing the substitution.

Source code in src/patientflow/aggregate.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def expression_subs(expression, n, predictions):
    """
    Substitute values into a symbolic expression based on a mapping from symbols to predictions.

    Parameters
    ----------
    expression : Expr
        The symbolic expression to perform substitution on.
    n : int
        Number of symbols and corresponding predictions.
    predictions : list
        List of numerical predictions to substitute.

    Returns
    -------
    Expr
        The expression after performing the substitution.

    """
    syms = create_symbols(n)
    substitution = dict(zip(syms, predictions))
    return expression.subs(substitution)

get_prob_dist(snapshots_dict, X_test, y_test, model, weights=None, verbose=False, category_filter=None, normal_approx_threshold=30)

Calculate probability distributions for each snapshot date based on given model predictions.

Parameters:

Name Type Description Default
snapshots_dict dict

A dictionary mapping snapshot dates to indices in X_test and y_test. Must have datetime.date objects as keys and lists of indices as values.

required
X_test DataFrame or array - like

Input test data to be passed to the model.

required
y_test array - like

Observed target values.

required
model object or TrainedClassifier

Either a predictive model which provides a predict_proba method, or a TrainedClassifier object containing a pipeline.

required
weights Series

A Series containing weights for the test data points, which may influence the prediction, by default None. If provided, the weights should be indexed similarly to X_test and y_test.

None
verbose (bool, optional(default=False))

If True, print progress information.

False
category_filter array - like

Boolean mask indicating which samples belong to the specific outcome category being analyzed. Should be the same length as y_test.

None
normal_approx_threshold (int, optional(default=30))

If the number of rows in a snapshot exceeds this threshold, use a Normal distribution approximation. Set to None or a very large number to always use the exact symbolic computation.

30

Returns:

Type Description
dict

A dictionary mapping snapshot dates to probability distributions.

Raises:

Type Description
ValueError

If snapshots_dict is not properly formatted or empty. If model has no predict_proba method and is not a TrainedClassifier.

Source code in src/patientflow/aggregate.py
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
def get_prob_dist(
    snapshots_dict,
    X_test,
    y_test,
    model,
    weights=None,
    verbose=False,
    category_filter=None,
    normal_approx_threshold=30,
):
    """
    Calculate probability distributions for each snapshot date based on given model predictions.

    Parameters
    ----------
    snapshots_dict : dict
        A dictionary mapping snapshot dates to indices in `X_test` and `y_test`.
        Must have datetime.date objects as keys and lists of indices as values.
    X_test : DataFrame or array-like
        Input test data to be passed to the model.
    y_test : array-like
        Observed target values.
    model : object or TrainedClassifier
        Either a predictive model which provides a `predict_proba` method,
        or a TrainedClassifier object containing a pipeline.
    weights : pandas.Series, optional
        A Series containing weights for the test data points, which may influence the prediction,
        by default None. If provided, the weights should be indexed similarly to `X_test` and `y_test`.
    verbose : bool, optional (default=False)
        If True, print progress information.
    category_filter : array-like, optional
        Boolean mask indicating which samples belong to the specific outcome category being analyzed.
        Should be the same length as y_test.
    normal_approx_threshold : int, optional (default=30)
        If the number of rows in a snapshot exceeds this threshold, use a Normal distribution approximation.
        Set to None or a very large number to always use the exact symbolic computation.

    Returns
    -------
    dict
        A dictionary mapping snapshot dates to probability distributions.

    Raises
    ------
    ValueError
        If snapshots_dict is not properly formatted or empty.
        If model has no predict_proba method and is not a TrainedClassifier.
    """
    # Validate snapshots_dict format
    if not snapshots_dict:
        raise ValueError("snapshots_dict cannot be empty")

    for dt, indices in snapshots_dict.items():
        if not isinstance(dt, date):
            raise ValueError(
                f"snapshots_dict keys must be datetime.date objects, got {type(dt)}"
            )
        if not isinstance(indices, list):
            raise ValueError(
                f"snapshots_dict values must be lists, got {type(indices)}"
            )
        if indices and not all(isinstance(idx, int) for idx in indices):
            raise ValueError("All indices in snapshots_dict must be integers")

    # Extract pipeline if model is a TrainedClassifier
    if hasattr(model, "calibrated_pipeline") and model.calibrated_pipeline is not None:
        model = model.calibrated_pipeline
    elif hasattr(model, "pipeline"):
        model = model.pipeline
    # Validate that model has predict_proba method
    elif not hasattr(model, "predict_proba"):
        raise ValueError(
            "Model must either be a TrainedClassifier or have a predict_proba method"
        )

    prob_dist_dict = {}
    if verbose:
        print(
            f"Calculating probability distributions for {len(snapshots_dict)} snapshot dates"
        )

        if len(snapshots_dict) > 10:
            print("This may take a minute or more")

    # Initialize a counter for notifying the user every 10 snapshot dates processed
    count = 0

    for dt, snapshots_to_include in snapshots_dict.items():
        if len(snapshots_to_include) == 0:
            # Create an empty dictionary for the current snapshot date
            prob_dist_dict[dt] = {
                "agg_predicted": pd.DataFrame({"agg_proba": [1]}, index=[0]),
                "agg_observed": 0,
            }
        else:
            # Ensure the lengths of test features and outcomes are equal
            assert len(X_test.loc[snapshots_to_include]) == len(
                y_test.loc[snapshots_to_include]
            ), "Mismatch in lengths of X_test and y_test snapshots."

            if weights is None:
                prediction_moment_weights = None
            else:
                prediction_moment_weights = weights.loc[snapshots_to_include].values

            # Apply category filter
            if category_filter is None:
                prediction_moment_category_filter = None
            else:
                prediction_moment_category_filter = category_filter.loc[
                    snapshots_to_include
                ]

            # Pass the normal_approx_threshold to get_prob_dist_for_prediction_moment
            prob_dist_dict[dt] = get_prob_dist_for_prediction_moment(
                X_test=X_test.loc[snapshots_to_include],
                y_test=y_test.loc[snapshots_to_include],
                model=model,
                weights=prediction_moment_weights,
                category_filter=prediction_moment_category_filter,
                normal_approx_threshold=normal_approx_threshold,
            )

        # Increment the counter and notify the user every 10 snapshot dates processed
        count += 1
        if verbose and count % 10 == 0 and count != len(snapshots_dict):
            print(f"Processed {count} snapshot dates")

    if verbose:
        print(f"Processed {len(snapshots_dict)} snapshot dates")

    return prob_dist_dict

get_prob_dist_for_prediction_moment(X_test, model, weights=None, inference_time=False, y_test=None, category_filter=None, normal_approx_threshold=30)

Calculate both predicted distributions and observed values for a given date using test data.

Parameters:

Name Type Description Default
X_test array - like

Test features for a specific snapshot date.

required
model object or TrainedClassifier

Either a predictive model which provides a predict_proba method, or a TrainedClassifier object containing a pipeline.

required
weights array - like

Weights to apply to the predictions for aggregate calculation.

None
inference_time (bool, optional(default=False))

If True, do not calculate or return actual aggregate.

False
y_test array - like

Actual outcomes corresponding to the test features. Required if inference_time is False.

None
category_filter array - like

Boolean mask indicating which samples belong to the specific outcome category being analyzed. Should be the same length as y_test.

None
normal_approx_threshold (int, optional(default=30))

If the number of rows in X_test exceeds this threshold, use a Normal distribution approximation. Set to None or a very large number to always use the exact symbolic computation.

30

Returns:

Type Description
dict

A dictionary with keys 'agg_predicted' and, if inference_time is False, 'agg_observed'.

Raises:

Type Description
ValueError

If y_test is not provided when inference_time is False. If model has no predict_proba method and is not a TrainedClassifier.

Source code in src/patientflow/aggregate.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
def get_prob_dist_for_prediction_moment(
    X_test,
    model,
    weights=None,
    inference_time=False,
    y_test=None,
    category_filter=None,
    normal_approx_threshold=30,
):
    """
    Calculate both predicted distributions and observed values for a given date using test data.

    Parameters
    ----------
    X_test : array-like
        Test features for a specific snapshot date.
    model : object or TrainedClassifier
        Either a predictive model which provides a `predict_proba` method,
        or a TrainedClassifier object containing a pipeline.
    weights : array-like, optional
        Weights to apply to the predictions for aggregate calculation.
    inference_time : bool, optional (default=False)
        If True, do not calculate or return actual aggregate.
    y_test : array-like, optional
        Actual outcomes corresponding to the test features. Required if inference_time is False.
    category_filter : array-like, optional
        Boolean mask indicating which samples belong to the specific outcome category being analyzed.
        Should be the same length as y_test.
    normal_approx_threshold : int, optional (default=30)
        If the number of rows in X_test exceeds this threshold, use a Normal distribution approximation.
        Set to None or a very large number to always use the exact symbolic computation.

    Returns
    -------
    dict
        A dictionary with keys 'agg_predicted' and, if inference_time is False, 'agg_observed'.

    Raises
    ------
    ValueError
        If y_test is not provided when inference_time is False.
        If model has no predict_proba method and is not a TrainedClassifier.
    """
    if not inference_time and y_test is None:
        raise ValueError("y_test must be provided if inference_time is False.")

    # Extract pipeline if model is a TrainedClassifier
    if hasattr(model, "calibrated_pipeline") and model.calibrated_pipeline is not None:
        model = model.calibrated_pipeline
    elif hasattr(model, "pipeline"):
        model = model.pipeline
    # Validate that model has predict_proba method
    elif not hasattr(model, "predict_proba"):
        raise ValueError(
            "Model must either be a TrainedClassifier or have a predict_proba method"
        )

    prediction_moment_dict = {}

    if len(X_test) > 0:
        pred_proba = model_input_to_pred_proba(X_test, model)
        agg_predicted = pred_proba_to_agg_predicted(
            pred_proba, weights, normal_approx_threshold
        )
        prediction_moment_dict["agg_predicted"] = agg_predicted

        if not inference_time:
            # Apply category filter when calculating observed sum
            if category_filter is None:
                prediction_moment_dict["agg_observed"] = sum(y_test)
            else:
                prediction_moment_dict["agg_observed"] = sum(y_test & category_filter)
    else:
        prediction_moment_dict["agg_predicted"] = pd.DataFrame(
            {"agg_proba": [1]}, index=[0]
        )
        if not inference_time:
            prediction_moment_dict["agg_observed"] = 0

    return prediction_moment_dict

get_prob_dist_using_survival_curve(snapshot_dates, test_visits, category, prediction_time, prediction_window, start_time_col, end_time_col, model, verbose=False)

Calculate probability distributions for each snapshot date using an EmpiricalIncomingAdmissionPredictor.

Parameters:

Name Type Description Default
snapshot_dates array - like

Array of dates for which to calculate probability distributions.

required
test_visits DataFrame

DataFrame containing test visit data. Must have either: - start_time_col as a column and end_time_col as a column, or - start_time_col as the index and end_time_col as a column

required
category str

Category to use for predictions (e.g., 'medical', 'surgical')

required
prediction_time tuple

Tuple of (hour, minute) representing the time of day for predictions

required
prediction_window timedelta

The prediction window duration

required
start_time_col str

Name of the column containing start times (or index name if using index)

required
end_time_col str

Name of the column containing end times

required
model EmpiricalSurvivalPredictor

A fitted instance of EmpiricalSurvivalPredictor

required
verbose (bool, optional(default=False))

If True, print progress information

False

Returns:

Type Description
dict

A dictionary mapping snapshot dates to probability distributions.

Raises:

Type Description
ValueError

If test_visits does not have the required columns or if model is not fitted.

Source code in src/patientflow/aggregate.py
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
def get_prob_dist_using_survival_curve(
    snapshot_dates: List[date],
    test_visits: pd.DataFrame,
    category: str,
    prediction_time: Tuple[int, int],
    prediction_window: timedelta,
    start_time_col: str,
    end_time_col: str,
    model: EmpiricalIncomingAdmissionPredictor,
    verbose=False,
):
    """
    Calculate probability distributions for each snapshot date using an EmpiricalIncomingAdmissionPredictor.

    Parameters
    ----------
    snapshot_dates : array-like
        Array of dates for which to calculate probability distributions.
    test_visits : pandas.DataFrame
        DataFrame containing test visit data. Must have either:
        - start_time_col as a column and end_time_col as a column, or
        - start_time_col as the index and end_time_col as a column
    category : str
        Category to use for predictions (e.g., 'medical', 'surgical')
    prediction_time : tuple
        Tuple of (hour, minute) representing the time of day for predictions
    prediction_window : timedelta
        The prediction window duration
    start_time_col : str
        Name of the column containing start times (or index name if using index)
    end_time_col : str
        Name of the column containing end times
    model : EmpiricalSurvivalPredictor
        A fitted instance of EmpiricalSurvivalPredictor
    verbose : bool, optional (default=False)
        If True, print progress information

    Returns
    -------
    dict
        A dictionary mapping snapshot dates to probability distributions.

    Raises
    ------
    ValueError
        If test_visits does not have the required columns or if model is not fitted.
    """

    # Validate test_visits has required columns
    if start_time_col in test_visits.columns:
        # start_time_col is a regular column
        if end_time_col not in test_visits.columns:
            raise ValueError(f"Column '{end_time_col}' not found in DataFrame")
    else:
        # Check if start_time_col is the index
        if test_visits.index.name != start_time_col:
            raise ValueError(
                f"'{start_time_col}' not found in DataFrame columns or index (index.name is '{test_visits.index.name}')"
            )
        if end_time_col not in test_visits.columns:
            raise ValueError(f"Column '{end_time_col}' not found in DataFrame")

    # Validate model is fitted
    if not hasattr(model, "survival_df") or model.survival_df is None:
        raise ValueError("Model must be fitted before calling get_prob_dist_empirical")

    prob_dist_dict = {}
    if verbose:
        print(
            f"Calculating probability distributions for {len(snapshot_dates)} snapshot dates"
        )

    # Create prediction context that will be the same for all dates
    prediction_context = {category: {"prediction_time": prediction_time}}

    for dt in snapshot_dates:
        # Create prediction moment by combining snapshot date and prediction time
        prediction_moment = datetime.combine(
            dt, time(prediction_time[0], prediction_time[1])
        )
        # Convert to UTC if the test_visits timestamps are timezone-aware
        if start_time_col in test_visits.columns:
            if test_visits[start_time_col].dt.tz is not None:
                prediction_moment = prediction_moment.replace(tzinfo=timezone.utc)
        else:
            if test_visits.index.tz is not None:
                prediction_moment = prediction_moment.replace(tzinfo=timezone.utc)

        # Get predictions from model
        predictions = model.predict(prediction_context)
        prob_dist_dict[dt] = {"agg_predicted": predictions[category]}

        # Calculate observed values
        if start_time_col in test_visits.columns:
            # start_time_col is a regular column
            mask = (test_visits[start_time_col] > prediction_moment) & (
                test_visits[end_time_col] <= prediction_moment + prediction_window
            )
        else:
            # start_time_col is the index
            mask = (test_visits.index > prediction_moment) & (
                test_visits[end_time_col] <= prediction_moment + prediction_window
            )
        nrow = mask.sum()
        prob_dist_dict[dt]["agg_observed"] = int(nrow) if nrow > 0 else 0

    if verbose:
        print(f"Processed {len(snapshot_dates)} snapshot dates")

    return prob_dist_dict

model_input_to_pred_proba(model_input, model)

Use a predictive model to convert model input data into predicted probabilities.

Parameters:

Name Type Description Default
model_input array - like

The input data to the model, typically as features used for predictions.

required
model object

A model object with a predict_proba method that computes probability estimates.

required

Returns:

Type Description
DataFrame

A pandas DataFrame containing the predicted probabilities for the positive class, with one column labeled 'pred_proba'.

Source code in src/patientflow/aggregate.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
def model_input_to_pred_proba(model_input, model):
    """
    Use a predictive model to convert model input data into predicted probabilities.

    Parameters
    ----------
    model_input : array-like
        The input data to the model, typically as features used for predictions.
    model : object
        A model object with a `predict_proba` method that computes probability estimates.

    Returns
    -------
    DataFrame
        A pandas DataFrame containing the predicted probabilities for the positive class,
        with one column labeled 'pred_proba'.

    """
    if len(model_input) == 0:
        return pd.DataFrame(columns=["pred_proba"])
    else:
        predictions = model.predict_proba(model_input)[:, 1]
        return pd.DataFrame(
            predictions, index=model_input.index, columns=["pred_proba"]
        )

pred_proba_to_agg_predicted(predictions_proba, weights=None, normal_approx_threshold=30)

Convert individual probability predictions into aggregate predicted probability distribution using optional weights. Uses a Normal approximation for large datasets (> normal_approx_threshold) for better performance.

Parameters:

Name Type Description Default
predictions_proba DataFrame

A DataFrame containing the probability predictions; must have a single column named 'pred_proba'.

required
weights array - like

An array of weights, of the same length as the DataFrame rows, to apply to each prediction.

None
normal_approx_threshold (int, optional(default=30))

If the number of rows in predictions_proba exceeds this threshold, use a Normal distribution approximation. Set to None or a very large number to always use the exact symbolic computation.

30

Returns:

Type Description
DataFrame

A DataFrame with a single column 'agg_proba' showing the aggregated probability, indexed from 0 to n, where n is the number of predictions.

Source code in src/patientflow/aggregate.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
def pred_proba_to_agg_predicted(
    predictions_proba, weights=None, normal_approx_threshold=30
):
    """
    Convert individual probability predictions into aggregate predicted probability distribution using optional weights.
    Uses a Normal approximation for large datasets (> normal_approx_threshold) for better performance.

    Parameters
    ----------
    predictions_proba : DataFrame
        A DataFrame containing the probability predictions; must have a single column named 'pred_proba'.
    weights : array-like, optional
        An array of weights, of the same length as the DataFrame rows, to apply to each prediction.
    normal_approx_threshold : int, optional (default=30)
        If the number of rows in predictions_proba exceeds this threshold, use a Normal distribution approximation.
        Set to None or a very large number to always use the exact symbolic computation.

    Returns
    -------
    DataFrame
        A DataFrame with a single column 'agg_proba' showing the aggregated probability,
        indexed from 0 to n, where n is the number of predictions.
    """
    n = len(predictions_proba)

    if n == 0:
        agg_predicted_dict = {0: 1}
    elif normal_approx_threshold is not None and n > normal_approx_threshold:
        # Apply a normal approximation for large datasets
        import numpy as np
        from scipy.stats import norm

        # Apply weights if provided
        if weights is not None:
            probs = predictions_proba["pred_proba"].values * weights
        else:
            probs = predictions_proba["pred_proba"].values

        # Calculate mean and variance for the normal approximation
        # For a sum of Bernoulli variables, mean = sum of probabilities
        mean = probs.sum()
        # Variance = sum of p_i * (1-p_i)
        variance = (probs * (1 - probs)).sum()

        # Handle the case where variance is zero (all probabilities are 0 or 1)
        if variance == 0:
            # If variance is zero, all probabilities are the same (either all 0 or all 1)
            # The distribution is deterministic - all probability mass is at the mean
            agg_predicted_dict = {int(round(mean)): 1.0}
        else:
            # Generate probabilities for each possible count using normal approximation
            counts = np.arange(n + 1)
            agg_predicted_dict = {}

            for i in counts:
                # Probability that count = i is the probability that a normal RV falls between i-0.5 and i+0.5
                if i == 0:
                    p = norm.cdf(0.5, loc=mean, scale=np.sqrt(variance))
                elif i == n:
                    p = 1 - norm.cdf(n - 0.5, loc=mean, scale=np.sqrt(variance))
                else:
                    p = norm.cdf(i + 0.5, loc=mean, scale=np.sqrt(variance)) - norm.cdf(
                        i - 0.5, loc=mean, scale=np.sqrt(variance)
                    )
                agg_predicted_dict[i] = p

            # Normalize to ensure the probabilities sum to 1
            total = sum(agg_predicted_dict.values())
            if total > 0:
                for i in agg_predicted_dict:
                    agg_predicted_dict[i] /= total
            else:
                # If all probabilities are zero, set a uniform distribution
                n = len(agg_predicted_dict)
                for i in agg_predicted_dict:
                    agg_predicted_dict[i] = 1.0 / n
    else:
        # Use the original symbolic computation for smaller datasets
        local_proba = predictions_proba.copy()
        if weights is not None:
            local_proba["pred_proba"] *= weights

        syms = create_symbols(n)
        expression = build_expression(syms, n)
        expression = expression_subs(expression, n, local_proba["pred_proba"])
        agg_predicted_dict = {i: return_coeff(expression, i) for i in range(n + 1)}

    agg_predicted = pd.DataFrame.from_dict(
        agg_predicted_dict, orient="index", columns=["agg_proba"]
    )
    return agg_predicted

return_coeff(expression, i)

Extract the coefficient of a specified power from an expanded symbolic expression.

Parameters:

Name Type Description Default
expression Expr

The expression to expand and extract from.

required
i int

The power of the term whose coefficient is to be extracted.

required

Returns:

Type Description
number

The coefficient of the specified power in the expression.

Source code in src/patientflow/aggregate.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def return_coeff(expression, i):
    """
    Extract the coefficient of a specified power from an expanded symbolic expression.

    Parameters
    ----------
    expression : Expr
        The expression to expand and extract from.
    i : int
        The power of the term whose coefficient is to be extracted.

    Returns
    -------
    number
        The coefficient of the specified power in the expression.

    """
    s = sym.Symbol("s")
    return expand(expression).coeff(s, i)

calculate

Calculation module for patient flow metrics.

This module provides functions for calculating various patient flow metrics such as arrival rates and admission probabilities within prediction windows.

admission_in_prediction_window

This module provides functions to model and analyze a curve consisting of an exponential growth segment followed by an exponential decay segment. It includes functions to create the curve, calculate specific points on it, and evaluate probabilities based on its shape.

Its intended use is to derive the probability of a patient being admitted to a hospital within a certain elapsed time after their arrival in the Emergency Department (ED), given the hospital's aspirations for the time it takes patients to be admitted. For this purpose, two points on the curve are required as parameters:

* (x1,y1) : The target proportion of patients y1 (eg 76%) who have been admitted or discharged by time x1 (eg 4 hours).
* (x2, y2) : The time x2 by which all but a small proportion y2 of patients have been admitted.

It is assumed that values of y where x < x1 is a growth curve grow exponentially towards x1 and that (x1,y1) the curve switches to a decay curve.

Functions:

Name Description
growth_curve : function

Calculate exponential growth at a point where x < x1.

decay_curve : function

Calculate exponential decay at a point where x >= x1.

create_curve : function

Generate a full curve with both growth and decay segments.

get_y_from_aspirational_curve : function

Read from the curve a value for y, the probability of being admitted, for a given moment x hours after arrival

calculate_probability : function

Compute the probability of a patient being admitted by the end of a prediction window, given how much time has elapsed since their arrival.

get_survival_probability : function

Calculate the probability of a patient still being in the ED after a certain time using survival curve data.

calculate_probability(elapsed_los, prediction_window, x1, y1, x2, y2)

Calculates the probability of an admission occurring within a specified prediction window after the moment of prediction, based on the patient's elapsed time in the ED prior to the moment of prediction and the length of the window

Parameters:

Name Type Description Default
elapsed_los timedelta

The elapsed time since the patient arrived at the ED.

required
prediction_window timedelta

The duration of the prediction window after the point of prediction, for which the probability is calculated.

required
x1 float

The time target for the first key point on the curve.

required
y1 float

The proportion target for the first key point (e.g., 76% of patients admitted by time x1).

required
x2 float

The time target for the second key point on the curve.

required
y2 float

The proportion target for the second key point (e.g., 99% of patients admitted by time x2).

required

Returns:

Type Description
float

The probability of the event occurring within the given prediction window.

Edge Case Handling

When elapsed_los is extremely high, such as values significantly greater than x2, the admission probability prior to the current time (prob_admission_prior_to_now) can reach 1.0 despite the curve being asymptotic. This scenario can cause computational errors when calculating the conditional probability, as it involves a division by zero. In such cases, this function directly returns a probability of 1.0, reflecting certainty of admission.

Example

Calculate the probability that a patient, who has already been in the ED for 3 hours, will be admitted in the next 2 hours. The ED targets that 76% of patients are admitted or discharged within 4 hours, and 99% within 12 hours.

from datetime import timedelta calculate_probability(timedelta(hours=3), timedelta(hours=2), 4, 0.76, 12, 0.99)

Source code in src/patientflow/calculate/admission_in_prediction_window.py
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def calculate_probability(
    elapsed_los: timedelta,
    prediction_window: timedelta,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
):
    """
    Calculates the probability of an admission occurring within a specified prediction window after the moment of prediction, based on the patient's elapsed time in the ED prior to the moment of prediction and the length of the window

    Parameters
    ----------
    elapsed_los : timedelta
        The elapsed time since the patient arrived at the ED.
    prediction_window : timedelta
        The duration of the prediction window after the point of prediction, for which the probability is calculated.
    x1 : float
        The time target for the first key point on the curve.
    y1 : float
        The proportion target for the first key point (e.g., 76% of patients admitted by time x1).
    x2 : float
        The time target for the second key point on the curve.
    y2 : float
        The proportion target for the second key point (e.g., 99% of patients admitted by time x2).

    Returns
    -------
    float
        The probability of the event occurring within the given prediction window.

    Edge Case Handling
    ------------------
    When elapsed_los is extremely high, such as values significantly greater than x2, the admission probability prior to the current time (`prob_admission_prior_to_now`) can reach 1.0 despite the curve being asymptotic. This scenario can cause computational errors when calculating the conditional probability, as it involves a division by zero. In such cases, this function directly returns a probability of 1.0, reflecting certainty of admission.

    Example
    -------
    Calculate the probability that a patient, who has already been in the ED for 3 hours, will be admitted in the next 2 hours. The ED targets that 76% of patients are admitted or discharged within 4 hours, and 99% within 12 hours.

    >>> from datetime import timedelta
    >>> calculate_probability(timedelta(hours=3), timedelta(hours=2), 4, 0.76, 12, 0.99)

    """
    # Validate inputs
    if not isinstance(elapsed_los, timedelta):
        raise TypeError("elapsed_los must be a timedelta object")
    if not isinstance(prediction_window, timedelta):
        raise TypeError("prediction_window must be a timedelta object")

    # Convert timedelta to hours
    elapsed_hours = elapsed_los.total_seconds() / 3600
    prediction_window_hours = prediction_window.total_seconds() / 3600

    # Validate elapsed time to ensure it represents a reasonable time value in hours
    if elapsed_hours < 0:
        raise ValueError(
            "elapsed_los must be non-negative (cannot have negative elapsed time)"
        )

    if elapsed_hours > 168:  # 168 hours = 1 week
        warnings.warn(
            "elapsed_los appears to be longer than 168 hours (1 week). "
            "Check that the units of elapsed_los are correct"
        )

    if not np.isfinite(elapsed_hours):
        raise ValueError("elapsed_los must be a finite time duration")

    # Validate prediction window to ensure it represents a reasonable time value in hours
    if prediction_window_hours < 0:
        raise ValueError(
            "prediction_window must be non-negative (cannot have negative prediction window)"
        )

    if prediction_window_hours > 72:  # 72 hours = 3 days
        warnings.warn(
            "prediction_window appears to be longer than 72 hours (3 days). "
            "Check that the units of prediction_window are correct"
        )

    if not np.isfinite(prediction_window_hours):
        raise ValueError("prediction_window must be a finite time duration")

    # probability of still being in the ED now (a function of elapsed time since arrival)
    prob_admission_prior_to_now = get_y_from_aspirational_curve(
        elapsed_hours, x1, y1, x2, y2
    )

    # prob admission when adding the prediction window added to elapsed time since arrival
    prob_admission_by_end_of_window = get_y_from_aspirational_curve(
        elapsed_hours + prediction_window_hours, x1, y1, x2, y2
    )

    # Direct return for edge cases where `prob_admission_prior_to_now` reaches 1.0
    if prob_admission_prior_to_now == 1:
        return 1.0

    # Calculate the conditional probability of admission within the prediction window
    # given that the patient hasn't been admitted yet
    conditional_prob = (
        prob_admission_by_end_of_window - prob_admission_prior_to_now
    ) / (1 - prob_admission_prior_to_now)

    return conditional_prob

create_curve(x1, y1, x2, y2, a=0.01, generate_values=False)

Generates parameters for an exponential growth and decay curve. Optionally generates x-values and corresponding y-values across a default or specified range.

Parameters:

Name Type Description Default
x1 float

The x-value where the curve transitions from growth to decay.

required
y1 float

The y-value at the transition point x1.

required
x2 float

The x-value defining the end of the decay curve for calculation purposes.

required
y2 float

The y-value at x2, intended to fine-tune the decay rate.

required
a float

The initial value coefficient for the growth curve, defaults to 0.01.

0.01
generate_values bool

Flag to determine whether to generate x-values and y-values for visualization purposes.

False

Returns:

Type Description
tuple

If generate_values is False, returns (gamma, lamda, a). If generate_values is True, returns (gamma, lamda, a, x_values, y_values).

Source code in src/patientflow/calculate/admission_in_prediction_window.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def create_curve(x1, y1, x2, y2, a=0.01, generate_values=False):
    """
    Generates parameters for an exponential growth and decay curve.
    Optionally generates x-values and corresponding y-values across a default or specified range.

    Parameters
    ----------
    x1 : float
        The x-value where the curve transitions from growth to decay.
    y1 : float
        The y-value at the transition point x1.
    x2 : float
        The x-value defining the end of the decay curve for calculation purposes.
    y2 : float
        The y-value at x2, intended to fine-tune the decay rate.
    a : float, optional
        The initial value coefficient for the growth curve, defaults to 0.01.
    generate_values : bool, optional
        Flag to determine whether to generate x-values and y-values for visualization purposes.

    Returns
    -------
    tuple
        If generate_values is False, returns (gamma, lamda, a).
        If generate_values is True, returns (gamma, lamda, a, x_values, y_values).

    """
    # Validate inputs
    if not (x1 < x2):
        raise ValueError("x1 must be less than x2")
    if not (0 < y1 < y2 < 1):
        raise ValueError("y1 must be less than y2, and both must be between 0 and 1")

    # Constants for growth and decay
    gamma = np.log(y1 / a) / x1
    lamda = np.log((1 - y1) / (1 - y2)) / (x2 - x1)

    if generate_values:
        x_values = np.linspace(0, 20, 200)
        y_values = [
            (growth_curve(x, a, gamma) if x <= x1 else decay_curve(x, x1, y1, lamda))
            for x in x_values
        ]
        return gamma, lamda, a, x_values, y_values

    return gamma, lamda, a

decay_curve(x, x1, y1, lamda)

Calculate the exponential decay value at a given x using specified parameters. The function supports both scalar and array inputs for x.

Parameters:

Name Type Description Default
x float or ndarray

The x-value(s) at which to evaluate the curve.

required
x1 float

The x-value where the growth curve transitions to the decay curve.

required
y1 float

The y-value at the transition point, where the decay curve starts.

required
lamda float

The decay rate coefficient.

required

Returns:

Type Description
float or ndarray

The y-value(s) of the decay curve at x.

Source code in src/patientflow/calculate/admission_in_prediction_window.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def decay_curve(x, x1, y1, lamda):
    """
    Calculate the exponential decay value at a given x using specified parameters.
    The function supports both scalar and array inputs for x.

    Parameters
    ----------
    x : float or np.ndarray
        The x-value(s) at which to evaluate the curve.
    x1 : float
        The x-value where the growth curve transitions to the decay curve.
    y1 : float
        The y-value at the transition point, where the decay curve starts.
    lamda : float
        The decay rate coefficient.

    Returns
    -------
    float or np.ndarray
        The y-value(s) of the decay curve at x.

    """
    return y1 + (1 - y1) * (1 - np.exp(-lamda * (x - x1)))

get_survival_probability(survival_df, time_hours)

Calculate the probability of a patient still being in the ED after a specified time using survival curve data.

Parameters:

Name Type Description Default
survival_df DataFrame

DataFrame containing survival curve data with columns: - time_hours: Time points in hours - survival_probability: Probability of still being in ED at each time point

required
time_hours float

The time point (in hours) at which to calculate the survival probability

required

Returns:

Type Description
float

The probability of still being in the ED at the specified time

Notes
  • If the exact time_hours is not in the survival curve data, the function will interpolate between the nearest time points
  • If time_hours is less than the minimum time in the data, returns 1.0
  • If time_hours is greater than the maximum time in the data, returns the last known survival probability

Examples:

>>> survival_df = pd.DataFrame({
...     'time_hours': [0, 2, 4, 6],
...     'survival_probability': [1.0, 0.8, 0.5, 0.2]
... })
>>> get_survival_probability(survival_df, 3.5)
0.65  # interpolated between 0.8 and 0.5
Source code in src/patientflow/calculate/admission_in_prediction_window.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
def get_survival_probability(survival_df, time_hours):
    """
    Calculate the probability of a patient still being in the ED after a specified time
    using survival curve data.

    Parameters
    ----------
    survival_df : pandas.DataFrame
        DataFrame containing survival curve data with columns:
        - time_hours: Time points in hours
        - survival_probability: Probability of still being in ED at each time point
    time_hours : float
        The time point (in hours) at which to calculate the survival probability

    Returns
    -------
    float
        The probability of still being in the ED at the specified time

    Notes
    -----
    - If the exact time_hours is not in the survival curve data, the function will
      interpolate between the nearest time points
    - If time_hours is less than the minimum time in the data, returns 1.0
    - If time_hours is greater than the maximum time in the data, returns the last
      known survival probability

    Examples
    --------
    >>> survival_df = pd.DataFrame({
    ...     'time_hours': [0, 2, 4, 6],
    ...     'survival_probability': [1.0, 0.8, 0.5, 0.2]
    ... })
    >>> get_survival_probability(survival_df, 3.5)
    0.65  # interpolated between 0.8 and 0.5
    """
    if time_hours < survival_df["time_hours"].min():
        return 1.0

    if time_hours > survival_df["time_hours"].max():
        return survival_df["survival_probability"].iloc[-1]

    # Find the closest time points for interpolation
    lower_idx = survival_df["time_hours"].searchsorted(time_hours, side="right") - 1
    upper_idx = lower_idx + 1

    if lower_idx < 0:
        return 1.0

    if upper_idx >= len(survival_df):
        return survival_df["survival_probability"].iloc[-1]

    # Get the surrounding points
    t1 = survival_df["time_hours"].iloc[lower_idx]
    t2 = survival_df["time_hours"].iloc[upper_idx]
    p1 = survival_df["survival_probability"].iloc[lower_idx]
    p2 = survival_df["survival_probability"].iloc[upper_idx]

    # Linear interpolation
    return p1 + (p2 - p1) * (time_hours - t1) / (t2 - t1)

get_y_from_aspirational_curve(x, x1, y1, x2, y2)

Calculate the probability y that a patient will have been admitted by a specified x after their arrival, by reading from the aspirational curve that has been constrained to pass through points (x1, y1) and (x2, y2) with an exponential growth curve where x < x1 and an exponential decay where x < x2

The function handles scalar or array inputs for x and determines y using either an exponential growth curve (for x < x1) or an exponential decay curve (for x >= x1). The curve parameters are derived to ensure the curve passes through specified points (x1, y1) and (x2, y2).

Parameters:

Name Type Description Default
x float or ndarray

The x-coordinate(s) at which to calculate the y-value on the curve. Can be a single value or an array of values.

required
x1 float

The x-coordinate of the first key point on the curve, where the growth phase ends and the decay phase begins.

required
y1 float

The y-coordinate of the first key point (x1), representing the target proportion of patients admitted by time x1.

required
x2 float

The x-coordinate of the second key point on the curve, beyond which all but a few patients are expected to be admitted.

required
y2 float

The y-coordinate of the second key point (x2), representing the target proportion of patients admitted by time x2.

required

Returns:

Type Description
float or ndarray

The calculated y-value(s) (probability of admission) at the given x. The type of the return matches the input type for x (either scalar or array).

Source code in src/patientflow/calculate/admission_in_prediction_window.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def get_y_from_aspirational_curve(x, x1, y1, x2, y2):
    """
    Calculate the probability y that a patient will have been admitted by a specified x after their arrival, by reading from the aspirational curve that has been constrained to pass through points (x1, y1) and (x2, y2) with an exponential growth curve where x < x1 and an exponential decay where x < x2

    The function handles scalar or array inputs for x and determines y using either an exponential growth curve (for x < x1)
    or an exponential decay curve (for x >= x1). The curve parameters are derived to ensure the curve passes through
    specified points (x1, y1) and (x2, y2).

    Parameters
    ----------
    x : float or np.ndarray
        The x-coordinate(s) at which to calculate the y-value on the curve. Can be a single value or an array of values.
    x1 : float
        The x-coordinate of the first key point on the curve, where the growth phase ends and the decay phase begins.
    y1 : float
        The y-coordinate of the first key point (x1), representing the target proportion of patients admitted by time x1.
    x2 : float
        The x-coordinate of the second key point on the curve, beyond which all but a few patients are expected to be admitted.
    y2 : float
        The y-coordinate of the second key point (x2), representing the target proportion of patients admitted by time x2.

    Returns
    -------
    float or np.ndarray
        The calculated y-value(s) (probability of admission) at the given x. The type of the return matches the input type
        for x (either scalar or array).

    """
    gamma, lamda, a = create_curve(x1, y1, x2, y2)
    y = np.where(x < x1, growth_curve(x, a, gamma), decay_curve(x, x1, y1, lamda))
    return y

growth_curve(x, a, gamma)

Calculate the exponential growth value at a given x using specified parameters. The function supports both scalar and array inputs for x.

Parameters:

Name Type Description Default
x float or ndarray

The x-value(s) at which to evaluate the curve.

required
a float

The coefficient that defines the starting point of the growth curve when x is 0.

required
gamma float

The growth rate coefficient of the curve.

required

Returns:

Type Description
float or ndarray

The y-value(s) of the growth curve at x.

Source code in src/patientflow/calculate/admission_in_prediction_window.py
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def growth_curve(x, a, gamma):
    """
    Calculate the exponential growth value at a given x using specified parameters.
    The function supports both scalar and array inputs for x.

    Parameters
    ----------
    x : float or np.ndarray
        The x-value(s) at which to evaluate the curve.
    a : float
        The coefficient that defines the starting point of the growth curve when x is 0.
    gamma : float
        The growth rate coefficient of the curve.

    Returns
    -------
    float or np.ndarray
        The y-value(s) of the growth curve at x.

    """
    return a * np.exp(x * gamma)

arrival_rates

Calculate and process time-varying arrival rates and admission probabilities.

This module provides functions for calculating arrival rates, admission probabilities, and unfettered demand rates for inpatient arrivals using an aspirational approach.

Functions:

Name Description
time_varying_arrival_rates : function

Calculate arrival rates for each time interval across the dataset's date range.

time_varying_arrival_rates_lagged : function

Create lagged arrival rates based on time intervals.

admission_probabilities : function

Compute cumulative and hourly admission probabilities using aspirational curves.

weighted_arrival_rates : function

Aggregate weighted arrival rates for specific time intervals.

unfettered_demand_by_hour : function

Estimate inpatient demand by hour using historical data and aspirational curves.

count_yet_to_arrive : function

Count patients who arrived after prediction times and were admitted within prediction windows.

Notes
  • All times are handled in local timezone
  • Arrival rates are normalized by the number of unique days in the dataset
  • Demand calculations consider both historical patterns and admission probabilities
  • Time intervals must divide evenly into 24 hours
  • Aspirational curves use (x1,y1) and (x2,y2) coordinates to model admission probabilities

Examples:

>>> # Generate random arrival times over a week
>>> np.random.seed(42)  # For reproducibility
>>> n_arrivals = 1000
>>> random_times = [
...     pd.Timestamp('2024-01-01') +
...     pd.Timedelta(days=np.random.randint(0, 7)) +
...     pd.Timedelta(hours=np.random.randint(0, 24)) +
...     pd.Timedelta(minutes=np.random.randint(0, 60))
...     for _ in range(n_arrivals)
... ]
>>> df = pd.DataFrame(index=sorted(random_times))
>>>
>>> # Calculate various rates and demand
>>> rates = time_varying_arrival_rates(df, yta_time_interval=60)
>>> lagged_rates = time_varying_arrival_rates_lagged(df, lagged_by=4)
>>> demand = unfettered_demand_by_hour(df, x1=4, y1=0.8, x2=8, y2=0.95)

admission_probabilities(hours_since_arrival, x1, y1, x2, y2)

Calculate probability of admission for each hour since arrival.

Parameters:

Name Type Description Default
hours_since_arrival ndarray

Array of hours since arrival.

required
x1 float

First x-coordinate of the aspirational curve.

required
y1 float

First y-coordinate of the aspirational curve.

required
x2 float

Second x-coordinate of the aspirational curve.

required
y2 float

Second y-coordinate of the aspirational curve.

required

Returns:

Type Description
Tuple[ndarray, ndarray]

A tuple containing: - np.ndarray: Cumulative admission probabilities - np.ndarray: Hourly admission probabilities

Notes

The aspirational curve is defined by two points (x1,y1) and (x2,y2) and is used to model the probability of admission over time.

Source code in src/patientflow/calculate/arrival_rates.py
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
def admission_probabilities(
    hours_since_arrival: np.ndarray, x1: float, y1: float, x2: float, y2: float
) -> Tuple[np.ndarray, np.ndarray]:
    """Calculate probability of admission for each hour since arrival.

    Parameters
    ----------
    hours_since_arrival : np.ndarray
        Array of hours since arrival.
    x1 : float
        First x-coordinate of the aspirational curve.
    y1 : float
        First y-coordinate of the aspirational curve.
    x2 : float
        Second x-coordinate of the aspirational curve.
    y2 : float
        Second y-coordinate of the aspirational curve.

    Returns
    -------
    Tuple[np.ndarray, np.ndarray]
        A tuple containing:
        - np.ndarray: Cumulative admission probabilities
        - np.ndarray: Hourly admission probabilities

    Notes
    -----
    The aspirational curve is defined by two points (x1,y1) and (x2,y2) and is used
    to model the probability of admission over time.
    """
    prob_admission_by_hour = np.array(
        [
            get_y_from_aspirational_curve(hour, x1, y1, x2, y2)
            for hour in hours_since_arrival
        ]
    )
    prob_admission_within_hour = np.diff(prob_admission_by_hour)

    return prob_admission_by_hour, prob_admission_within_hour

count_yet_to_arrive(df, snapshot_dates, prediction_times, prediction_window_hours)

Count patients who arrived after prediction times and were admitted within prediction windows.

This function counts patients who arrived after specified prediction times and were admitted to a ward within the specified prediction window for each combination of snapshot date and prediction time.

Parameters:

Name Type Description Default
df DataFrame

A DataFrame containing patient data with 'arrival_datetime', 'admitted_to_ward_datetime', and 'patient_id' columns.

required
snapshot_dates list

List of dates (datetime.date objects) to analyze.

required
prediction_times list

List of (hour, minute) tuples representing prediction times.

required
prediction_window_hours float

Length of prediction window in hours after the prediction time.

required

Returns:

Type Description
DataFrame

DataFrame with columns: - 'snapshot_date': The date of the snapshot - 'prediction_time': Tuple of (hour, minute) for the prediction time - 'count': Number of unique patients who arrived after prediction time and were admitted within the prediction window

Raises:

Type Description
TypeError

If df is not a DataFrame or if required columns are missing.

ValueError

If prediction_window_hours is not positive.

Notes

This function is useful for analyzing historical patterns of patient arrivals and admissions to inform predictive models for emergency department demand. Only patients with non-null admitted_to_ward_datetime are counted.

Examples:

>>> import pandas as pd
>>> from datetime import date, time
>>> prediction_times = [(12, 0), (15, 30)]
>>> snapshot_dates = [date(2023, 1, 1), date(2023, 1, 2)]
>>> results = count_yet_to_arrive(df, snapshot_dates, prediction_times, 8.0)
Source code in src/patientflow/calculate/arrival_rates.py
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
def count_yet_to_arrive(
    df: DataFrame,
    snapshot_dates: List,
    prediction_times: List,
    prediction_window_hours: float,
) -> DataFrame:
    """Count patients who arrived after prediction times and were admitted within prediction windows.

    This function counts patients who arrived after specified prediction times and were
    admitted to a ward within the specified prediction window for each combination of
    snapshot date and prediction time.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame containing patient data with 'arrival_datetime',
        'admitted_to_ward_datetime', and 'patient_id' columns.
    snapshot_dates : list
        List of dates (datetime.date objects) to analyze.
    prediction_times : list
        List of (hour, minute) tuples representing prediction times.
    prediction_window_hours : float
        Length of prediction window in hours after the prediction time.

    Returns
    -------
    pandas.DataFrame
        DataFrame with columns:
        - 'snapshot_date': The date of the snapshot
        - 'prediction_time': Tuple of (hour, minute) for the prediction time
        - 'count': Number of unique patients who arrived after prediction time
                  and were admitted within the prediction window

    Raises
    ------
    TypeError
        If df is not a DataFrame or if required columns are missing.
    ValueError
        If prediction_window_hours is not positive.

    Notes
    -----
    This function is useful for analyzing historical patterns of patient arrivals
    and admissions to inform predictive models for emergency department demand.
    Only patients with non-null admitted_to_ward_datetime are counted.

    Examples
    --------
    >>> import pandas as pd
    >>> from datetime import date, time
    >>> prediction_times = [(12, 0), (15, 30)]
    >>> snapshot_dates = [date(2023, 1, 1), date(2023, 1, 2)]
    >>> results = count_yet_to_arrive(df, snapshot_dates, prediction_times, 8.0)
    """
    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")

    required_columns = ["arrival_datetime", "admitted_to_ward_datetime", "patient_id"]
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise TypeError(f"DataFrame missing required columns: {missing_columns}")

    if (
        not isinstance(prediction_window_hours, (int, float))
        or prediction_window_hours <= 0
    ):
        raise ValueError("prediction_window_hours must be a positive number.")

    # Create an empty list to store results
    results = []

    # For each combination of date and time
    for date_val in snapshot_dates:
        for hour, minute in prediction_times:
            # Create the prediction datetime
            prediction_datetime = pd.Timestamp(
                datetime.combine(date_val, time(hour=hour, minute=minute))
            )

            # Calculate the end of the prediction window
            prediction_window_end = prediction_datetime + pd.Timedelta(
                hours=prediction_window_hours
            )

            # Count patients who arrived after prediction time and were admitted within the window
            admitted_within_window = len(
                df[
                    (df["arrival_datetime"] > prediction_datetime)
                    & (df["admitted_to_ward_datetime"] <= prediction_window_end)
                ]
            )

            # Store the result
            results.append(
                {
                    "snapshot_date": date_val,
                    "prediction_time": (hour, minute),
                    "count": admitted_within_window,
                }
            )

    # Convert results to a DataFrame
    results_df = pd.DataFrame(results)

    return results_df

process_arrival_rates(arrival_rates_dict)

Process arrival rates dictionary into formats needed for plotting.

Parameters
arrival_rates_dict : Dict[datetime.time, float]
    Mapping of times to arrival rates.
Returns
Tuple[List[float], List[str], List[int]]
    A tuple containing:
    - List[float]: Arrival rate values
    - List[str]: Formatted hour range strings (e.g., "09-

10") - List[int]: Integers for x-axis positioning

Notes
The hour labels are formatted with line breaks for better plot readability.
Source code in src/patientflow/calculate/arrival_rates.py
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
def process_arrival_rates(
    arrival_rates_dict: Dict[time, float],
) -> Tuple[List[float], List[str], List[int]]:
    """Process arrival rates dictionary into formats needed for plotting.

    Parameters
    ----------
    arrival_rates_dict : Dict[datetime.time, float]
        Mapping of times to arrival rates.

    Returns
    -------
    Tuple[List[float], List[str], List[int]]
        A tuple containing:
        - List[float]: Arrival rate values
        - List[str]: Formatted hour range strings (e.g., "09-\n10")
        - List[int]: Integers for x-axis positioning

    Notes
    -----
    The hour labels are formatted with line breaks for better plot readability.
    """
    # Extract hours and rates
    hours = list(arrival_rates_dict.keys())
    arrival_rates = list(arrival_rates_dict.values())

    # Create formatted hour labels with line breaks for better plot readability
    hour_labels = [
        f'{hour.strftime("%H")}-\n{str((hour.hour + 1) % 24).zfill(2)}'
        for hour in hours
    ]

    # Generate numerical values for x-axis positioning
    hour_values = list(range(len(hour_labels)))

    return arrival_rates, hour_labels, hour_values

time_varying_arrival_rates(df, yta_time_interval, num_days=None, verbose=False)

Calculate the time-varying arrival rates for a dataset indexed by datetime.

This function computes the arrival rates for each time interval specified, across the entire date range present in the dataframe. The arrival rate is calculated as the number of entries in the dataframe for each time interval, divided by the number of days in the dataset's timespan.

Parameters:

Name Type Description Default
df DataFrame

A DataFrame indexed by datetime, representing the data for which arrival rates are to be calculated. The index of the DataFrame should be of datetime type.

required
yta_time_interval int or timedelta

The time interval for which the arrival rates are to be calculated. If int, assumed to be in minutes. If timedelta, will be converted to minutes. For example, if yta_time_interval=60, the function will calculate hourly arrival rates.

required
num_days int

The number of days that the DataFrame spans. If not provided, the number of days is calculated from the date of the min and max arrival datetimes.

None
verbose bool

If True, enable info-level logging. Defaults to False.

False

Returns:

Type Description
OrderedDict[time, float]

A dictionary mapping times to arrival rates, where times are datetime.time objects and rates are float values.

Raises:

Type Description
TypeError

If 'df' is not a pandas DataFrame, 'yta_time_interval' is not an integer or timedelta, or the DataFrame index is not a DatetimeIndex.

ValueError

If 'yta_time_interval' is less than or equal to 0 or does not divide evenly into 24 hours.

Notes

The minimum and maximum dates in the dataset are used to determine the timespan if num_days is not provided.

Source code in src/patientflow/calculate/arrival_rates.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
def time_varying_arrival_rates(
    df: DataFrame,
    yta_time_interval: Union[int, timedelta],
    num_days: Optional[int] = None,
    verbose: bool = False,
) -> OrderedDict[time, float]:
    """Calculate the time-varying arrival rates for a dataset indexed by datetime.

    This function computes the arrival rates for each time interval specified, across
    the entire date range present in the dataframe. The arrival rate is calculated as
    the number of entries in the dataframe for each time interval, divided by the
    number of days in the dataset's timespan.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame indexed by datetime, representing the data for which arrival rates
        are to be calculated. The index of the DataFrame should be of datetime type.
    yta_time_interval : int or timedelta
        The time interval for which the arrival rates are to be calculated.
        If int, assumed to be in minutes. If timedelta, will be converted to minutes.
        For example, if `yta_time_interval=60`, the function will calculate hourly
        arrival rates.
    num_days : int, optional
        The number of days that the DataFrame spans. If not provided, the number of
        days is calculated from the date of the min and max arrival datetimes.
    verbose : bool, optional
        If True, enable info-level logging. Defaults to False.

    Returns
    -------
    OrderedDict[datetime.time, float]
        A dictionary mapping times to arrival rates, where times are datetime.time
        objects and rates are float values.

    Raises
    ------
    TypeError
        If 'df' is not a pandas DataFrame, 'yta_time_interval' is not an integer or timedelta,
        or the DataFrame index is not a DatetimeIndex.
    ValueError
        If 'yta_time_interval' is less than or equal to 0 or does not divide evenly
        into 24 hours.

    Notes
    -----
    The minimum and maximum dates in the dataset are used to determine the timespan
    if num_days is not provided.
    """
    import logging
    import sys

    if verbose:
        # Create logger with a unique name
        logger = logging.getLogger(f"{__name__}.time_varying_arrival_rates")

        # Only set up handlers if they don't exist
        if not logger.handlers:
            logger.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create handler that writes to sys.stdout
            handler = logging.StreamHandler(sys.stdout)
            handler.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create a formatting configuration
            formatter = logging.Formatter("%(message)s")
            handler.setFormatter(formatter)

            # Add the handler to the logger
            logger.addHandler(handler)

            # Prevent propagation to root logger
            logger.propagate = False

    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")
    if not isinstance(yta_time_interval, (int, timedelta)):
        raise TypeError(
            "The parameter 'yta_time_interval' must be an integer or timedelta."
        )
    if not isinstance(df.index, pd.DatetimeIndex):
        raise TypeError("The DataFrame index must be a pandas DatetimeIndex.")

    # Handle both timedelta and numeric inputs for yta_time_interval
    if isinstance(yta_time_interval, timedelta):
        yta_time_interval_minutes = int(yta_time_interval.total_seconds() / 60)
    elif isinstance(yta_time_interval, int):
        yta_time_interval_minutes = yta_time_interval
    else:
        raise TypeError("yta_time_interval must be a timedelta object or integer")

    # Validate time interval
    minutes_in_day = 24 * 60
    if yta_time_interval_minutes <= 0:
        raise ValueError("The parameter 'yta_time_interval' must be positive.")
    if minutes_in_day % yta_time_interval_minutes != 0:
        raise ValueError(
            f"Time interval ({yta_time_interval_minutes} minutes) must divide evenly into 24 hours."
        )

    if num_days is None:
        # Calculate total days between first and last date
        if verbose and logger:
            logger.info("Inferring number of days from dataset")
        start_date = df.index.date.min()
        end_date = df.index.date.max()
        num_days = (end_date - start_date).days + 1

    if num_days == 0:
        raise ValueError("DataFrame contains no data.")

    if verbose and logger:
        logger.info(
            f"Calculating time-varying arrival rates for data provided, which spans {num_days} unique dates"
        )

    arrival_rates_dict = OrderedDict()

    # Initialize a time object to iterate through one day in the specified intervals
    _start_datetime = datetime(1970, 1, 1, 0, 0, 0, 0)
    _stop_datetime = _start_datetime + timedelta(days=1)

    # Iterate over each interval in a single day to calculate the arrival rate
    while _start_datetime != _stop_datetime:
        _start_time = _start_datetime.time()
        _end_time = (
            _start_datetime + timedelta(minutes=yta_time_interval_minutes)
        ).time()

        # Filter the dataframe for entries within the current time interval
        _df = df.between_time(_start_time, _end_time, inclusive="left")

        # Calculate and store the arrival rate for the interval
        arrival_rates_dict[_start_time] = _df.shape[0] / num_days

        # Move to the next interval
        _start_datetime = _start_datetime + timedelta(minutes=yta_time_interval_minutes)

    return arrival_rates_dict

time_varying_arrival_rates_lagged(df, lagged_by, num_days=None, yta_time_interval=60)

Calculate lagged time-varying arrival rates for a dataset indexed by datetime.

This function first calculates the basic arrival rates and then adjusts them by a specified lag time, returning the rates sorted by the lagged times.

Parameters:

Name Type Description Default
df DataFrame

A DataFrame indexed by datetime, representing the data for which arrival rates are to be calculated. The index must be a DatetimeIndex.

required
lagged_by int

Number of hours to lag the arrival times.

required
num_days int

The number of days that the DataFrame spans. If not provided, the number of days is calculated from the date of the min and max arrival datetimes.

None
yta_time_interval int or timedelta

The time interval for which the arrival rates are to be calculated. If int, assumed to be in minutes. If timedelta, will be converted to minutes. Defaults to 60.

60

Returns:

Type Description
OrderedDict[time, float]

A dictionary mapping lagged times (datetime.time objects) to their corresponding arrival rates.

Raises:

Type Description
TypeError

If df is not a DataFrame, lagged_by is not an integer, yta_time_interval is not an integer or timedelta, or DataFrame index is not DatetimeIndex.

ValueError

If lagged_by is negative or yta_time_interval is not positive.

Notes

The lagged times are calculated by adding the specified number of hours to each time in the original arrival rates dictionary.

Source code in src/patientflow/calculate/arrival_rates.py
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
def time_varying_arrival_rates_lagged(
    df: DataFrame,
    lagged_by: int,
    num_days: Optional[int] = None,
    yta_time_interval: Union[int, timedelta] = 60,
) -> OrderedDict[time, float]:
    """Calculate lagged time-varying arrival rates for a dataset indexed by datetime.

    This function first calculates the basic arrival rates and then adjusts them by
    a specified lag time, returning the rates sorted by the lagged times.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame indexed by datetime, representing the data for which arrival rates
        are to be calculated. The index must be a DatetimeIndex.
    lagged_by : int
        Number of hours to lag the arrival times.
    num_days : int, optional
        The number of days that the DataFrame spans. If not provided, the number of
        days is calculated from the date of the min and max arrival datetimes.
    yta_time_interval : int or timedelta, optional
        The time interval for which the arrival rates are to be calculated.
        If int, assumed to be in minutes. If timedelta, will be converted to minutes.
        Defaults to 60.

    Returns
    -------
    OrderedDict[datetime.time, float]
        A dictionary mapping lagged times (datetime.time objects) to their
        corresponding arrival rates.

    Raises
    ------
    TypeError
        If df is not a DataFrame, lagged_by is not an integer, yta_time_interval is not an integer or timedelta,
        or DataFrame index is not DatetimeIndex.
    ValueError
        If lagged_by is negative or yta_time_interval is not positive.

    Notes
    -----
    The lagged times are calculated by adding the specified number of hours to each
    time in the original arrival rates dictionary.
    """
    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")

    if not isinstance(lagged_by, int):
        raise TypeError("The parameter 'lagged_by' must be an integer.")

    if not isinstance(yta_time_interval, (int, timedelta)):
        raise TypeError(
            "The parameter 'yta_time_interval' must be an integer or timedelta."
        )

    if not isinstance(df.index, pd.DatetimeIndex):
        raise TypeError("The DataFrame index must be a pandas DatetimeIndex.")

    if lagged_by < 0:
        raise ValueError("The parameter 'lagged_by' must be non-negative.")

    # Handle both timedelta and numeric inputs for yta_time_interval
    if isinstance(yta_time_interval, timedelta):
        yta_time_interval_minutes = int(yta_time_interval.total_seconds() / 60)
    elif isinstance(yta_time_interval, int):
        yta_time_interval_minutes = yta_time_interval
    else:
        raise TypeError("yta_time_interval must be a timedelta object or integer")

    if yta_time_interval_minutes <= 0:
        raise ValueError("The parameter 'yta_time_interval' must be positive.")

    # Calculate base arrival rates
    arrival_rates_dict = time_varying_arrival_rates(
        df, yta_time_interval, num_days=num_days
    )

    # Apply lag to the times
    lagged_dict = OrderedDict()
    reference_date = datetime(2000, 1, 1)  # Use arbitrary reference date

    for base_time, rate in arrival_rates_dict.items():
        # Combine with reference date and apply lag
        lagged_datetime = datetime.combine(reference_date, base_time) + timedelta(
            hours=lagged_by
        )
        lagged_dict[lagged_datetime.time()] = rate

    # Sort by lagged times
    return OrderedDict(sorted(lagged_dict.items()))

unfettered_demand_by_hour(df, x1, y1, x2, y2, yta_time_interval=60, max_hours_since_arrival=10, num_days=None)

Calculate true inpatient demand by hour based on historical arrival data.

This function estimates demand rates using historical arrival data and an aspirational curve for admission probabilities. It takes a DataFrame of historical arrivals and parameters defining an aspirational curve to calculate hourly demand rates.

Parameters:

Name Type Description Default
df DataFrame

A DataFrame indexed by datetime, representing historical arrival data. The index must be a DatetimeIndex.

required
x1 float

First x-coordinate of the aspirational curve.

required
y1 float

First y-coordinate of the aspirational curve (0-1).

required
x2 float

Second x-coordinate of the aspirational curve.

required
y2 float

Second y-coordinate of the aspirational curve (0-1).

required
yta_time_interval int or timedelta

Time interval for which the arrival rates are to be calculated. If int, assumed to be in minutes. If timedelta, will be converted to minutes. Defaults to 60.

60
max_hours_since_arrival int

Maximum hours since arrival to consider. Defaults to 10.

10
num_days int

The number of days that the DataFrame spans. If not provided, the number of days is calculated from the date of the min and max arrival datetimes.

None

Returns:

Type Description
OrderedDict[time, float]

A dictionary mapping times (datetime.time objects) to their corresponding demand rates.

Raises:

Type Description
TypeError

If df is not a DataFrame, coordinates are not floats, or DataFrame index is not DatetimeIndex.

ValueError

If coordinates are outside valid ranges, yta_time_interval is not positive, or doesn't divide evenly into 24 hours.

Notes

The function combines historical arrival patterns with admission probabilities to estimate true inpatient demand. The aspirational curve is used to model how admission probabilities change over time.

Source code in src/patientflow/calculate/arrival_rates.py
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
def unfettered_demand_by_hour(
    df: DataFrame,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    yta_time_interval: Union[int, timedelta] = 60,
    max_hours_since_arrival: int = 10,
    num_days: Optional[int] = None,
) -> OrderedDict[time, float]:
    """Calculate true inpatient demand by hour based on historical arrival data.

    This function estimates demand rates using historical arrival data and an aspirational
    curve for admission probabilities. It takes a DataFrame of historical arrivals and
    parameters defining an aspirational curve to calculate hourly demand rates.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame indexed by datetime, representing historical arrival data.
        The index must be a DatetimeIndex.
    x1 : float
        First x-coordinate of the aspirational curve.
    y1 : float
        First y-coordinate of the aspirational curve (0-1).
    x2 : float
        Second x-coordinate of the aspirational curve.
    y2 : float
        Second y-coordinate of the aspirational curve (0-1).
    yta_time_interval : int or timedelta, optional
        Time interval for which the arrival rates are to be calculated.
        If int, assumed to be in minutes. If timedelta, will be converted to minutes.
        Defaults to 60.
    max_hours_since_arrival : int, optional
        Maximum hours since arrival to consider. Defaults to 10.
    num_days : int, optional
        The number of days that the DataFrame spans. If not provided, the number of
        days is calculated from the date of the min and max arrival datetimes.

    Returns
    -------
    OrderedDict[datetime.time, float]
        A dictionary mapping times (datetime.time objects) to their corresponding
        demand rates.

    Raises
    ------
    TypeError
        If df is not a DataFrame, coordinates are not floats, or DataFrame index
        is not DatetimeIndex.
    ValueError
        If coordinates are outside valid ranges, yta_time_interval is not positive,
        or doesn't divide evenly into 24 hours.

    Notes
    -----
    The function combines historical arrival patterns with admission probabilities
    to estimate true inpatient demand. The aspirational curve is used to model
    how admission probabilities change over time.
    """
    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")

    if not isinstance(df.index, pd.DatetimeIndex):
        raise TypeError("The DataFrame index must be a pandas DatetimeIndex.")

    if not all(isinstance(x, (int, float)) for x in [x1, y1, x2, y2]):
        raise TypeError("Curve coordinates must be numeric values.")

    if not isinstance(yta_time_interval, (int, timedelta)):
        raise TypeError(
            "The parameter 'yta_time_interval' must be an integer or timedelta."
        )

    if not isinstance(max_hours_since_arrival, int):
        raise TypeError("The parameter 'max_hours_since_arrival' must be an integer.")

    # Handle both timedelta and numeric inputs for yta_time_interval
    if isinstance(yta_time_interval, timedelta):
        yta_time_interval_minutes = int(yta_time_interval.total_seconds() / 60)
    elif isinstance(yta_time_interval, int):
        yta_time_interval_minutes = yta_time_interval
    else:
        raise TypeError("yta_time_interval must be a timedelta object or integer")

    # Validate time interval
    minutes_in_day = 24 * 60
    if yta_time_interval_minutes <= 0:
        raise ValueError("The parameter 'yta_time_interval' must be positive.")
    if minutes_in_day % yta_time_interval_minutes != 0:
        raise ValueError(
            f"Time interval ({yta_time_interval_minutes} minutes) must divide evenly into 24 hours."
        )

    if max_hours_since_arrival <= 0:
        raise ValueError("The parameter 'max_hours_since_arrival' must be positive.")

    if not (0 <= y1 <= 1 and 0 <= y2 <= 1):
        raise ValueError("Y-coordinates must be between 0 and 1.")

    if x1 >= x2:
        raise ValueError("x1 must be less than x2.")

    # Calculate number of intervals in a day
    num_intervals = minutes_in_day // yta_time_interval_minutes

    # Calculate admission probabilities
    hours_since_arrival = np.arange(max_hours_since_arrival + 1)
    _, prob_admission_within_hour = admission_probabilities(
        hours_since_arrival, x1, y1, x2, y2
    )

    # Calculate base arrival rates from historical data
    arrival_rates_dict = time_varying_arrival_rates(
        df, yta_time_interval_minutes, num_days=num_days
    )

    # Convert dict to arrays while preserving order
    hour_keys = list(arrival_rates_dict.keys())
    arrival_rates = np.array([arrival_rates_dict[hour] for hour in hour_keys])

    # Initialize array for weighted arrival rates
    weighted_rates = np.zeros((max_hours_since_arrival, len(arrival_rates)))

    # Calculate weighted arrival rates for each hour and elapsed time
    for hour_idx, _ in enumerate(hour_keys):
        arrival_rate = arrival_rates[hour_idx]
        weighted_rates[:, hour_idx] = (
            arrival_rate * prob_admission_within_hour[:max_hours_since_arrival]
        )

    # Calculate summed demand rates for each hour
    demand_by_hour = OrderedDict()
    elapsed_hours = range(max_hours_since_arrival)

    for hour_idx, hour_key in enumerate(hour_keys):
        demand_by_hour[hour_key] = weighted_arrival_rates(
            weighted_rates, elapsed_hours, hour_idx, num_intervals
        )

    return demand_by_hour

weighted_arrival_rates(weighted_rates, elapsed_hours, hour_idx, num_intervals)

Calculate sum of weighted arrival rates for a specific time interval.

Parameters:

Name Type Description Default
weighted_rates ndarray

Array of weighted arrival rates.

required
elapsed_hours range

Range of elapsed hours to consider.

required
hour_idx int

Current interval index.

required
num_intervals int

Total number of intervals in a day.

required

Returns:

Type Description
float

Sum of weighted arrival rates.

Notes

The function calculates the sum of weighted arrival rates by iterating through the elapsed hours and considering the appropriate interval index for each hour.

Source code in src/patientflow/calculate/arrival_rates.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
def weighted_arrival_rates(
    weighted_rates: np.ndarray, elapsed_hours: range, hour_idx: int, num_intervals: int
) -> float:
    """Calculate sum of weighted arrival rates for a specific time interval.

    Parameters
    ----------
    weighted_rates : np.ndarray
        Array of weighted arrival rates.
    elapsed_hours : range
        Range of elapsed hours to consider.
    hour_idx : int
        Current interval index.
    num_intervals : int
        Total number of intervals in a day.

    Returns
    -------
    float
        Sum of weighted arrival rates.

    Notes
    -----
    The function calculates the sum of weighted arrival rates by iterating through
    the elapsed hours and considering the appropriate interval index for each hour.
    """
    total = 0
    for elapsed_hour in elapsed_hours:
        interval_index = (hour_idx - elapsed_hour) % num_intervals
        total += weighted_rates[elapsed_hour][interval_index]
    return total

survival_curve

calculate_survival_curve(df, start_time_col, end_time_col)

Calculate survival curve data from patient visit data.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing patient visit data

required
start_time_col str

Name of the column containing the start time (e.g., arrival_datetime)

required
end_time_col str

Name of the column containing the end time (e.g., departure_datetime)

required

Returns:

Type Description
DataFrame

DataFrame with columns: - time_hours: Time points in hours - survival_probability: Survival probabilities at each time point - event_probability: Event probabilities (1 - survival_probability)

Source code in src/patientflow/calculate/survival_curve.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def calculate_survival_curve(df, start_time_col, end_time_col):
    """Calculate survival curve data from patient visit data.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient visit data
    start_time_col : str
        Name of the column containing the start time (e.g., arrival_datetime)
    end_time_col : str
        Name of the column containing the end time (e.g., departure_datetime)

    Returns
    -------
    pandas.DataFrame
        DataFrame with columns:
        - time_hours: Time points in hours
        - survival_probability: Survival probabilities at each time point
        - event_probability: Event probabilities (1 - survival_probability)
    """
    # Calculate the wait time in hours
    df = df.copy()
    df["wait_time_hours"] = (
        df[end_time_col] - df[start_time_col]
    ).dt.total_seconds() / 3600

    # Drop any rows with missing wait times
    df_clean = df.dropna(subset=["wait_time_hours"]).copy()

    # Sort the data by wait time
    df_clean = df_clean.sort_values("wait_time_hours")

    # Calculate the number of patients
    n_patients = len(df_clean)

    # Calculate the survival function manually
    # For each time point, calculate proportion of patients who are still waiting
    unique_times = np.sort(df_clean["wait_time_hours"].unique())
    survival_prob = []

    for t in unique_times:
        # Number of patients who experienced the event after this time point
        n_event_after = sum(df_clean["wait_time_hours"] > t)
        # Proportion of patients still waiting
        survival_prob.append(n_event_after / n_patients)

    # Add zero hours wait time (everyone is waiting at time 0)
    unique_times = np.insert(unique_times, 0, 0)
    survival_prob = np.insert(survival_prob, 0, 1.0)

    # Return structured DataFrame
    return pd.DataFrame(
        {
            "time_hours": unique_times,
            "survival_probability": survival_prob,
            "event_probability": 1 - survival_prob,
        }
    )

errors

Custom exception classes for model loading and validation.

This module defines specialized exceptions used during model loading

Classes:

Name Description
ModelLoadError

Raised when a model fails to load due to an unspecified error.

MissingKeysError

Raised when expected keys are missing from a dictionary of special parameters.

MissingKeysError

Bases: ValueError

Exception raised when required keys are missing from special_params.

Parameters:

Name Type Description Default
missing_keys list or set

The keys that are required but missing from the input dictionary.

required

Attributes:

Name Type Description
missing_keys list or set

Stores the missing keys that caused the exception.

Source code in src/patientflow/errors.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class MissingKeysError(ValueError):
    """
    Exception raised when required keys are missing from special_params.

    Parameters
    ----------
    missing_keys : list or set
        The keys that are required but missing from the input dictionary.

    Attributes
    ----------
    missing_keys : list or set
        Stores the missing keys that caused the exception.
    """

    def __init__(self, missing_keys):
        super().__init__(f"special_params is missing required keys: {missing_keys}")
        self.missing_keys = missing_keys

ModelLoadError

Bases: Exception

Exception raised when a model fails to load.

This generic exception can be used to signal a failure during the model loading process due to unexpected issues such as file corruption, invalid formats, or unsupported configurations.

Source code in src/patientflow/errors.py
16
17
18
19
20
21
22
23
24
25
class ModelLoadError(Exception):
    """
    Exception raised when a model fails to load.

    This generic exception can be used to signal a failure during the model
    loading process due to unexpected issues such as file corruption,
    invalid formats, or unsupported configurations.
    """

    pass

evaluate

Patient Flow Evaluation Module

This module provides functions for evaluating and comparing different prediction models for non-clincal outcomes in a healthcare setting. It includes utilities for calculating metrics such as Mean Absolute Error (MAE) and Mean Percentage Error (MPE), as well as functions for predicting admissions based on historical data and combining different prediction models.

Functions:

Name Description
calculate_results : function

Calculate evaluation metrics based on expected and observed values

calc_mae_mpe : function

Calculate MAE and MPE for probability distribution predictions

calculate_admission_probs_relative_to_prediction : function

Calculate admission probabilities for arrivals relative to a prediction time window

get_arrivals_with_admission_probs : function

Get arrivals before and after prediction time with their admission probabilities

calculate_weighted_observed : function

Calculate actual admissions assuming ED targets are met

create_time_mask : function

Create a mask for times before/after a specific hour:minute

predict_using_previous_weeks : function

Predict admissions using average from previous weeks

evaluate_six_week_average : function

Evaluate the six-week average prediction model

combine_distributions : function

Combine two probability distributions using convolution

evaluate_combined_model : function

Evaluate a combined prediction model

calc_mae_mpe(prob_dist_dict_all, use_most_probable=False)

Calculate MAE and MPE for all prediction times in the given probability distribution dictionary.

Parameters:

Name Type Description Default
prob_dist_dict_all Dict[Any, Dict[Any, Dict[str, Any]]]

Nested dictionary containing probability distributions.

required
use_most_probable bool

Whether to use the most probable value or mathematical expectation of the distribution. Default is False.

False

Returns:

Type Description
Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]

Dictionary of results sorted by prediction time, containing: - expected : List[Union[int, float]] Expected values for each prediction - observed : List[float] Observed values for each prediction - mae : float Mean Absolute Error - mpe : float Mean Percentage Error

Source code in src/patientflow/evaluate.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def calc_mae_mpe(
    prob_dist_dict_all: Dict[Any, Dict[Any, Dict[str, Any]]],
    use_most_probable: bool = False,
) -> Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]:
    """Calculate MAE and MPE for all prediction times in the given probability distribution dictionary.

    Parameters
    ----------
    prob_dist_dict_all : Dict[Any, Dict[Any, Dict[str, Any]]]
        Nested dictionary containing probability distributions.
    use_most_probable : bool, optional
        Whether to use the most probable value or mathematical expectation of the distribution.
        Default is False.

    Returns
    -------
    Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]
        Dictionary of results sorted by prediction time, containing:
        - expected : List[Union[int, float]]
            Expected values for each prediction
        - observed : List[float]
            Observed values for each prediction
        - mae : float
            Mean Absolute Error
        - mpe : float
            Mean Percentage Error
    """
    # Create temporary results dictionary
    unsorted_results: Dict[Any, Dict[str, Union[List[Union[int, float]], float]]] = {}

    # Process results as before
    for _prediction_time in prob_dist_dict_all.keys():
        expected_values: List[Union[int, float]] = []
        observed_values: List[float] = []

        for dt in prob_dist_dict_all[_prediction_time].keys():
            preds: Dict[str, Any] = prob_dist_dict_all[_prediction_time][dt]

            expected_value: Union[int, float] = (
                int(preds["agg_predicted"].idxmax().values[0])
                if use_most_probable
                else float(
                    np.dot(
                        preds["agg_predicted"].index,
                        preds["agg_predicted"].values.flatten(),
                    )
                )
            )

            observed_value: float = float(preds["agg_observed"])

            expected_values.append(expected_value)
            observed_values.append(observed_value)

        unsorted_results[_prediction_time] = calculate_results(
            expected_values, observed_values
        )

    # Sort results by prediction time
    def get_time_value(key: str) -> int:
        # Extract time from key (e.g., 'admissions_1530' -> 1530)
        time_str = key.split("_")[1]
        return int(time_str)

    # Create sorted dictionary
    sorted_results = dict(
        sorted(unsorted_results.items(), key=lambda x: get_time_value(x[0]))
    )

    return sorted_results

calculate_admission_probs_relative_to_prediction(df, prediction_datetime, prediction_window, x1, y1, x2, y2, is_before=True)

Calculate admission probabilities for arrivals relative to a prediction time window.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing arrival_datetime column.

required
prediction_datetime datetime

Datetime for prediction window start.

required
prediction_window int

Window length in minutes.

required
x1 float

First x-coordinate for aspirational curve.

required
y1 float

First y-coordinate for aspirational curve.

required
x2 float

Second x-coordinate for aspirational curve.

required
y2 float

Second y-coordinate for aspirational curve.

required
is_before bool

Boolean indicating if arrivals are before prediction time. Default is True.

True

Returns:

Type Description
DataFrame

DataFrame with added probability columns: - hours_before_pred_window : float Hours before prediction window (if is_before=True) - hours_after_pred_window : float Hours after prediction window (if is_before=False) - prob_admission_before_pred_window : float Probability of admission before prediction window - prob_admission_in_pred_window : float Probability of admission within prediction window

Source code in src/patientflow/evaluate.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
def calculate_admission_probs_relative_to_prediction(
    df, prediction_datetime, prediction_window, x1, y1, x2, y2, is_before=True
):
    """Calculate admission probabilities for arrivals relative to a prediction time window.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing arrival_datetime column.
    prediction_datetime : datetime
        Datetime for prediction window start.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    is_before : bool, optional
        Boolean indicating if arrivals are before prediction time.
        Default is True.

    Returns
    -------
    pandas.DataFrame
        DataFrame with added probability columns:
        - hours_before_pred_window : float
            Hours before prediction window (if is_before=True)
        - hours_after_pred_window : float
            Hours after prediction window (if is_before=False)
        - prob_admission_before_pred_window : float
            Probability of admission before prediction window
        - prob_admission_in_pred_window : float
            Probability of admission within prediction window
    """
    result = df.copy()

    if is_before:
        result["hours_before_pred_window"] = result["arrival_datetime"].apply(
            lambda x: (prediction_datetime - x).seconds / 3600
        )
        result["prob_admission_before_pred_window"] = result[
            "hours_before_pred_window"
        ].apply(lambda x: get_y_from_aspirational_curve(x, x1, y1, x2, y2))
        result["prob_admission_in_pred_window"] = result[
            "hours_before_pred_window"
        ].apply(
            lambda x: get_y_from_aspirational_curve(
                x + prediction_window / 60, x1, y1, x2, y2
            )
            - get_y_from_aspirational_curve(x, x1, y1, x2, y2)
        )
    else:
        result["hours_after_pred_window"] = result["arrival_datetime"].apply(
            lambda x: (x - prediction_datetime).seconds / 3600
        )
        result["prob_admission_in_pred_window"] = result[
            "hours_after_pred_window"
        ].apply(
            lambda x: get_y_from_aspirational_curve(
                (prediction_window / 60) - x, x1, y1, x2, y2
            )
        )

    return result

calculate_results(expected_values, observed_values)

Calculate evaluation metrics based on expected and observed values.

Parameters:

Name Type Description Default
expected_values List[Union[int, float]]

List of expected values.

required
observed_values List[float]

List of observed values.

required

Returns:

Type Description
Dict[str, Union[List[Union[int, float]], float]]

Dictionary containing: - expected : List[Union[int, float]] Original expected values - observed : List[float] Original observed values - mae : float Mean Absolute Error - mpe : float Mean Percentage Error

Source code in src/patientflow/evaluate.py
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def calculate_results(
    expected_values: List[Union[int, float]], observed_values: List[float]
) -> Dict[str, Union[List[Union[int, float]], float]]:
    """Calculate evaluation metrics based on expected and observed values.

    Parameters
    ----------
    expected_values : List[Union[int, float]]
        List of expected values.
    observed_values : List[float]
        List of observed values.

    Returns
    -------
    Dict[str, Union[List[Union[int, float]], float]]
        Dictionary containing:
        - expected : List[Union[int, float]]
            Original expected values
        - observed : List[float]
            Original observed values
        - mae : float
            Mean Absolute Error
        - mpe : float
            Mean Percentage Error
    """
    expected_array: np.ndarray = np.array(expected_values)
    observed_array: np.ndarray = np.array(observed_values)

    if len(expected_array) == 0 or len(observed_array) == 0:
        return {
            "expected": expected_values,
            "observed": observed_values,
            "mae": 0.0,
            "mpe": 0.0,
        }

    absolute_errors: np.ndarray = np.abs(expected_array - observed_array)
    mae: float = float(np.mean(absolute_errors)) if len(absolute_errors) > 0 else 0.0

    non_zero_mask: np.ndarray = observed_array != 0
    filtered_absolute_errors: np.ndarray = absolute_errors[non_zero_mask]
    filtered_observed_array: np.ndarray = observed_array[non_zero_mask]

    mpe: float = 0.0
    if len(filtered_absolute_errors) > 0 and len(filtered_observed_array) > 0:
        percentage_errors: np.ndarray = (
            filtered_absolute_errors / filtered_observed_array * 100
        )
        mpe = float(np.mean(percentage_errors))

    return {
        "expected": expected_values,
        "observed": observed_values,
        "mae": mae,
        "mpe": mpe,
    }

calculate_weighted_observed(df, dt, prediction_window, x1, y1, x2, y2, prediction_time)

Calculate weighted observed admissions for a specific date and prediction window.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with arrival_datetime column.

required
dt date

Target date for calculation.

required
prediction_window int

Window length in minutes.

required
x1 float

First x-coordinate for aspirational curve.

required
y1 float

First y-coordinate for aspirational curve.

required
x2 float

Second x-coordinate for aspirational curve.

required
y2 float

Second y-coordinate for aspirational curve.

required
prediction_time tuple

Tuple of (hour, minute) for prediction time.

required

Returns:

Type Description
float

Weighted sum of observed admissions for the specified time period.

Source code in src/patientflow/evaluate.py
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
def calculate_weighted_observed(
    df, dt, prediction_window, x1, y1, x2, y2, prediction_time
):
    """Calculate weighted observed admissions for a specific date and prediction window.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with arrival_datetime column.
    dt : datetime.date
        Target date for calculation.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    prediction_time : tuple
        Tuple of (hour, minute) for prediction time.

    Returns
    -------
    float
        Weighted sum of observed admissions for the specified time period.
    """
    # Create prediction datetime
    prediction_datetime = pd.to_datetime(dt).replace(
        hour=prediction_time[0], minute=prediction_time[1]
    )

    # Filter for target date and get arrivals with probabilities
    filtered_df = df[df["arrival_datetime"].dt.date == dt]
    arrived_before, arrived_after = get_arrivals_with_admission_probs(
        filtered_df,
        prediction_datetime,
        prediction_window,
        prediction_time,
        x1,
        y1,
        x2,
        y2,
        target_date=dt,
    )

    # Calculate weighted sum
    weighted_observed = (
        arrived_before["prob_admission_in_pred_window"].sum()
        + arrived_after["prob_admission_in_pred_window"].sum()
    )

    return weighted_observed

combine_distributions(dist1, dist2)

Combine two probability distributions using convolution.

Parameters:

Name Type Description Default
dist1 DataFrame

First probability distribution.

required
dist2 DataFrame

Second probability distribution.

required

Returns:

Type Description
DataFrame

Combined probability distribution with columns: - agg_predicted : float Combined probability values

Source code in src/patientflow/evaluate.py
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
def combine_distributions(dist1: pd.DataFrame, dist2: pd.DataFrame) -> pd.DataFrame:
    """Combine two probability distributions using convolution.

    Parameters
    ----------
    dist1 : pandas.DataFrame
        First probability distribution.
    dist2 : pandas.DataFrame
        Second probability distribution.

    Returns
    -------
    pandas.DataFrame
        Combined probability distribution with columns:
        - agg_predicted : float
            Combined probability values
    """
    arr1 = dist1.values
    arr2 = dist2.values

    combined = signal.convolve(arr1, arr2)
    new_index = range(len(combined))

    combined_df = pd.DataFrame(combined, index=new_index, columns=["agg_predicted"])
    combined_df["agg_predicted"] = (
        combined_df["agg_predicted"] / combined_df["agg_predicted"].sum()
    )

    return combined_df

create_time_mask(df, hour, minute)

Create a mask for times before/after a specific hour:minute.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing arrival_datetime column.

required
hour int

Target hour (0-23).

required
minute int

Target minute (0-59).

required

Returns:

Type Description
Series

Boolean mask indicating times after the specified hour:minute.

Source code in src/patientflow/evaluate.py
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
def create_time_mask(df, hour, minute):
    """Create a mask for times before/after a specific hour:minute.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing arrival_datetime column.
    hour : int
        Target hour (0-23).
    minute : int
        Target minute (0-59).

    Returns
    -------
    pandas.Series
        Boolean mask indicating times after the specified hour:minute.
    """
    return (df["arrival_datetime"].dt.hour > hour) | (
        (df["arrival_datetime"].dt.hour == hour)
        & (df["arrival_datetime"].dt.minute > minute)
    )

evaluate_combined_model(prob_dist_dict_all, df, yta_preds, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name, use_most_probable=True)

Evaluate the combined prediction model.

Parameters:

Name Type Description Default
prob_dist_dict_all Dict[Any, Dict[Any, Dict[str, Any]]]

Nested dictionary containing probability distributions.

required
df DataFrame

DataFrame containing patient data.

required
yta_preds DataFrame

Yet-to-arrive predictions.

required
prediction_window int

Window length in minutes.

required
x1 float

First x-coordinate for aspirational curve.

required
y1 float

First y-coordinate for aspirational curve.

required
x2 float

Second x-coordinate for aspirational curve.

required
y2 float

Second y-coordinate for aspirational curve.

required
prediction_time Tuple[int, int]

Hour and minute of prediction.

required
num_weeks int

Number of previous weeks to consider.

required
model_name str

Name of the model.

required
use_most_probable bool

Whether to use the most probable value or expected value. Default is True.

True

Returns:

Type Description
Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]

Dictionary containing evaluation results: - expected : List[Union[int, float]] Expected values for each prediction - observed : List[float] Observed values for each prediction - mae : float Mean Absolute Error - mpe : float Mean Percentage Error

Source code in src/patientflow/evaluate.py
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
def evaluate_combined_model(
    prob_dist_dict_all: Dict[Any, Dict[Any, Dict[str, Any]]],
    df: pd.DataFrame,
    yta_preds: pd.DataFrame,
    prediction_window: int,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    prediction_time: Tuple[int, int],
    num_weeks: int,
    model_name: str,
    use_most_probable: bool = True,
) -> Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]:
    """Evaluate the combined prediction model.

    Parameters
    ----------
    prob_dist_dict_all : Dict[Any, Dict[Any, Dict[str, Any]]]
        Nested dictionary containing probability distributions.
    df : pandas.DataFrame
        DataFrame containing patient data.
    yta_preds : pandas.DataFrame
        Yet-to-arrive predictions.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    prediction_time : Tuple[int, int]
        Hour and minute of prediction.
    num_weeks : int
        Number of previous weeks to consider.
    model_name : str
        Name of the model.
    use_most_probable : bool, optional
        Whether to use the most probable value or expected value.
        Default is True.

    Returns
    -------
    Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]
        Dictionary containing evaluation results:
        - expected : List[Union[int, float]]
            Expected values for each prediction
        - observed : List[float]
            Observed values for each prediction
        - mae : float
            Mean Absolute Error
        - mpe : float
            Mean Percentage Error
    """
    expected_values: List[Union[int, float]] = []
    observed_values: List[float] = []

    model_name = get_model_key(model_name, prediction_time)

    for dt in prob_dist_dict_all[model_name].keys():
        in_ed_preds: Dict[str, Any] = prob_dist_dict_all[model_name][dt]
        combined = combine_distributions(yta_preds, in_ed_preds["agg_predicted"])

        expected_value: Union[int, float] = (
            int(combined["agg_predicted"].idxmax())
            if use_most_probable
            else float(
                np.dot(
                    combined["agg_predicted"].index,
                    combined["agg_predicted"].values.flatten(),
                )
            )
        )

        observed_value: float = float(
            calculate_weighted_observed(
                df, dt, prediction_window, x1, y1, x2, y2, prediction_time
            )
        )

        expected_values.append(expected_value)
        observed_values.append(observed_value)

    results = {model_name: calculate_results(expected_values, observed_values)}
    return results

evaluate_six_week_average(prob_dist_dict_all, df, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name)

Evaluate the six-week average prediction model.

Parameters:

Name Type Description Default
prob_dist_dict_all Dict[Any, Dict[Any, Dict[str, Any]]]

Nested dictionary containing probability distributions.

required
df DataFrame

DataFrame containing patient data.

required
prediction_window int

Prediction window in minutes.

required
x1 float

First x-coordinate for aspirational curve.

required
y1 float

First y-coordinate for aspirational curve.

required
prediction_time Tuple[int, int]

Hour and minute of prediction.

required
num_weeks int

Number of previous weeks to consider.

required
model_name str

Name of the model.

required

Returns:

Type Description
Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]

Dictionary containing evaluation results: - expected : List[Union[int, float]] Expected values for each prediction - observed : List[float] Observed values for each prediction

Source code in src/patientflow/evaluate.py
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
def evaluate_six_week_average(
    prob_dist_dict_all: Dict[Any, Dict[Any, Dict[str, Any]]],
    df: pd.DataFrame,
    prediction_window: int,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    prediction_time: Tuple[int, int],
    num_weeks: int,
    model_name: str,
) -> Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]:
    """
    Evaluate the six-week average prediction model.

    Parameters
    ----------
    prob_dist_dict_all : Dict[Any, Dict[Any, Dict[str, Any]]]
        Nested dictionary containing probability distributions.
    df : pandas.DataFrame
        DataFrame containing patient data.
    prediction_window : int
        Prediction window in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    prediction_time : Tuple[int, int]
        Hour and minute of prediction.
    num_weeks : int
        Number of previous weeks to consider.
    model_name : str
        Name of the model.

    Returns
    -------
    Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]
        Dictionary containing evaluation results:
        - expected : List[Union[int, float]]
            Expected values for each prediction
        - observed : List[float]
            Observed values for each prediction
    """
    expected_values: List[Union[int, float]] = []
    observed_values: List[float] = []

    model_name = get_model_key(model_name, prediction_time)

    for dt in prob_dist_dict_all[model_name].keys():
        expected_value: float = float(
            predict_using_previous_weeks(
                df, dt, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks
            )
        )
        observed_value: float = float(
            calculate_weighted_observed(
                df, dt, prediction_window, x1, y1, x2, y2, prediction_time
            )
        )

        expected_values.append(expected_value)
        observed_values.append(observed_value)

    results = {model_name: calculate_results(expected_values, observed_values)}
    return results

get_arrivals_with_admission_probs(df, prediction_datetime, prediction_window, prediction_time, x1, y1, x2, y2, date_range=None, target_date=None, target_weekday=None)

Get arrivals before and after prediction time with their admission probabilities.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with arrival_datetime column.

required
prediction_datetime datetime

Datetime for prediction window start.

required
prediction_window int

Window length in minutes.

required
prediction_time tuple

Tuple of (hour, minute) for prediction time.

required
x1 float

First x-coordinate for aspirational curve.

required
y1 float

First y-coordinate for aspirational curve.

required
x2 float

Second x-coordinate for aspirational curve.

required
y2 float

Second y-coordinate for aspirational curve.

required
date_range tuple

Optional tuple of (start_date, end_date) to filter data.

None
target_date date

Optional specific date to analyze.

None
target_weekday int

Optional specific weekday to filter for (0-6, where 0 is Monday).

None

Returns:

Type Description
tuple

Tuple of (arrived_before, arrived_after) DataFrames containing: - arrived_before : pandas.DataFrame DataFrame with arrivals before prediction time - arrived_after : pandas.DataFrame DataFrame with arrivals after prediction time

Source code in src/patientflow/evaluate.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
def get_arrivals_with_admission_probs(
    df,
    prediction_datetime,
    prediction_window,
    prediction_time,
    x1,
    y1,
    x2,
    y2,
    date_range=None,
    target_date=None,
    target_weekday=None,
):
    """Get arrivals before and after prediction time with their admission probabilities.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with arrival_datetime column.
    prediction_datetime : datetime
        Datetime for prediction window start.
    prediction_window : int
        Window length in minutes.
    prediction_time : tuple
        Tuple of (hour, minute) for prediction time.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    date_range : tuple, optional
        Optional tuple of (start_date, end_date) to filter data.
    target_date : datetime.date, optional
        Optional specific date to analyze.
    target_weekday : int, optional
        Optional specific weekday to filter for (0-6, where 0 is Monday).

    Returns
    -------
    tuple
        Tuple of (arrived_before, arrived_after) DataFrames containing:
        - arrived_before : pandas.DataFrame
            DataFrame with arrivals before prediction time
        - arrived_after : pandas.DataFrame
            DataFrame with arrivals after prediction time
    """
    hour, minute = prediction_time

    # Create base time masks
    after_mask = create_time_mask(df, hour, minute)
    before_mask = ~after_mask

    # Add date and weekday conditions if specified
    if date_range:
        start_date, end_date = date_range
        date_mask = (df["arrival_datetime"].dt.date >= start_date) & (
            df["arrival_datetime"].dt.date < end_date
        )
        if target_weekday is not None:
            date_mask &= df["arrival_datetime"].dt.weekday == target_weekday

        after_mask &= date_mask
        before_mask &= date_mask

    if target_date:
        target_mask = df["arrival_datetime"].dt.date == target_date
        after_mask &= target_mask
        before_mask &= target_mask

    # Calculate probabilities for filtered groups
    arrived_before = calculate_admission_probs_relative_to_prediction(
        df[before_mask],
        prediction_datetime,
        prediction_window,
        x1,
        y1,
        x2,
        y2,
        is_before=True,
    )

    arrived_after = calculate_admission_probs_relative_to_prediction(
        df[after_mask],
        prediction_datetime,
        prediction_window,
        x1,
        y1,
        x2,
        y2,
        is_before=False,
    )

    return arrived_before, arrived_after

predict_using_previous_weeks(df, dt, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, weighted=True)

Calculate predicted admissions remaining until midnight.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing patient data.

required
dt datetime

Date for prediction.

required
prediction_window int

Window length in minutes.

required
x1 float

First x-coordinate for aspirational curve.

required
y1 float

First y-coordinate for aspirational curve.

required
x2 float

Second x-coordinate for aspirational curve.

required
y2 float

Second y-coordinate for aspirational curve.

required
prediction_time Tuple[int, int]

Hour and minute of prediction.

required
num_weeks int

Number of previous weeks to consider.

required
weighted bool

Whether to weight the numbers according to aspirational ED targets. Default is True.

True

Returns:

Type Description
float

Predicted number of admissions remaining until midnight.

Source code in src/patientflow/evaluate.py
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
def predict_using_previous_weeks(
    df: pd.DataFrame,
    dt: datetime,
    prediction_window: int,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    prediction_time: Tuple[int, int],
    num_weeks: int,
    weighted: bool = True,
) -> float:
    """Calculate predicted admissions remaining until midnight.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient data.
    dt : datetime
        Date for prediction.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    prediction_time : Tuple[int, int]
        Hour and minute of prediction.
    num_weeks : int
        Number of previous weeks to consider.
    weighted : bool, optional
        Whether to weight the numbers according to aspirational ED targets.
        Default is True.

    Returns
    -------
    float
        Predicted number of admissions remaining until midnight.
    """
    prediction_datetime = pd.to_datetime(dt).replace(
        hour=prediction_time[0], minute=prediction_time[1]
    )
    target_day_of_week = dt.weekday()

    end_date = dt - timedelta(days=1)
    start_date = end_date - timedelta(weeks=num_weeks)

    if weighted:
        # Create mask for historical data
        historical_mask = (
            (df["arrival_datetime"].dt.date >= start_date)
            & (df["arrival_datetime"].dt.date <= end_date)
            & (df["arrival_datetime"].dt.weekday == target_day_of_week)
        )

        # Create explicit copy of filtered data
        historical_data = df[historical_mask].copy()

        # Calculate minutes until midnight
        midnight_times = (
            historical_data["arrival_datetime"].dt.normalize()
            + pd.Timedelta(days=1)
            - pd.Timedelta(minutes=1)
        )
        historical_data.loc[:, "minutes_to_midnight"] = (
            midnight_times - historical_data["arrival_datetime"]
        ).dt.total_seconds() / 60

        # Calculate admission probabilities
        historical_data.loc[:, "admission_probability"] = historical_data[
            "minutes_to_midnight"
        ].apply(lambda x: get_y_from_aspirational_curve(x / 60, x1, y1, x2, y2))

        # Group by date and calculate average
        historical_daily_sums = historical_data.groupby(
            historical_data["arrival_datetime"].dt.date
        )["admission_probability"].sum()
        historical_average = historical_daily_sums.mean()

        # Create mask for today's data
        today_mask = (df["arrival_datetime"].dt.date == dt) & (
            df["arrival_datetime"] < prediction_datetime
        )

        # Create explicit copy of today's filtered data
        today_data = df[today_mask].copy()

        # Calculate minutes until midnight for today's data
        midnight_today = (
            pd.to_datetime(dt).normalize()
            + pd.Timedelta(days=1)
            - pd.Timedelta(minutes=1)
        )
        today_data.loc[:, "minutes_to_midnight"] = (
            midnight_today - today_data["arrival_datetime"]
        ).dt.total_seconds() / 60

        # Calculate admission probabilities for today
        today_data.loc[:, "admission_probability"] = today_data[
            "minutes_to_midnight"
        ].apply(lambda x: get_y_from_aspirational_curve(x / 60, x1, y1, x2, y2))

        today_sum = today_data["admission_probability"].sum()

        still_to_admit = max(historical_average - today_sum, 0)

    else:
        # Original unweighted logic with explicit copies
        historical_mask = (
            (df["arrival_datetime"].dt.date >= start_date)
            & (df["arrival_datetime"].dt.date < end_date)
            & (df["arrival_datetime"].dt.weekday == target_day_of_week)
        )
        historical_df = df[historical_mask].copy()
        average_count = len(historical_df) / num_weeks

        target_mask = (df["arrival_datetime"].dt.date == dt) & (
            df["arrival_datetime"] < prediction_datetime
        )
        target_date_count = len(df[target_mask])

        still_to_admit = max(average_count - target_date_count, 0)

    return still_to_admit

generate

Generate fake Emergency Department visit data.

This module provides functions to generate fake datasets for patient visits to an emergency department (ED). It generates arrival and departure times, triage scores, lab orders, and patient admissions. The functions are used for illustrative purposes in some of the notebooks.

Functions:

Name Description
create_fake_finished_visits

Generate synthetic patient visits, triage observations, and lab orders.

create_fake_snapshots

Create patient-level snapshots at specific times with visit, triage, and lab features.

create_fake_finished_visits(start_date, end_date, mean_patients_per_day, admitted_only=False)

Generate synthetic patient visit data for an emergency department.

This function simulates a realistic distribution of patient arrivals, triage scores, lengths of stay, admissions, and lab orders over a specified date range. Some patients may have multiple visits.

Parameters:

Name Type Description Default
start_date str or datetime

The starting date for the simulation (inclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.

required
end_date str or datetime

The ending date for the simulation (exclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.

required
mean_patients_per_day float

The average number of patient visits to generate per day.

required
admitted_only bool

If True, only return admitted patients. The mean_patients_per_day will be adjusted to maintain the same total number of admitted patients as would be expected in the full dataset.

False

Returns:

Name Type Description
visits_df DataFrame

DataFrame containing visit records with the following columns: - 'visit_number' - 'patient_id' - 'arrival_datetime' - 'departure_datetime' - 'is_admitted' - 'specialty' - 'age'

observations_df DataFrame

DataFrame containing triage score observations with columns: - 'visit_number' - 'observation_datetime' - 'triage_score'

lab_orders_df DataFrame

DataFrame containing lab test orders with columns: - 'visit_number' - 'order_datetime' - 'lab_name'

Notes
  • Patients are more likely to arrive during daytime hours.
  • 20% of patients will have more than one visit during the simulation period.
  • Lab test ordering likelihood depends on the severity of the triage score.
  • When admitted_only=True, the mean_patients_per_day is adjusted to maintain the same number of admitted patients as would be expected in the full dataset.
Source code in src/patientflow/generate.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
def create_fake_finished_visits(
    start_date, end_date, mean_patients_per_day, admitted_only=False
):
    """
    Generate synthetic patient visit data for an emergency department.

    This function simulates a realistic distribution of patient arrivals, triage scores, lengths of stay,
    admissions, and lab orders over a specified date range. Some patients may have multiple visits.

    Parameters
    ----------
    start_date : str or datetime
        The starting date for the simulation (inclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.
    end_date : str or datetime
        The ending date for the simulation (exclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.
    mean_patients_per_day : float
        The average number of patient visits to generate per day.
    admitted_only : bool, optional
        If True, only return admitted patients. The mean_patients_per_day will be adjusted to maintain
        the same total number of admitted patients as would be expected in the full dataset.

    Returns
    -------
    visits_df : pandas.DataFrame
        DataFrame containing visit records with the following columns:
        - 'visit_number'
        - 'patient_id'
        - 'arrival_datetime'
        - 'departure_datetime'
        - 'is_admitted'
        - 'specialty'
        - 'age'
    observations_df : pandas.DataFrame
        DataFrame containing triage score observations with columns:
        - 'visit_number'
        - 'observation_datetime'
        - 'triage_score'
    lab_orders_df : pandas.DataFrame
        DataFrame containing lab test orders with columns:
        - 'visit_number'
        - 'order_datetime'
        - 'lab_name'

    Notes
    -----
    - Patients are more likely to arrive during daytime hours.
    - 20% of patients will have more than one visit during the simulation period.
    - Lab test ordering likelihood depends on the severity of the triage score.
    - When admitted_only=True, the mean_patients_per_day is adjusted to maintain the same number
      of admitted patients as would be expected in the full dataset.
    """

    # Convert string dates to datetime if needed
    if isinstance(start_date, str):
        start_date = datetime.strptime(start_date, "%Y-%m-%d")
    if isinstance(end_date, str):
        end_date = datetime.strptime(end_date, "%Y-%m-%d")

    # Set random seed for reproducibility
    np.random.seed(42)  # You can change this seed value as needed

    # Define admission probabilities based on triage score
    # Triage 1: 80% admission, Triage 2: 60%, Triage 3: 30%, Triage 4: 10%, Triage 5: 2%
    admission_probabilities = {
        1: 0.80,  # Highest severity - highest admission probability
        2: 0.60,
        3: 0.30,
        4: 0.10,
        5: 0.02,  # Lowest severity - lowest admission probability
    }

    # Define triage score distribution
    # Most common is 3-4, less common are 2 and 5, least common is 1 (most severe)
    triage_probabilities = [0.05, 0.15, 0.35, 0.35, 0.10]  # For scores 1-5

    # Calculate total days in range (changed to exclusive end date)
    days_range = (end_date - start_date).days

    # If admitted_only is True, adjust mean_patients_per_day to maintain the same number of admitted patients
    if admitted_only:
        # Calculate expected admission rate based on triage probabilities and admission probabilities
        expected_admission_rate = sum(
            triage_prob * admission_prob
            for triage_prob, admission_prob in zip(
                triage_probabilities, admission_probabilities.values()
            )
        )
        # Adjust mean_patients_per_day to maintain the same number of admitted patients
        mean_patients_per_day = mean_patients_per_day / expected_admission_rate

    # Generate random number of patients for each day using Poisson distribution
    daily_patients = np.random.poisson(mean_patients_per_day, days_range)

    # Calculate the total number of visits
    total_visits = sum(daily_patients)

    # Calculate approximately how many unique patients we need
    # If 20% of patients have more than one visit (let's assume they have exactly 2),
    # then for N total visits, we need approximately N * 0.8 + (N * 0.2) / 2 unique patients
    # Simplifying: N * (0.8 + 0.1) = N * 0.9 unique patients
    num_unique_patients = int(total_visits * 0.9)

    # Create patient ids
    patient_ids = list(range(1, num_unique_patients + 1))

    # Define common ED lab tests and their ordering probabilities based on triage score
    lab_tests = ["CBC", "BMP", "Troponin", "D-dimer", "Urinalysis"]
    lab_probabilities = {
        # Higher severity -> more likely to get labs
        1: {
            "CBC": 0.95,
            "BMP": 0.95,
            "Troponin": 0.90,
            "D-dimer": 0.70,
            "Urinalysis": 0.60,
        },
        2: {
            "CBC": 0.90,
            "BMP": 0.90,
            "Troponin": 0.80,
            "D-dimer": 0.60,
            "Urinalysis": 0.50,
        },
        3: {
            "CBC": 0.80,
            "BMP": 0.80,
            "Troponin": 0.60,
            "D-dimer": 0.40,
            "Urinalysis": 0.40,
        },
        4: {
            "CBC": 0.60,
            "BMP": 0.60,
            "Troponin": 0.30,
            "D-dimer": 0.20,
            "Urinalysis": 0.30,
        },
        5: {
            "CBC": 0.40,
            "BMP": 0.40,
            "Troponin": 0.15,
            "D-dimer": 0.10,
            "Urinalysis": 0.20,
        },
    }

    visits = []
    observations = []
    lab_orders = []
    visit_number = 1

    # Create a dictionary to track number of visits per patient
    patient_visit_count = {patient_id: 0 for patient_id in patient_ids}

    # Create a pool of patients who will have multiple visits (20% of patients)
    multi_visit_patients = set(
        np.random.choice(
            patient_ids, size=int(num_unique_patients * 0.2), replace=False
        )
    )

    for day_idx, num_patients in enumerate(daily_patients):
        current_date = start_date + timedelta(days=day_idx)

        # Generate patients for this day
        for _ in range(num_patients):
            # Select a patient ID based on our requirements
            # If we haven't assigned all patients yet, use a new one
            # Otherwise, pick from multi-visit patients
            available_new_patients = [
                pid for pid in patient_ids if patient_visit_count[pid] == 0
            ]

            if available_new_patients:
                # Use a new patient
                patient_id = np.random.choice(available_new_patients)
            else:
                # All patients have at least one visit, now use multi-visit patients
                patient_id = np.random.choice(list(multi_visit_patients))

            # Increment the visit count for this patient
            patient_visit_count[patient_id] += 1

            # Random hour for arrival (more likely during daytime)
            arrival_hour = np.random.normal(13, 4)  # Mean at 1 PM, std dev of 4 hours
            arrival_hour = max(0, min(23, int(arrival_hour)))  # Clamp between 0-23

            # Random minutes
            arrival_minute = np.random.randint(0, 60)

            # Create arrival datetime
            arrival_datetime = current_date.replace(
                hour=arrival_hour,
                minute=arrival_minute,
                second=np.random.randint(0, 60),
            )

            # Generate triage score (1-5)
            triage_score = np.random.choice([1, 2, 3, 4, 5], p=triage_probabilities)

            # Generate admission status based on triage score
            admission_prob = admission_probabilities[triage_score]
            is_admitted = np.random.choice(
                [0, 1], p=[1 - admission_prob, admission_prob]
            )

            # Generate specialty for admitted patients
            if is_admitted:
                specialty = np.random.choice(
                    ["medical", "surgical", "haem/onc", "paediatric"],
                    p=[0.65, 0.25, 0.05, 0.05],
                )
            else:
                specialty = None

            # Skip this visit if admitted_only is True and patient is not admitted
            if admitted_only and not is_admitted:
                continue

            # Generate length of stay (in minutes) - log-normal distribution
            # Most visits are 4 to 12 hours, but some can be shorter or longer
            length_of_stay = np.random.lognormal(mean=5.8, sigma=0.5)
            length_of_stay = max(
                60, min(2880, length_of_stay)
            )  # Between 1 hour and 48 hours

            # Make higher triage scores (more severe) stay longer on average
            if triage_score <= 2:
                length_of_stay *= 1.8  # 80% longer stays for more severe cases

            # Calculate departure time
            departure_datetime = arrival_datetime + timedelta(
                minutes=int(length_of_stay)
            )

            # For returning patients, use the same age as their first visit
            if patient_id in [v["patient_id"] for v in visits]:
                # Find the age from a previous visit
                age = next(v["age"] for v in visits if v["patient_id"] == patient_id)
            else:
                # Generate age with a distribution skewed towards older adults
                age = int(
                    np.random.lognormal(mean=3.8, sigma=0.5)
                )  # Centers around 45 years
                age = max(0, min(100, age))  # Clamp between 0-100 years

            # Add visit record (without triage score, but with patient_id)
            visits.append(
                {
                    "patient_id": patient_id,
                    "visit_number": visit_number,
                    "arrival_datetime": arrival_datetime,
                    "departure_datetime": departure_datetime,
                    "age": age,
                    "is_admitted": is_admitted,
                    "specialty": specialty,
                }
            )

            # Generate triage observation within first 10 minutes
            minutes_after_arrival = np.random.uniform(0, 10)
            observation_datetime = arrival_datetime + timedelta(
                minutes=minutes_after_arrival
            )

            observations.append(
                {
                    "visit_number": visit_number,
                    "observation_datetime": observation_datetime,
                    "triage_score": triage_score,
                }
            )

            # Generate lab orders if visit is longer than 2 hours
            if length_of_stay > 120:
                # For each lab test, decide if it should be ordered based on triage score
                for lab_test in lab_tests:
                    if np.random.random() < lab_probabilities[triage_score][lab_test]:
                        # Order time is after triage but within first 90 minutes
                        minutes_after_triage = np.random.uniform(
                            0, 90 - minutes_after_arrival
                        )
                        order_datetime = observation_datetime + timedelta(
                            minutes=minutes_after_triage
                        )

                        lab_orders.append(
                            {
                                "visit_number": visit_number,
                                "order_datetime": order_datetime,
                                "lab_name": lab_test,
                            }
                        )

            visit_number += 1

    # Create DataFrames and sort by time
    visits_df = pd.DataFrame(visits)
    visits_df = visits_df.sort_values("arrival_datetime").reset_index(drop=True)

    observations_df = pd.DataFrame(observations)
    observations_df = observations_df.sort_values("observation_datetime").reset_index(
        drop=True
    )

    lab_orders_df = pd.DataFrame(lab_orders)
    if not lab_orders_df.empty:
        lab_orders_df = lab_orders_df.sort_values("order_datetime").reset_index(
            drop=True
        )

    return visits_df, observations_df, lab_orders_df

create_fake_snapshots(prediction_times, start_date, end_date, df=None, observations_df=None, lab_orders_df=None, mean_patients_per_day=50)

Generate patient-level snapshots at specific times for prediction modeling.

For each specified time on each date in the range, this function returns a snapshot of patients who are currently in the emergency department, along with their visit features, latest triage score, and number of lab tests ordered prior to that time.

Parameters:

Name Type Description Default
prediction_times list of tuple of int

A list of (hour, minute) tuples indicating times of day to create snapshots.

required
start_date str or datetime

The starting date for generating snapshots (inclusive).

required
end_date str or datetime

The ending date for generating snapshots (exclusive).

required
df DataFrame

Patient visit data from create_fake_finished_visits. If None, synthetic data is generated.

None
observations_df DataFrame

Triage score data from create_fake_finished_visits. If None, synthetic data is generated.

None
lab_orders_df DataFrame

Lab order data from create_fake_finished_visits. If None, synthetic data is generated.

None
mean_patients_per_day float

Average number of patients per day (used only if synthetic data is generated).

50

Returns:

Name Type Description
final_df DataFrame

A DataFrame with one row per patient visit present at the snapshot time. Columns include: - 'snapshot_date' - 'prediction_time' - 'patient_id' - 'visit_number' - 'is_admitted' - 'age' - 'latest_triage_score' - One column per lab test: 'num__orders'

Notes
  • Only patients present in the ED at the snapshot time are included.
  • Lab order columns reflect counts of tests ordered before the snapshot time.
  • If no patients are present at a snapshot time, that snapshot is omitted.
Source code in src/patientflow/generate.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
def create_fake_snapshots(
    prediction_times,
    start_date,
    end_date,
    df=None,
    observations_df=None,
    lab_orders_df=None,
    mean_patients_per_day=50,
):
    """
    Generate patient-level snapshots at specific times for prediction modeling.

    For each specified time on each date in the range, this function returns a snapshot of patients
    who are currently in the emergency department, along with their visit features, latest triage score,
    and number of lab tests ordered prior to that time.

    Parameters
    ----------
    prediction_times : list of tuple of int
        A list of (hour, minute) tuples indicating times of day to create snapshots.
    start_date : str or datetime
        The starting date for generating snapshots (inclusive).
    end_date : str or datetime
        The ending date for generating snapshots (exclusive).
    df : pandas.DataFrame, optional
        Patient visit data from `create_fake_finished_visits`. If None, synthetic data is generated.
    observations_df : pandas.DataFrame, optional
        Triage score data from `create_fake_finished_visits`. If None, synthetic data is generated.
    lab_orders_df : pandas.DataFrame, optional
        Lab order data from `create_fake_finished_visits`. If None, synthetic data is generated.
    mean_patients_per_day : float, optional
        Average number of patients per day (used only if synthetic data is generated).

    Returns
    -------
    final_df : pandas.DataFrame
        A DataFrame with one row per patient visit present at the snapshot time. Columns include:
        - 'snapshot_date'
        - 'prediction_time'
        - 'patient_id'
        - 'visit_number'
        - 'is_admitted'
        - 'age'
        - 'latest_triage_score'
        - One column per lab test: 'num_<lab_name>_orders'

    Notes
    -----
    - Only patients present in the ED at the snapshot time are included.
    - Lab order columns reflect counts of tests ordered before the snapshot time.
    - If no patients are present at a snapshot time, that snapshot is omitted.
    """

    # Generate fake data if not provided
    if df is None or observations_df is None or lab_orders_df is None:
        df, observations_df, lab_orders_df = create_fake_finished_visits(
            start_date, end_date, mean_patients_per_day
        )

    # Add date conversion at the start
    if isinstance(start_date, str):
        start_date = datetime.strptime(start_date, "%Y-%m-%d").date()
    elif isinstance(start_date, datetime):
        start_date = start_date.date()

    if isinstance(end_date, str):
        end_date = datetime.strptime(end_date, "%Y-%m-%d").date()
    elif isinstance(end_date, datetime):
        end_date = end_date.date()

    # Create date range (changed to exclusive end date)
    snapshot_dates = []
    current_date = start_date
    while current_date < end_date:  # Changed from <= to <
        snapshot_dates.append(current_date)
        current_date += timedelta(days=1)

    # Get unique lab test names
    lab_tests = lab_orders_df["lab_name"].unique() if not lab_orders_df.empty else []

    # Create empty list to store all results
    all_results = []

    # For each combination of date and time
    for date in snapshot_dates:
        for hour, minute in prediction_times:
            snapshot_datetime = datetime.combine(date, time(hour=hour, minute=minute))

            # Filter dataframe for this snapshot
            mask = (df["arrival_datetime"] <= snapshot_datetime) & (
                df["departure_datetime"] > snapshot_datetime
            )
            snapshot_df = df[mask].copy()  # Create copy to avoid SettingWithCopyWarning

            # Skip if no patients at this time
            if len(snapshot_df) == 0:
                continue

            # Get triage scores recorded before the snapshot time
            valid_observations = observations_df[
                (observations_df["visit_number"].isin(snapshot_df["visit_number"]))
                & (observations_df["observation_datetime"] <= snapshot_datetime)
            ].copy()

            # Keep only the most recent triage score for each visit
            if not valid_observations.empty:
                valid_observations = valid_observations.sort_values(
                    "observation_datetime"
                )
                valid_observations = (
                    valid_observations.groupby("visit_number").last().reset_index()
                )
                valid_observations = valid_observations.rename(
                    columns={"triage_score": "latest_triage_score"}
                )

            # Get lab orders placed before the snapshot time
            valid_orders = lab_orders_df[
                (lab_orders_df["visit_number"].isin(snapshot_df["visit_number"]))
                & (lab_orders_df["order_datetime"] <= snapshot_datetime)
            ].copy()

            # Initialize lab_counts with zeros for all visits in snapshot_df
            lab_counts = pd.DataFrame(
                0,
                index=pd.Index(
                    snapshot_df["visit_number"].unique(), name="visit_number"
                ),
                columns=[f"num_{test.lower()}_orders" for test in lab_tests],
            )

            # Update counts if there are any valid orders
            if not valid_orders.empty:
                order_counts = (
                    valid_orders.groupby(["visit_number", "lab_name"])
                    .size()
                    .unstack(fill_value=0)
                )
                order_counts.columns = [
                    f"num_{test.lower()}_orders" for test in order_counts.columns
                ]
                # Update the counts in lab_counts where we have orders
                lab_counts.update(order_counts)

            lab_counts = lab_counts.reset_index()

            # Add snapshot information columns
            snapshot_df["snapshot_date"] = date
            snapshot_df["prediction_time"] = [(hour, minute)] * len(snapshot_df)

            # Merge with valid observations to get triage scores, handling empty case
            if not valid_observations.empty:
                snapshot_df = pd.merge(
                    snapshot_df,
                    valid_observations[["visit_number", "latest_triage_score"]],
                    on="visit_number",
                    how="left",
                )
            else:
                snapshot_df["latest_triage_score"] = pd.Series(
                    [np.nan] * len(snapshot_df),
                    dtype="float64",
                    index=snapshot_df.index,
                )
            # Merge with lab counts
            snapshot_df = pd.merge(
                snapshot_df, lab_counts, on="visit_number", how="left"
            )

            # Fill NA values in lab count columns with 0
            for col in snapshot_df.columns:
                if col.endswith("_orders"):
                    snapshot_df[col] = snapshot_df[col].fillna(0)
            if not snapshot_df.empty:
                # Optionally check for all-NA in key columns
                snapshot_cols = [
                    "snapshot_date",
                    "prediction_time",
                    "snapshot_datetime",
                ]
                # Only check columns that exist in the DataFrame
                check_cols = [
                    col for col in snapshot_cols if col in snapshot_df.columns
                ]

                if not check_cols or not snapshot_df[check_cols].isna().all().any():
                    all_results.append(snapshot_df)
                else:
                    print(
                        f"Skipping DataFrame with all-NA values in key columns: {check_cols}"
                    )
            else:
                print("Skipping empty DataFrame")

    # Combine all results into single dataframe
    if all_results:
        final_df = pd.concat(all_results, ignore_index=True)

        # Define column order
        snapshot_cols = ["snapshot_date", "prediction_time"]
        visit_cols = [
            "patient_id",
            "visit_number",
            "is_admitted",
            "age",
            "latest_triage_score",
        ]
        lab_cols = [col for col in final_df.columns if col.endswith("_orders")]

        # Ensure all required columns exist
        for col in visit_cols:
            if col not in final_df.columns:
                if col == "latest_triage_score":
                    final_df[col] = pd.NA
                else:
                    final_df[col] = None

        # Reorder columns
        final_df = final_df[snapshot_cols + visit_cols + lab_cols]
    else:
        # Create empty dataframe with correct columns if no results found
        lab_cols = [f"num_{test.lower()}_orders" for test in lab_tests]
        columns = [
            "snapshot_date",
            "prediction_time",
            "visit_number",
            "is_admitted",
            "age",
            "latest_triage_score",
        ] + lab_cols
        final_df = pd.DataFrame(columns=columns)

    # Name the index snapshot_id before returning
    final_df.index.name = "snapshot_id"
    return final_df

load

This module provides functionality for loading configuration files, data from CSV files, and trained machine learning models.

It includes the following features:

  • Loading Configurations: Parse YAML configuration files and extract necessary parameters for data processing and modeling.
  • Data Handling: Load and preprocess data from CSV files, including optional operations like setting an index, sorting, and applying literal evaluation on columns.
  • Model Management: Load saved machine learning models, customize model filenames based on time, and categorize DataFrame columns into predefined groups for analysis.

The module handles common file and parsing errors, returning appropriate error messages or exceptions.

Functions:

Name Description
parse_args:

Parses command-line arguments for training models.

set_project_root:

Validates project root path from specified environment variable.

load_config_file:

Load a YAML configuration file and extract key parameters.

set_file_paths:

Sets up the file paths based on UCLH-specific or default parameters.

set_data_file_names:

Set file locations based on UCLH-specific or default data sources.

safe_literal_eval:

Safely evaluate string literals into Python objects when loading from csv.

load_data:

Load and preprocess data from a CSV or pickle file.

get_model_key:

Generate a model name based on the time of day.

load_saved_model:

Load a machine learning model saved in a joblib file.

get_dict_cols:

Categorize columns from a DataFrame into predefined groups for analysis.

data_from_csv(csv_path, index_column=None, sort_columns=None, eval_columns=None)

Loads data from a CSV file, with optional transformations. LEGACY!

This function loads a CSV file into a pandas DataFrame and provides the following optional features: - Setting a specified column as the index. - Sorting the DataFrame by one or more specified columns. - Applying safe literal evaluation to specified columns to handle string representations of Python objects.

Parameters:

Name Type Description Default
csv_path str

The relative or absolute path to the CSV file.

required
index_column str

The column to set as the index of the DataFrame. If not provided, no index column is set.

None
sort_columns list of str

A list of columns by which to sort the DataFrame. If not provided, the DataFrame is not sorted.

None
eval_columns list of str

A list of columns to which safe_literal_eval should be applied. This is useful for columns containing string representations of Python data structures (e.g., lists, dictionaries).

None

Returns:

Type Description
DataFrame

A pandas DataFrame containing the loaded data with any specified transformations applied.

Raises:

Type Description
SystemExit

If the file cannot be found or another error occurs during loading or processing.

Notes

The function will terminate the program with a message if the file is not found or if any errors occur while loading the data. If sorting columns or applying safe_literal_eval fails, a warning message is printed, but execution continues.

Source code in src/patientflow/load.py
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
def data_from_csv(csv_path, index_column=None, sort_columns=None, eval_columns=None):
    """
    Loads data from a CSV file, with optional transformations. LEGACY!

    This function loads a CSV file into a pandas DataFrame and provides the following optional features:
    - Setting a specified column as the index.
    - Sorting the DataFrame by one or more specified columns.
    - Applying safe literal evaluation to specified columns to handle string representations of Python objects.

    Parameters
    ----------
    csv_path : str
        The relative or absolute path to the CSV file.
    index_column : str, optional
        The column to set as the index of the DataFrame. If not provided, no index column is set.
    sort_columns : list of str, optional
        A list of columns by which to sort the DataFrame. If not provided, the DataFrame is not sorted.
    eval_columns : list of str, optional
        A list of columns to which `safe_literal_eval` should be applied. This is useful for columns containing
        string representations of Python data structures (e.g., lists, dictionaries).

    Returns
    -------
    pd.DataFrame
        A pandas DataFrame containing the loaded data with any specified transformations applied.

    Raises
    ------
    SystemExit
        If the file cannot be found or another error occurs during loading or processing.

    Notes
    -----
    The function will terminate the program with a message if the file is not found or if any errors
    occur while loading the data. If sorting columns or applying `safe_literal_eval` fails,
    a warning message is printed, but execution continues.

    """
    path = os.path.join(Path().home(), csv_path)

    if not os.path.exists(path):
        print(f"Data file not found at path: {path}")
        sys.exit(1)

    try:
        df = pd.read_csv(path, parse_dates=True)
    except FileNotFoundError:
        print(f"Data file not found at path: {path}")
        sys.exit(1)
    except Exception as e:
        print(f"Error loading data: {e}")
        sys.exit(1)

    if index_column:
        try:
            if df.index.name != index_column:
                df = df.set_index(index_column)
        except KeyError:
            print(f"Index column '{index_column}' not found in dataframe")

    if sort_columns:
        try:
            df.sort_values(sort_columns, inplace=True)
        except KeyError:
            print("One or more sort columns not found in dataframe")

    if eval_columns:
        for column in eval_columns:
            if column in df.columns:
                try:
                    df[column] = df[column].apply(safe_literal_eval)
                except Exception as e:
                    print(f"Error applying safe_literal_eval to column '{column}': {e}")

    return df

get_dict_cols(df)

Categorize DataFrame columns into predefined groups.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to categorize.

required

Returns:

Type Description
dict

A dictionary where keys are column group names and values are lists of column names in each group.

Source code in src/patientflow/load.py
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
def get_dict_cols(df):
    """
    Categorize DataFrame columns into predefined groups.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame to categorize.

    Returns
    -------
    dict
        A dictionary where keys are column group names and values are lists of column names in each group.
    """
    not_used_in_training_vars = [
        "snapshot_id",
        "snapshot_date",
        "prediction_time",
        "visit_number",
        "training_validation_test",
        "random_number",
    ]
    arrival_and_demographic_vars = [
        "elapsed_los",
        "sex",
        "age_group",
        "age_on_arrival",
        "arrival_method",
    ]
    summary_vars = [
        "num_obs",
        "num_obs_events",
        "num_obs_types",
        "num_lab_batteries_ordered",
    ]

    location_vars = []
    observations_vars = []
    labs_vars = []
    consults_vars = [
        "has_consultation",
        "consultation_sequence",
        "final_sequence",
        "specialty",
    ]
    outcome_vars = ["is_admitted"]

    for col in df.columns:
        if (
            col in not_used_in_training_vars
            or col in arrival_and_demographic_vars
            or col in summary_vars
        ):
            continue
        elif "visited" in col or "location" in col:
            location_vars.append(col)
        elif "num_obs" in col or "latest_obs" in col:
            observations_vars.append(col)
        elif "lab_orders" in col or "latest_lab_results" in col:
            labs_vars.append(col)
        elif col in consults_vars or col in outcome_vars:
            continue  # Already categorized
        else:
            print(f"Column '{col}' did not match any predefined group")

    # Create a list of column groups
    col_group_names = [
        "not used in training",
        "arrival and demographic",
        "summary",
        "location",
        "observations",
        "lab orders and results",
        "consults",
        "outcome",
    ]

    # Create a list of the column names within those groups
    col_groups = [
        not_used_in_training_vars,
        arrival_and_demographic_vars,
        summary_vars,
        location_vars,
        observations_vars,
        labs_vars,
        consults_vars,
        outcome_vars,
    ]

    # Use dictionary to combine them
    dict_col_groups = {
        category: var_list for category, var_list in zip(col_group_names, col_groups)
    }

    return dict_col_groups

get_model_key(model_name, prediction_time)

Create a model name based on the time of day.

Parameters:

Name Type Description Default
model_name str

The base name of the model.

required
prediction_time tuple of int

A tuple representing the time of day (hour, minute).

required

Returns:

Type Description
str

A string representing the model name based on the time of day.

Source code in src/patientflow/load.py
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
def get_model_key(model_name, prediction_time):
    """
    Create a model name based on the time of day.

    Parameters
    ----------
    model_name : str
        The base name of the model.
    prediction_time : tuple of int
        A tuple representing the time of day (hour, minute).

    Returns
    -------
    str
        A string representing the model name based on the time of day.
    """

    hour_, min_ = prediction_time
    min_ = f"{min_}0" if min_ % 60 == 0 else str(min_)
    model_name = model_name + "_" + f"{hour_:02}" + min_
    return model_name

load_config_file(config_file_path, return_start_end_dates=False)

Load configuration from a YAML file.

Parameters:

Name Type Description Default
config_file_path str

The path to the configuration file.

required
return_start_end_dates bool

If True, return only the start and end dates from the file (default is False).

False

Returns:

Type Description
dict or tuple or None

If return_start_end_dates is True, returns a tuple of start and end dates (str). Otherwise, returns a dictionary containing the configuration parameters. Returns None if an error occurs during file reading or parsing.

Source code in src/patientflow/load.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def load_config_file(
    config_file_path: str, return_start_end_dates: bool = False
) -> Optional[Union[Dict[str, Any], Tuple[str, str]]]:
    """
    Load configuration from a YAML file.

    Parameters
    ----------
    config_file_path : str
        The path to the configuration file.
    return_start_end_dates : bool, optional
        If True, return only the start and end dates from the file (default is False).

    Returns
    -------
    dict or tuple or None
        If `return_start_end_dates` is True, returns a tuple of start and end dates (str).
        Otherwise, returns a dictionary containing the configuration parameters.
        Returns None if an error occurs during file reading or parsing.
    """
    try:
        with open(config_file_path, "r") as file:
            config = yaml.safe_load(file)
    except FileNotFoundError:
        print(f"Error: The file '{config_file_path}' was not found.")
        return None
    except yaml.YAMLError as e:
        print(f"Error parsing YAML file: {e}")
        return None

    try:
        if return_start_end_dates:
            # load the dates used in saved data for uclh versions
            if "file_dates" in config and config["file_dates"]:
                start_date, end_date = [str(item) for item in config["file_dates"]]
                return (start_date, end_date)
            else:
                print(
                    "Error: 'file_dates' key not found or empty in the configuration file."
                )
                return None

        params: Dict[str, Any] = {}

        if "prediction_times" in config:
            params["prediction_times"] = [
                tuple(item) for item in config["prediction_times"]
            ]
        else:
            print("Error: 'prediction_times' key not found in the configuration file.")
            sys.exit(1)

        if "modelling_dates" in config and len(config["modelling_dates"]) == 4:
            (
                params["start_training_set"],
                params["start_validation_set"],
                params["start_test_set"],
                params["end_test_set"],
            ) = [item for item in config["modelling_dates"]]
        else:
            print(
                f"Error: expecting 4 modelling dates and only got {len(config.get('modelling_dates', []))}"
            )
            return None

        params["x1"] = float(config.get("x1", 4))
        params["y1"] = float(config.get("y1", 0.76))
        params["x2"] = float(config.get("x2", 12))
        params["y2"] = float(config.get("y2", 0.99))
        params["prediction_window"] = config.get("prediction_window", 480)
        params["epsilon"] = config.get("epsilon", 10**-7)
        params["yta_time_interval"] = config.get("yta_time_interval", 15)

        return params

    except KeyError as e:
        print(f"Error: Missing key in the configuration file: {e}")
        return None
    except ValueError as e:
        print(f"Error: Invalid value found in the configuration file: {e}")
        return None

load_data(data_file_path, file_name, index_column=None, sort_columns=None, eval_columns=None, home_path=None, encoding=None)

Loads data from CSV or pickle file with optional transformations.

Parameters:

Name Type Description Default
data_file_path str

Directory path containing the data file

required
file_name str

Name of the CSV or pickle file to load

required
index_column str

Column to set as DataFrame index

None
sort_columns list of str

Columns to sort DataFrame by

None
eval_columns list of str

Columns to apply safe_literal_eval to

None
home_path str or Path

Base path to use instead of user's home directory

None
encoding str

The encoding to use when reading CSV files (e.g., 'utf-8', 'latin1')

None

Returns:

Type Description
DataFrame

Loaded and transformed DataFrame

Raises:

Type Description
FileNotFoundError

If the specified file does not exist

ValueError

If the file format is not supported or other processing errors occur

Source code in src/patientflow/load.py
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
def load_data(
    data_file_path,
    file_name,
    index_column=None,
    sort_columns=None,
    eval_columns=None,
    home_path=None,
    encoding=None,
):
    """
    Loads data from CSV or pickle file with optional transformations.

    Parameters
    ----------
    data_file_path : str
        Directory path containing the data file
    file_name : str
        Name of the CSV or pickle file to load
    index_column : str, optional
        Column to set as DataFrame index
    sort_columns : list of str, optional
        Columns to sort DataFrame by
    eval_columns : list of str, optional
        Columns to apply safe_literal_eval to
    home_path : str or Path, optional
        Base path to use instead of user's home directory
    encoding : str, optional
        The encoding to use when reading CSV files (e.g., 'utf-8', 'latin1')

    Returns
    -------
    pd.DataFrame
        Loaded and transformed DataFrame

    Raises
    ------
    FileNotFoundError
        If the specified file does not exist
    ValueError
        If the file format is not supported or other processing errors occur
    """
    from pathlib import Path

    # Use provided home_path if available, otherwise default to user's home directory
    base_path = Path(home_path) if home_path else Path.home()
    path = base_path / data_file_path / file_name

    if not path.exists():
        raise FileNotFoundError(f"Data file not found at path: {path}")

    try:
        if path.suffix.lower() == ".csv":
            df = pd.read_csv(path, parse_dates=True, encoding=encoding)
        elif path.suffix.lower() == ".pkl":
            df = pd.read_pickle(path)
        else:
            raise ValueError(
                f"Unsupported file format: {path.suffix}. Must be .csv or .pkl"
            )
    except Exception as e:
        raise ValueError(f"Error loading data: {str(e)}")

    if index_column and df.index.name != index_column:
        try:
            df = df.set_index(index_column)
        except KeyError:
            print(f"Warning: Index column '{index_column}' not found in dataframe")

    if sort_columns:
        try:
            df.sort_values(sort_columns, inplace=True)
        except KeyError:
            print("Warning: One or more sort columns not found in dataframe")

    if eval_columns:
        for column in eval_columns:
            if column in df.columns:
                try:
                    df[column] = df[column].apply(safe_literal_eval)
                except Exception as e:
                    print(
                        f"Warning: Error applying safe_literal_eval to column '{column}': {str(e)}"
                    )

    return df

parse_args()

Parse command-line arguments for the training script.

Returns: argparse.Namespace: The parsed arguments containing 'data_folder_name' and 'uclh' keys.

Source code in src/patientflow/load.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def parse_args() -> argparse.Namespace:
    """
    Parse command-line arguments for the training script.

    Returns:
        argparse.Namespace: The parsed arguments containing 'data_folder_name' and 'uclh' keys.
    """
    parser = argparse.ArgumentParser(description="Train emergency demand models")
    parser.add_argument(
        "--data_folder_name",
        type=str,
        default="data-synthetic",
        help="Location of data for training",
    )
    parser.add_argument(
        "--uclh",
        type=lambda x: x.lower() in ["true", "1", "yes", "y"],
        default=False,
        help="Train using UCLH data (True) or Public data (False)",
    )
    args = parser.parse_args()
    return args

safe_literal_eval(s)

Safely evaluate a string literal into a Python object. Handles list-like strings by converting them to lists.

Parameters:

Name Type Description Default
s str

The string to evaluate.

required

Returns:

Type Description
Any, list, or None

The evaluated Python object if successful, a list if the input is list-like, or None for empty/null values.

Source code in src/patientflow/load.py
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
def safe_literal_eval(s):
    """
    Safely evaluate a string literal into a Python object.
    Handles list-like strings by converting them to lists.

    Parameters
    ----------
    s : str
        The string to evaluate.

    Returns
    -------
    Any, list, or None
        The evaluated Python object if successful, a list if the input is list-like,
        or None for empty/null values.
    """
    if pd.isna(s) or str(s).strip().lower() in ["nan", "none", ""]:
        return None

    if isinstance(s, str):
        s = s.strip()
        if s.startswith("[") and s.endswith("]"):
            try:
                # Remove square brackets and split by comma
                items = s[1:-1].split(",")
                # Strip whitespace from each item and remove empty strings
                return [item.strip() for item in items if item.strip()]
            except Exception:
                # If the above fails, fall back to ast.literal_eval
                pass

    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        # If ast.literal_eval fails, return the original string
        return s

set_data_file_names(uclh, data_file_path, config_file_path=None)

Set file locations based on UCLH or default data source.

Parameters:

Name Type Description Default
uclh bool

If True, use UCLH-specific file locations. If False, use default file locations.

required
data_file_path Path

The base path to the data directory.

required
config_file_path str

The path to the configuration file, required if uclh is True.

None

Returns:

Type Description
tuple

Paths to the required files (visits, arrivals) based on the configuration.

Source code in src/patientflow/load.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
def set_data_file_names(uclh, data_file_path, config_file_path=None):
    """
    Set file locations based on UCLH or default data source.

    Parameters
    ----------
    uclh : bool
        If True, use UCLH-specific file locations. If False, use default file locations.
    data_file_path : Path
        The base path to the data directory.
    config_file_path : str, optional
        The path to the configuration file, required if `uclh` is True.

    Returns
    -------
    tuple
        Paths to the required files (visits, arrivals) based on the configuration.
    """
    if not isinstance(data_file_path, Path):
        data_file_path = Path(data_file_path)

    if not uclh:
        csv_filename = "ed_visits.csv"
        yta_csv_filename = "inpatient_arrivals.csv"

        visits_csv_path = data_file_path / csv_filename
        yta_csv_path = data_file_path / yta_csv_filename

        return visits_csv_path, yta_csv_path

    else:
        start_date, end_date = load_config_file(
            config_file_path, return_start_end_dates=True
        )
        data_filename = (
            "uclh_visits_exc_beds_inc_minority_"
            + str(start_date)
            + "_"
            + str(end_date)
            + ".pickle"
        )
        csv_filename = "uclh_ed_visits.csv"
        yta_filename = (
            "uclh_yet_to_arrive_" + str(start_date) + "_" + str(end_date) + ".pickle"
        )
        yta_csv_filename = "uclh_inpatient_arrivals.csv"

        visits_path = data_file_path / data_filename
        yta_path = data_file_path / yta_filename

        visits_csv_path = data_file_path / csv_filename
        yta_csv_path = data_file_path / yta_csv_filename

    return visits_path, visits_csv_path, yta_path, yta_csv_path

set_file_paths(project_root, data_folder_name, train_dttm=None, inference_time=False, config_file='config.yaml', prefix=None, verbose=True)

Sets up the file paths

Args: project_root (Path): Root path of the project data_folder_name (str): Name of the folder where data files are located train_dttm (Optional[str], optional): A string representation of the datetime at which training commenced. Defaults to None inference_time (bool, optional): A flag indicating whether it is inference time or not. Defaults to False config_file (str, optional): Name of config file. Defaults to "config.yaml" prefix (Optional[str], optional): String to prefix model folder names. Defaults to None verbose (bool, optional): Whether to print path information. Defaults to True

Returns: tuple: Contains (data_file_path, media_file_path, model_file_path, config_path)

Source code in src/patientflow/load.py
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def set_file_paths(
    project_root: Path,
    data_folder_name: str,
    train_dttm: Optional[str] = None,
    inference_time: bool = False,
    config_file: str = "config.yaml",
    prefix: Optional[str] = None,
    verbose: bool = True,
) -> Tuple[Path, Path, Path, Path]:
    """
    Sets up the file paths

    Args:
        project_root (Path): Root path of the project
        data_folder_name (str): Name of the folder where data files are located
        train_dttm (Optional[str], optional): A string representation of the datetime at which training commenced. Defaults to None
        inference_time (bool, optional): A flag indicating whether it is inference time or not. Defaults to False
        config_file (str, optional): Name of config file. Defaults to "config.yaml"
        prefix (Optional[str], optional): String to prefix model folder names. Defaults to None
        verbose (bool, optional): Whether to print path information. Defaults to True

    Returns:
        tuple: Contains (data_file_path, media_file_path, model_file_path, config_path)
    """

    config_path = Path(project_root) / config_file
    if verbose:
        print(f"Configuration will be loaded from: {config_path}")

    data_file_path = Path(project_root) / data_folder_name
    if verbose:
        print(f"Data files will be loaded from: {data_file_path}")

    model_id = data_folder_name.lstrip("data-")
    if prefix:
        model_id = f"{prefix}_{model_id}"
    if train_dttm:
        model_id = f"{model_id}_{train_dttm}"

    model_file_path = Path(project_root) / "trained-models" / model_id
    media_file_path = model_file_path / "media"

    if not inference_time:
        if verbose:
            print(f"Trained models will be saved to: {model_file_path}")
        model_file_path.mkdir(parents=True, exist_ok=True)
        (model_file_path / "model-output").mkdir(parents=False, exist_ok=True)
        media_file_path.mkdir(parents=False, exist_ok=True)
        if verbose:
            print(f"Images will be saved to: {media_file_path}")

    return data_file_path, media_file_path, model_file_path, config_path

set_project_root(env_var=None)

Sets project root path from environment variable or infers it from current path.

First checks specified environment variable for project root path. If not found, searches current path hierarchy for highest-level 'patientflow' directory.

Args: env_var (Optional[str]): Name of environment variable containing project root path

Returns: Path: Validated project root path

Raises: ValueError: If environment variable not set and 'patientflow' not found in path NotADirectoryError: If path doesn't exist TypeError: If env_var is not None and not a string

Source code in src/patientflow/load.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def set_project_root(env_var: Optional[str] = None) -> Path:
    """
    Sets project root path from environment variable or infers it from current path.

    First checks specified environment variable for project root path.
    If not found, searches current path hierarchy for highest-level 'patientflow' directory.

    Args:
        env_var (Optional[str]): Name of environment variable containing project root path

    Returns:
        Path: Validated project root path

    Raises:
        ValueError: If environment variable not set and 'patientflow' not found in path
        NotADirectoryError: If path doesn't exist
        TypeError: If env_var is not None and not a string
    """
    # Only try to get env path if env_var is provided
    env_path: Optional[str] = os.getenv(env_var) if env_var is not None else None
    project_root: Optional[Path] = None

    # Try getting from environment variable first
    if env_path is not None:
        try:
            project_root = Path(env_path)
            if not project_root.is_dir():
                raise NotADirectoryError(f"Path does not exist: {project_root}")
            print(f"Project root from environment: {project_root}")
            return project_root
        except (TypeError, ValueError) as e:
            print(f"Error converting {env_path} to Path: {e}")
            raise
    else:
        # If not in env var, try to infer from current path
        current: Path = Path().absolute()

        # Search through parents to find highest-level 'patientflow' directory
        for parent in [current, *current.parents]:
            if parent.name == "patientflow" and parent.is_dir():
                project_root = parent
                # Continue searching to find highest level

        if project_root:
            print(f"Inferred project root: {project_root}")
            return project_root

        print(
            f"Could not find project root - {env_var} not set and 'patientflow' not found in path"
        )
        print(f"\nCurrent directory: {Path().absolute()}")
        if env_var:
            print(f"\nRun one of these commands in a new cell to set {env_var}:")
            print("# Linux/Mac:")
            print(f"%env {env_var}=/path/to/project")
            print("\n# Windows:")
            print(f"%env {env_var}=C:\\path\\to\\project")
        raise ValueError("Project root not found")

model_artifacts

Model training results containers.

This module defines a set of data classes to organise results from model training, including hyperparameter tuning, cross-validation fold metrics, and final trained classifier artifacts. These classes serve as structured containers for various types of model evaluation outputs and metadata.

Classes:

Name Description
HyperParameterTrial

Container for storing hyperparameter tuning trial results.

FoldResults

Stores evaluation metrics from a single cross-validation fold.

TrainingResults

Encapsulates comprehensive evaluation metrics and metadata from model training.

TrainedClassifier

Container for a trained model and associated training results.

FoldResults dataclass

Store evaluation metrics for a single fold.

Attributes:

Name Type Description
auc float

Area Under the ROC Curve (AUC) for this fold.

logloss float

Logarithmic loss (cross-entropy loss) for this fold.

auprc float

Area Under the Precision-Recall Curve (AUPRC) for this fold.

Source code in src/patientflow/model_artifacts.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
@dataclass
class FoldResults:
    """
    Store evaluation metrics for a single fold.

    Attributes
    ----------
    auc : float
        Area Under the ROC Curve (AUC) for this fold.
    logloss : float
        Logarithmic loss (cross-entropy loss) for this fold.
    auprc : float
        Area Under the Precision-Recall Curve (AUPRC) for this fold.
    """

    auc: float
    logloss: float
    auprc: float

HyperParameterTrial dataclass

Container for a single hyperparameter tuning trial.

Attributes:

Name Type Description
parameters dict of str to Any

Dictionary of hyperparameters used in the trial.

cv_results dict of str to float

Cross-validation metrics obtained using the specified parameters.

Source code in src/patientflow/model_artifacts.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
@dataclass
class HyperParameterTrial:
    """
    Container for a single hyperparameter tuning trial.

    Attributes
    ----------
    parameters : dict of str to Any
        Dictionary of hyperparameters used in the trial.
    cv_results : dict of str to float
        Cross-validation metrics obtained using the specified parameters.
    """

    parameters: Dict[str, Any]
    cv_results: Dict[str, float]

TrainedClassifier dataclass

Container for trained model artifacts and their associated information.

Attributes:

Name Type Description
training_results TrainingResults

Evaluation metrics and training metadata for the classifier.

pipeline (Pipeline or None, optional)

The scikit-learn pipeline representing the trained classifier.

calibrated_pipeline (Pipeline or None, optional)

The calibrated version of the pipeline, if model calibration was performed.

Source code in src/patientflow/model_artifacts.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
@dataclass
class TrainedClassifier:
    """
    Container for trained model artifacts and their associated information.

    Attributes
    ----------
    training_results : TrainingResults
        Evaluation metrics and training metadata for the classifier.
    pipeline : sklearn.pipeline.Pipeline or None, optional
        The scikit-learn pipeline representing the trained classifier.
    calibrated_pipeline : sklearn.pipeline.Pipeline or None, optional
        The calibrated version of the pipeline, if model calibration was performed.
    """

    training_results: TrainingResults
    pipeline: Optional[Pipeline] = None
    calibrated_pipeline: Optional[Pipeline] = None

TrainingResults dataclass

Store comprehensive evaluation metrics and metadata from model training.

Attributes:

Name Type Description
prediction_time tuple of int

Start and end time of prediction, represented as UNIX timestamps.

training_info dict of str to Any, optional

Metadata or logs collected during training.

calibration_info dict of str to Any, optional

Information about model calibration, if applicable.

test_results dict of str to float, optional

Evaluation metrics computed on the test dataset. None if test evaluation was not performed.

balance_info dict of str to bool or int or float, optional

Information related to class balance (e.g., whether data was balanced, class ratios).

Source code in src/patientflow/model_artifacts.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
@dataclass
class TrainingResults:
    """
    Store comprehensive evaluation metrics and metadata from model training.

    Attributes
    ----------
    prediction_time : tuple of int
        Start and end time of prediction, represented as UNIX timestamps.
    training_info : dict of str to Any, optional
        Metadata or logs collected during training.
    calibration_info : dict of str to Any, optional
        Information about model calibration, if applicable.
    test_results : dict of str to float, optional
        Evaluation metrics computed on the test dataset. None if test evaluation was not performed.
    balance_info : dict of str to bool or int or float, optional
        Information related to class balance (e.g., whether data was balanced, class ratios).
    """

    prediction_time: Tuple[int, int]
    training_info: Dict[str, Any] = field(default_factory=dict)
    calibration_info: Dict[str, Any] = field(default_factory=dict)
    test_results: Optional[Dict[str, float]] = None
    balance_info: Dict[str, Union[bool, int, float]] = field(default_factory=dict)

predict

Prediction module for patient flow forecasting.

This module provides functions for making predictions about future patient flow, including emergency demand forecasting and other predictive analytics.

emergency_demand

Emergency demand prediction module.

This module provides functionality for predicting emergency department demand, including specialty-specific predictions for both current patients and yet-to-arrive patients. It handles probability calculations, model predictions, and threshold-based resource estimation.

The module integrates multiple prediction models: - Admission prediction classifier - Specialty sequence predictor - Yet-to-arrive weighted Poisson predictor

Functions:

Name Description
add_missing_columns : function

Add missing columns required by the prediction pipeline

find_probability_threshold_index : function

Find index where cumulative probability exceeds threshold

get_specialty_probs : function

Calculate specialty probability distributions

create_predictions : function

Create predictions for emergency demand

add_missing_columns(pipeline, df)

Add missing columns required by the prediction pipeline from the training data.

Parameters:

Name Type Description Default
pipeline Pipeline

The trained pipeline containing the feature transformer

required
df DataFrame

Input dataframe that may be missing required columns

required

Returns:

Type Description
DataFrame

DataFrame with missing columns added and filled with appropriate default values

Notes

Adds columns with default values based on column name patterns: - lab_orders_, visited_, has_ : False - num_, total_ : 0 - latest_ : pd.NA - arrival_method : "None" - others : pd.NA

Source code in src/patientflow/predict/emergency_demand.py
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def add_missing_columns(pipeline, df):
    """Add missing columns required by the prediction pipeline from the training data.

    Parameters
    ----------
    pipeline : sklearn.pipeline.Pipeline
        The trained pipeline containing the feature transformer
    df : pandas.DataFrame
        Input dataframe that may be missing required columns

    Returns
    -------
    pandas.DataFrame
        DataFrame with missing columns added and filled with appropriate default values

    Notes
    -----
    Adds columns with default values based on column name patterns:
    - lab_orders_, visited_, has_ : False
    - num_, total_ : 0
    - latest_ : pd.NA
    - arrival_method : "None"
    - others : pd.NA
    """
    # check input data for missing columns
    column_transformer = pipeline.named_steps["feature_transformer"]

    # Function to get feature names before one-hot encoding
    def get_feature_names_before_encoding(column_transformer):
        feature_names = []
        for name, transformer, columns in column_transformer.transformers:
            if isinstance(transformer, OneHotEncoder):
                feature_names.extend(columns)
            elif isinstance(transformer, OrdinalEncoder):
                feature_names.extend(columns)
            elif isinstance(transformer, StandardScaler):
                feature_names.extend(columns)
            else:
                feature_names.extend(columns)
        return feature_names

    feature_names_before_encoding = get_feature_names_before_encoding(
        column_transformer
    )

    added_columns = []
    for missing_col in set(feature_names_before_encoding).difference(set(df.columns)):
        if missing_col.startswith(("lab_orders_", "visited_", "has_")):
            df[missing_col] = False
        elif missing_col.startswith(("num_", "total_")):
            df[missing_col] = 0
        elif missing_col.startswith("latest_"):
            df[missing_col] = pd.NA
        elif missing_col == "arrival_method":
            df[missing_col] = "None"
        else:
            df[missing_col] = pd.NA
        added_columns.append(missing_col)

    if added_columns:
        print(
            f"Warning: The following columns were used in training, but not found in the real-time data. These have been added to the dataframe: {', '.join(added_columns)}"
        )

    return df

create_predictions(models, prediction_time, prediction_snapshots, specialties, prediction_window, x1, y1, x2, y2, cdf_cut_points, use_admission_in_window_prob=True)

Create predictions for emergency demand for a single prediction moment.

Parameters:

Name Type Description Default
models Tuple[TrainedClassifier, Union[SequenceToOutcomePredictor, ValueToOutcomePredictor], ParametricIncomingAdmissionPredictor]

Tuple containing: - classifier: TrainedClassifier containing admission predictions - spec_model: SequenceToOutcomePredictor or ValueToOutcomePredictor for specialty predictions - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions

required
prediction_time Tuple

Hour and minute of time for model inference

required
prediction_snapshots DataFrame

DataFrame containing prediction snapshots. Must have an 'elapsed_los' column of type timedelta.

required
specialties List[str]

List of specialty names for predictions (e.g., ['surgical', 'medical'])

required
prediction_window timedelta

Prediction window as a timedelta object

required
x1 float

X-coordinate of first point for probability curve

required
y1 float

Y-coordinate of first point for probability curve

required
x2 float

X-coordinate of second point for probability curve

required
y2 float

Y-coordinate of second point for probability curve

required
cdf_cut_points List[float]

List of cumulative distribution function cut points (e.g., [0.9, 0.7])

required
use_admission_in_window_prob bool

Whether to use probability calculation for admission within prediction window for patients already in the ED. If False, probability is set to 1.0 for all current ED patients. This parameter does not affect the yet-to-arrive predictions. By default True

True

Returns:

Type Description
Dict[str, Dict[str, List[int]]]

Nested dictionary containing predictions for each specialty: { 'specialty_name': { 'in_ed': [pred1, pred2, ...], 'yet_to_arrive': [pred1, pred2, ...] } }

Raises:

Type Description
TypeError

If any of the models are not of the expected type or if prediction_window is not a timedelta

ValueError

If models have not been fit or if prediction parameters don't match training parameters If 'elapsed_los' column is missing or not of type timedelta

Notes

The models in the models dictionary must be ModelResults objects that contain either a 'pipeline' or 'calibrated_pipeline' attribute. The pipeline will be used for making predictions, with calibrated_pipeline taking precedence if both exist.

Source code in src/patientflow/predict/emergency_demand.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
def create_predictions(
    models: Tuple[
        TrainedClassifier,
        Union[SequenceToOutcomePredictor, ValueToOutcomePredictor],
        ParametricIncomingAdmissionPredictor,
    ],
    prediction_time: Tuple,
    prediction_snapshots: pd.DataFrame,
    specialties: List[str],
    prediction_window: timedelta,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    cdf_cut_points: List[float],
    use_admission_in_window_prob: bool = True,
) -> Dict[str, Dict[str, List[int]]]:
    """Create predictions for emergency demand for a single prediction moment.

    Parameters
    ----------
    models : Tuple[TrainedClassifier, Union[SequenceToOutcomePredictor, ValueToOutcomePredictor], ParametricIncomingAdmissionPredictor]
        Tuple containing:
        - classifier: TrainedClassifier containing admission predictions
        - spec_model: SequenceToOutcomePredictor or ValueToOutcomePredictor for specialty predictions
        - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions
    prediction_time : Tuple
        Hour and minute of time for model inference
    prediction_snapshots : pandas.DataFrame
        DataFrame containing prediction snapshots. Must have an 'elapsed_los' column of type timedelta.
    specialties : List[str]
        List of specialty names for predictions (e.g., ['surgical', 'medical'])
    prediction_window : timedelta
        Prediction window as a timedelta object
    x1 : float
        X-coordinate of first point for probability curve
    y1 : float
        Y-coordinate of first point for probability curve
    x2 : float
        X-coordinate of second point for probability curve
    y2 : float
        Y-coordinate of second point for probability curve
    cdf_cut_points : List[float]
        List of cumulative distribution function cut points (e.g., [0.9, 0.7])
    use_admission_in_window_prob : bool, optional
        Whether to use probability calculation for admission within prediction window for patients
        already in the ED. If False, probability is set to 1.0 for all current ED patients.
        This parameter does not affect the yet-to-arrive predictions. By default True

    Returns
    -------
    Dict[str, Dict[str, List[int]]]
        Nested dictionary containing predictions for each specialty:
        {
            'specialty_name': {
                'in_ed': [pred1, pred2, ...],
                'yet_to_arrive': [pred1, pred2, ...]
            }
        }

    Raises
    ------
    TypeError
        If any of the models are not of the expected type or if prediction_window is not a timedelta
    ValueError
        If models have not been fit or if prediction parameters don't match training parameters
        If 'elapsed_los' column is missing or not of type timedelta

    Notes
    -----
    The models in the models dictionary must be ModelResults objects
    that contain either a 'pipeline' or 'calibrated_pipeline' attribute. The pipeline
    will be used for making predictions, with calibrated_pipeline taking precedence
    if both exist.
    """
    # Validate model types
    classifier, spec_model, yet_to_arrive_model = models

    if not isinstance(classifier, TrainedClassifier):
        raise TypeError("First model must be of type TrainedClassifier")
    if not isinstance(
        spec_model, (SequenceToOutcomePredictor, ValueToOutcomePredictor)
    ):
        raise TypeError(
            "Second model must be of type SequenceToOutcomePredictor or ValueToOutcomePredictor"
        )
    if not isinstance(yet_to_arrive_model, ParametricIncomingAdmissionPredictor):
        raise TypeError(
            "Third model must be of type ParametricIncomingAdmissionPredictor"
        )
    if "elapsed_los" not in prediction_snapshots.columns:
        raise ValueError("Column 'elapsed_los' not found in prediction_snapshots")
    if not pd.api.types.is_timedelta64_dtype(prediction_snapshots["elapsed_los"]):
        actual_type = prediction_snapshots["elapsed_los"].dtype
        raise ValueError(
            f"Column 'elapsed_los' must be a timedelta column, but found type: {actual_type}"
        )

    # Check that all models have been fit
    if not hasattr(classifier, "pipeline") or classifier.pipeline is None:
        raise ValueError("Classifier model has not been fit")
    if not hasattr(spec_model, "weights") or spec_model.weights is None:
        raise ValueError("Specialty model has not been fit")
    if (
        not hasattr(yet_to_arrive_model, "prediction_window")
        or yet_to_arrive_model.prediction_window is None
    ):
        raise ValueError("Yet-to-arrive model has not been fit")

    # Validate that the correct models have been passed for the requested prediction time and prediction window
    if not classifier.training_results.prediction_time == prediction_time:
        raise ValueError(
            f"Requested prediction time {prediction_time} does not match the prediction time of the trained classifier {classifier.training_results.prediction_time}"
        )

    # Compare prediction windows directly
    if prediction_window != yet_to_arrive_model.prediction_window:
        raise ValueError(
            f"Requested prediction window {prediction_window} does not match the prediction window of the trained yet-to-arrive model {yet_to_arrive_model.prediction_window}"
        )

    if not set(yet_to_arrive_model.filters.keys()) == set(specialties):
        raise ValueError(
            f"Requested specialties {set(specialties)} do not match the specialties of the trained yet-to-arrive model {set(yet_to_arrive_model.filters.keys())}"
        )

    special_params = spec_model.special_params

    if special_params:
        special_category_func = special_params["special_category_func"]
        special_category_dict = special_params["special_category_dict"]
        special_func_map = special_params["special_func_map"]
    else:
        special_category_func = special_category_dict = special_func_map = None

    if special_category_dict is not None and not set(specialties) == set(
        special_category_dict.keys()
    ):
        raise ValueError(
            "Requested specialties do not match the specialty dictionary defined in special_params"
        )

    predictions: Dict[str, Dict[str, List[int]]] = {
        specialty: {"in_ed": [], "yet_to_arrive": []} for specialty in specialties
    }

    # Use calibrated pipeline if available, otherwise use regular pipeline
    if (
        hasattr(classifier, "calibrated_pipeline")
        and classifier.calibrated_pipeline is not None
    ):
        pipeline = classifier.calibrated_pipeline
    else:
        pipeline = classifier.pipeline

    # Add missing columns expected by the model
    prediction_snapshots = add_missing_columns(pipeline, prediction_snapshots)

    # Before we get predictions, we need to create a temp copy with the elapsed_los column in seconds
    prediction_snapshots_temp = prediction_snapshots.copy()
    prediction_snapshots_temp["elapsed_los"] = prediction_snapshots_temp[
        "elapsed_los"
    ].dt.total_seconds()

    # Get predictions of admissions for ED patients
    prob_admission_after_ed = model_input_to_pred_proba(
        prediction_snapshots_temp, pipeline
    )

    # Get predictions of admission to specialty
    prediction_snapshots.loc[:, "specialty_prob"] = get_specialty_probs(
        specialties,
        spec_model,
        prediction_snapshots,
        special_category_func=special_category_func,
        special_category_dict=special_category_dict,
    )

    # Get probability of admission within prediction window for current ED patients
    if use_admission_in_window_prob:
        prob_admission_in_window = prediction_snapshots.apply(
            lambda row: calculate_probability(
                row["elapsed_los"], prediction_window, x1, y1, x2, y2
            ),
            axis=1,
        )
    else:
        prob_admission_in_window = pd.Series(1.0, index=prediction_snapshots.index)

    if special_func_map is None:
        special_func_map = {"default": lambda row: True}

    for specialty in specialties:
        func = special_func_map.get(specialty, special_func_map["default"])
        non_zero_indices = prediction_snapshots[
            prediction_snapshots.apply(func, axis=1)
        ].index

        filtered_prob_admission_after_ed = prob_admission_after_ed.loc[non_zero_indices]
        prob_admission_to_specialty = prediction_snapshots["specialty_prob"].apply(
            lambda x: x[specialty]
        )

        filtered_prob_admission_to_specialty = prob_admission_to_specialty.loc[
            non_zero_indices
        ]
        filtered_prob_admission_in_window = prob_admission_in_window.loc[
            non_zero_indices
        ]

        filtered_weights = (
            filtered_prob_admission_to_specialty * filtered_prob_admission_in_window
        )

        agg_predicted_in_ed = pred_proba_to_agg_predicted(
            filtered_prob_admission_after_ed, weights=filtered_weights
        )

        prediction_context = {specialty: {"prediction_time": prediction_time}}
        agg_predicted_yta = yet_to_arrive_model.predict(
            prediction_context, x1=x1, y1=y1, x2=x2, y2=y2
        )

        predictions[specialty]["in_ed"] = [
            find_probability_threshold_index(
                agg_predicted_in_ed["agg_proba"].values.cumsum(), cut_point
            )
            for cut_point in cdf_cut_points
        ]
        predictions[specialty]["yet_to_arrive"] = [
            find_probability_threshold_index(
                agg_predicted_yta[specialty]["agg_proba"].values.cumsum(), cut_point
            )
            for cut_point in cdf_cut_points
        ]

    return predictions

find_probability_threshold_index(sequence, threshold)

Find index where cumulative probability exceeds threshold.

Parameters:

Name Type Description Default
sequence List[float]

The probability mass function (PMF) of resource needs

required
threshold float

The probability threshold (e.g., 0.9 for 90%)

required

Returns:

Type Description
int

The index where the cumulative probability exceeds 1 - threshold, indicating the number of resources needed with the specified probability

Examples:

>>> pmf = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
>>> find_probability_threshold_index(pmf, 0.9)
5
# This means there is a 90% probability of needing at least 5 beds
Source code in src/patientflow/predict/emergency_demand.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
def find_probability_threshold_index(sequence: List[float], threshold: float) -> int:
    """Find index where cumulative probability exceeds threshold.

    Parameters
    ----------
    sequence : List[float]
        The probability mass function (PMF) of resource needs
    threshold : float
        The probability threshold (e.g., 0.9 for 90%)

    Returns
    -------
    int
        The index where the cumulative probability exceeds 1 - threshold,
        indicating the number of resources needed with the specified probability

    Examples
    --------
    >>> pmf = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
    >>> find_probability_threshold_index(pmf, 0.9)
    5
    # This means there is a 90% probability of needing at least 5 beds
    """
    cumulative_sum = 0.0
    for i, value in enumerate(sequence):
        cumulative_sum += value
        if cumulative_sum >= 1 - threshold:
            return i
    return len(sequence) - 1  # Return the last index if the threshold isn't reached

get_specialty_probs(specialties, specialty_model, snapshots_df, special_category_func=None, special_category_dict=None)

Calculate specialty probability distributions for patient visits.

Parameters:

Name Type Description Default
specialties str

List of specialty names for which predictions are required

required
specialty_model object

Trained model for making specialty predictions

required
snapshots_df DataFrame

DataFrame containing the data on which predictions are to be made. Must include the input_var column if no special_category_func is applied

required
special_category_func callable

A function that takes a DataFrame row (Series) as input and returns True if the row belongs to a special category that requires a fixed probability distribution

None
special_category_dict dict

A dictionary containing the fixed probability distribution for special category cases. Required if special_category_func is provided

None

Returns:

Type Description
Series

A Series containing dictionaries as values. Each dictionary represents the probability distribution of specialties for each patient visit

Raises:

Type Description
ValueError

If special_category_func is provided but special_category_dict is None

Source code in src/patientflow/predict/emergency_demand.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
def get_specialty_probs(
    specialties,
    specialty_model,
    snapshots_df,
    special_category_func=None,
    special_category_dict=None,
):
    """Calculate specialty probability distributions for patient visits.

    Parameters
    ----------
    specialties : str
        List of specialty names for which predictions are required
    specialty_model : object
        Trained model for making specialty predictions
    snapshots_df : pandas.DataFrame
        DataFrame containing the data on which predictions are to be made. Must include
        the input_var column if no special_category_func is applied
    special_category_func : callable, optional
        A function that takes a DataFrame row (Series) as input and returns True if the row
        belongs to a special category that requires a fixed probability distribution
    special_category_dict : dict, optional
        A dictionary containing the fixed probability distribution for special category cases.
        Required if special_category_func is provided

    Returns
    -------
    pandas.Series
        A Series containing dictionaries as values. Each dictionary represents the probability
        distribution of specialties for each patient visit

    Raises
    ------
    ValueError
        If special_category_func is provided but special_category_dict is None

    """

    # Convert input_var to tuple if not already a tuple
    if len(snapshots_df[specialty_model.input_var]) > 0 and not isinstance(
        snapshots_df[specialty_model.input_var].iloc[0], tuple
    ):
        snapshots_df.loc[:, specialty_model.input_var] = snapshots_df[
            specialty_model.input_var
        ].apply(lambda x: tuple(x) if x else ())

    if special_category_func and not special_category_dict:
        raise ValueError(
            "special_category_dict must be provided if special_category_func is specified."
        )

    # Function to determine the specialty probabilities
    def determine_specialty(row):
        if special_category_func and special_category_func(row):
            return special_category_dict
        else:
            return specialty_model.predict(row[specialty_model.input_var])

    # Apply the determine_specialty function to each row
    specialty_prob_series = snapshots_df.apply(determine_specialty, axis=1)

    # Find all unique keys used in any dictionary within the series
    all_keys = set().union(
        *(d.keys() for d in specialty_prob_series if isinstance(d, dict))
    )

    # Combine all_keys with the specialties requested
    all_keys = set(all_keys).union(set(specialties))

    # Ensure each dictionary contains all keys found, with default values of 0 for missing keys
    specialty_prob_series = specialty_prob_series.apply(
        lambda d: (
            {key: d.get(key, 0) for key in all_keys} if isinstance(d, dict) else d
        )
    )

    return specialty_prob_series

predictors

Predictor models for patient flow analysis.

This module contains various predictor model implementations, including sequence-based predictors and weighted Poisson predictors for modeling patient flow patterns.

incoming_admission_predictors

Hospital Admissions Forecasting Predictors.

This module implements custom predictors to estimate the number of hospital admissions within a specified prediction window using historical admission data. It provides two approaches: parametric curves with Poisson-binomial distributions and empirical survival curves with convolution of Poisson distributions. Both predictors accommodate different data filters for tailored predictions across various hospital settings.

Classes:

Name Description
IncomingAdmissionPredictor : BaseEstimator, TransformerMixin

Base class for admission predictors that handles filtering and arrival rate calculation.

ParametricIncomingAdmissionPredictor : IncomingAdmissionPredictor

Predicts the number of admissions within a given prediction window based on historical data and Poisson-binomial distribution using parametric aspirational curves.

EmpiricalIncomingAdmissionPredictor : IncomingAdmissionPredictor

Predicts the number of admissions using empirical survival curves and convolution of Poisson distributions instead of parametric curves.

Notes

The ParametricIncomingAdmissionPredictor uses a combination of Poisson and binomial distributions to model the probability of admissions within a prediction window using parametric curves defined by transition points (x1, y1, x2, y2).

The EmpiricalIncomingAdmissionPredictor inherits the arrival rate calculation and filtering logic but replaces the parametric approach with empirical survival probabilities and convolution of individual Poisson distributions for each time interval.

Both predictors take into account historical data patterns and can be filtered for specific hospital settings or specialties.

EmpiricalIncomingAdmissionPredictor

Bases: IncomingAdmissionPredictor

A predictor that uses empirical survival curves instead of parameterised curves.

This predictor inherits all the arrival rate calculation and filtering logic from IncomingAdmissionPredictor but uses empirical survival probabilities and convolution of Poisson distributions for prediction instead of the Poisson-binomial approach.

The survival curve is automatically calculated from the training data during the fit process by analysing time-to-admission patterns.

Parameters:

Name Type Description Default
filters dict

Optional filters for data categorization. If None, no filtering is applied.

None
verbose bool

Whether to enable verbose logging.

False

Attributes:

Name Type Description
survival_df DataFrame

The survival data calculated from training data, containing time-to-event information for empirical probability calculations.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
class EmpiricalIncomingAdmissionPredictor(IncomingAdmissionPredictor):
    """A predictor that uses empirical survival curves instead of parameterised curves.

    This predictor inherits all the arrival rate calculation and filtering logic from
    IncomingAdmissionPredictor but uses empirical survival probabilities and convolution
    of Poisson distributions for prediction instead of the Poisson-binomial approach.

    The survival curve is automatically calculated from the training data during the
    fit process by analysing time-to-admission patterns.

    Parameters
    ----------
    filters : dict, optional
        Optional filters for data categorization. If None, no filtering is applied.
    verbose : bool, default=False
        Whether to enable verbose logging.

    Attributes
    ----------
    survival_df : pandas.DataFrame
        The survival data calculated from training data, containing time-to-event
        information for empirical probability calculations.
    """

    def __init__(self, filters=None, verbose=False):
        """Initialize the EmpiricalIncomingAdmissionPredictor."""
        super().__init__(filters, verbose)
        self.survival_df = None

    def fit(
        self,
        train_df: pd.DataFrame,
        prediction_window,
        yta_time_interval,
        prediction_times: List[float],
        num_days: int,
        epsilon=10**-7,
        y=None,
        start_time_col="arrival_datetime",
        end_time_col="departure_datetime",
    ) -> "EmpiricalIncomingAdmissionPredictor":
        """Fit the model to the training data and calculate empirical survival curve.

        Parameters
        ----------
        train_df : pandas.DataFrame
            The training dataset with historical admission data.
            Expected to have start_time_col as the index and end_time_col as a column.
            Alternatively, both can be regular columns.
        prediction_window : int or timedelta
            The prediction window in minutes. If timedelta, will be converted to minutes.
            If int, assumed to be in minutes.
        yta_time_interval : int or timedelta
            The interval in minutes for splitting the prediction window. If timedelta, will be converted to minutes.
            If int, assumed to be in minutes.
        prediction_times : list
            Times of day at which predictions are made, in hours.
        num_days : int
            The number of days that the train_df spans.
        epsilon : float, default=1e-7
            A small value representing acceptable error rate to enable calculation
            of the maximum value of the random variable representing number of beds.
        y : None, optional
            Ignored, present for compatibility with scikit-learn's fit method.
        start_time_col : str, default='arrival_datetime'
            Name of the column containing the start time (e.g., arrival time).
            Expected to be the DataFrame index, but can also be a regular column.
        end_time_col : str, default='departure_datetime'
            Name of the column containing the end time (e.g., departure time).

        Returns
        -------
        EmpiricalIncomingAdmissionPredictor
            The instance itself, fitted with the training data.
        """
        # Calculate survival curve from training data using existing function
        # Handle case where start_time_col is in the index
        if start_time_col in train_df.columns:
            # start_time_col is a regular column
            df_for_survival = train_df
        else:
            # start_time_col is likely the index, reset it to make it a column
            df_for_survival = train_df.reset_index()
            # Verify that start_time_col is now available
            if start_time_col not in df_for_survival.columns:
                raise ValueError(
                    f"Column '{start_time_col}' not found in DataFrame columns or index"
                )

        self.survival_df = calculate_survival_curve(
            df_for_survival, start_time_col=start_time_col, end_time_col=end_time_col
        )

        # Verify survival curve was calculated and saved successfully
        if self.survival_df is None or len(self.survival_df) == 0:
            raise RuntimeError("Failed to calculate survival curve from training data")

        # Ensure train_df has start_time_col as index for parent fit method
        if start_time_col in train_df.columns:
            train_df = train_df.set_index(start_time_col)

        # Call parent fit method to handle arrival rate calculation and validation
        super().fit(
            train_df,
            prediction_window,
            yta_time_interval,
            prediction_times,
            num_days,
            epsilon=epsilon,
            y=y,
        )

        if self.verbose:
            self.logger.info(
                f"EmpiricalIncomingAdmissionPredictor has been fitted with survival curve containing {len(self.survival_df)} time points"
            )

        return self

    def get_survival_curve(self):
        """Get the survival curve calculated during fitting.

        Returns
        -------
        pandas.DataFrame
            DataFrame containing the survival curve with columns:
            - time_hours: Time points in hours
            - survival_probability: Survival probabilities at each time point
            - event_probability: Event probabilities (1 - survival_probability)

        Raises
        ------
        RuntimeError
            If the model has not been fitted yet.
        """
        if self.survival_df is None:
            raise RuntimeError("Model has not been fitted yet. Call fit() first.")
        return self.survival_df.copy()

    def _calculate_survival_probabilities(self, prediction_window, yta_time_interval):
        """Calculate survival probabilities for each time interval.

        Parameters
        ----------
        prediction_window : int or timedelta
            The prediction window.
        yta_time_interval : int or timedelta
            The time interval for splitting the prediction window.

        Returns
        -------
        numpy.ndarray
            Array of admission probabilities for each time interval.
        """
        # Calculate number of time intervals
        if isinstance(prediction_window, timedelta) and isinstance(
            yta_time_interval, timedelta
        ):
            NTimes = int(prediction_window / yta_time_interval)
        elif isinstance(prediction_window, timedelta):
            NTimes = int(prediction_window.total_seconds() / 60 / yta_time_interval)
        elif isinstance(yta_time_interval, timedelta):
            NTimes = int(prediction_window / (yta_time_interval.total_seconds() / 60))
        else:
            NTimes = int(prediction_window / yta_time_interval)

        # Convert to hours for survival probability calculation
        if isinstance(prediction_window, timedelta):
            prediction_window_hours = prediction_window.total_seconds() / 3600
        else:
            prediction_window_hours = prediction_window / 60

        if isinstance(yta_time_interval, timedelta):
            yta_time_interval_hours = yta_time_interval.total_seconds() / 3600
        else:
            yta_time_interval_hours = yta_time_interval / 60

        # Calculate admission probabilities for each time interval
        probabilities = []
        for i in range(NTimes):
            # Time remaining until end of prediction window
            time_remaining = prediction_window_hours - (i * yta_time_interval_hours)

            # Interpolate survival probability from survival curve
            if time_remaining <= 0:
                prob_admission = (
                    1.0  # If time remaining is 0 or negative, probability is 1
                )
            else:
                # Find the survival probability at this time point
                # Linear interpolation between points in survival curve
                survival_curve = self.survival_df
                if time_remaining >= survival_curve["time_hours"].max():
                    # If time is beyond our data, use the last survival probability
                    survival_prob = survival_curve["survival_probability"].iloc[-1]
                elif time_remaining <= survival_curve["time_hours"].min():
                    # If time is before our data, use the first survival probability
                    survival_prob = survival_curve["survival_probability"].iloc[0]
                else:
                    # Interpolate between points
                    survival_prob = np.interp(
                        time_remaining,
                        survival_curve["time_hours"],
                        survival_curve["survival_probability"],
                    )

                # Probability of admission = 1 - survival probability
                prob_admission = 1 - survival_prob

            probabilities.append(prob_admission)

        return np.array(probabilities)

    def _convolve_poisson_distributions(
        self, arrival_rates, probabilities, max_value=20
    ):
        """Convolve Poisson distributions for each time interval.

        Parameters
        ----------
        arrival_rates : numpy.ndarray
            Array of arrival rates for each time interval.
        probabilities : numpy.ndarray
            Array of admission probabilities for each time interval.
        max_value : int, default=20
            Maximum value for the discrete distribution support.

        Returns
        -------
        pandas.DataFrame
            DataFrame with 'sum' and 'agg_proba' columns representing the final distribution.
        """
        from scipy import stats

        # Create weighted Poisson distributions for each time interval
        weighted_rates = arrival_rates * probabilities
        poisson_dists = [stats.poisson(rate) for rate in weighted_rates]

        # Get PMF for each distribution
        x = np.arange(max_value)
        pmfs = [dist.pmf(x) for dist in poisson_dists]

        # Convolve all distributions together
        if len(pmfs) == 0:
            # Handle edge case of no distributions
            combined_pmf = np.zeros(max_value)
            combined_pmf[0] = 1.0  # All probability at 0
        else:
            combined_pmf = pmfs[0]
            for pmf in pmfs[1:]:
                combined_pmf = np.convolve(combined_pmf, pmf)

        # Create result DataFrame
        result_df = pd.DataFrame(
            {"sum": range(len(combined_pmf)), "agg_proba": combined_pmf}
        )

        # Filter out near-zero probabilities and normalize
        result_df = result_df[result_df["agg_proba"] > 1e-10]
        result_df["agg_proba"] = result_df["agg_proba"] / result_df["agg_proba"].sum()

        return result_df.set_index("sum")

    def predict(self, prediction_context: Dict, **kwargs) -> Dict:
        """Predict the number of admissions using empirical survival curves.

        Parameters
        ----------
        prediction_context : dict
            A dictionary defining the context for which predictions are to be made.
            It should specify either a general context or one based on the applied filters.
        **kwargs
            Additional keyword arguments for prediction configuration:

            max_value : int, default=20
                Maximum value for the discrete distribution support.

        Returns
        -------
        dict
            A dictionary with predictions for each specified context.

        Raises
        ------
        ValueError
            If filter key is not recognized or prediction_time is not provided.
        KeyError
            If required keys are missing from the prediction context.
        RuntimeError
            If survival_df was not provided during fitting.
        """
        if self.survival_df is None:
            raise RuntimeError(
                "No survival data available. Please call fit() method first to calculate survival curve from training data."
            )

        # Extract parameters from kwargs with defaults
        max_value = kwargs.get("max_value", 20)

        predictions = {}

        # Calculate survival probabilities once (they're the same for all contexts)
        survival_probabilities = self._calculate_survival_probabilities(
            self.prediction_window, self.yta_time_interval
        )

        for filter_key, filter_values in prediction_context.items():
            try:
                if filter_key not in self.weights:
                    raise ValueError(
                        f"Filter key '{filter_key}' is not recognized in the model weights."
                    )

                prediction_time = filter_values.get("prediction_time")
                if prediction_time is None:
                    raise ValueError(
                        f"No 'prediction_time' provided for filter '{filter_key}'."
                    )

                if prediction_time not in self.prediction_times:
                    prediction_time = find_nearest_previous_prediction_time(
                        prediction_time, self.prediction_times
                    )

                arrival_rates = self.weights[filter_key][prediction_time].get(
                    "arrival_rates"
                )
                if arrival_rates is None:
                    raise ValueError(
                        f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                    )

                # Convert arrival rates to numpy array
                arrival_rates = np.array(arrival_rates)

                # Generate prediction using convolution approach
                predictions[filter_key] = self._convolve_poisson_distributions(
                    arrival_rates, survival_probabilities, max_value=max_value
                )

                # if self.verbose:
                #     total_expected = (arrival_rates * survival_probabilities).sum()
                #     self.logger.info(
                #         f"Prediction for {filter_key} at {prediction_time}: "
                #         f"Expected value ≈ {total_expected:.2f}"
                #     )

            except KeyError as e:
                raise KeyError(f"Key error occurred: {e!s}")

        return predictions
__init__(filters=None, verbose=False)

Initialize the EmpiricalIncomingAdmissionPredictor.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
775
776
777
778
def __init__(self, filters=None, verbose=False):
    """Initialize the EmpiricalIncomingAdmissionPredictor."""
    super().__init__(filters, verbose)
    self.survival_df = None
fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None, start_time_col='arrival_datetime', end_time_col='departure_datetime')

Fit the model to the training data and calculate empirical survival curve.

Parameters:

Name Type Description Default
train_df DataFrame

The training dataset with historical admission data. Expected to have start_time_col as the index and end_time_col as a column. Alternatively, both can be regular columns.

required
prediction_window int or timedelta

The prediction window in minutes. If timedelta, will be converted to minutes. If int, assumed to be in minutes.

required
yta_time_interval int or timedelta

The interval in minutes for splitting the prediction window. If timedelta, will be converted to minutes. If int, assumed to be in minutes.

required
prediction_times list

Times of day at which predictions are made, in hours.

required
num_days int

The number of days that the train_df spans.

required
epsilon float

A small value representing acceptable error rate to enable calculation of the maximum value of the random variable representing number of beds.

1e-7
y None

Ignored, present for compatibility with scikit-learn's fit method.

None
start_time_col str

Name of the column containing the start time (e.g., arrival time). Expected to be the DataFrame index, but can also be a regular column.

'arrival_datetime'
end_time_col str

Name of the column containing the end time (e.g., departure time).

'departure_datetime'

Returns:

Type Description
EmpiricalIncomingAdmissionPredictor

The instance itself, fitted with the training data.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
def fit(
    self,
    train_df: pd.DataFrame,
    prediction_window,
    yta_time_interval,
    prediction_times: List[float],
    num_days: int,
    epsilon=10**-7,
    y=None,
    start_time_col="arrival_datetime",
    end_time_col="departure_datetime",
) -> "EmpiricalIncomingAdmissionPredictor":
    """Fit the model to the training data and calculate empirical survival curve.

    Parameters
    ----------
    train_df : pandas.DataFrame
        The training dataset with historical admission data.
        Expected to have start_time_col as the index and end_time_col as a column.
        Alternatively, both can be regular columns.
    prediction_window : int or timedelta
        The prediction window in minutes. If timedelta, will be converted to minutes.
        If int, assumed to be in minutes.
    yta_time_interval : int or timedelta
        The interval in minutes for splitting the prediction window. If timedelta, will be converted to minutes.
        If int, assumed to be in minutes.
    prediction_times : list
        Times of day at which predictions are made, in hours.
    num_days : int
        The number of days that the train_df spans.
    epsilon : float, default=1e-7
        A small value representing acceptable error rate to enable calculation
        of the maximum value of the random variable representing number of beds.
    y : None, optional
        Ignored, present for compatibility with scikit-learn's fit method.
    start_time_col : str, default='arrival_datetime'
        Name of the column containing the start time (e.g., arrival time).
        Expected to be the DataFrame index, but can also be a regular column.
    end_time_col : str, default='departure_datetime'
        Name of the column containing the end time (e.g., departure time).

    Returns
    -------
    EmpiricalIncomingAdmissionPredictor
        The instance itself, fitted with the training data.
    """
    # Calculate survival curve from training data using existing function
    # Handle case where start_time_col is in the index
    if start_time_col in train_df.columns:
        # start_time_col is a regular column
        df_for_survival = train_df
    else:
        # start_time_col is likely the index, reset it to make it a column
        df_for_survival = train_df.reset_index()
        # Verify that start_time_col is now available
        if start_time_col not in df_for_survival.columns:
            raise ValueError(
                f"Column '{start_time_col}' not found in DataFrame columns or index"
            )

    self.survival_df = calculate_survival_curve(
        df_for_survival, start_time_col=start_time_col, end_time_col=end_time_col
    )

    # Verify survival curve was calculated and saved successfully
    if self.survival_df is None or len(self.survival_df) == 0:
        raise RuntimeError("Failed to calculate survival curve from training data")

    # Ensure train_df has start_time_col as index for parent fit method
    if start_time_col in train_df.columns:
        train_df = train_df.set_index(start_time_col)

    # Call parent fit method to handle arrival rate calculation and validation
    super().fit(
        train_df,
        prediction_window,
        yta_time_interval,
        prediction_times,
        num_days,
        epsilon=epsilon,
        y=y,
    )

    if self.verbose:
        self.logger.info(
            f"EmpiricalIncomingAdmissionPredictor has been fitted with survival curve containing {len(self.survival_df)} time points"
        )

    return self
get_survival_curve()

Get the survival curve calculated during fitting.

Returns:

Type Description
DataFrame

DataFrame containing the survival curve with columns: - time_hours: Time points in hours - survival_probability: Survival probabilities at each time point - event_probability: Event probabilities (1 - survival_probability)

Raises:

Type Description
RuntimeError

If the model has not been fitted yet.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
def get_survival_curve(self):
    """Get the survival curve calculated during fitting.

    Returns
    -------
    pandas.DataFrame
        DataFrame containing the survival curve with columns:
        - time_hours: Time points in hours
        - survival_probability: Survival probabilities at each time point
        - event_probability: Event probabilities (1 - survival_probability)

    Raises
    ------
    RuntimeError
        If the model has not been fitted yet.
    """
    if self.survival_df is None:
        raise RuntimeError("Model has not been fitted yet. Call fit() first.")
    return self.survival_df.copy()
predict(prediction_context, **kwargs)

Predict the number of admissions using empirical survival curves.

Parameters:

Name Type Description Default
prediction_context dict

A dictionary defining the context for which predictions are to be made. It should specify either a general context or one based on the applied filters.

required
**kwargs

Additional keyword arguments for prediction configuration:

max_value : int, default=20 Maximum value for the discrete distribution support.

{}

Returns:

Type Description
dict

A dictionary with predictions for each specified context.

Raises:

Type Description
ValueError

If filter key is not recognized or prediction_time is not provided.

KeyError

If required keys are missing from the prediction context.

RuntimeError

If survival_df was not provided during fitting.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
def predict(self, prediction_context: Dict, **kwargs) -> Dict:
    """Predict the number of admissions using empirical survival curves.

    Parameters
    ----------
    prediction_context : dict
        A dictionary defining the context for which predictions are to be made.
        It should specify either a general context or one based on the applied filters.
    **kwargs
        Additional keyword arguments for prediction configuration:

        max_value : int, default=20
            Maximum value for the discrete distribution support.

    Returns
    -------
    dict
        A dictionary with predictions for each specified context.

    Raises
    ------
    ValueError
        If filter key is not recognized or prediction_time is not provided.
    KeyError
        If required keys are missing from the prediction context.
    RuntimeError
        If survival_df was not provided during fitting.
    """
    if self.survival_df is None:
        raise RuntimeError(
            "No survival data available. Please call fit() method first to calculate survival curve from training data."
        )

    # Extract parameters from kwargs with defaults
    max_value = kwargs.get("max_value", 20)

    predictions = {}

    # Calculate survival probabilities once (they're the same for all contexts)
    survival_probabilities = self._calculate_survival_probabilities(
        self.prediction_window, self.yta_time_interval
    )

    for filter_key, filter_values in prediction_context.items():
        try:
            if filter_key not in self.weights:
                raise ValueError(
                    f"Filter key '{filter_key}' is not recognized in the model weights."
                )

            prediction_time = filter_values.get("prediction_time")
            if prediction_time is None:
                raise ValueError(
                    f"No 'prediction_time' provided for filter '{filter_key}'."
                )

            if prediction_time not in self.prediction_times:
                prediction_time = find_nearest_previous_prediction_time(
                    prediction_time, self.prediction_times
                )

            arrival_rates = self.weights[filter_key][prediction_time].get(
                "arrival_rates"
            )
            if arrival_rates is None:
                raise ValueError(
                    f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                )

            # Convert arrival rates to numpy array
            arrival_rates = np.array(arrival_rates)

            # Generate prediction using convolution approach
            predictions[filter_key] = self._convolve_poisson_distributions(
                arrival_rates, survival_probabilities, max_value=max_value
            )

            # if self.verbose:
            #     total_expected = (arrival_rates * survival_probabilities).sum()
            #     self.logger.info(
            #         f"Prediction for {filter_key} at {prediction_time}: "
            #         f"Expected value ≈ {total_expected:.2f}"
            #     )

        except KeyError as e:
            raise KeyError(f"Key error occurred: {e!s}")

    return predictions

IncomingAdmissionPredictor

Bases: BaseEstimator, TransformerMixin, ABC

Base class for admission predictors that handles filtering and arrival rate calculation.

This abstract base class provides the common functionality for predicting hospital admissions, including data filtering, arrival rate calculation, and basic prediction infrastructure. Subclasses implement specific prediction strategies.

Parameters:

Name Type Description Default
filters dict

Optional filters for data categorization. If None, no filtering is applied.

None
verbose bool

Whether to enable verbose logging.

False

Attributes:

Name Type Description
filters dict

Filters for data categorization.

verbose bool

Verbose logging flag.

metrics dict

Stores metadata about the model and training data.

weights dict

Model parameters computed during fitting.

Notes

The predictor implements scikit-learn's BaseEstimator and TransformerMixin interfaces for compatibility with scikit-learn pipelines.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
class IncomingAdmissionPredictor(BaseEstimator, TransformerMixin, ABC):
    """Base class for admission predictors that handles filtering and arrival rate calculation.

    This abstract base class provides the common functionality for predicting hospital
    admissions, including data filtering, arrival rate calculation, and basic prediction
    infrastructure. Subclasses implement specific prediction strategies.

    Parameters
    ----------
    filters : dict, optional
        Optional filters for data categorization. If None, no filtering is applied.
    verbose : bool, default=False
        Whether to enable verbose logging.

    Attributes
    ----------
    filters : dict
        Filters for data categorization.
    verbose : bool
        Verbose logging flag.
    metrics : dict
        Stores metadata about the model and training data.
    weights : dict
        Model parameters computed during fitting.

    Notes
    -----
    The predictor implements scikit-learn's BaseEstimator and TransformerMixin
    interfaces for compatibility with scikit-learn pipelines.
    """

    def __init__(self, filters=None, verbose=False):
        """
        Initialize the IncomingAdmissionPredictor with optional filters.

        Args:
            filters (dict, optional): A dictionary defining filters for different categories or specialties.
                                    If None or empty, no filtering will be applied.
            verbose (bool, optional): If True, enable info-level logging. Defaults to False.
        """
        self.filters = filters if filters else {}
        self.verbose = verbose
        self.metrics = {}  # Add metrics dictionary to store metadata

        if verbose:
            # Configure logging for Jupyter notebook compatibility
            import logging
            import sys

            # Create logger
            self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

            # Only set up handlers if they don't exist
            if not self.logger.handlers:
                self.logger.setLevel(logging.INFO if verbose else logging.WARNING)

                # Create handler that writes to sys.stdout
                handler = logging.StreamHandler(sys.stdout)
                handler.setLevel(logging.INFO if verbose else logging.WARNING)

                # Create a formatting configuration
                formatter = logging.Formatter("%(message)s")
                handler.setFormatter(formatter)

                # Add the handler to the logger
                self.logger.addHandler(handler)

                # Prevent propagation to root logger
                self.logger.propagate = False

        # Apply filters
        self.filters = filters if filters else {}

    def filter_dataframe(self, df: pd.DataFrame, filters: Dict) -> pd.DataFrame:
        """Apply a set of filters to a dataframe.

        Parameters
        ----------
        df : pandas.DataFrame
            The DataFrame to filter.
        filters : dict
            A dictionary where keys are column names and values are the criteria
            or function to filter by.

        Returns
        -------
        pandas.DataFrame
            A filtered DataFrame.
        """
        filtered_df = df
        for column, criteria in filters.items():
            if callable(criteria):  # If the criteria is a function, apply it directly
                filtered_df = filtered_df[filtered_df[column].apply(criteria)]
            else:  # Otherwise, assume the criteria is a value or list of values for equality check
                filtered_df = filtered_df[filtered_df[column] == criteria]
        return filtered_df

    def _calculate_parameters(
        self,
        df,
        prediction_window: timedelta,
        yta_time_interval: timedelta,
        prediction_times,
        num_days,
    ):
        """Calculate parameters required for the model.

        Parameters
        ----------
        df : pandas.DataFrame
            The data frame to process.
        prediction_window : timedelta
            The total prediction window for prediction.
        yta_time_interval : timedelta
            The interval for splitting the prediction window.
        prediction_times : list
            Times of day at which predictions are made.
        num_days : int
            Number of days over which to calculate time-varying arrival rates.

        Returns
        -------
        dict
            Calculated arrival_rates parameters organized by time of day.
        """

        # Calculate Ntimes - Python handles the division naturally
        Ntimes = int(prediction_window / yta_time_interval)

        # Pass original type to time_varying_arrival_rates
        arrival_rates_dict = time_varying_arrival_rates(
            df, yta_time_interval, num_days, verbose=self.verbose
        )
        prediction_time_dict = {}

        for prediction_time_ in prediction_times:
            prediction_time_hr, prediction_time_min = (
                (prediction_time_, 0)
                if isinstance(prediction_time_, int)
                else prediction_time_
            )
            arrival_rates = [
                arrival_rates_dict[
                    (
                        datetime(1970, 1, 1, prediction_time_hr, prediction_time_min)
                        + i * yta_time_interval
                    ).time()
                ]
                for i in range(Ntimes)
            ]
            prediction_time_dict[(prediction_time_hr, prediction_time_min)] = {
                "arrival_rates": arrival_rates
            }

        return prediction_time_dict

    def fit(
        self,
        train_df: pd.DataFrame,
        prediction_window: timedelta,
        yta_time_interval: timedelta,
        prediction_times: List[float],
        num_days: int,
        epsilon: float = 10**-7,
        y: Optional[None] = None,
    ) -> "IncomingAdmissionPredictor":
        """Fit the model to the training data.

        Parameters
        ----------
        train_df : pandas.DataFrame
            The training dataset with historical admission data.
        prediction_window : timedelta
            The prediction window as a timedelta object.
        yta_time_interval : timedelta
            The interval for splitting the prediction window as a timedelta object.
        prediction_times : list
            Times of day at which predictions are made, in hours.
        num_days : int
            The number of days that the train_df spans.
        epsilon : float, default=1e-7
            A small value representing acceptable error rate to enable calculation
            of the maximum value of the random variable representing number of beds.
        y : None, optional
            Ignored, present for compatibility with scikit-learn's fit method.

        Returns
        -------
        IncomingAdmissionPredictor
            The instance itself, fitted with the training data.

        Raises
        ------
        TypeError
            If prediction_window or yta_time_interval are not timedelta objects.
        ValueError
            If prediction_window/yta_time_interval is not greater than 1.
        """

        # Validate inputs
        if not isinstance(prediction_window, timedelta):
            raise TypeError("prediction_window must be a timedelta object")
        if not isinstance(yta_time_interval, timedelta):
            raise TypeError("yta_time_interval must be a timedelta object")

        if prediction_window.total_seconds() <= 0:
            raise ValueError("prediction_window must be positive")
        if yta_time_interval.total_seconds() <= 0:
            raise ValueError("yta_time_interval must be positive")
        if yta_time_interval.total_seconds() > 4 * 3600:  # 4 hours in seconds
            warnings.warn("yta_time_interval appears to be longer than 4 hours")

        # Validate the ratio makes sense
        ratio = prediction_window / yta_time_interval
        if int(ratio) == 0:
            raise ValueError(
                "prediction_window must be significantly larger than yta_time_interval"
            )

        # Store original types
        self.prediction_window = prediction_window
        self.yta_time_interval = yta_time_interval
        self.epsilon = epsilon
        self.prediction_times = [
            tuple(x)
            if isinstance(x, (list, np.ndarray))
            else (x, 0)
            if isinstance(x, (int, float))
            else x
            for x in prediction_times
        ]

        # Initialize yet_to_arrive_dict
        self.weights = {}

        # If there are filters specified, calculate and store the parameters directly with the respective spec keys
        if self.filters:
            for spec, filters in self.filters.items():
                self.weights[spec] = self._calculate_parameters(
                    self.filter_dataframe(train_df, filters),
                    prediction_window,
                    yta_time_interval,
                    prediction_times,
                    num_days,
                )
        else:
            # If there are no filters, store the parameters with a generic key of 'unfiltered'
            self.weights["unfiltered"] = self._calculate_parameters(
                train_df,
                prediction_window,
                yta_time_interval,
                prediction_times,
                num_days,
            )

        if self.verbose:
            self.logger.info(
                f"{self.__class__.__name__} trained for these times: {prediction_times}"
            )
            self.logger.info(
                f"using prediction window of {prediction_window} after the time of prediction"
            )
            self.logger.info(
                f"and time interval of {yta_time_interval} within the prediction window."
            )
            self.logger.info(f"The error value for prediction will be {epsilon}")
            self.logger.info(
                "To see the weights saved by this model, used the get_weights() method"
            )

        # Store metrics about the training data
        self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
        self.metrics["train_set_no"] = len(train_df)
        self.metrics["start_date"] = train_df.index.min().date()
        self.metrics["end_date"] = train_df.index.max().date()
        self.metrics["num_days"] = num_days

        return self

    def get_weights(self):
        """Get the weights computed by the fit method.

        Returns
        -------
        dict
            The weights computed during model fitting.
        """
        return self.weights

    @abstractmethod
    def predict(self, prediction_context: Dict, **kwargs) -> Dict:
        """Predict the number of admissions for the given context.

        This is an abstract method that must be implemented by subclasses.

        Parameters
        ----------
        prediction_context : dict
            A dictionary defining the context for which predictions are to be made.
            It should specify either a general context or one based on the applied filters.
        **kwargs
            Additional keyword arguments specific to the prediction method.

        Returns
        -------
        dict
            A dictionary with predictions for each specified context.

        Raises
        ------
        ValueError
            If filter key is not recognized or prediction_time is not provided.
        KeyError
            If required keys are missing from the prediction context.
        """
        pass
__init__(filters=None, verbose=False)

Initialize the IncomingAdmissionPredictor with optional filters.

Args: filters (dict, optional): A dictionary defining filters for different categories or specialties. If None or empty, no filtering will be applied. verbose (bool, optional): If True, enable info-level logging. Defaults to False.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def __init__(self, filters=None, verbose=False):
    """
    Initialize the IncomingAdmissionPredictor with optional filters.

    Args:
        filters (dict, optional): A dictionary defining filters for different categories or specialties.
                                If None or empty, no filtering will be applied.
        verbose (bool, optional): If True, enable info-level logging. Defaults to False.
    """
    self.filters = filters if filters else {}
    self.verbose = verbose
    self.metrics = {}  # Add metrics dictionary to store metadata

    if verbose:
        # Configure logging for Jupyter notebook compatibility
        import logging
        import sys

        # Create logger
        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

        # Only set up handlers if they don't exist
        if not self.logger.handlers:
            self.logger.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create handler that writes to sys.stdout
            handler = logging.StreamHandler(sys.stdout)
            handler.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create a formatting configuration
            formatter = logging.Formatter("%(message)s")
            handler.setFormatter(formatter)

            # Add the handler to the logger
            self.logger.addHandler(handler)

            # Prevent propagation to root logger
            self.logger.propagate = False

    # Apply filters
    self.filters = filters if filters else {}
filter_dataframe(df, filters)

Apply a set of filters to a dataframe.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to filter.

required
filters dict

A dictionary where keys are column names and values are the criteria or function to filter by.

required

Returns:

Type Description
DataFrame

A filtered DataFrame.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
def filter_dataframe(self, df: pd.DataFrame, filters: Dict) -> pd.DataFrame:
    """Apply a set of filters to a dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame to filter.
    filters : dict
        A dictionary where keys are column names and values are the criteria
        or function to filter by.

    Returns
    -------
    pandas.DataFrame
        A filtered DataFrame.
    """
    filtered_df = df
    for column, criteria in filters.items():
        if callable(criteria):  # If the criteria is a function, apply it directly
            filtered_df = filtered_df[filtered_df[column].apply(criteria)]
        else:  # Otherwise, assume the criteria is a value or list of values for equality check
            filtered_df = filtered_df[filtered_df[column] == criteria]
    return filtered_df
fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None)

Fit the model to the training data.

Parameters:

Name Type Description Default
train_df DataFrame

The training dataset with historical admission data.

required
prediction_window timedelta

The prediction window as a timedelta object.

required
yta_time_interval timedelta

The interval for splitting the prediction window as a timedelta object.

required
prediction_times list

Times of day at which predictions are made, in hours.

required
num_days int

The number of days that the train_df spans.

required
epsilon float

A small value representing acceptable error rate to enable calculation of the maximum value of the random variable representing number of beds.

1e-7
y None

Ignored, present for compatibility with scikit-learn's fit method.

None

Returns:

Type Description
IncomingAdmissionPredictor

The instance itself, fitted with the training data.

Raises:

Type Description
TypeError

If prediction_window or yta_time_interval are not timedelta objects.

ValueError

If prediction_window/yta_time_interval is not greater than 1.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
def fit(
    self,
    train_df: pd.DataFrame,
    prediction_window: timedelta,
    yta_time_interval: timedelta,
    prediction_times: List[float],
    num_days: int,
    epsilon: float = 10**-7,
    y: Optional[None] = None,
) -> "IncomingAdmissionPredictor":
    """Fit the model to the training data.

    Parameters
    ----------
    train_df : pandas.DataFrame
        The training dataset with historical admission data.
    prediction_window : timedelta
        The prediction window as a timedelta object.
    yta_time_interval : timedelta
        The interval for splitting the prediction window as a timedelta object.
    prediction_times : list
        Times of day at which predictions are made, in hours.
    num_days : int
        The number of days that the train_df spans.
    epsilon : float, default=1e-7
        A small value representing acceptable error rate to enable calculation
        of the maximum value of the random variable representing number of beds.
    y : None, optional
        Ignored, present for compatibility with scikit-learn's fit method.

    Returns
    -------
    IncomingAdmissionPredictor
        The instance itself, fitted with the training data.

    Raises
    ------
    TypeError
        If prediction_window or yta_time_interval are not timedelta objects.
    ValueError
        If prediction_window/yta_time_interval is not greater than 1.
    """

    # Validate inputs
    if not isinstance(prediction_window, timedelta):
        raise TypeError("prediction_window must be a timedelta object")
    if not isinstance(yta_time_interval, timedelta):
        raise TypeError("yta_time_interval must be a timedelta object")

    if prediction_window.total_seconds() <= 0:
        raise ValueError("prediction_window must be positive")
    if yta_time_interval.total_seconds() <= 0:
        raise ValueError("yta_time_interval must be positive")
    if yta_time_interval.total_seconds() > 4 * 3600:  # 4 hours in seconds
        warnings.warn("yta_time_interval appears to be longer than 4 hours")

    # Validate the ratio makes sense
    ratio = prediction_window / yta_time_interval
    if int(ratio) == 0:
        raise ValueError(
            "prediction_window must be significantly larger than yta_time_interval"
        )

    # Store original types
    self.prediction_window = prediction_window
    self.yta_time_interval = yta_time_interval
    self.epsilon = epsilon
    self.prediction_times = [
        tuple(x)
        if isinstance(x, (list, np.ndarray))
        else (x, 0)
        if isinstance(x, (int, float))
        else x
        for x in prediction_times
    ]

    # Initialize yet_to_arrive_dict
    self.weights = {}

    # If there are filters specified, calculate and store the parameters directly with the respective spec keys
    if self.filters:
        for spec, filters in self.filters.items():
            self.weights[spec] = self._calculate_parameters(
                self.filter_dataframe(train_df, filters),
                prediction_window,
                yta_time_interval,
                prediction_times,
                num_days,
            )
    else:
        # If there are no filters, store the parameters with a generic key of 'unfiltered'
        self.weights["unfiltered"] = self._calculate_parameters(
            train_df,
            prediction_window,
            yta_time_interval,
            prediction_times,
            num_days,
        )

    if self.verbose:
        self.logger.info(
            f"{self.__class__.__name__} trained for these times: {prediction_times}"
        )
        self.logger.info(
            f"using prediction window of {prediction_window} after the time of prediction"
        )
        self.logger.info(
            f"and time interval of {yta_time_interval} within the prediction window."
        )
        self.logger.info(f"The error value for prediction will be {epsilon}")
        self.logger.info(
            "To see the weights saved by this model, used the get_weights() method"
        )

    # Store metrics about the training data
    self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
    self.metrics["train_set_no"] = len(train_df)
    self.metrics["start_date"] = train_df.index.min().date()
    self.metrics["end_date"] = train_df.index.max().date()
    self.metrics["num_days"] = num_days

    return self
get_weights()

Get the weights computed by the fit method.

Returns:

Type Description
dict

The weights computed during model fitting.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
557
558
559
560
561
562
563
564
565
def get_weights(self):
    """Get the weights computed by the fit method.

    Returns
    -------
    dict
        The weights computed during model fitting.
    """
    return self.weights
predict(prediction_context, **kwargs) abstractmethod

Predict the number of admissions for the given context.

This is an abstract method that must be implemented by subclasses.

Parameters:

Name Type Description Default
prediction_context dict

A dictionary defining the context for which predictions are to be made. It should specify either a general context or one based on the applied filters.

required
**kwargs

Additional keyword arguments specific to the prediction method.

{}

Returns:

Type Description
dict

A dictionary with predictions for each specified context.

Raises:

Type Description
ValueError

If filter key is not recognized or prediction_time is not provided.

KeyError

If required keys are missing from the prediction context.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
@abstractmethod
def predict(self, prediction_context: Dict, **kwargs) -> Dict:
    """Predict the number of admissions for the given context.

    This is an abstract method that must be implemented by subclasses.

    Parameters
    ----------
    prediction_context : dict
        A dictionary defining the context for which predictions are to be made.
        It should specify either a general context or one based on the applied filters.
    **kwargs
        Additional keyword arguments specific to the prediction method.

    Returns
    -------
    dict
        A dictionary with predictions for each specified context.

    Raises
    ------
    ValueError
        If filter key is not recognized or prediction_time is not provided.
    KeyError
        If required keys are missing from the prediction context.
    """
    pass

ParametricIncomingAdmissionPredictor

Bases: IncomingAdmissionPredictor

A predictor for estimating hospital admissions using parametric curves.

This predictor uses a combination of Poisson and binomial distributions to forecast future admissions, excluding patients who have already arrived. The prediction is based on historical data and can be filtered for specific hospital settings.

Parameters:

Name Type Description Default
filters dict

Optional filters for data categorization. If None, no filtering is applied.

None
verbose bool

Whether to enable verbose logging.

False

Attributes:

Name Type Description
filters dict

Filters for data categorization.

verbose bool

Verbose logging flag.

metrics dict

Stores metadata about the model and training data.

weights dict

Model parameters computed during fitting.

Notes

The predictor implements scikit-learn's BaseEstimator and TransformerMixin interfaces for compatibility with scikit-learn pipelines.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
class ParametricIncomingAdmissionPredictor(IncomingAdmissionPredictor):
    """A predictor for estimating hospital admissions using parametric curves.

    This predictor uses a combination of Poisson and binomial distributions to forecast
    future admissions, excluding patients who have already arrived. The prediction is
    based on historical data and can be filtered for specific hospital settings.

    Parameters
    ----------
    filters : dict, optional
        Optional filters for data categorization. If None, no filtering is applied.
    verbose : bool, default=False
        Whether to enable verbose logging.

    Attributes
    ----------
    filters : dict
        Filters for data categorization.
    verbose : bool
        Verbose logging flag.
    metrics : dict
        Stores metadata about the model and training data.
    weights : dict
        Model parameters computed during fitting.

    Notes
    -----
    The predictor implements scikit-learn's BaseEstimator and TransformerMixin
    interfaces for compatibility with scikit-learn pipelines.
    """

    def predict(self, prediction_context: Dict, **kwargs) -> Dict:
        """Predict the number of admissions for the given context using parametric curves.

        Parameters
        ----------
        prediction_context : dict
            A dictionary defining the context for which predictions are to be made.
            It should specify either a general context or one based on the applied filters.
        **kwargs
            Additional keyword arguments for parametric curve configuration:

            x1 : float
                The x-coordinate of the first transition point on the aspirational curve,
                where the growth phase ends and the decay phase begins.
            y1 : float
                The y-coordinate of the first transition point (x1), representing the target
                proportion of patients admitted by time x1.
            x2 : float
                The x-coordinate of the second transition point on the curve, beyond which
                all but a few patients are expected to be admitted.
            y2 : float
                The y-coordinate of the second transition point (x2), representing the target
                proportion of patients admitted by time x2.

        Returns
        -------
        dict
            A dictionary with predictions for each specified context.

        Raises
        ------
        ValueError
            If filter key is not recognized or prediction_time is not provided.
        KeyError
            If required keys are missing from the prediction context.
        """
        # Extract required parameters from kwargs
        x1 = kwargs.get("x1")
        y1 = kwargs.get("y1")
        x2 = kwargs.get("x2")
        y2 = kwargs.get("y2")

        # Validate that required parameters are provided
        if x1 is None or y1 is None or x2 is None or y2 is None:
            raise ValueError(
                "x1, y1, x2, and y2 parameters are required for parametric prediction"
            )

        predictions = {}

        # Calculate Ntimes
        if isinstance(self.prediction_window, timedelta) and isinstance(
            self.yta_time_interval, timedelta
        ):
            NTimes = int(self.prediction_window / self.yta_time_interval)
        elif isinstance(self.prediction_window, timedelta):
            NTimes = int(
                self.prediction_window.total_seconds() / 60 / self.yta_time_interval
            )
        elif isinstance(self.yta_time_interval, timedelta):
            NTimes = int(
                self.prediction_window / (self.yta_time_interval.total_seconds() / 60)
            )
        else:
            NTimes = int(self.prediction_window / self.yta_time_interval)

        # Convert to hours only for numpy operations (which require numeric types)
        prediction_window_hours = (
            self.prediction_window.total_seconds() / 3600
            if isinstance(self.prediction_window, timedelta)
            else self.prediction_window / 60
        )
        yta_time_interval_hours = (
            self.yta_time_interval.total_seconds() / 3600
            if isinstance(self.yta_time_interval, timedelta)
            else self.yta_time_interval / 60
        )

        # Calculate theta, probability of admission in prediction window
        # for each time interval, calculate time remaining before end of window
        time_remaining_before_end_of_window = prediction_window_hours - np.arange(
            0, prediction_window_hours, yta_time_interval_hours
        )

        theta = get_y_from_aspirational_curve(
            time_remaining_before_end_of_window, x1, y1, x2, y2
        )

        for filter_key, filter_values in prediction_context.items():
            try:
                if filter_key not in self.weights:
                    raise ValueError(
                        f"Filter key '{filter_key}' is not recognized in the model weights."
                    )

                prediction_time = filter_values.get("prediction_time")
                if prediction_time is None:
                    raise ValueError(
                        f"No 'prediction_time' provided for filter '{filter_key}'."
                    )

                if prediction_time not in self.prediction_times:
                    prediction_time = find_nearest_previous_prediction_time(
                        prediction_time, self.prediction_times
                    )

                arrival_rates = self.weights[filter_key][prediction_time].get(
                    "arrival_rates"
                )
                if arrival_rates is None:
                    raise ValueError(
                        f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                    )

                predictions[filter_key] = poisson_binom_generating_function(
                    NTimes, arrival_rates, theta, self.epsilon
                )

            except KeyError as e:
                raise KeyError(f"Key error occurred: {e!s}")

        return predictions
predict(prediction_context, **kwargs)

Predict the number of admissions for the given context using parametric curves.

Parameters:

Name Type Description Default
prediction_context dict

A dictionary defining the context for which predictions are to be made. It should specify either a general context or one based on the applied filters.

required
**kwargs

Additional keyword arguments for parametric curve configuration:

x1 : float The x-coordinate of the first transition point on the aspirational curve, where the growth phase ends and the decay phase begins. y1 : float The y-coordinate of the first transition point (x1), representing the target proportion of patients admitted by time x1. x2 : float The x-coordinate of the second transition point on the curve, beyond which all but a few patients are expected to be admitted. y2 : float The y-coordinate of the second transition point (x2), representing the target proportion of patients admitted by time x2.

{}

Returns:

Type Description
dict

A dictionary with predictions for each specified context.

Raises:

Type Description
ValueError

If filter key is not recognized or prediction_time is not provided.

KeyError

If required keys are missing from the prediction context.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
def predict(self, prediction_context: Dict, **kwargs) -> Dict:
    """Predict the number of admissions for the given context using parametric curves.

    Parameters
    ----------
    prediction_context : dict
        A dictionary defining the context for which predictions are to be made.
        It should specify either a general context or one based on the applied filters.
    **kwargs
        Additional keyword arguments for parametric curve configuration:

        x1 : float
            The x-coordinate of the first transition point on the aspirational curve,
            where the growth phase ends and the decay phase begins.
        y1 : float
            The y-coordinate of the first transition point (x1), representing the target
            proportion of patients admitted by time x1.
        x2 : float
            The x-coordinate of the second transition point on the curve, beyond which
            all but a few patients are expected to be admitted.
        y2 : float
            The y-coordinate of the second transition point (x2), representing the target
            proportion of patients admitted by time x2.

    Returns
    -------
    dict
        A dictionary with predictions for each specified context.

    Raises
    ------
    ValueError
        If filter key is not recognized or prediction_time is not provided.
    KeyError
        If required keys are missing from the prediction context.
    """
    # Extract required parameters from kwargs
    x1 = kwargs.get("x1")
    y1 = kwargs.get("y1")
    x2 = kwargs.get("x2")
    y2 = kwargs.get("y2")

    # Validate that required parameters are provided
    if x1 is None or y1 is None or x2 is None or y2 is None:
        raise ValueError(
            "x1, y1, x2, and y2 parameters are required for parametric prediction"
        )

    predictions = {}

    # Calculate Ntimes
    if isinstance(self.prediction_window, timedelta) and isinstance(
        self.yta_time_interval, timedelta
    ):
        NTimes = int(self.prediction_window / self.yta_time_interval)
    elif isinstance(self.prediction_window, timedelta):
        NTimes = int(
            self.prediction_window.total_seconds() / 60 / self.yta_time_interval
        )
    elif isinstance(self.yta_time_interval, timedelta):
        NTimes = int(
            self.prediction_window / (self.yta_time_interval.total_seconds() / 60)
        )
    else:
        NTimes = int(self.prediction_window / self.yta_time_interval)

    # Convert to hours only for numpy operations (which require numeric types)
    prediction_window_hours = (
        self.prediction_window.total_seconds() / 3600
        if isinstance(self.prediction_window, timedelta)
        else self.prediction_window / 60
    )
    yta_time_interval_hours = (
        self.yta_time_interval.total_seconds() / 3600
        if isinstance(self.yta_time_interval, timedelta)
        else self.yta_time_interval / 60
    )

    # Calculate theta, probability of admission in prediction window
    # for each time interval, calculate time remaining before end of window
    time_remaining_before_end_of_window = prediction_window_hours - np.arange(
        0, prediction_window_hours, yta_time_interval_hours
    )

    theta = get_y_from_aspirational_curve(
        time_remaining_before_end_of_window, x1, y1, x2, y2
    )

    for filter_key, filter_values in prediction_context.items():
        try:
            if filter_key not in self.weights:
                raise ValueError(
                    f"Filter key '{filter_key}' is not recognized in the model weights."
                )

            prediction_time = filter_values.get("prediction_time")
            if prediction_time is None:
                raise ValueError(
                    f"No 'prediction_time' provided for filter '{filter_key}'."
                )

            if prediction_time not in self.prediction_times:
                prediction_time = find_nearest_previous_prediction_time(
                    prediction_time, self.prediction_times
                )

            arrival_rates = self.weights[filter_key][prediction_time].get(
                "arrival_rates"
            )
            if arrival_rates is None:
                raise ValueError(
                    f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                )

            predictions[filter_key] = poisson_binom_generating_function(
                NTimes, arrival_rates, theta, self.epsilon
            )

        except KeyError as e:
            raise KeyError(f"Key error occurred: {e!s}")

    return predictions

aggregate_probabilities(lam, kmax, theta, time_index)

Aggregate probabilities for a range of values using the weighted Poisson-Binomial distribution.

Parameters:

Name Type Description Default
lam ndarray

An array of lambda values for each time interval.

required
kmax int

The maximum number of events to consider.

required
theta ndarray

An array of theta values for each time interval.

required
time_index int

The current time index for which to calculate probabilities.

required

Returns:

Type Description
ndarray

Aggregated probabilities for the given time index.

Raises:

Type Description
ValueError

If kmax < 0, time_index < 0, or array lengths are invalid.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
def aggregate_probabilities(lam, kmax, theta, time_index):
    """Aggregate probabilities for a range of values using the weighted Poisson-Binomial distribution.

    Parameters
    ----------
    lam : numpy.ndarray
        An array of lambda values for each time interval.
    kmax : int
        The maximum number of events to consider.
    theta : numpy.ndarray
        An array of theta values for each time interval.
    time_index : int
        The current time index for which to calculate probabilities.

    Returns
    -------
    numpy.ndarray
        Aggregated probabilities for the given time index.

    Raises
    ------
    ValueError
        If kmax < 0, time_index < 0, or array lengths are invalid.
    """
    if kmax < 0 or time_index < 0 or len(lam) <= time_index or len(theta) <= time_index:
        raise ValueError("Invalid kmax, time_index, or array lengths.")

    probabilities_matrix = np.zeros((kmax + 1, kmax + 1))
    for i in range(kmax + 1):
        probabilities_matrix[: i + 1, i] = weighted_poisson_binomial(
            i, lam[time_index], theta[time_index]
        )
    return probabilities_matrix.sum(axis=1)

convolute_distributions(dist_a, dist_b)

Convolutes two probability distributions represented as dataframes.

Parameters:

Name Type Description Default
dist_a DataFrame

The first distribution with columns ['sum', 'prob'].

required
dist_b DataFrame

The second distribution with columns ['sum', 'prob'].

required

Returns:

Type Description
DataFrame

The convoluted distribution.

Raises:

Type Description
ValueError

If DataFrames do not contain required 'sum' and 'prob' columns.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def convolute_distributions(dist_a, dist_b):
    """Convolutes two probability distributions represented as dataframes.

    Parameters
    ----------
    dist_a : pd.DataFrame
        The first distribution with columns ['sum', 'prob'].
    dist_b : pd.DataFrame
        The second distribution with columns ['sum', 'prob'].

    Returns
    -------
    pd.DataFrame
        The convoluted distribution.

    Raises
    ------
    ValueError
        If DataFrames do not contain required 'sum' and 'prob' columns.
    """
    if not {"sum", "prob"}.issubset(dist_a.columns) or not {
        "sum",
        "prob",
    }.issubset(dist_b.columns):
        raise ValueError("DataFrames must contain 'sum' and 'prob' columns.")

    sums = [x + y for x in dist_a["sum"] for y in dist_b["sum"]]
    probs = [x * y for x in dist_a["prob"] for y in dist_b["prob"]]
    result = pd.DataFrame(zip(sums, probs), columns=["sum", "prob"])
    return result.groupby("sum")["prob"].sum().reset_index()

find_nearest_previous_prediction_time(requested_time, prediction_times)

Find the nearest previous time of day in prediction_times relative to requested time.

Parameters:

Name Type Description Default
requested_time tuple

The requested time as (hour, minute).

required
prediction_times list

List of available prediction times.

required

Returns:

Type Description
tuple

The closest previous time of day from prediction_times.

Notes

If the requested time is earlier than all times in prediction_times, returns the latest time in prediction_times.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
def find_nearest_previous_prediction_time(requested_time, prediction_times):
    """Find the nearest previous time of day in prediction_times relative to requested time.

    Parameters
    ----------
    requested_time : tuple
        The requested time as (hour, minute).
    prediction_times : list
        List of available prediction times.

    Returns
    -------
    tuple
        The closest previous time of day from prediction_times.

    Notes
    -----
    If the requested time is earlier than all times in prediction_times,
    returns the latest time in prediction_times.
    """
    if requested_time in prediction_times:
        return requested_time

    original_prediction_time = requested_time
    requested_datetime = datetime.strptime(
        f"{requested_time[0]:02d}:{requested_time[1]:02d}", "%H:%M"
    )
    closest_prediction_time = max(
        prediction_times,
        key=lambda prediction_time_time: datetime.strptime(
            f"{prediction_time_time[0]:02d}:{prediction_time_time[1]:02d}",
            "%H:%M",
        ),
    )
    min_diff = float("inf")

    for prediction_time_time in prediction_times:
        prediction_time_datetime = datetime.strptime(
            f"{prediction_time_time[0]:02d}:{prediction_time_time[1]:02d}",
            "%H:%M",
        )
        diff = (requested_datetime - prediction_time_datetime).total_seconds()

        # If the difference is negative, it means the prediction_time_time is ahead of the requested_time,
        # hence we calculate the difference by considering a day's wrap around.
        if diff < 0:
            diff += 24 * 60 * 60  # Add 24 hours in seconds

        if 0 <= diff < min_diff:
            closest_prediction_time = prediction_time_time
            min_diff = diff

    warnings.warn(
        f"Time of day requested of {original_prediction_time} was not in model training. "
        f"Reverting to predictions for {closest_prediction_time}."
    )

    return closest_prediction_time

poisson_binom_generating_function(NTimes, arrival_rates, theta, epsilon)

Generate a distribution based on the aggregate of Poisson and Binomial distributions.

Parameters:

Name Type Description Default
NTimes int

The number of time intervals.

required
arrival_rates ndarray

An array of lambda values for each time interval.

required
theta ndarray

An array of theta values for each time interval.

required
epsilon float

The desired error threshold.

required

Returns:

Type Description
DataFrame

The generated distribution.

Raises:

Type Description
ValueError

If NTimes <= 0 or epsilon is not between 0 and 1.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def poisson_binom_generating_function(NTimes, arrival_rates, theta, epsilon):
    """Generate a distribution based on the aggregate of Poisson and Binomial distributions.

    Parameters
    ----------
    NTimes : int
        The number of time intervals.
    arrival_rates : numpy.ndarray
        An array of lambda values for each time interval.
    theta : numpy.ndarray
        An array of theta values for each time interval.
    epsilon : float
        The desired error threshold.

    Returns
    -------
    pd.DataFrame
        The generated distribution.

    Raises
    ------
    ValueError
        If NTimes <= 0 or epsilon is not between 0 and 1.
    """

    if NTimes <= 0 or epsilon <= 0 or epsilon >= 1:
        raise ValueError("Ensure NTimes > 0 and 0 < epsilon < 1.")

    maxlam = max(arrival_rates)
    kmax = int(poisson.ppf(1 - epsilon, maxlam))
    distribution = np.zeros((kmax + 1, NTimes))

    for j in range(NTimes):
        distribution[:, j] = aggregate_probabilities(arrival_rates, kmax, theta, j)

    df_list = [
        pd.DataFrame({"sum": range(kmax + 1), "prob": distribution[:, j]})
        for j in range(NTimes)
    ]
    total_distribution = df_list[0]

    for df in df_list[1:]:
        total_distribution = convolute_distributions(total_distribution, df)

    total_distribution = total_distribution.rename(
        columns={"prob": "agg_proba"}
    ).set_index("sum")

    return total_distribution

weighted_poisson_binomial(i, lam, theta)

Calculate weighted probabilities using Poisson and Binomial distributions.

Parameters:

Name Type Description Default
i int

The upper bound of the range for the binomial distribution.

required
lam float

The lambda parameter for the Poisson distribution.

required
theta float

The probability of success for the binomial distribution.

required

Returns:

Type Description
ndarray

An array of weighted probabilities.

Raises:

Type Description
ValueError

If i < 0, lam < 0, or theta is not between 0 and 1.

Source code in src/patientflow/predictors/incoming_admission_predictors.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def weighted_poisson_binomial(i, lam, theta):
    """Calculate weighted probabilities using Poisson and Binomial distributions.

    Parameters
    ----------
    i : int
        The upper bound of the range for the binomial distribution.
    lam : float
        The lambda parameter for the Poisson distribution.
    theta : float
        The probability of success for the binomial distribution.

    Returns
    -------
    numpy.ndarray
        An array of weighted probabilities.

    Raises
    ------
    ValueError
        If i < 0, lam < 0, or theta is not between 0 and 1.
    """
    if i < 0 or lam < 0 or not 0 <= theta <= 1:
        raise ValueError("Ensure i >= 0, lam >= 0, and 0 <= theta <= 1.")

    arr_seq = np.arange(i + 1)
    probabilities = binom.pmf(arr_seq, i, theta)
    return poisson.pmf(i, lam) * probabilities

sequence_to_outcome_predictor

This module implements a SequenceToOutcomePredictor class that models and predicts the probability distribution of sequences in categorical data. The class builds a model based on training data, where input sequences are mapped to specific outcome categories. It provides methods to fit the model, compute sequence-based probabilities, and make predictions on an unseen datatset of input sequences.

Classes:

Name Description
SequenceToOutcomePredictor : sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A model that predicts the probability of ending in different outcome categories based on input sequences. Note: All sequence inputs are expected to be tuples. Lists will be automatically converted to tuples, and None values will be converted to empty tuples.

SequenceToOutcomePredictor

Bases: BaseEstimator, TransformerMixin

A class to model sequence-based predictions for categorical data using input and grouping sequences. This class implements both the fit and predict methods from the parent sklearn classes.

Parameters:

Name Type Description Default
input_var str

Name of the column representing the input sequence in the DataFrame.

required
grouping_var str

Name of the column representing the grouping sequence in the DataFrame.

required
outcome_var str

Name of the column representing the outcome category in the DataFrame.

required
apply_special_category_filtering bool

Whether to filter out special categories of patients before fitting the model.

True
admit_col str

Name of the column indicating whether a patient was admitted.

'is_admitted'

Attributes:

Name Type Description
weights dict

A dictionary storing the probabilities of different input sequences leading to specific outcome categories.

input_to_grouping_probs DataFrame

A DataFrame that stores the computed probabilities of input sequences being associated with different grouping sequences.

special_params (dict, optional)

The special category parameters used for filtering, only populated if apply_special_category_filtering=True.

metrics dict

A dictionary to store metrics related to the training process.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
class SequenceToOutcomePredictor(BaseEstimator, TransformerMixin):
    """
    A class to model sequence-based predictions for categorical data using input and grouping sequences.
    This class implements both the `fit` and `predict` methods from the parent sklearn classes.

    Parameters
    ----------
    input_var : str
        Name of the column representing the input sequence in the DataFrame.
    grouping_var : str
        Name of the column representing the grouping sequence in the DataFrame.
    outcome_var : str
        Name of the column representing the outcome category in the DataFrame.
    apply_special_category_filtering : bool, default=True
        Whether to filter out special categories of patients before fitting the model.
    admit_col : str, default='is_admitted'
        Name of the column indicating whether a patient was admitted.

    Attributes
    ----------
    weights : dict
        A dictionary storing the probabilities of different input sequences leading to specific outcome categories.
    input_to_grouping_probs : pd.DataFrame
        A DataFrame that stores the computed probabilities of input sequences being associated with different grouping sequences.
    special_params : dict, optional
        The special category parameters used for filtering, only populated if apply_special_category_filtering=True.
    metrics : dict
        A dictionary to store metrics related to the training process.
    """

    def __init__(
        self,
        input_var,
        grouping_var,
        outcome_var,
        apply_special_category_filtering=True,
        admit_col="is_admitted",
    ):
        self.input_var = input_var
        self.grouping_var = grouping_var
        self.outcome_var = outcome_var
        self.apply_special_category_filtering = apply_special_category_filtering
        self.admit_col = admit_col
        self.weights = None
        self.special_params = None
        self.metrics = {}

    def __repr__(self):
        """Return a string representation of the estimator."""
        class_name = self.__class__.__name__
        return (
            f"{class_name}(\n"
            f"    input_var='{self.input_var}',\n"
            f"    grouping_var='{self.grouping_var}',\n"
            f"    outcome_var='{self.outcome_var}',\n"
            f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
            f"    admit_col='{self.admit_col}'\n"
            f")"
        )

    def _ensure_tuple(self, sequence):
        """
        Convert a sequence to tuple if it's not already a tuple.
        Handles string cleaning to avoid double-quoting issues.

        Parameters
        ----------
        sequence : tuple, list, or None
            The sequence to convert

        Returns
        -------
        tuple
            The input sequence as a tuple, or an empty tuple if input was None
        """
        if sequence is None:
            return ()
        if isinstance(sequence, (list, pd.Series)):
            # Clean any quoted strings in the sequence
            cleaned_sequence = [
                ast.literal_eval(item)
                if isinstance(item, str) and item.startswith("'") and item.endswith("'")
                else item
                for item in sequence
            ]
            return tuple(cleaned_sequence) if cleaned_sequence else ()
        if isinstance(sequence, tuple):
            # Clean any quoted strings in the tuple
            return tuple(
                ast.literal_eval(item)
                if isinstance(item, str) and item.startswith("'") and item.endswith("'")
                else item
                for item in sequence
            )
        return sequence

    def _preprocess_data(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocesses the input data before fitting the model.

        Steps include:
        1. Selecting only admitted patients with a non-null specialty
        2. Optionally filtering out special categories
        3. Converting sequence columns to tuple format if they aren't already

        Parameters
        ----------
        X : pd.DataFrame
            DataFrame containing patient data.

        Returns
        -------
        pd.DataFrame
            Preprocessed DataFrame ready for model fitting.
        """
        # Make a copy to avoid modifying the original
        df = X.copy()

        # Step 1: Select only admitted patients with a non-null specialty
        if self.admit_col in df.columns:
            df = df[df[self.admit_col] & ~df[self.outcome_var].isnull()]

        # Step 2: Optionally apply filtering for special categories
        if self.apply_special_category_filtering:
            # Get configuration for categorizing patients based on columns
            self.special_params = create_special_category_objects(df.columns)

            # Extract function that identifies non-special category patients
            opposite_special_category_func = self.special_params["special_func_map"][
                "default"
            ]

            # Determine which category is the special category
            special_category_key = next(
                key
                for key, value in self.special_params["special_category_dict"].items()
                if value == 1.0
            )

            # Filter out special category patients
            df = df[
                df.apply(opposite_special_category_func, axis=1)
                & (df[self.outcome_var] != special_category_key)
            ]

        # Step 3: Convert sequence columns to tuple format
        if self.input_var in df.columns:
            df[self.input_var] = df[self.input_var].apply(self._ensure_tuple)

        if self.grouping_var in df.columns:
            df[self.grouping_var] = df[self.grouping_var].apply(self._ensure_tuple)

        return df

    def fit(self, X: pd.DataFrame) -> "SequenceToOutcomePredictor":
        """
        Fits the predictor based on training data by computing the proportion of each input variable sequence
        ending in specific outcome variable categories.

        Automatically preprocesses the data before fitting.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

        Returns
        -------
        self : SequenceToOutcomePredictor
            The fitted SequenceToOutcomePredictor model with calculated probabilities for each sequence.
        """
        # Store metrics about the training data
        self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
        self.metrics["train_set_no"] = len(X)
        if not X.empty:
            self.metrics["start_date"] = X["snapshot_date"].min()
            self.metrics["end_date"] = X["snapshot_date"].max()

        # Preprocess the data
        X = self._preprocess_data(X)

        # derive the names of the observed outcome variables from the data
        prop_keys = X[self.outcome_var].unique()

        # For each sequence count the number of observed categories
        X_grouped = (
            X.groupby(self.grouping_var)[self.outcome_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # Calculate the total number of times each grouping sequence occurred
        row_totals = X_grouped.sum(axis=1)

        # Calculate for each grouping sequence, the proportion of ending with each observed specialty
        proportions = X_grouped.div(row_totals, axis=0)

        # Calculate the probability of each grouping sequence occurring in the original data
        proportions["probability_of_grouping_sequence"] = row_totals / row_totals.sum()

        # Reweight probabilities of ending with each observed specialty
        # by the likelihood of each grouping sequence occurring
        for col in proportions.columns[
            :-1
        ]:  # Avoid the last column which is the 'probability_of_grouping_sequence'
            proportions[col] *= proportions["probability_of_grouping_sequence"]

        # Convert final sequence to a string in order to conduct string searches on it
        proportions["grouping_sequence_to_string"] = proportions.index.map(
            lambda x: "-".join(map(str, x))
        )

        # Row-wise function to return, for each input sequence,
        # the proportion that end up in each final sequence and thereby
        # the probability of it ending in any observed category
        proportions["prob_input_var_ends_in_observed_specialty"] = proportions[
            "grouping_sequence_to_string"
        ].apply(lambda x: self._string_match_input_var(x, proportions, prop_keys))

        # Convert the prob_input_var_ends_in_observed_specialty column to a dictionary
        result_dict = proportions["prob_input_var_ends_in_observed_specialty"].to_dict()

        # Clean the key to remove excess strint quotes
        def clean_tuple_key(key):
            if isinstance(key, tuple):
                return tuple(
                    ast.literal_eval(item)
                    if item.startswith("'") and item.endswith("'")
                    else item
                    for item in key
                )
            return key

        cleaned_dict = {clean_tuple_key(k): v for k, v in result_dict.items()}

        # save prob_input_var_ends_in_observed_specialty as weights within the model
        self.weights = cleaned_dict

        # save the input to grouping probabilities for use as a reference
        self.input_to_grouping_probs = self._probability_of_input_to_grouping_sequence(
            X
        )

        return self

    def _string_match_input_var(self, input_var_string, proportions, prop_keys):
        """
        Matches a given input sequence string with grouped sequences (expressed as strings) in the dataset and aggregates
        their probabilities for each outcome category. This function filters the data to
        match only those rows where the *beginning* of the grouped sequence string
        matches the given input sequence string, allowing for partial matches.
        For instance, the sequence 'medical' will match 'medical, elderly' and 'medical, surgical'
        as well as 'medical' on its own. It computes the total probabilities of any input sequence ending
        in each outcome category, and normalizes these totals if possible.

        Parameters
        ----------
        input_var_string : str
            The sequence of inputs represented as a string, used to match against sequences in the proportions DataFrame.
        proportions : pd.DataFrame
            DataFrame containing proportions data with an additional column 'grouping_sequence_to_string'
            which includes string representations of sequences.
        prop_keys : np.array
            Array of unique outcomes to consider in calculations.

        Returns
        -------
        dict
            A dictionary where keys are outcome names and values are the aggregated and normalized probabilities
            of an input sequence ending in those outcomes.

        """
        # Filter rows where the grouped sequence string starts with the input sequence string
        props = proportions[
            proportions["grouping_sequence_to_string"].str.match("^" + input_var_string)
        ][prop_keys].sum()

        # Sum of all probabilities to normalize them
        props_total = props.sum()

        # Handle cases where the total probability is zero to avoid division by zero
        if props_total > 0:
            normalized_props = props / props_total
        else:
            normalized_props = (
                props * 0
            )  # Returns zero probabilities if no matches found

        return dict(zip(prop_keys, normalized_props))

    def _probability_of_input_to_grouping_sequence(self, X):
        """
        Computes the probabilities of different input sequences leading to specific grouping sequences.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var` and `grouping_var`.

        Returns
        -------
        pd.DataFrame
            A DataFrame containing the probabilities of input sequences leading to grouping sequences.
        """
        # For each input sequence count the number of grouping sequences
        X_grouped = (
            X.groupby(self.input_var)[self.grouping_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # # Calculate the total number of times each input sequence occurred
        row_totals = X_grouped.sum(axis=1)

        # # Calculate for each grouping sequence, the proportion of ending with each grouping sequence
        proportions = X_grouped.div(row_totals, axis=0)

        # # Calculate the probability of each input sequence occurring in the original data
        proportions["probability_of_grouping_sequence"] = row_totals / row_totals.sum()

        return proportions

    def predict(self, input_sequence: tuple[str, ...]) -> Dict[str, float]:
        """
        Predicts the probabilities of ending in various outcome categories for a given input sequence.

        Parameters
        ----------
        input_sequence : tuple[str, ...]
            A tuple containing the categories that have been observed for an entity in the order they
            have been encountered. An empty tuple represents an entity with no observed categories.

        Returns
        -------
        dict
            A dictionary of categories and the probabilities that the input sequence will end in them.
        """
        input_sequence = self._ensure_tuple(input_sequence)

        if input_sequence is None or pd.isna(input_sequence):
            return self.weights.get(tuple(), {})

        # Return a direct lookup of probabilities if possible.
        if input_sequence in self.weights:
            return self.weights[input_sequence]

        # Otherwise, if the sequence has multiple elements, work back looking for a match
        while len(input_sequence) > 1:
            input_sequence_list = list(input_sequence)
            input_sequence = tuple(input_sequence_list[:-1])  # remove last element

            if input_sequence in self.weights:
                return self.weights[input_sequence]

        # If no relevant data is found:
        return self.weights.get(tuple(), {})
__repr__()

Return a string representation of the estimator.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py
71
72
73
74
75
76
77
78
79
80
81
82
def __repr__(self):
    """Return a string representation of the estimator."""
    class_name = self.__class__.__name__
    return (
        f"{class_name}(\n"
        f"    input_var='{self.input_var}',\n"
        f"    grouping_var='{self.grouping_var}',\n"
        f"    outcome_var='{self.outcome_var}',\n"
        f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
        f"    admit_col='{self.admit_col}'\n"
        f")"
    )
fit(X)

Fits the predictor based on training data by computing the proportion of each input variable sequence ending in specific outcome variable categories.

Automatically preprocesses the data before fitting.

Parameters:

Name Type Description Default
X DataFrame

A pandas DataFrame containing at least the columns specified by input_var, grouping_var, and outcome_var.

required

Returns:

Name Type Description
self SequenceToOutcomePredictor

The fitted SequenceToOutcomePredictor model with calculated probabilities for each sequence.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
def fit(self, X: pd.DataFrame) -> "SequenceToOutcomePredictor":
    """
    Fits the predictor based on training data by computing the proportion of each input variable sequence
    ending in specific outcome variable categories.

    Automatically preprocesses the data before fitting.

    Parameters
    ----------
    X : pd.DataFrame
        A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

    Returns
    -------
    self : SequenceToOutcomePredictor
        The fitted SequenceToOutcomePredictor model with calculated probabilities for each sequence.
    """
    # Store metrics about the training data
    self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
    self.metrics["train_set_no"] = len(X)
    if not X.empty:
        self.metrics["start_date"] = X["snapshot_date"].min()
        self.metrics["end_date"] = X["snapshot_date"].max()

    # Preprocess the data
    X = self._preprocess_data(X)

    # derive the names of the observed outcome variables from the data
    prop_keys = X[self.outcome_var].unique()

    # For each sequence count the number of observed categories
    X_grouped = (
        X.groupby(self.grouping_var)[self.outcome_var]
        .value_counts()
        .unstack(fill_value=0)
    )

    # Calculate the total number of times each grouping sequence occurred
    row_totals = X_grouped.sum(axis=1)

    # Calculate for each grouping sequence, the proportion of ending with each observed specialty
    proportions = X_grouped.div(row_totals, axis=0)

    # Calculate the probability of each grouping sequence occurring in the original data
    proportions["probability_of_grouping_sequence"] = row_totals / row_totals.sum()

    # Reweight probabilities of ending with each observed specialty
    # by the likelihood of each grouping sequence occurring
    for col in proportions.columns[
        :-1
    ]:  # Avoid the last column which is the 'probability_of_grouping_sequence'
        proportions[col] *= proportions["probability_of_grouping_sequence"]

    # Convert final sequence to a string in order to conduct string searches on it
    proportions["grouping_sequence_to_string"] = proportions.index.map(
        lambda x: "-".join(map(str, x))
    )

    # Row-wise function to return, for each input sequence,
    # the proportion that end up in each final sequence and thereby
    # the probability of it ending in any observed category
    proportions["prob_input_var_ends_in_observed_specialty"] = proportions[
        "grouping_sequence_to_string"
    ].apply(lambda x: self._string_match_input_var(x, proportions, prop_keys))

    # Convert the prob_input_var_ends_in_observed_specialty column to a dictionary
    result_dict = proportions["prob_input_var_ends_in_observed_specialty"].to_dict()

    # Clean the key to remove excess strint quotes
    def clean_tuple_key(key):
        if isinstance(key, tuple):
            return tuple(
                ast.literal_eval(item)
                if item.startswith("'") and item.endswith("'")
                else item
                for item in key
            )
        return key

    cleaned_dict = {clean_tuple_key(k): v for k, v in result_dict.items()}

    # save prob_input_var_ends_in_observed_specialty as weights within the model
    self.weights = cleaned_dict

    # save the input to grouping probabilities for use as a reference
    self.input_to_grouping_probs = self._probability_of_input_to_grouping_sequence(
        X
    )

    return self
predict(input_sequence)

Predicts the probabilities of ending in various outcome categories for a given input sequence.

Parameters:

Name Type Description Default
input_sequence tuple[str, ...]

A tuple containing the categories that have been observed for an entity in the order they have been encountered. An empty tuple represents an entity with no observed categories.

required

Returns:

Type Description
dict

A dictionary of categories and the probabilities that the input sequence will end in them.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
def predict(self, input_sequence: tuple[str, ...]) -> Dict[str, float]:
    """
    Predicts the probabilities of ending in various outcome categories for a given input sequence.

    Parameters
    ----------
    input_sequence : tuple[str, ...]
        A tuple containing the categories that have been observed for an entity in the order they
        have been encountered. An empty tuple represents an entity with no observed categories.

    Returns
    -------
    dict
        A dictionary of categories and the probabilities that the input sequence will end in them.
    """
    input_sequence = self._ensure_tuple(input_sequence)

    if input_sequence is None or pd.isna(input_sequence):
        return self.weights.get(tuple(), {})

    # Return a direct lookup of probabilities if possible.
    if input_sequence in self.weights:
        return self.weights[input_sequence]

    # Otherwise, if the sequence has multiple elements, work back looking for a match
    while len(input_sequence) > 1:
        input_sequence_list = list(input_sequence)
        input_sequence = tuple(input_sequence_list[:-1])  # remove last element

        if input_sequence in self.weights:
            return self.weights[input_sequence]

    # If no relevant data is found:
    return self.weights.get(tuple(), {})

value_to_outcome_predictor

This module implements a ValueToOutcomePredictor class that models and predicts the probability distribution of outcomes based on a single categorical input. The class builds a model based on training data, where input values are mapped to specific outcome categories through an intermediate grouping variable. It provides methods to fit the model, compute probabilities, and make predictions on unseen data.

Classes:

Name Description
ValueToOutcomePredictor : sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A model that predicts the probability of ending in different outcome categories based on a single input value. Note: All inputs are expected to be strings. None values will be converted to empty strings during preprocessing.

ValueToOutcomePredictor

Bases: BaseEstimator, TransformerMixin

A class to model predictions for categorical data using a single input value and grouping variable. This class implements both the fit and predict methods from the parent sklearn classes.

Parameters:

Name Type Description Default
input_var str

Name of the column representing the input value in the DataFrame.

required
grouping_var str

Name of the column representing the grouping value in the DataFrame.

required
outcome_var str

Name of the column representing the outcome category in the DataFrame.

required
apply_special_category_filtering bool

Whether to filter out special categories of patients before fitting the model.

True
admit_col str

Name of the column indicating whether a patient was admitted.

'is_admitted'

Attributes:

Name Type Description
weights dict

A dictionary storing the probabilities of different input values leading to specific outcome categories.

input_to_grouping_probs DataFrame

A DataFrame that stores the computed probabilities of input values being associated with different grouping values.

special_params (dict, optional)

The special category parameters used for filtering, only populated if apply_special_category_filtering=True.

metrics dict

A dictionary to store metrics related to the training process.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
class ValueToOutcomePredictor(BaseEstimator, TransformerMixin):
    """
    A class to model predictions for categorical data using a single input value and grouping variable.
    This class implements both the `fit` and `predict` methods from the parent sklearn classes.

    Parameters
    ----------
    input_var : str
        Name of the column representing the input value in the DataFrame.
    grouping_var : str
        Name of the column representing the grouping value in the DataFrame.
    outcome_var : str
        Name of the column representing the outcome category in the DataFrame.
    apply_special_category_filtering : bool, default=True
        Whether to filter out special categories of patients before fitting the model.
    admit_col : str, default='is_admitted'
        Name of the column indicating whether a patient was admitted.

    Attributes
    ----------
    weights : dict
        A dictionary storing the probabilities of different input values leading to specific outcome categories.
    input_to_grouping_probs : pd.DataFrame
        A DataFrame that stores the computed probabilities of input values being associated with different grouping values.
    special_params : dict, optional
        The special category parameters used for filtering, only populated if apply_special_category_filtering=True.
    metrics : dict
        A dictionary to store metrics related to the training process.
    """

    def __init__(
        self,
        input_var,
        grouping_var,
        outcome_var,
        apply_special_category_filtering=True,
        admit_col="is_admitted",
    ):
        self.input_var = input_var
        self.grouping_var = grouping_var
        self.outcome_var = outcome_var
        self.apply_special_category_filtering = apply_special_category_filtering
        self.admit_col = admit_col
        self.weights = None
        self.special_params = None
        self.metrics = {}

    def __repr__(self):
        """Return a string representation of the estimator."""
        class_name = self.__class__.__name__
        return (
            f"{class_name}(\n"
            f"    input_var='{self.input_var}',\n"
            f"    grouping_var='{self.grouping_var}',\n"
            f"    outcome_var='{self.outcome_var}',\n"
            f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
            f"    admit_col='{self.admit_col}'\n"
            f")"
        )

    def _preprocess_data(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocesses the input data before fitting the model.

        Steps include:
        1. Selecting only admitted patients with a non-null specialty
        2. Optionally filtering out special categories
        3. Converting input values to strings and handling nulls

        Parameters
        ----------
        X : pd.DataFrame
            DataFrame containing patient data.

        Returns
        -------
        pd.DataFrame
            Preprocessed DataFrame ready for model fitting.
        """
        # Make a copy to avoid modifying the original
        df = X.copy()

        # Step 1: Select only admitted patients with a non-null specialty
        if self.admit_col in df.columns:
            df = df[df[self.admit_col] & ~df[self.outcome_var].isnull()]

        # Step 2: Optionally apply filtering for special categories
        if self.apply_special_category_filtering:
            # Get configuration for categorizing patients based on columns
            self.special_params = create_special_category_objects(df.columns)

            # Extract function that identifies non-special category patients
            opposite_special_category_func = self.special_params["special_func_map"][
                "default"
            ]

            # Determine which category is the special category
            special_category_key = next(
                key
                for key, value in self.special_params["special_category_dict"].items()
                if value == 1.0
            )

            # Filter out special category patients
            df = df[
                df.apply(opposite_special_category_func, axis=1)
                & (df[self.outcome_var] != special_category_key)
            ]

        # Step 3: Convert input values to strings and handle nulls
        if self.input_var in df.columns:
            df[self.input_var] = df[self.input_var].fillna("").astype(str)

        if self.grouping_var in df.columns:
            df[self.grouping_var] = df[self.grouping_var].fillna("").astype(str)

        return df

    def fit(self, X: pd.DataFrame) -> "ValueToOutcomePredictor":
        """
        Fits the predictor based on training data by computing the proportion of each input value
        ending in specific outcome variable categories.

        Automatically preprocesses the data before fitting. During preprocessing, any null values in the
        input and grouping variables are converted to empty strings. These empty strings are then used
        as keys in the model's weights dictionary.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

        Returns
        -------
        self : ValueToOutcomePredictor
            The fitted ValueToOutcomePredictor model with calculated probabilities for each input value.
            The weights dictionary will contain an empty string key ('') for any null values from the input data.
        """

        # Store metrics about the training data
        self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
        self.metrics["train_set_no"] = len(X)
        if not X.empty:
            self.metrics["start_date"] = X["snapshot_date"].min()
            self.metrics["end_date"] = X["snapshot_date"].max()

        # Preprocess the data
        X = self._preprocess_data(X)

        # For each grouping value count the number of observed categories
        X_grouped = (
            X.groupby(self.grouping_var)[self.outcome_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # Calculate the total number of times each grouping value occurred
        row_totals = X_grouped.sum(axis=1)

        # Calculate for each grouping value, the proportion of ending with each observed specialty
        proportions = X_grouped.div(row_totals, axis=0).fillna(0)

        # Calculate probabilities for each input value
        input_probs = {}
        for input_val in X[self.input_var].unique():
            # Get all grouping values associated with this input value
            grouping_vals = X[X[self.input_var] == input_val][
                self.grouping_var
            ].unique()

            # Calculate probability distribution of grouping values for this input value
            input_to_group_probs = X[X[self.input_var] == input_val][
                self.grouping_var
            ].value_counts(normalize=True)

            # Get the probability distribution of outcomes for all relevant grouping values
            # This includes all rows in proportions where the grouping value appears for this input
            group_to_outcome_probs = proportions.loc[grouping_vals]

            # Ensure the rows are aligned by reindexing group_to_outcome_probs
            aligned_group_to_outcome = group_to_outcome_probs.reindex(
                input_to_group_probs.index
            )

            # Create outer product matrix of probabilities:
            # - Rows represent grouping values
            # - Columns represent outcome categories
            # Each cell contains the joint probability of the grouping value and outcome
            input_to_outcome_probs = pd.DataFrame(
                input_to_group_probs.values.reshape(-1, 1)
                * aligned_group_to_outcome.values,
                index=input_to_group_probs.index,
                columns=group_to_outcome_probs.columns,
            )

            # Sum across grouping values to get final probability distribution for this input value
            input_probs[input_val] = input_to_outcome_probs.sum().to_dict()

        # Clean the keys to remove excess string quotes
        def clean_key(key):
            if isinstance(key, str):
                # Remove surrounding quotes if they exist
                if key.startswith("'") and key.endswith("'"):
                    return key[1:-1]
            return key

        # Note: cleaned_dict will contain an empty string key ('') for any null values from the input data
        # This is because null values are converted to empty strings during preprocessing
        cleaned_dict = {clean_key(k): v for k, v in input_probs.items()}

        # save probabilities as weights within the model
        self.weights = cleaned_dict

        # save the input to grouping probabilities for use as a reference
        self.input_to_grouping_probs = self._probability_of_input_to_grouping_value(X)

        return self

    def _probability_of_input_to_grouping_value(self, X):
        """
        Computes the probabilities of different input values leading to specific grouping values.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var` and `grouping_var`.

        Returns
        -------
        pd.DataFrame
            A DataFrame containing the probabilities of input values leading to grouping values.
        """
        # For each input value count the number of grouping values
        X_grouped = (
            X.groupby(self.input_var)[self.grouping_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # Calculate the total number of times each input value occurred
        row_totals = X_grouped.sum(axis=1)

        # Calculate for each grouping value, the proportion of ending with each grouping value
        proportions = X_grouped.div(row_totals, axis=0)

        # Calculate the probability of each input value occurring in the original data
        proportions["probability_of_input_value"] = row_totals / row_totals.sum()

        return proportions

    def predict(self, input_value: str) -> Dict[str, float]:
        """
        Predicts the probabilities of ending in various outcome categories for a given input value.

        Parameters
        ----------
        input_value : str
            The input value to predict outcomes for. None values will be handled appropriately.

        Returns
        -------
        dict
            A dictionary of categories and the probabilities that the input value will end in them.
        """
        if input_value is None or pd.isna(input_value):
            return self.weights.get("", {})

        # Convert input to string if it isn't already
        input_value = str(input_value)

        # Return a direct lookup of probabilities if possible
        if input_value in self.weights:
            return self.weights[input_value]

        # If no relevant data is found, return null probabilities
        return self.weights.get(None, {})
__repr__()

Return a string representation of the estimator.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py
69
70
71
72
73
74
75
76
77
78
79
80
def __repr__(self):
    """Return a string representation of the estimator."""
    class_name = self.__class__.__name__
    return (
        f"{class_name}(\n"
        f"    input_var='{self.input_var}',\n"
        f"    grouping_var='{self.grouping_var}',\n"
        f"    outcome_var='{self.outcome_var}',\n"
        f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
        f"    admit_col='{self.admit_col}'\n"
        f")"
    )
fit(X)

Fits the predictor based on training data by computing the proportion of each input value ending in specific outcome variable categories.

Automatically preprocesses the data before fitting. During preprocessing, any null values in the input and grouping variables are converted to empty strings. These empty strings are then used as keys in the model's weights dictionary.

Parameters:

Name Type Description Default
X DataFrame

A pandas DataFrame containing at least the columns specified by input_var, grouping_var, and outcome_var.

required

Returns:

Name Type Description
self ValueToOutcomePredictor

The fitted ValueToOutcomePredictor model with calculated probabilities for each input value. The weights dictionary will contain an empty string key ('') for any null values from the input data.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def fit(self, X: pd.DataFrame) -> "ValueToOutcomePredictor":
    """
    Fits the predictor based on training data by computing the proportion of each input value
    ending in specific outcome variable categories.

    Automatically preprocesses the data before fitting. During preprocessing, any null values in the
    input and grouping variables are converted to empty strings. These empty strings are then used
    as keys in the model's weights dictionary.

    Parameters
    ----------
    X : pd.DataFrame
        A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

    Returns
    -------
    self : ValueToOutcomePredictor
        The fitted ValueToOutcomePredictor model with calculated probabilities for each input value.
        The weights dictionary will contain an empty string key ('') for any null values from the input data.
    """

    # Store metrics about the training data
    self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
    self.metrics["train_set_no"] = len(X)
    if not X.empty:
        self.metrics["start_date"] = X["snapshot_date"].min()
        self.metrics["end_date"] = X["snapshot_date"].max()

    # Preprocess the data
    X = self._preprocess_data(X)

    # For each grouping value count the number of observed categories
    X_grouped = (
        X.groupby(self.grouping_var)[self.outcome_var]
        .value_counts()
        .unstack(fill_value=0)
    )

    # Calculate the total number of times each grouping value occurred
    row_totals = X_grouped.sum(axis=1)

    # Calculate for each grouping value, the proportion of ending with each observed specialty
    proportions = X_grouped.div(row_totals, axis=0).fillna(0)

    # Calculate probabilities for each input value
    input_probs = {}
    for input_val in X[self.input_var].unique():
        # Get all grouping values associated with this input value
        grouping_vals = X[X[self.input_var] == input_val][
            self.grouping_var
        ].unique()

        # Calculate probability distribution of grouping values for this input value
        input_to_group_probs = X[X[self.input_var] == input_val][
            self.grouping_var
        ].value_counts(normalize=True)

        # Get the probability distribution of outcomes for all relevant grouping values
        # This includes all rows in proportions where the grouping value appears for this input
        group_to_outcome_probs = proportions.loc[grouping_vals]

        # Ensure the rows are aligned by reindexing group_to_outcome_probs
        aligned_group_to_outcome = group_to_outcome_probs.reindex(
            input_to_group_probs.index
        )

        # Create outer product matrix of probabilities:
        # - Rows represent grouping values
        # - Columns represent outcome categories
        # Each cell contains the joint probability of the grouping value and outcome
        input_to_outcome_probs = pd.DataFrame(
            input_to_group_probs.values.reshape(-1, 1)
            * aligned_group_to_outcome.values,
            index=input_to_group_probs.index,
            columns=group_to_outcome_probs.columns,
        )

        # Sum across grouping values to get final probability distribution for this input value
        input_probs[input_val] = input_to_outcome_probs.sum().to_dict()

    # Clean the keys to remove excess string quotes
    def clean_key(key):
        if isinstance(key, str):
            # Remove surrounding quotes if they exist
            if key.startswith("'") and key.endswith("'"):
                return key[1:-1]
        return key

    # Note: cleaned_dict will contain an empty string key ('') for any null values from the input data
    # This is because null values are converted to empty strings during preprocessing
    cleaned_dict = {clean_key(k): v for k, v in input_probs.items()}

    # save probabilities as weights within the model
    self.weights = cleaned_dict

    # save the input to grouping probabilities for use as a reference
    self.input_to_grouping_probs = self._probability_of_input_to_grouping_value(X)

    return self
predict(input_value)

Predicts the probabilities of ending in various outcome categories for a given input value.

Parameters:

Name Type Description Default
input_value str

The input value to predict outcomes for. None values will be handled appropriately.

required

Returns:

Type Description
dict

A dictionary of categories and the probabilities that the input value will end in them.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
def predict(self, input_value: str) -> Dict[str, float]:
    """
    Predicts the probabilities of ending in various outcome categories for a given input value.

    Parameters
    ----------
    input_value : str
        The input value to predict outcomes for. None values will be handled appropriately.

    Returns
    -------
    dict
        A dictionary of categories and the probabilities that the input value will end in them.
    """
    if input_value is None or pd.isna(input_value):
        return self.weights.get("", {})

    # Convert input to string if it isn't already
    input_value = str(input_value)

    # Return a direct lookup of probabilities if possible
    if input_value in self.weights:
        return self.weights[input_value]

    # If no relevant data is found, return null probabilities
    return self.weights.get(None, {})

prepare

Module for preparing data, loading models, and organizing snapshots for inference.

This module provides functionality to load a trained model, prepare data for making predictions, calculate arrival rates, and organize snapshot data. It allows for selecting one snapshot per visit, filtering snapshots by prediction time, and mapping snapshot dates to corresponding indices.

Functions:

Name Description
git select_one_snapshot_per_visit

Selects one snapshot per visit based on a random number and returns the filtered DataFrame.

prepare_patient_snapshots

Filters the DataFrame by prediction time and optionally selects one snapshot per visit.

prepare_group_snapshot_dict

Prepares a dictionary mapping snapshot dates to their corresponding snapshot indices.

calculate_time_varying_arrival_rates

Calculates the time-varying arrival rates for a dataset indexed by datetime.

SpecialCategoryParams

A picklable implementation of special category parameters for patient classification.

This class identifies pediatric patients based on available age-related columns in the dataset and provides functions to categorise patients accordingly. It's designed to be serializable with pickle by implementing the reduce method.

Parameters:

Name Type Description Default
columns list or Index

Column names from the dataset used to determine the appropriate age identification method

required

Attributes:

Name Type Description
columns list

List of column names from the dataset

method_type str

The method used for age detection ('age_on_arrival' or 'age_group')

special_category_dict dict

Default category values mapping

Raises:

Type Description
ValueError

If neither 'age_on_arrival' nor 'age_group' columns are found

Source code in src/patientflow/prepare.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
class SpecialCategoryParams:
    """A picklable implementation of special category parameters for patient classification.

    This class identifies pediatric patients based on available age-related columns
    in the dataset and provides functions to categorise patients accordingly.
    It's designed to be serializable with pickle by implementing the __reduce__ method.

    Parameters
    ----------
    columns : list or pandas.Index
        Column names from the dataset used to determine the appropriate age identification method

    Attributes
    ----------
    columns : list
        List of column names from the dataset
    method_type : str
        The method used for age detection ('age_on_arrival' or 'age_group')
    special_category_dict : dict
        Default category values mapping

    Raises
    ------
    ValueError
        If neither 'age_on_arrival' nor 'age_group' columns are found
    """

    def __init__(self, columns):
        """Initialize the SpecialCategoryParams object.

        Parameters
        ----------
        columns : list or pandas.Index
            Column names from the dataset used to determine the appropriate age identification method

        Raises
        ------
        ValueError
            If neither 'age_on_arrival' nor 'age_group' columns are found
        """
        self.columns = columns
        self.special_category_dict = {
            "medical": 0.0,
            "surgical": 0.0,
            "haem/onc": 0.0,
            "paediatric": 1.0,
        }

        if "age_on_arrival" in columns:
            self.method_type = "age_on_arrival"
        elif "age_group" in columns:
            self.method_type = "age_group"
        else:
            raise ValueError("Unknown data format: could not find expected age columns")

    def special_category_func(self, row: Union[dict, pd.Series]) -> bool:
        """Identify if a patient is pediatric based on age data.

        Parameters
        ----------
        row : Union[dict, pd.Series]
            A row of patient data containing either 'age_on_arrival' or 'age_group'

        Returns
        -------
        bool
            True if the patient is pediatric (age < 18 or age_group is '0-17'),
            False otherwise
        """
        if self.method_type == "age_on_arrival":
            return row["age_on_arrival"] < 18
        else:  # age_group
            return row["age_group"] == "0-17"

    def opposite_special_category_func(self, row: Union[dict, pd.Series]) -> bool:
        """Identify if a patient is NOT pediatric.

        Parameters
        ----------
        row : Union[dict, pd.Series]
            A row of patient data

        Returns
        -------
        bool
            True if the patient is NOT pediatric, False if they are pediatric
        """
        return not self.special_category_func(row)

    def get_params_dict(
        self,
    ) -> Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]:
        """Get the special parameter dictionary in the format expected by the SequencePredictor.

        Returns
        -------
        Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]
            A dictionary containing:
            - 'special_category_func': Function to identify pediatric patients
            - 'special_category_dict': Default category values (float)
            - 'special_func_map': Mapping of category names to detection functions
        """
        return {
            "special_category_func": self.special_category_func,
            "special_category_dict": self.special_category_dict,
            "special_func_map": {
                "paediatric": self.special_category_func,
                "default": self.opposite_special_category_func,
            },
        }

    def __reduce__(self) -> Tuple[Type["SpecialCategoryParams"], Tuple[list]]:
        """Support for pickle serialization.

        Returns
        -------
        Tuple[Type['SpecialCategoryParams'], Tuple[list]]
            A tuple containing:
            - The class itself (to be called as a function)
            - A tuple of arguments to pass to the class constructor
        """
        return (self.__class__, (self.columns,))

__init__(columns)

Initialize the SpecialCategoryParams object.

Parameters:

Name Type Description Default
columns list or Index

Column names from the dataset used to determine the appropriate age identification method

required

Raises:

Type Description
ValueError

If neither 'age_on_arrival' nor 'age_group' columns are found

Source code in src/patientflow/prepare.py
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
def __init__(self, columns):
    """Initialize the SpecialCategoryParams object.

    Parameters
    ----------
    columns : list or pandas.Index
        Column names from the dataset used to determine the appropriate age identification method

    Raises
    ------
    ValueError
        If neither 'age_on_arrival' nor 'age_group' columns are found
    """
    self.columns = columns
    self.special_category_dict = {
        "medical": 0.0,
        "surgical": 0.0,
        "haem/onc": 0.0,
        "paediatric": 1.0,
    }

    if "age_on_arrival" in columns:
        self.method_type = "age_on_arrival"
    elif "age_group" in columns:
        self.method_type = "age_group"
    else:
        raise ValueError("Unknown data format: could not find expected age columns")

__reduce__()

Support for pickle serialization.

Returns:

Type Description
Tuple[Type[SpecialCategoryParams], Tuple[list]]

A tuple containing: - The class itself (to be called as a function) - A tuple of arguments to pass to the class constructor

Source code in src/patientflow/prepare.py
462
463
464
465
466
467
468
469
470
471
472
def __reduce__(self) -> Tuple[Type["SpecialCategoryParams"], Tuple[list]]:
    """Support for pickle serialization.

    Returns
    -------
    Tuple[Type['SpecialCategoryParams'], Tuple[list]]
        A tuple containing:
        - The class itself (to be called as a function)
        - A tuple of arguments to pass to the class constructor
    """
    return (self.__class__, (self.columns,))

get_params_dict()

Get the special parameter dictionary in the format expected by the SequencePredictor.

Returns:

Type Description
Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]

A dictionary containing: - 'special_category_func': Function to identify pediatric patients - 'special_category_dict': Default category values (float) - 'special_func_map': Mapping of category names to detection functions

Source code in src/patientflow/prepare.py
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
def get_params_dict(
    self,
) -> Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]:
    """Get the special parameter dictionary in the format expected by the SequencePredictor.

    Returns
    -------
    Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]
        A dictionary containing:
        - 'special_category_func': Function to identify pediatric patients
        - 'special_category_dict': Default category values (float)
        - 'special_func_map': Mapping of category names to detection functions
    """
    return {
        "special_category_func": self.special_category_func,
        "special_category_dict": self.special_category_dict,
        "special_func_map": {
            "paediatric": self.special_category_func,
            "default": self.opposite_special_category_func,
        },
    }

opposite_special_category_func(row)

Identify if a patient is NOT pediatric.

Parameters:

Name Type Description Default
row Union[dict, Series]

A row of patient data

required

Returns:

Type Description
bool

True if the patient is NOT pediatric, False if they are pediatric

Source code in src/patientflow/prepare.py
425
426
427
428
429
430
431
432
433
434
435
436
437
438
def opposite_special_category_func(self, row: Union[dict, pd.Series]) -> bool:
    """Identify if a patient is NOT pediatric.

    Parameters
    ----------
    row : Union[dict, pd.Series]
        A row of patient data

    Returns
    -------
    bool
        True if the patient is NOT pediatric, False if they are pediatric
    """
    return not self.special_category_func(row)

special_category_func(row)

Identify if a patient is pediatric based on age data.

Parameters:

Name Type Description Default
row Union[dict, Series]

A row of patient data containing either 'age_on_arrival' or 'age_group'

required

Returns:

Type Description
bool

True if the patient is pediatric (age < 18 or age_group is '0-17'), False otherwise

Source code in src/patientflow/prepare.py
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
def special_category_func(self, row: Union[dict, pd.Series]) -> bool:
    """Identify if a patient is pediatric based on age data.

    Parameters
    ----------
    row : Union[dict, pd.Series]
        A row of patient data containing either 'age_on_arrival' or 'age_group'

    Returns
    -------
    bool
        True if the patient is pediatric (age < 18 or age_group is '0-17'),
        False otherwise
    """
    if self.method_type == "age_on_arrival":
        return row["age_on_arrival"] < 18
    else:  # age_group
        return row["age_group"] == "0-17"

additional_details(column, col_name)

Generate additional statistical details about a column's contents.

Parameters:

Name Type Description Default
column Series

The column to analyze

required
col_name str

Name of the column (used for context)

required

Returns:

Type Description
str

A string containing statistical details about the column's contents, including: - For dates: Date range - For categorical data: Frequency of values - For numeric data: Range, mean, standard deviation, and NA count - For datetime: Date range with time

Source code in src/patientflow/prepare.py
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
def additional_details(column, col_name):
    """Generate additional statistical details about a column's contents.

    Parameters
    ----------
    column : pandas.Series
        The column to analyze
    col_name : str
        Name of the column (used for context)

    Returns
    -------
    str
        A string containing statistical details about the column's contents, including:
        - For dates: Date range
        - For categorical data: Frequency of values
        - For numeric data: Range, mean, standard deviation, and NA count
        - For datetime: Date range with time
    """

    def is_date(string):
        try:
            # Try to parse the string using the strptime method
            datetime.strptime(
                string, "%Y-%m-%d"
            )  # You can adjust the format to match your date format
            return True
        except (ValueError, TypeError):
            return False

    # Convert to datetime if it's an object but formatted as a date
    if column.dtype == "object" and all(
        is_date(str(x)) for x in column.dropna().unique()
    ):
        column = pd.to_datetime(column)
        return f"Date Range: {column.min().strftime('%Y-%m-%d')} - {column.max().strftime('%Y-%m-%d')}"

    if column.dtype in ["object", "category", "bool"]:
        # Categorical data: Frequency of unique values
        # Handle enum instances
        try:
            from enum import Enum

            if any(isinstance(x, Enum) for x in column.dropna().unique()):
                # Convert enum instances to their values for counting
                column = column.apply(lambda x: x.value if isinstance(x, Enum) else x)
        except ImportError:
            pass

        if len(column.value_counts()) <= 12:
            value_counts = column.value_counts(dropna=False).to_dict()
            value_counts = dict(sorted(value_counts.items(), key=lambda x: str(x[0])))
            value_counts_formatted = {k: f"{v:,}" for k, v in value_counts.items()}
            return f"Frequencies: {value_counts_formatted}"
        value_counts = column.value_counts(dropna=False)[0:12].to_dict()
        value_counts = dict(sorted(value_counts.items(), key=lambda x: str(x[0])))
        value_counts_formatted = {k: f"{v:,}" for k, v in value_counts.items()}
        return f"Frequencies (highest 12): {value_counts_formatted}"

    if pd.api.types.is_float_dtype(column):
        # Float data: Range with rounding
        na_count = column.isna().sum()
        column = column.dropna()
        return f"Range: {column.min():.2f} - {column.max():.2f},  Mean: {column.mean():.2f}, Std Dev: {column.std():.2f}, NA: {na_count}"
    if pd.api.types.is_integer_dtype(column):
        # Float data: Range without rounding
        na_count = column.isna().sum()
        column = column.dropna()
        return f"Range: {column.min()} - {column.max()}, Mean: {column.mean():.2f}, Std Dev: {column.std():.2f}, NA: {na_count}"
    if pd.api.types.is_datetime64_any_dtype(column):
        # Datetime data: Minimum and Maximum dates
        return f"Date Range: {column.min().strftime('%Y-%m-%d %H:%M')} - {column.max().strftime('%Y-%m-%d %H:%M')}"
    else:
        return "N/A"

apply_set(row)

Randomly assign a set label based on weighted probabilities.

Parameters:

Name Type Description Default
row Series

Series containing 'training_set', 'validation_set', and 'test_set' weights

required

Returns:

Type Description
str

One of 'train', 'valid', or 'test' based on weighted random choice

Source code in src/patientflow/prepare.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def apply_set(row: pd.Series) -> str:
    """Randomly assign a set label based on weighted probabilities.

    Parameters
    ----------
    row : pandas.Series
        Series containing 'training_set', 'validation_set', and 'test_set' weights

    Returns
    -------
    str
        One of 'train', 'valid', or 'test' based on weighted random choice
    """
    return random.choices(
        ["train", "valid", "test"],
        weights=[row.training_set, row.validation_set, row.test_set],
    )[0]

assign_patient_ids(df, start_training_set, start_validation_set, start_test_set, end_test_set, date_col='arrival_datetime', patient_id='mrn', visit_col='encounter', seed=42)

Probabilistically assign patient IDs to train/validation/test sets.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with patient_id, encounter, and temporal columns

required
start_training_set date

Start date for training period

required
start_validation_set date

Start date for validation period

required
start_test_set date

Start date for test period

required
end_test_set date

End date for test period

required
date_col str

Column name for temporal splitting, by default "arrival_datetime"

'arrival_datetime'
patient_id str

Column name for patient identifier, by default "mrn"

'mrn'
visit_col str

Column name for visit identifier, by default "encounter"

'encounter'
seed int

Random seed for reproducible results, by default 42

42

Returns:

Type Description
DataFrame

DataFrame with patient ID assignments based on weighted random sampling

Notes
  • Counts encounters in each time period per patient ID
  • Randomly assigns each patient ID to one set, weighted by their temporal distribution
  • Patient with 70% encounters in training, 30% in validation has 70% chance of training assignment
Source code in src/patientflow/prepare.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
def assign_patient_ids(
    df: pd.DataFrame,
    start_training_set: date,
    start_validation_set: date,
    start_test_set: date,
    end_test_set: date,
    date_col: str = "arrival_datetime",
    patient_id: str = "mrn",
    visit_col: str = "encounter",
    seed: int = 42,
) -> pd.DataFrame:
    """Probabilistically assign patient IDs to train/validation/test sets.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with patient_id, encounter, and temporal columns
    start_training_set : datetime.date
        Start date for training period
    start_validation_set : datetime.date
        Start date for validation period
    start_test_set : datetime.date
        Start date for test period
    end_test_set : datetime.date
        End date for test period
    date_col : str, optional
        Column name for temporal splitting, by default "arrival_datetime"
    patient_id : str, optional
        Column name for patient identifier, by default "mrn"
    visit_col : str, optional
        Column name for visit identifier, by default "encounter"
    seed : int, optional
        Random seed for reproducible results, by default 42

    Returns
    -------
    pandas.DataFrame
        DataFrame with patient ID assignments based on weighted random sampling

    Notes
    -----
    - Counts encounters in each time period per patient ID
    - Randomly assigns each patient ID to one set, weighted by their temporal distribution
    - Patient with 70% encounters in training, 30% in validation has 70% chance of training assignment
    """
    # Set random seed for reproducibility
    random.seed(seed)

    patients: pd.DataFrame = (
        df.groupby([patient_id, visit_col])[date_col].max().reset_index()
    )

    # Handle date_col as string, datetime, or date type
    if pd.api.types.is_datetime64_any_dtype(patients[date_col]):
        # Already datetime, extract date if needed
        if hasattr(patients[date_col].iloc[0], "date"):
            date_series = patients[date_col].dt.date
        else:
            # Already date type
            date_series = patients[date_col]
    else:
        # Try to convert string to datetime
        try:
            patients[date_col] = pd.to_datetime(patients[date_col])
            date_series = patients[date_col].dt.date
        except (TypeError, ValueError) as e:
            raise ValueError(
                f"Could not convert column '{date_col}' to datetime format: {str(e)}"
            )

    # Filter out patient IDs outside temporal bounds
    pre_training_patients = patients[date_series < start_training_set]
    post_test_patients = patients[date_series >= end_test_set]

    if len(pre_training_patients) > 0:
        print(
            f"Filtered out {len(pre_training_patients)} patients with only pre-training visits"
        )
    if len(post_test_patients) > 0:
        print(
            f"Filtered out {len(post_test_patients)} patients with only post-test visits"
        )

    valid_patients = patients[
        (date_series >= start_training_set) & (date_series < end_test_set)
    ]
    patients = valid_patients

    # Use the date_series for set assignment
    patients["training_set"] = (date_series >= start_training_set) & (
        date_series < start_validation_set
    )
    patients["validation_set"] = (date_series >= start_validation_set) & (
        date_series < start_test_set
    )
    patients["test_set"] = (date_series >= start_test_set) & (
        date_series < end_test_set
    )

    patients = patients.groupby(patient_id)[
        ["training_set", "validation_set", "test_set"]
    ].sum()
    patients["training_validation_test"] = patients.apply(apply_set, axis=1)

    print(
        f"\nPatient Set Overlaps (before random assignment):"
        f"\nTrain-Valid: {patients[patients.training_set * patients.validation_set != 0].shape[0]} of {patients[patients.training_set + patients.validation_set > 0].shape[0]}"
        f"\nValid-Test: {patients[patients.validation_set * patients.test_set != 0].shape[0]} of {patients[patients.validation_set + patients.test_set > 0].shape[0]}"
        f"\nTrain-Test: {patients[patients.training_set * patients.test_set != 0].shape[0]} of {patients[patients.training_set + patients.test_set > 0].shape[0]}"
        f"\nAll Sets: {patients[patients.training_set * patients.validation_set * patients.test_set != 0].shape[0]} of {patients.shape[0]} total patients"
    )

    return patients

convert_dict_to_values(df, column, prefix)

Convert a column containing dictionaries into separate columns.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the dictionary column

required
column str

Name of the column containing dictionaries to convert

required
prefix str

Prefix to use for the new column names

required

Returns:

Type Description
DataFrame

DataFrame containing separate columns for each dictionary key, with values extracted from 'value_as_real' or 'value_as_text' if present

Source code in src/patientflow/prepare.py
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def convert_dict_to_values(df, column, prefix):
    """Convert a column containing dictionaries into separate columns.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the dictionary column
    column : str
        Name of the column containing dictionaries to convert
    prefix : str
        Prefix to use for the new column names

    Returns
    -------
    pandas.DataFrame
        DataFrame containing separate columns for each dictionary key,
        with values extracted from 'value_as_real' or 'value_as_text' if present
    """

    def extract_relevant_value(d):
        if isinstance(d, dict):
            if "value_as_real" in d or "value_as_text" in d:
                return (
                    d.get("value_as_real")
                    if d.get("value_as_real") is not None
                    else d.get("value_as_text")
                )
            else:
                return d  # Return the dictionary as is if it does not contain 'value_as_real' or 'value_as_text'
        return d  # Return the value as is if it is not a dictionary

    # Apply the extraction function to each entry in the dictionary column
    extracted_values = df[column].apply(
        lambda x: {k: extract_relevant_value(v) for k, v in x.items()}
    )

    # Create a DataFrame from the processed dictionary column
    dict_df = extracted_values.apply(pd.Series)

    # Add a prefix to the column names
    dict_df.columns = [f"{prefix}_{col}" for col in dict_df.columns]

    return dict_df

convert_set_to_dummies(df, column, prefix)

Convert a column containing sets into dummy variables.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the set column

required
column str

Name of the column containing sets to convert

required
prefix str

Prefix to use for the dummy variable column names

required

Returns:

Type Description
DataFrame

DataFrame containing dummy variables for each unique item in the sets

Source code in src/patientflow/prepare.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def convert_set_to_dummies(df, column, prefix):
    """Convert a column containing sets into dummy variables.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the set column
    column : str
        Name of the column containing sets to convert
    prefix : str
        Prefix to use for the dummy variable column names

    Returns
    -------
    pandas.DataFrame
        DataFrame containing dummy variables for each unique item in the sets
    """
    # Explode the set into rows
    exploded_df = df[column].explode().dropna().to_frame()

    # Create dummy variables for each unique item with a specified prefix
    dummies = pd.get_dummies(exploded_df[column], prefix=prefix)

    # # Sum the dummies back to the original DataFrame's index
    dummies = dummies.groupby(dummies.index).sum()

    # Convert dummy variables to boolean
    dummies = dummies.astype(bool)

    return dummies

create_special_category_objects(columns)

Create a configuration for categorising patients with special handling for pediatric cases.

Parameters:

Name Type Description Default
columns list or Index

The column names available in the dataset. Used to determine which age format is present.

required

Returns:

Type Description
dict

A dictionary containing special category configuration with: - 'special_category_func': Function to identify pediatric patients - 'special_category_dict': Default category values - 'special_func_map': Mapping of category names to detection functions

Source code in src/patientflow/prepare.py
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
def create_special_category_objects(columns):
    """Create a configuration for categorising patients with special handling for pediatric cases.

    Parameters
    ----------
    columns : list or pandas.Index
        The column names available in the dataset. Used to determine which age format is present.

    Returns
    -------
    dict
        A dictionary containing special category configuration with:
        - 'special_category_func': Function to identify pediatric patients
        - 'special_category_dict': Default category values
        - 'special_func_map': Mapping of category names to detection functions
    """
    # Create the class instance and return its parameter dictionary
    params_obj = SpecialCategoryParams(columns)
    return params_obj.get_params_dict()

create_temporal_splits(df, start_train, start_valid, start_test, end_test, col_name='arrival_datetime', patient_id='mrn', visit_col='encounter', seed=42)

Split dataset into temporal train/validation/test sets.

Parameters:

Name Type Description Default
df DataFrame

Input dataframe

required
start_train date

Training start (inclusive)

required
start_valid date

Validation start (inclusive)

required
start_test date

Test start (inclusive)

required
end_test date

Test end (exclusive)

required
col_name str

Primary datetime column for splitting, by default "arrival_datetime"

'arrival_datetime'
patient_id str

Column name for patient identifier, by default "mrn"

'mrn'
visit_col str

Column name for visit identifier, by default "encounter"

'encounter'
seed int

Random seed for reproducible results, by default 42

42

Returns:

Type Description
Tuple[DataFrame, DataFrame, DataFrame]

Tuple containing (train_df, valid_df, test_df) split dataframes

Notes

Creates temporal data splits using primary datetime column and optional snapshot dates. Handles patient ID grouping if present to prevent data leakage.

Source code in src/patientflow/prepare.py
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
def create_temporal_splits(
    df: pd.DataFrame,
    start_train: date,
    start_valid: date,
    start_test: date,
    end_test: date,
    col_name: str = "arrival_datetime",
    patient_id: str = "mrn",
    visit_col: str = "encounter",
    seed: int = 42,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Split dataset into temporal train/validation/test sets.

    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe
    start_train : datetime.date
        Training start (inclusive)
    start_valid : datetime.date
        Validation start (inclusive)
    start_test : datetime.date
        Test start (inclusive)
    end_test : datetime.date
        Test end (exclusive)
    col_name : str, optional
        Primary datetime column for splitting, by default "arrival_datetime"
    patient_id : str, optional
        Column name for patient identifier, by default "mrn"
    visit_col : str, optional
        Column name for visit identifier, by default "encounter"
    seed : int, optional
        Random seed for reproducible results, by default 42

    Returns
    -------
    Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame]
        Tuple containing (train_df, valid_df, test_df) split dataframes

    Notes
    -----
    Creates temporal data splits using primary datetime column and optional snapshot dates.
    Handles patient ID grouping if present to prevent data leakage.
    """

    def get_date_value(series: pd.Series) -> pd.Series:
        """Convert timestamp or date column to date, handling both types.

        Parameters
        ----------
        series : pandas.Series
            Series containing datetime or date values

        Returns
        -------
        pandas.Series
            Series with date values
        """
        try:
            return pd.to_datetime(series).dt.date
        except (AttributeError, TypeError):
            return series

    if patient_id in df.columns:
        set_assignment: pd.DataFrame = assign_patient_ids(
            df,
            start_train,
            start_valid,
            start_test,
            end_test,
            col_name,
            patient_id,
            visit_col,
            seed=seed,
        )
        patient_sets: Dict[str, Set] = {
            k: set(set_assignment[set_assignment.training_validation_test == v].index)
            for k, v in {"train": "train", "valid": "valid", "test": "test"}.items()
        }

    splits: List[pd.DataFrame] = []
    for start, end, set_key in [
        (start_train, start_valid, "train"),
        (start_valid, start_test, "valid"),
        (start_test, end_test, "test"),
    ]:
        mask = (get_date_value(df[col_name]) >= start) & (
            get_date_value(df[col_name]) < end
        )

        if "snapshot_date" in df.columns:
            mask &= (get_date_value(df.snapshot_date) >= start) & (
                get_date_value(df.snapshot_date) < end
            )

        if patient_id in df.columns:
            mask &= df[patient_id].isin(patient_sets[set_key])

        splits.append(df[mask].copy())

    print(f"Split sizes: {[len(split) for split in splits]}")
    return tuple(splits)

create_yta_filters(df)

Create specialty filters for categorizing patients by specialty and age group.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing patient data with columns that include either 'age_on_arrival' or 'age_group' for pediatric classification

required

Returns:

Type Description
dict

A dictionary mapping specialty names to filter configurations. Each configuration contains: - For pediatric specialty: {"is_child": True} - For other specialties: {"specialty": specialty_name, "is_child": False}

Examples:

>>> df = pd.DataFrame({'patient_id': [1, 2], 'age_on_arrival': [10, 40]})
>>> filters = create_yta_filters(df)
>>> print(filters['paediatric'])
{'is_child': True}
>>> print(filters['medical'])
{'specialty': 'medical', 'is_child': False}
Source code in src/patientflow/prepare.py
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
def create_yta_filters(df):
    """Create specialty filters for categorizing patients by specialty and age group.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient data with columns that include either
        'age_on_arrival' or 'age_group' for pediatric classification

    Returns
    -------
    dict
        A dictionary mapping specialty names to filter configurations.
        Each configuration contains:
        - For pediatric specialty: {"is_child": True}
        - For other specialties: {"specialty": specialty_name, "is_child": False}

    Examples
    --------
    >>> df = pd.DataFrame({'patient_id': [1, 2], 'age_on_arrival': [10, 40]})
    >>> filters = create_yta_filters(df)
    >>> print(filters['paediatric'])
    {'is_child': True}
    >>> print(filters['medical'])
    {'specialty': 'medical', 'is_child': False}
    """
    # Get the special category parameters using the picklable implementation
    special_params = create_special_category_objects(df.columns)

    # Extract necessary data from the special_params
    special_category_dict = special_params["special_category_dict"]

    # Create the specialty_filters dictionary
    specialty_filters = {}

    for specialty, is_paediatric_flag in special_category_dict.items():
        if is_paediatric_flag == 1.0:
            # For the paediatric specialty, set `is_child` to True
            specialty_filters[specialty] = {"is_child": True}
        else:
            # For other specialties, set `is_child` to False
            specialty_filters[specialty] = {"specialty": specialty, "is_child": False}

    return specialty_filters

find_group_for_colname(column, dict_col_groups)

Find the group name that a column belongs to in the column groups dictionary.

Parameters:

Name Type Description Default
column str

Name of the column to find the group for

required
dict_col_groups dict

Dictionary mapping group names to lists of column names

required

Returns:

Type Description
str or None

The name of the group the column belongs to, or None if not found

Source code in src/patientflow/prepare.py
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
def find_group_for_colname(column, dict_col_groups):
    """Find the group name that a column belongs to in the column groups dictionary.

    Parameters
    ----------
    column : str
        Name of the column to find the group for
    dict_col_groups : dict
        Dictionary mapping group names to lists of column names

    Returns
    -------
    str or None
        The name of the group the column belongs to, or None if not found
    """
    for key, values_list in dict_col_groups.items():
        if column in values_list:
            return key
    return None

generate_description(col_name)

Generate a description for a column based on its name and manual descriptions.

Parameters:

Name Type Description Default
col_name str

Name of the column to generate a description for

required

Returns:

Type Description
str

A descriptive string explaining the column's purpose and content

Source code in src/patientflow/prepare.py
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
def generate_description(col_name):
    """Generate a description for a column based on its name and manual descriptions.

    Parameters
    ----------
    col_name : str
        Name of the column to generate a description for

    Returns
    -------
    str
        A descriptive string explaining the column's purpose and content
    """
    manual_descriptions = get_manual_descriptions()

    # Check if manual description is provided
    if col_name in manual_descriptions:
        return manual_descriptions[col_name]

    if (
        col_name.startswith("num")
        and not col_name.startswith("num_obs")
        and not col_name.startswith("num_orders")
    ):
        return "Number of times " + col_name[4:] + " has been recorded"
    if col_name.startswith("num_obs"):
        return "Number of observations of " + col_name[8:]
    if col_name.startswith("latest_obs"):
        return "Latest result for " + col_name[11:]
    if col_name.startswith("latest_lab"):
        return "Latest result for " + col_name[19:]
    if col_name.startswith("lab_orders"):
        return "Request for lab battery " + col_name[11:] + " has been placed"
    if col_name.startswith("visited"):
        return "Patient visited " + col_name[8:] + " previously or is there now"
    else:
        return col_name

prepare_group_snapshot_dict(df, start_dt=None, end_dt=None)

Prepare a dictionary mapping snapshot dates to their corresponding snapshot indices.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing at least a 'snapshot_date' column

required
start_dt date

Start date for filtering snapshots, by default None

None
end_dt date

End date for filtering snapshots, by default None

None

Returns:

Type Description
dict

A dictionary where: - Keys are dates - Values are arrays of indices corresponding to each date's snapshots - Empty arrays for dates with no snapshots (if start_dt and end_dt are provided)

Raises:

Type Description
ValueError

If 'snapshot_date' column is not present in the DataFrame

Source code in src/patientflow/prepare.py
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
def prepare_group_snapshot_dict(df, start_dt=None, end_dt=None):
    """Prepare a dictionary mapping snapshot dates to their corresponding snapshot indices.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing at least a 'snapshot_date' column
    start_dt : datetime.date, optional
        Start date for filtering snapshots, by default None
    end_dt : datetime.date, optional
        End date for filtering snapshots, by default None

    Returns
    -------
    dict
        A dictionary where:
        - Keys are dates
        - Values are arrays of indices corresponding to each date's snapshots
        - Empty arrays for dates with no snapshots (if start_dt and end_dt are provided)

    Raises
    ------
    ValueError
        If 'snapshot_date' column is not present in the DataFrame
    """
    # Ensure 'snapshot_date' is in the DataFrame
    if "snapshot_date" not in df.columns:
        raise ValueError("DataFrame must include a 'snapshot_date' column")

    # Filter DataFrame to date range if provided
    filtered_df = df.copy()
    if start_dt and end_dt:
        filtered_df = df[
            (df["snapshot_date"] >= start_dt) & (df["snapshot_date"] < end_dt)
        ]

    # Group the DataFrame by 'snapshot_date' and collect the indices for each group
    snapshots_dict = {
        date: group.index.tolist()
        for date, group in filtered_df.groupby("snapshot_date")
    }

    # If start_dt and end_dt are specified, add any missing keys from prediction_dates
    if start_dt:
        prediction_dates = pd.date_range(
            start=start_dt, end=end_dt, freq="D"
        ).date.tolist()[:-1]
        for dt in prediction_dates:
            if dt not in snapshots_dict:
                snapshots_dict[dt] = []

    return snapshots_dict

prepare_patient_snapshots(df, prediction_time, exclude_columns=[], single_snapshot_per_visit=True, visit_col=None, label_col='is_admitted')

Prepare patient snapshots for model training or prediction.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing patient visit data

required
prediction_time str or datetime

The specific prediction time to filter for

required
exclude_columns list

List of columns to exclude from the final DataFrame, by default []

[]
single_snapshot_per_visit bool

Whether to select only one snapshot per visit, by default True

True
visit_col str

Name of the column containing visit identifiers, required if single_snapshot_per_visit is True

None
label_col str

Name of the column containing the target labels, by default "is_admitted"

'is_admitted'

Returns:

Type Description
Tuple[DataFrame, Series]

A tuple containing: - DataFrame: Processed DataFrame with features - Series: Corresponding labels

Raises:

Type Description
ValueError

If single_snapshot_per_visit is True but visit_col is not provided

Source code in src/patientflow/prepare.py
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
def prepare_patient_snapshots(
    df,
    prediction_time,
    exclude_columns=[],
    single_snapshot_per_visit=True,
    visit_col=None,
    label_col="is_admitted",
) -> Tuple[pd.DataFrame, pd.Series]:
    """Prepare patient snapshots for model training or prediction.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing patient visit data
    prediction_time : str or datetime
        The specific prediction time to filter for
    exclude_columns : list, optional
        List of columns to exclude from the final DataFrame, by default []
    single_snapshot_per_visit : bool, optional
        Whether to select only one snapshot per visit, by default True
    visit_col : str, optional
        Name of the column containing visit identifiers, required if single_snapshot_per_visit is True
    label_col : str, optional
        Name of the column containing the target labels, by default "is_admitted"

    Returns
    -------
    Tuple[pandas.DataFrame, pandas.Series]
        A tuple containing:
        - DataFrame: Processed DataFrame with features
        - Series: Corresponding labels

    Raises
    ------
    ValueError
        If single_snapshot_per_visit is True but visit_col is not provided
    """
    if single_snapshot_per_visit and visit_col is None:
        raise ValueError(
            "visit_col must be provided when single_snapshot_per_visit is True"
        )

    # Filter by the time of day while keeping the original index
    df_tod = df[df["prediction_time"] == prediction_time].copy()

    if single_snapshot_per_visit:
        # Select one row for each visit
        df_single = select_one_snapshot_per_visit(df_tod, visit_col)
        # Create label array with the same index
        y = df_single.pop(label_col).astype(int)
        # Drop specified columns and ensure we do not reset the index
        df_single.drop(columns=exclude_columns, inplace=True)
        return df_single, y
    else:
        # Directly modify df_tod without resetting the index
        df_tod.drop(
            columns=["random_number"] + exclude_columns, inplace=True, errors="ignore"
        )
        y = df_tod.pop(label_col).astype(int)
        return df_tod, y

select_one_snapshot_per_visit(df, visit_col, seed=42)

Select one random snapshot per visit from a DataFrame.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing visit snapshots

required
visit_col str

Name of the column containing visit identifiers

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
DataFrame

DataFrame containing one randomly selected snapshot per visit

Source code in src/patientflow/prepare.py
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
def select_one_snapshot_per_visit(df, visit_col, seed=42):
    """Select one random snapshot per visit from a DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing visit snapshots
    visit_col : str
        Name of the column containing visit identifiers
    seed : int, optional
        Random seed for reproducibility, by default 42

    Returns
    -------
    pandas.DataFrame
        DataFrame containing one randomly selected snapshot per visit
    """
    # Generate random numbers if not present
    if "random_number" not in df.columns:
        if seed is not None:
            np.random.seed(seed)
        df["random_number"] = np.random.random(size=len(df))

    # Select the row with the maximum random_number for each visit
    max_indices = df.groupby(visit_col)["random_number"].idxmax()
    return df.loc[max_indices].drop(columns=["random_number"])

validate_special_category_objects(special_params)

Validate that a special category parameters dictionary contains all required keys.

Parameters:

Name Type Description Default
special_params Dict[str, Any]

Dictionary of special category parameters to validate

required

Raises:

Type Description
MissingKeysError

If any required keys are missing from the dictionary

Source code in src/patientflow/prepare.py
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
def validate_special_category_objects(special_params: Dict[str, Any]) -> None:
    """Validate that a special category parameters dictionary contains all required keys.

    Parameters
    ----------
    special_params : Dict[str, Any]
        Dictionary of special category parameters to validate

    Raises
    ------
    MissingKeysError
        If any required keys are missing from the dictionary
    """
    required_keys = [
        "special_category_func",
        "special_category_dict",
        "special_func_map",
    ]
    missing_keys = [key for key in required_keys if key not in special_params]

    if missing_keys:
        raise MissingKeysError(missing_keys)

write_data_dict(df, dict_name, dict_path)

Write a data dictionary for a DataFrame to both Markdown and CSV formats.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to create a data dictionary for

required
dict_name str

Base name for the output files (without extension)

required
dict_path str or Path

Directory path where the data dictionary files will be written

required

Returns:

Type Description
DataFrame

The created data dictionary as a DataFrame

Notes

Creates two files: - {dict_name}.md: Markdown format data dictionary - {dict_name}.csv: CSV format data dictionary

For visit data, includes separate statistics for admitted and non-admitted patients.

Source code in src/patientflow/prepare.py
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
def write_data_dict(df, dict_name, dict_path):
    """Write a data dictionary for a DataFrame to both Markdown and CSV formats.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame to create a data dictionary for
    dict_name : str
        Base name for the output files (without extension)
    dict_path : str or pathlib.Path
        Directory path where the data dictionary files will be written

    Returns
    -------
    pandas.DataFrame
        The created data dictionary as a DataFrame

    Notes
    -----
    Creates two files:
    - {dict_name}.md: Markdown format data dictionary
    - {dict_name}.csv: CSV format data dictionary

    For visit data, includes separate statistics for admitted and non-admitted patients.
    """
    cols_to_exclude = ["snapshot_id", "visit_number"]

    df = df.copy(deep=True)

    if "visits" in dict_name:
        df.consultation_sequence = df.consultation_sequence.apply(
            lambda x: str(x)
        ).to_frame()
        df.final_sequence = df.final_sequence.apply(lambda x: str(x)).to_frame()
        df_admitted = df[df.is_admitted]
        df_not_admitted = df[~df.is_admitted]
        dict_col_groups = get_dict_cols(df)

        data_dict = pd.DataFrame(
            {
                "Variable type": [
                    find_group_for_colname(col, dict_col_groups) for col in df.columns
                ],
                "Column Name": df.columns,
                "Data Type": df.dtypes,
                "Description": [generate_description(col) for col in df.columns],
                "Whole dataset": [
                    additional_details(df[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df.columns
                ],
                "Admitted": [
                    additional_details(df_admitted[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df_admitted.columns
                ],
                "Not admitted": [
                    additional_details(df_not_admitted[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df_not_admitted.columns
                ],
            }
        )
        data_dict["Whole dataset"] = data_dict["Whole dataset"].str.replace("'", "")
        data_dict["Admitted"] = data_dict["Admitted"].str.replace("'", "")
        data_dict["Not admitted"] = data_dict["Not admitted"].str.replace("'", "")

    else:
        data_dict = pd.DataFrame(
            {
                "Column Name": df.columns,
                "Data Type": df.dtypes,
                "Description": [generate_description(col) for col in df.columns],
                "Additional Details": [
                    additional_details(df[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df.columns
                ],
            }
        )
        data_dict["Additional Details"] = data_dict["Additional Details"].str.replace(
            "'", ""
        )

    # Export to Markdown and csv for data dictionary
    data_dict.to_markdown(str(dict_path) + "/" + dict_name + ".md", index=False)
    data_dict.to_csv(str(dict_path) + "/" + dict_name + ".csv", index=False)

    return data_dict

survival_curve

Core survival curve calculation functions for patient flow analysis.

This module provides the mathematical computation functions for survival analysis without visualization dependencies.

Functions:

Name Description
calculate_survival_curve : function

Calculate survival curve data from patient visit data

calculate_survival_curve(df, start_time_col, end_time_col)

Calculate survival curve data from patient visit data.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing patient visit data

required
start_time_col str

Name of the column containing the start time (e.g., arrival time)

required
end_time_col str

Name of the column containing the end time (e.g., admission time)

required

Returns:

Type Description
tuple of (numpy.ndarray, numpy.ndarray, pandas.DataFrame)
  • unique_times: Array of time points in hours
  • survival_prob: Array of survival probabilities at each time point
  • df_clean: Cleaned DataFrame with wait_time_hours column added
Source code in src/patientflow/survival_curve.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def calculate_survival_curve(df, start_time_col, end_time_col):
    """Calculate survival curve data from patient visit data.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient visit data
    start_time_col : str
        Name of the column containing the start time (e.g., arrival time)
    end_time_col : str
        Name of the column containing the end time (e.g., admission time)

    Returns
    -------
    tuple of (numpy.ndarray, numpy.ndarray, pandas.DataFrame)
        - unique_times: Array of time points in hours
        - survival_prob: Array of survival probabilities at each time point
        - df_clean: Cleaned DataFrame with wait_time_hours column added
    """
    # Calculate the wait time in hours
    df = df.copy()
    df["wait_time_hours"] = (
        df[end_time_col] - df[start_time_col]
    ).dt.total_seconds() / 3600

    # Drop any rows with missing wait times
    df_clean = df.dropna(subset=["wait_time_hours"]).copy()

    # Sort the data by wait time
    df_clean = df_clean.sort_values("wait_time_hours")

    # Calculate the number of patients
    n_patients = len(df_clean)

    # Calculate the survival function manually
    # For each time point, calculate proportion of patients who are still waiting
    unique_times = np.sort(df_clean["wait_time_hours"].unique())
    survival_prob = []

    for t in unique_times:
        # Number of patients who experienced the event after this time point
        n_event_after = sum(df_clean["wait_time_hours"] > t)
        # Proportion of patients still waiting
        survival_prob.append(n_event_after / n_patients)

    # Add zero hours wait time (everyone is waiting at time 0)
    unique_times = np.insert(unique_times, 0, 0)
    survival_prob = np.insert(survival_prob, 0, 1.0)

    return unique_times, survival_prob, df_clean

train

Training module for patient flow models.

This module provides functionality for training various predictive models used in patient flow analysis, including classifiers and demand forecasting models.

classifiers

Machine learning classifiers for patient flow prediction.

This module provides functions for training and evaluating machine learning classifiers for patient admission prediction. It includes utilities for data preparation, model training, hyperparameter tuning, and evaluation using time series cross-validation.

Functions:

Name Description
evaluate_predictions

Calculate multiple metrics (AUC, log loss, AUPRC) for given predictions

chronological_cross_validation

Perform time series cross-validation with multiple metrics

initialise_model

Initialize a model with given hyperparameters

create_column_transformer

Create a column transformer for a dataframe with dynamic column handling

calculate_class_balance

Calculate class balance ratios for target labels

get_feature_metadata

Extract feature names and importances from pipeline

get_dataset_metadata

Get dataset sizes and class balances

create_balance_info

Create a dictionary with balance information

evaluate_model

Evaluate model on test set

train_classifier

Train a single model including data preparation and balancing

train_multiple_classifiers

Train admission prediction models for multiple prediction times

calculate_class_balance(y)

Calculate class balance ratios for target labels.

Parameters:

Name Type Description Default
y Series

Target labels

required

Returns:

Type Description
Dict[Any, float]

Dictionary mapping each class to its proportion

Source code in src/patientflow/train/classifiers.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
def calculate_class_balance(y: Series) -> Dict[Any, float]:
    """Calculate class balance ratios for target labels.

    Parameters
    ----------
    y : Series
        Target labels

    Returns
    -------
    Dict[Any, float]
        Dictionary mapping each class to its proportion
    """
    counter = Counter(y)
    total = len(y)
    return {cls: count / total for cls, count in counter.items()}

chronological_cross_validation(pipeline, X, y, n_splits=5)

Perform time series cross-validation with multiple metrics.

Parameters:

Name Type Description Default
pipeline Pipeline

Sklearn pipeline to evaluate

required
X DataFrame

Feature matrix

required
y Series

Target labels

required
n_splits int

Number of time series splits, by default 5

5

Returns:

Type Description
Dict[str, float]

Dictionary containing training and validation metrics

Source code in src/patientflow/train/classifiers.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
def chronological_cross_validation(
    pipeline: Pipeline, X: DataFrame, y: Series, n_splits: int = 5
) -> Dict[str, float]:
    """Perform time series cross-validation with multiple metrics.

    Parameters
    ----------
    pipeline : Pipeline
        Sklearn pipeline to evaluate
    X : DataFrame
        Feature matrix
    y : Series
        Target labels
    n_splits : int, optional
        Number of time series splits, by default 5

    Returns
    -------
    Dict[str, float]
        Dictionary containing training and validation metrics
    """
    tscv = TimeSeriesSplit(n_splits=n_splits)

    train_metrics: List[FoldResults] = []
    valid_metrics: List[FoldResults] = []

    for train_idx, valid_idx in tscv.split(X):
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

        pipeline.fit(X_train, y_train)
        train_preds = pipeline.predict_proba(X_train)[:, 1]
        valid_preds = pipeline.predict_proba(X_valid)[:, 1]

        train_metrics.append(evaluate_predictions(y_train, train_preds))
        valid_metrics.append(evaluate_predictions(y_valid, valid_preds))

    def aggregate_metrics(metrics_list: List[FoldResults]) -> Dict[str, float]:
        return {
            field: np.mean([getattr(m, field) for m in metrics_list])
            for field in FoldResults.__dataclass_fields__
        }

    train_means = aggregate_metrics(train_metrics)
    valid_means = aggregate_metrics(valid_metrics)

    return {f"train_{metric}": value for metric, value in train_means.items()} | {
        f"valid_{metric}": value for metric, value in valid_means.items()
    }

create_balance_info(is_balanced, original_size, balanced_size, original_positive_rate, balanced_positive_rate, majority_to_minority_ratio)

Create a dictionary with balance information.

Parameters:

Name Type Description Default
is_balanced bool

Whether the dataset was balanced

required
original_size int

Original dataset size

required
balanced_size int

Size after balancing

required
original_positive_rate float

Positive class rate before balancing

required
balanced_positive_rate float

Positive class rate after balancing

required
majority_to_minority_ratio float

Ratio of majority to minority class samples

required

Returns:

Type Description
Dict[str, Union[bool, int, float]]

Dictionary containing balance information

Source code in src/patientflow/train/classifiers.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
def create_balance_info(
    is_balanced: bool,
    original_size: int,
    balanced_size: int,
    original_positive_rate: float,
    balanced_positive_rate: float,
    majority_to_minority_ratio: float,
) -> Dict[str, Union[bool, int, float]]:
    """Create a dictionary with balance information.

    Parameters
    ----------
    is_balanced : bool
        Whether the dataset was balanced
    original_size : int
        Original dataset size
    balanced_size : int
        Size after balancing
    original_positive_rate : float
        Positive class rate before balancing
    balanced_positive_rate : float
        Positive class rate after balancing
    majority_to_minority_ratio : float
        Ratio of majority to minority class samples

    Returns
    -------
    Dict[str, Union[bool, int, float]]
        Dictionary containing balance information
    """
    return {
        "is_balanced": is_balanced,
        "original_size": original_size,
        "balanced_size": balanced_size,
        "original_positive_rate": original_positive_rate,
        "balanced_positive_rate": balanced_positive_rate,
        "majority_to_minority_ratio": majority_to_minority_ratio,
    }

create_column_transformer(df, ordinal_mappings=None)

Create a column transformer for a dataframe with dynamic column handling.

Parameters:

Name Type Description Default
df DataFrame

Input dataframe

required
ordinal_mappings Dict[str, List[Any]]

Mappings for ordinal categorical features, by default None

None

Returns:

Type Description
ColumnTransformer

Configured column transformer

Source code in src/patientflow/train/classifiers.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def create_column_transformer(
    df: DataFrame, ordinal_mappings: Optional[Dict[str, List[Any]]] = None
) -> ColumnTransformer:
    """Create a column transformer for a dataframe with dynamic column handling.

    Parameters
    ----------
    df : DataFrame
        Input dataframe
    ordinal_mappings : Dict[str, List[Any]], optional
        Mappings for ordinal categorical features, by default None

    Returns
    -------
    ColumnTransformer
        Configured column transformer
    """
    transformers: List[
        Tuple[str, Union[OrdinalEncoder, OneHotEncoder, StandardScaler], List[str]]
    ] = []

    if ordinal_mappings is None:
        ordinal_mappings = {}

    for col in df.columns:
        if col in ordinal_mappings:
            transformers.append(
                (
                    col,
                    OrdinalEncoder(
                        categories=[ordinal_mappings[col]],
                        handle_unknown="use_encoded_value",
                        unknown_value=np.nan,
                    ),
                    [col],
                )
            )
        elif df[col].dtype == "object" or (
            df[col].dtype == "bool" or df[col].nunique() == 2
        ):
            transformers.append((col, OneHotEncoder(handle_unknown="ignore"), [col]))
        else:
            transformers.append((col, StandardScaler(), [col]))

    return ColumnTransformer(transformers)

evaluate_model(pipeline, X_test, y_test)

Evaluate model on test set.

Parameters:

Name Type Description Default
pipeline Pipeline

Trained sklearn pipeline

required
X_test DataFrame

Test features

required
y_test Series

Test labels

required

Returns:

Type Description
Dict[str, float]

Dictionary containing test metrics

Source code in src/patientflow/train/classifiers.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
def evaluate_model(
    pipeline: Pipeline, X_test: DataFrame, y_test: Series
) -> Dict[str, float]:
    """Evaluate model on test set.

    Parameters
    ----------
    pipeline : Pipeline
        Trained sklearn pipeline
    X_test : DataFrame
        Test features
    y_test : Series
        Test labels

    Returns
    -------
    Dict[str, float]
        Dictionary containing test metrics
    """
    y_test_pred = pipeline.predict_proba(X_test)[:, 1]
    return {
        "test_auc": float(roc_auc_score(y_test, y_test_pred)),
        "test_logloss": float(log_loss(y_test, y_test_pred)),
        "test_auprc": float(average_precision_score(y_test, y_test_pred)),
    }

evaluate_predictions(y_true, y_pred)

Calculate multiple metrics for given predictions.

Parameters:

Name Type Description Default
y_true NDArray[int_]

True binary labels

required
y_pred NDArray[float64]

Predicted probabilities

required

Returns:

Type Description
FoldResults

Object containing AUC, log loss, and AUPRC metrics

Source code in src/patientflow/train/classifiers.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def evaluate_predictions(
    y_true: npt.NDArray[np.int_], y_pred: npt.NDArray[np.float64]
) -> FoldResults:
    """Calculate multiple metrics for given predictions.

    Parameters
    ----------
    y_true : npt.NDArray[np.int_]
        True binary labels
    y_pred : npt.NDArray[np.float64]
        Predicted probabilities

    Returns
    -------
    FoldResults
        Object containing AUC, log loss, and AUPRC metrics
    """
    return FoldResults(
        auc=roc_auc_score(y_true, y_pred),
        logloss=log_loss(y_true, y_pred),
        auprc=average_precision_score(y_true, y_pred),
    )

get_dataset_metadata(X_train, X_valid, y_train, y_valid, X_test=None, y_test=None)

Get dataset sizes and class balances.

Parameters:

Name Type Description Default
X_train DataFrame

Training features

required
X_valid DataFrame

Validation features

required
y_train Series

Training labels

required
y_valid Series

Validation labels

required
X_test DataFrame

Test features. If None, test set information will be set to None.

None
y_test Series

Test labels. If None, test set information will be set to None.

None

Returns:

Type Description
DatasetMetadata

Dictionary containing dataset sizes and class balances

Source code in src/patientflow/train/classifiers.py
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
def get_dataset_metadata(
    X_train: DataFrame,
    X_valid: DataFrame,
    y_train: Series,
    y_valid: Series,
    X_test: Optional[DataFrame] = None,
    y_test: Optional[Series] = None,
) -> DatasetMetadata:
    """Get dataset sizes and class balances.

    Parameters
    ----------
    X_train : DataFrame
        Training features
    X_valid : DataFrame
        Validation features
    y_train : Series
        Training labels
    y_valid : Series
        Validation labels
    X_test : DataFrame, optional
        Test features. If None, test set information will be set to None.
    y_test : Series, optional
        Test labels. If None, test set information will be set to None.

    Returns
    -------
    DatasetMetadata
        Dictionary containing dataset sizes and class balances
    """
    metadata: DatasetMetadata = {
        "train_valid_test_set_no": {
            "train_set_no": len(X_train),
            "valid_set_no": len(X_valid),
            "test_set_no": len(X_test) if X_test is not None else None,
        },
        "train_valid_test_class_balance": {
            "y_train_class_balance": calculate_class_balance(y_train),
            "y_valid_class_balance": calculate_class_balance(y_valid),
            "y_test_class_balance": calculate_class_balance(y_test)
            if y_test is not None
            else None,
        },
    }

    return metadata

get_feature_metadata(pipeline)

Extract feature names and importances from pipeline.

Parameters:

Name Type Description Default
pipeline Pipeline

Sklearn pipeline containing feature transformer and classifier

required

Returns:

Type Description
FeatureMetadata

Dictionary containing feature names and their importance scores (if available)

Raises:

Type Description
AttributeError

If the classifier doesn't support feature importance

Source code in src/patientflow/train/classifiers.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
def get_feature_metadata(pipeline: Pipeline) -> FeatureMetadata:
    """
    Extract feature names and importances from pipeline.

    Parameters
    ----------
    pipeline : Pipeline
        Sklearn pipeline containing feature transformer and classifier

    Returns
    -------
    FeatureMetadata
        Dictionary containing feature names and their importance scores (if available)

    Raises
    ------
    AttributeError
        If the classifier doesn't support feature importance
    """
    transformed_cols = pipeline.named_steps[
        "feature_transformer"
    ].get_feature_names_out()
    classifier = pipeline.named_steps["classifier"]

    # Try different common feature importance attributes
    if hasattr(classifier, "feature_importances_"):
        importances = classifier.feature_importances_
    elif hasattr(classifier, "coef_"):
        importances = (
            np.abs(classifier.coef_[0])
            if classifier.coef_.ndim > 1
            else np.abs(classifier.coef_)
        )
    else:
        raise AttributeError("Classifier doesn't provide feature importance scores")

    return {
        "feature_names": [col.split("__")[-1] for col in transformed_cols],
        "feature_importances": importances.tolist(),
    }

initialise_model(model_class, params, xgb_specific_params={'n_jobs': -1, 'eval_metric': 'logloss', 'enable_categorical': True})

Initialize a model with given hyperparameters.

Parameters:

Name Type Description Default
model_class Type

The classifier class to instantiate

required
params Dict[str, Any]

Model-specific parameters to set

required
xgb_specific_params Dict[str, Any]

XGBoost-specific default parameters

{'n_jobs': -1, 'eval_metric': 'logloss', 'enable_categorical': True}

Returns:

Type Description
Any

Initialized model instance

Source code in src/patientflow/train/classifiers.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def initialise_model(
    model_class: Type,
    params: Dict[str, Any],
    xgb_specific_params: Dict[str, Any] = {
        "n_jobs": -1,
        "eval_metric": "logloss",
        "enable_categorical": True,
    },
) -> Any:
    """
    Initialize a model with given hyperparameters.

    Parameters
    ----------
    model_class : Type
        The classifier class to instantiate
    params : Dict[str, Any]
        Model-specific parameters to set
    xgb_specific_params : Dict[str, Any], optional
        XGBoost-specific default parameters

    Returns
    -------
    Any
        Initialized model instance
    """
    if model_class == XGBClassifier:
        model = model_class(**xgb_specific_params)
        model.set_params(**params)
    else:
        model = model_class(**params)

    return model

train_classifier(train_visits, valid_visits, prediction_time, exclude_from_training_data, grid, ordinal_mappings, test_visits=None, visit_col=None, model_class=XGBClassifier, use_balanced_training=True, majority_to_minority_ratio=1.0, calibrate_probabilities=True, calibration_method='sigmoid', single_snapshot_per_visit=True, label_col='is_admitted', evaluate_on_test=False)

Train a single model including data preparation and balancing.

Parameters:

Name Type Description Default
train_visits DataFrame

Training visits dataset

required
valid_visits DataFrame

Validation visits dataset

required
prediction_time Tuple[int, int]

The prediction time point to use

required
exclude_from_training_data List[str]

Columns to exclude from training

required
grid Dict[str, List[Any]]

Parameter grid for hyperparameter tuning

required
ordinal_mappings Dict[str, List[Any]]

Mappings for ordinal categorical features

required
test_visits DataFrame

Test visits dataset. Required only when evaluate_on_test=True.

None
visit_col str

Name of the visit column. Required if single_snapshot_per_visit is True.

None
model_class Type

The classifier class to use. Must be sklearn-compatible with fit() and predict_proba(). Defaults to XGBClassifier.

XGBClassifier
use_balanced_training bool

Whether to use balanced training data

True
majority_to_minority_ratio float

Ratio of majority to minority class samples

1.0
calibrate_probabilities bool

Whether to apply probability calibration to the best model

True
calibration_method str

Method for probability calibration ('isotonic' or 'sigmoid')

'sigmoid'
single_snapshot_per_visit bool

Whether to select only one snapshot per visit. If True, visit_col must be provided.

True
label_col str

Name of the column containing the target labels

"is_admitted"
evaluate_on_test bool

Whether to evaluate the final model on the test set. Set to True only when satisfied with validation performance to avoid test set contamination.

False

Returns:

Type Description
TrainedClassifier

Trained model, including metrics, and feature information

Source code in src/patientflow/train/classifiers.py
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
def train_classifier(
    train_visits: DataFrame,
    valid_visits: DataFrame,
    prediction_time: Tuple[int, int],
    exclude_from_training_data: List[str],
    grid: Dict[str, List[Any]],
    ordinal_mappings: Dict[str, List[Any]],
    test_visits: Optional[DataFrame] = None,
    visit_col: Optional[str] = None,
    model_class: Type = XGBClassifier,
    use_balanced_training: bool = True,
    majority_to_minority_ratio: float = 1.0,
    calibrate_probabilities: bool = True,
    calibration_method: str = "sigmoid",
    single_snapshot_per_visit: bool = True,
    label_col: str = "is_admitted",
    evaluate_on_test: bool = False,
) -> TrainedClassifier:
    """
    Train a single model including data preparation and balancing.

    Parameters
    ----------
    train_visits : DataFrame
        Training visits dataset
    valid_visits : DataFrame
        Validation visits dataset
    prediction_time : Tuple[int, int]
        The prediction time point to use
    exclude_from_training_data : List[str]
        Columns to exclude from training
    grid : Dict[str, List[Any]]
        Parameter grid for hyperparameter tuning
    ordinal_mappings : Dict[str, List[Any]]
        Mappings for ordinal categorical features
    test_visits : DataFrame, optional
        Test visits dataset. Required only when evaluate_on_test=True.
    visit_col : str, optional
        Name of the visit column. Required if single_snapshot_per_visit is True.
    model_class : Type, optional
        The classifier class to use. Must be sklearn-compatible with fit() and predict_proba().
        Defaults to XGBClassifier.
    use_balanced_training : bool, default=True
        Whether to use balanced training data
    majority_to_minority_ratio : float, default=1.0
        Ratio of majority to minority class samples
    calibrate_probabilities : bool, default=True
        Whether to apply probability calibration to the best model
    calibration_method : str, default='sigmoid'
        Method for probability calibration ('isotonic' or 'sigmoid')
    single_snapshot_per_visit : bool, default=True
        Whether to select only one snapshot per visit. If True, visit_col must be provided.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels
    evaluate_on_test : bool, default=False
        Whether to evaluate the final model on the test set. Set to True only when
        satisfied with validation performance to avoid test set contamination.

    Returns
    -------
    TrainedClassifier
        Trained model, including metrics, and feature information

    """
    if single_snapshot_per_visit and visit_col is None:
        raise ValueError(
            "visit_col must be provided when single_snapshot_per_visit is True"
        )

    if evaluate_on_test and test_visits is None:
        raise ValueError("test_visits must be provided when evaluate_on_test=True")

    # Get snapshots for each set
    X_train, y_train = prepare_patient_snapshots(
        train_visits,
        prediction_time,
        exclude_from_training_data,
        visit_col=visit_col,
        single_snapshot_per_visit=single_snapshot_per_visit,
        label_col=label_col,
    )
    X_valid, y_valid = prepare_patient_snapshots(
        valid_visits,
        prediction_time,
        exclude_from_training_data,
        visit_col=visit_col,
        single_snapshot_per_visit=single_snapshot_per_visit,
        label_col=label_col,
    )

    # Only prepare test data if evaluation is requested
    if evaluate_on_test:
        X_test, y_test = prepare_patient_snapshots(
            test_visits,
            prediction_time,
            exclude_from_training_data,
            visit_col=visit_col,
            single_snapshot_per_visit=single_snapshot_per_visit,
            label_col=label_col,
        )
    else:
        X_test, y_test = None, None

    # Get dataset metadata before any balancing
    dataset_metadata = get_dataset_metadata(
        X_train, X_valid, y_train, y_valid, X_test, y_test
    )

    # Store original size and positive rate before any balancing
    original_size = len(X_train)
    original_positive_rate = y_train.mean()

    if use_balanced_training:
        pos_indices = y_train[y_train == 1].index
        neg_indices = y_train[y_train == 0].index

        n_pos = len(pos_indices)
        n_neg = int(n_pos * majority_to_minority_ratio)

        neg_indices_sampled = np.random.choice(
            neg_indices, size=min(n_neg, len(neg_indices)), replace=False
        )

        train_balanced_indices = np.concatenate([pos_indices, neg_indices_sampled])
        np.random.shuffle(train_balanced_indices)

        X_train = X_train.loc[train_balanced_indices]
        y_train = y_train.loc[train_balanced_indices]

    # Create balance info after any balancing is done
    balance_info = create_balance_info(
        is_balanced=use_balanced_training,
        original_size=original_size,
        balanced_size=len(X_train),
        original_positive_rate=original_positive_rate,
        balanced_positive_rate=y_train.mean(),
        majority_to_minority_ratio=majority_to_minority_ratio
        if use_balanced_training
        else 1.0,
    )

    # Initialize best training results with default values
    best_training = TrainingResults(
        prediction_time=prediction_time,
        balance_info=balance_info,
        # Other fields will use their default empty dictionaries
    )

    # Initialize best model container
    best_model = TrainedClassifier(
        training_results=best_training,
        pipeline=None,
        calibrated_pipeline=None,
    )

    trials_list: List[HyperParameterTrial] = []
    best_logloss = float("inf")

    for params in ParameterGrid(grid):
        # Initialize model based on provided class
        model = initialise_model(model_class, params)

        column_transformer = create_column_transformer(X_train, ordinal_mappings)
        pipeline = Pipeline(
            [("feature_transformer", column_transformer), ("classifier", model)]
        )

        cv_results = chronological_cross_validation(
            pipeline, X_train, y_train, n_splits=5
        )
        # Store trial results
        trials_list.append(
            HyperParameterTrial(
                parameters=params.copy(),  # Make a copy to ensure immutability
                cv_results=cv_results,
            )
        )

        if cv_results["valid_logloss"] < best_logloss:
            best_logloss = cv_results["valid_logloss"]
            best_model.pipeline = pipeline

            # Get feature metadata if available
            try:
                feature_metadata = get_feature_metadata(pipeline)
                has_feature_importance = True
            except (AttributeError, NotImplementedError):
                feature_metadata = {
                    "feature_names": column_transformer.get_feature_names_out().tolist(),
                    "feature_importances": [],
                }
                has_feature_importance = False

            # Update training results
            best_training.training_info = {
                "cv_trials": trials_list,
                "features": {
                    "names": feature_metadata["feature_names"],
                    "importances": feature_metadata["feature_importances"],
                    "has_importance_values": has_feature_importance,
                },
                "dataset_info": dataset_metadata,
            }

            if calibrate_probabilities:
                best_training.calibration_info = {"method": calibration_method}

    # Apply probability calibration to the best model if requested
    if calibrate_probabilities and best_model.pipeline is not None:
        best_feature_transformer = best_model.pipeline.named_steps[
            "feature_transformer"
        ]
        best_classifier = best_model.pipeline.named_steps["classifier"]

        X_valid_transformed = best_feature_transformer.transform(X_valid)

        if sk_version >= "1.6.0":
            from sklearn.frozen import FrozenEstimator

            calibrated_classifier = CalibratedClassifierCV(
                estimator=FrozenEstimator(best_classifier),
                method=calibration_method,
            )
        else:
            calibrated_classifier = CalibratedClassifierCV(
                estimator=best_classifier, method=calibration_method, cv="prefit"
            )
        calibrated_classifier.fit(X_valid_transformed, y_valid)

        calibrated_pipeline = Pipeline(
            [
                ("feature_transformer", best_feature_transformer),
                ("classifier", calibrated_classifier),
            ]
        )

        best_model.calibrated_pipeline = calibrated_pipeline

        # Only evaluate on test set if requested
        if evaluate_on_test:
            best_training.test_results = evaluate_model(
                calibrated_pipeline, X_test, y_test
            )
        else:
            best_training.test_results = None

    else:
        # Only evaluate on test set if requested
        if evaluate_on_test:
            best_training.test_results = evaluate_model(
                best_model.pipeline, X_test, y_test
            )
        else:
            best_training.test_results = None

    return best_model

train_multiple_classifiers(train_visits, valid_visits, grid, exclude_from_training_data, ordinal_mappings, prediction_times, test_visits=None, model_name='admissions', visit_col='visit_number', calibrate_probabilities=True, calibration_method='isotonic', use_balanced_training=True, majority_to_minority_ratio=1.0, label_col='is_admitted', evaluate_on_test=False)

Train admission prediction models for multiple prediction times.

Parameters:

Name Type Description Default
train_visits DataFrame

Training visits dataset

required
valid_visits DataFrame

Validation visits dataset

required
grid Dict[str, List[Any]]

Parameter grid for hyperparameter tuning

required
exclude_from_training_data List[str]

Columns to exclude from training

required
ordinal_mappings Dict[str, List[Any]]

Mappings for ordinal categorical features

required
prediction_times List[Tuple[int, int]]

List of prediction time points

required
test_visits DataFrame

Test visits dataset, by default None

None
model_name str

Name prefix for models, by default "admissions"

'admissions'
visit_col str

Name of the visit column, by default "visit_number"

'visit_number'
calibrate_probabilities bool

Whether to calibrate probabilities, by default True

True
calibration_method str

Calibration method, by default "isotonic"

'isotonic'
use_balanced_training bool

Whether to use balanced training, by default True

True
majority_to_minority_ratio float

Ratio for class balancing, by default 1.0

1.0
label_col str

Name of the label column, by default "is_admitted"

'is_admitted'
evaluate_on_test bool

Whether to evaluate on test set, by default False

False

Returns:

Type Description
Dict[str, TrainedClassifier]

Dictionary mapping model keys to trained classifiers

Source code in src/patientflow/train/classifiers.py
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
def train_multiple_classifiers(
    train_visits: DataFrame,
    valid_visits: DataFrame,
    grid: Dict[str, List[Any]],
    exclude_from_training_data: List[str],
    ordinal_mappings: Dict[str, List[Any]],
    prediction_times: List[Tuple[int, int]],
    test_visits: Optional[DataFrame] = None,
    model_name: str = "admissions",
    visit_col: str = "visit_number",
    calibrate_probabilities: bool = True,
    calibration_method: str = "isotonic",
    use_balanced_training: bool = True,
    majority_to_minority_ratio: float = 1.0,
    label_col: str = "is_admitted",
    evaluate_on_test: bool = False,
) -> Dict[str, TrainedClassifier]:
    """Train admission prediction models for multiple prediction times.

    Parameters
    ----------
    train_visits : DataFrame
        Training visits dataset
    valid_visits : DataFrame
        Validation visits dataset
    grid : Dict[str, List[Any]]
        Parameter grid for hyperparameter tuning
    exclude_from_training_data : List[str]
        Columns to exclude from training
    ordinal_mappings : Dict[str, List[Any]]
        Mappings for ordinal categorical features
    prediction_times : List[Tuple[int, int]]
        List of prediction time points
    test_visits : DataFrame, optional
        Test visits dataset, by default None
    model_name : str, optional
        Name prefix for models, by default "admissions"
    visit_col : str, optional
        Name of the visit column, by default "visit_number"
    calibrate_probabilities : bool, optional
        Whether to calibrate probabilities, by default True
    calibration_method : str, optional
        Calibration method, by default "isotonic"
    use_balanced_training : bool, optional
        Whether to use balanced training, by default True
    majority_to_minority_ratio : float, optional
        Ratio for class balancing, by default 1.0
    label_col : str, optional
        Name of the label column, by default "is_admitted"
    evaluate_on_test : bool, optional
        Whether to evaluate on test set, by default False

    Returns
    -------
    Dict[str, TrainedClassifier]
        Dictionary mapping model keys to trained classifiers
    """
    if evaluate_on_test and test_visits is None:
        raise ValueError("test_visits must be provided when evaluate_on_test=True")

    trained_models: Dict[str, TrainedClassifier] = {}

    for prediction_time in prediction_times:
        print(f"\nProcessing: {prediction_time}")
        model_key = get_model_key(model_name, prediction_time)

        # Train model with the new simplified interface
        best_model = train_classifier(
            train_visits,
            valid_visits,
            prediction_time,
            exclude_from_training_data,
            grid,
            ordinal_mappings,
            test_visits,
            visit_col,
            use_balanced_training=use_balanced_training,
            majority_to_minority_ratio=majority_to_minority_ratio,
            calibrate_probabilities=calibrate_probabilities,
            calibration_method=calibration_method,
            label_col=label_col,
            evaluate_on_test=evaluate_on_test,
        )

        trained_models[model_key] = best_model

    return trained_models

emergency_demand

Emergency demand prediction training module.

This module provides functionality that is specific to the implementation of the patientflow package at University College London Hospital (ULCH). It trains models to predict emergency bed demand.

The module trains three model types: 1. Admission prediction models (multiple classifiers, one for each prediction time) 2. Specialty prediction models (sequence-based) 3. Yet-to-arrive prediction models (aspirational)

Functions:

Name Description
test_real_time_predictions : Test real-time prediction functionality

Selects random test cases and validates that the trained models can generate predictions as if it where making a real-time prediction.

train_all_models : Complete training pipeline

Trains all three model types (admissions, specialty, yet-to-arrive) with proper validation and optional model saving.

main : Entry point for training pipeline

Loads configuration, data, and runs the complete training process.

main(data_folder_name=None)

Main entry point for training patient flow models.

This function orchestrates the complete training pipeline for emergency demand prediction models. It loads configuration, data, and trains all three model types: admission prediction models, specialty prediction models, and yet-to-arrive prediction models.

Parameters:

Name Type Description Default
data_folder_name str

Name of the data folder containing the training datasets. If None, will be extracted from command line arguments.

None

Returns:

Type Description
None

The function trains and optionally saves models but does not return any values.

Notes

The function performs the following steps: 1. Loads configuration from config.yaml 2. Loads ED visits and inpatient arrivals data 3. Sets up model parameters and hyperparameters 4. Trains admission prediction classifiers 5. Trains specialty prediction sequence model 6. Trains yet-to-arrive prediction model 7. Optionally saves trained models 8. Optionally tests real-time prediction functionality

Source code in src/patientflow/train/emergency_demand.py
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
def main(data_folder_name=None):
    """
    Main entry point for training patient flow models.

    This function orchestrates the complete training pipeline for emergency demand
    prediction models. It loads configuration, data, and trains all three model
    types: admission prediction models, specialty prediction models, and
    yet-to-arrive prediction models.

    Parameters
    ----------
    data_folder_name : str, optional
        Name of the data folder containing the training datasets.
        If None, will be extracted from command line arguments.

    Returns
    -------
    None
        The function trains and optionally saves models but does not return
        any values.

    Notes
    -----
    The function performs the following steps:
    1. Loads configuration from config.yaml
    2. Loads ED visits and inpatient arrivals data
    3. Sets up model parameters and hyperparameters
    4. Trains admission prediction classifiers
    5. Trains specialty prediction sequence model
    6. Trains yet-to-arrive prediction model
    7. Optionally saves trained models
    8. Optionally tests real-time prediction functionality

    """
    # Parse arguments if not provided
    if data_folder_name is None:
        args = parse_args()
        data_folder_name = (
            data_folder_name if data_folder_name is not None else args.data_folder_name
        )
    print(f"Loading data from folder: {data_folder_name}")

    project_root = set_project_root()

    # Set file locations
    data_file_path, _, model_file_path, config_path = set_file_paths(
        project_root=project_root,
        inference_time=False,
        train_dttm=None,
        data_folder_name=data_folder_name,
        config_file="config.yaml",
    )

    # Load parameters
    config = load_config_file(config_path)

    # Extract parameters
    prediction_times = config["prediction_times"]
    start_training_set = config["start_training_set"]
    start_validation_set = config["start_validation_set"]
    start_test_set = config["start_test_set"]
    end_test_set = config["end_test_set"]
    prediction_window = timedelta(minutes=config["prediction_window"])
    epsilon = float(config["epsilon"])
    yta_time_interval = timedelta(minutes=config["yta_time_interval"])
    x1, y1, x2, y2 = config["x1"], config["y1"], config["x2"], config["y2"]

    # Load data
    ed_visits = load_data(
        data_file_path=data_file_path,
        file_name="ed_visits.csv",
        index_column="snapshot_id",
        sort_columns=["visit_number", "snapshot_date", "prediction_time"],
        eval_columns=["prediction_time", "consultation_sequence", "final_sequence"],
    )
    inpatient_arrivals = load_data(
        data_file_path=data_file_path, file_name="inpatient_arrivals.csv"
    )

    # Create snapshot date
    ed_visits["snapshot_date"] = pd.to_datetime(
        ed_visits["snapshot_date"], dayfirst=True
    ).dt.date

    # Set up model parameters
    grid_params = {"n_estimators": [30], "subsample": [0.7], "colsample_bytree": [0.7]}

    exclude_columns = [
        "visit_number",
        "snapshot_date",
        "prediction_time",
        "specialty",
        "consultation_sequence",
        "final_sequence",
    ]

    ordinal_mappings = {
        "age_group": [
            "0-17",
            "18-24",
            "25-34",
            "35-44",
            "45-54",
            "55-64",
            "65-74",
            "75-115",
        ],
        "latest_acvpu": ["A", "C", "V", "P", "U"],
        "latest_obs_manchester_triage_acuity": [
            "Blue",
            "Green",
            "Yellow",
            "Orange",
            "Red",
        ],
        "latest_obs_objective_pain_score": [
            "Nil",
            "Mild",
            "Moderate",
            "Severe\\E\\Very Severe",
        ],
        "latest_obs_level_of_consciousness": ["A", "C", "V", "P", "U"],
    }

    specialties = ["surgical", "haem/onc", "medical", "paediatric"]
    cdf_cut_points = [0.9, 0.7]
    curve_params = (x1, y1, x2, y2)
    random_seed = 42

    # Call train_all_models with prepared parameters
    train_all_models(
        visits=ed_visits,
        start_training_set=start_training_set,
        start_validation_set=start_validation_set,
        start_test_set=start_test_set,
        end_test_set=end_test_set,
        yta=inpatient_arrivals,
        model_file_path=model_file_path,
        prediction_times=prediction_times,
        prediction_window=prediction_window,
        yta_time_interval=yta_time_interval,
        epsilon=epsilon,
        curve_params=curve_params,
        grid_params=grid_params,
        exclude_columns=exclude_columns,
        ordinal_mappings=ordinal_mappings,
        specialties=specialties,
        cdf_cut_points=cdf_cut_points,
        random_seed=random_seed,
    )

    return

test_real_time_predictions(visits, models, prediction_window, specialties, cdf_cut_points, curve_params, random_seed)

Test real-time predictions by selecting a random sample from the visits dataset and generating predictions using the trained models.

Parameters:

Name Type Description Default
visits DataFrame

DataFrame containing visit data with columns including 'prediction_time', 'snapshot_date', and other required features for predictions.

required
models Tuple[Dict[str, TrainedClassifier], SequenceToOutcomePredictor, ParametricIncomingAdmissionPredictor]

Tuple containing: - trained_classifiers: TrainedClassifier containing admission predictions - spec_model: SequenceToOutcomePredictor for specialty predictions - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions

required
prediction_window int

Size of the prediction window in minutes for which to generate forecasts.

required
specialties list[str]

List of specialty names to generate predictions for (e.g., ['surgical', 'medical', 'paediatric']).

required
cdf_cut_points list[float]

List of probability thresholds for cumulative distribution function cut points (e.g., [0.9, 0.7]).

required
curve_params tuple[float, float, float, float]

Parameters (x1, y1, x2, y2) defining the curve used for predictions.

required
random_seed int

Random seed for reproducible sampling of test cases.

required

Returns:

Type Description
dict

Dictionary containing: - 'prediction_time': str, The time point for which predictions were made - 'prediction_date': str, The date for which predictions were made - 'realtime_preds': dict, The generated predictions for the sample

Raises:

Type Description
Exception

If real-time inference fails, with detailed error message printed before system exit.

Notes

The function selects a single random row from the visits DataFrame and generates predictions for that specific time point using all provided models. The predictions are made using the create_predictions() function with the specified parameters.

Source code in src/patientflow/train/emergency_demand.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def test_real_time_predictions(
    visits,
    models: Tuple[
        Dict[str, TrainedClassifier],
        SequenceToOutcomePredictor,
        ParametricIncomingAdmissionPredictor,
    ],
    prediction_window,
    specialties,
    cdf_cut_points,
    curve_params,
    random_seed,
):
    """
    Test real-time predictions by selecting a random sample from the visits dataset
    and generating predictions using the trained models.

    Parameters
    ----------
    visits : pd.DataFrame
        DataFrame containing visit data with columns including 'prediction_time',
        'snapshot_date', and other required features for predictions.
    models : Tuple[Dict[str, TrainedClassifier], SequenceToOutcomePredictor, ParametricIncomingAdmissionPredictor]
        Tuple containing:
        - trained_classifiers: TrainedClassifier containing admission predictions
        - spec_model: SequenceToOutcomePredictor for specialty predictions
        - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions
    prediction_window : int
        Size of the prediction window in minutes for which to generate forecasts.
    specialties : list[str]
        List of specialty names to generate predictions for (e.g., ['surgical',
        'medical', 'paediatric']).
    cdf_cut_points : list[float]
        List of probability thresholds for cumulative distribution function
        cut points (e.g., [0.9, 0.7]).
    curve_params : tuple[float, float, float, float]
        Parameters (x1, y1, x2, y2) defining the curve used for predictions.
    random_seed : int
        Random seed for reproducible sampling of test cases.

    Returns
    -------
    dict
        Dictionary containing:
        - 'prediction_time': str, The time point for which predictions were made
        - 'prediction_date': str, The date for which predictions were made
        - 'realtime_preds': dict, The generated predictions for the sample

    Raises
    ------
    Exception
        If real-time inference fails, with detailed error message printed before
        system exit.

    Notes
    -----
    The function selects a single random row from the visits DataFrame and
    generates predictions for that specific time point using all provided models.
    The predictions are made using the create_predictions() function with the
    specified parameters.
    """
    # Select random test set row
    random_row = visits.sample(n=1, random_state=random_seed)
    prediction_time = random_row.prediction_time.values[0]
    prediction_date = random_row.snapshot_date.values[0]

    # Get prediction snapshots
    prediction_snapshots = visits[
        (visits.prediction_time == prediction_time)
        & (visits.snapshot_date == prediction_date)
    ]

    trained_classifiers, spec_model, yet_to_arrive_model = models

    # Find the model matching the required prediction time
    classifier = None
    for model_key, trained_model in trained_classifiers.items():
        if trained_model.training_results.prediction_time == prediction_time:
            classifier = trained_model
            break

    if classifier is None:
        raise ValueError(f"No model found for prediction time {prediction_time}")

    try:
        x1, y1, x2, y2 = curve_params
        _ = create_predictions(
            models=(classifier, spec_model, yet_to_arrive_model),
            prediction_time=prediction_time,
            prediction_snapshots=prediction_snapshots,
            specialties=specialties,
            prediction_window=prediction_window,
            cdf_cut_points=cdf_cut_points,
            x1=x1,
            y1=y1,
            x2=x2,
            y2=y2,
        )
        print("Real-time inference ran correctly")
    except Exception as e:
        print(f"Real-time inference failed due to this error: {str(e)}")
        sys.exit(1)

    return

train_all_models(visits, start_training_set, start_validation_set, start_test_set, end_test_set, yta, prediction_times, prediction_window, yta_time_interval, epsilon, grid_params, exclude_columns, ordinal_mappings, random_seed, visit_col='visit_number', specialties=None, cdf_cut_points=None, curve_params=None, model_file_path=None, save_models=True, test_realtime=True)

Train and evaluate patient flow models.

Parameters:

Name Type Description Default
visits DataFrame

DataFrame containing visit data.

required
yta DataFrame

DataFrame containing yet-to-arrive data.

required
prediction_times list

List of times for making predictions.

required
prediction_window int

Prediction window size in minutes.

required
yta_time_interval int

Interval size for yet-to-arrive predictions in minutes.

required
epsilon float

Epsilon parameter for model training.

required
grid_params dict

Hyperparameter grid for model training.

required
exclude_columns list

Columns to exclude during training.

required
ordinal_mappings dict

Ordinal variable mappings for categorical features.

required
random_seed int

Random seed for reproducibility.

required
visit_col str

Name of column in dataset that is used to identify a hospital visit (eg visit_number, csn).

'visit_number'
specialties list

List of specialties to consider. Required if test_realtime is True.

None
cdf_cut_points list

CDF cut points for predictions. Required if test_realtime is True.

None
curve_params tuple

Curve parameters (x1, y1, x2, y2). Required if test_realtime is True.

None
model_file_path Path

Path to save trained models. Required if save_models is True.

None
save_models bool

Whether to save the trained models to disk. Defaults to True.

True
test_realtime bool

Whether to run real-time prediction tests. Defaults to True.

True

Returns:

Type Description
None

Raises:

Type Description
ValueError

If save_models is True but model_file_path is not provided, or if test_realtime is True but any of specialties, cdf_cut_points, or curve_params are not provided.

Notes

The function generates model names internally: - "admissions": "admissions" - "specialty": "ed_specialty" - "yet_to_arrive": f"yet_to_arrive_{int(prediction_window.total_seconds()/3600)}_hours"

Source code in src/patientflow/train/emergency_demand.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
def train_all_models(
    visits,
    start_training_set,
    start_validation_set,
    start_test_set,
    end_test_set,
    yta,
    prediction_times,
    prediction_window: timedelta,
    yta_time_interval: timedelta,
    epsilon,
    grid_params,
    exclude_columns,
    ordinal_mappings,
    random_seed,
    visit_col="visit_number",
    specialties=None,
    cdf_cut_points=None,
    curve_params=None,
    model_file_path=None,
    save_models=True,
    test_realtime=True,
):
    """
    Train and evaluate patient flow models.

    Parameters
    ----------
    visits : pd.DataFrame
        DataFrame containing visit data.
    yta : pd.DataFrame
        DataFrame containing yet-to-arrive data.
    prediction_times : list
        List of times for making predictions.
    prediction_window : int
        Prediction window size in minutes.
    yta_time_interval : int
        Interval size for yet-to-arrive predictions in minutes.
    epsilon : float
        Epsilon parameter for model training.
    grid_params : dict
        Hyperparameter grid for model training.
    exclude_columns : list
        Columns to exclude during training.
    ordinal_mappings : dict
        Ordinal variable mappings for categorical features.
    random_seed : int
        Random seed for reproducibility.
    visit_col : str, optional
        Name of column in dataset that is used to identify a hospital visit (eg visit_number, csn).
    specialties : list, optional
        List of specialties to consider. Required if test_realtime is True.
    cdf_cut_points : list, optional
        CDF cut points for predictions. Required if test_realtime is True.
    curve_params : tuple, optional
        Curve parameters (x1, y1, x2, y2). Required if test_realtime is True.
    model_file_path : Path, optional
        Path to save trained models. Required if save_models is True.
    save_models : bool, optional
        Whether to save the trained models to disk. Defaults to True.
    test_realtime : bool, optional
        Whether to run real-time prediction tests. Defaults to True.

    Returns
    -------
    None

    Raises
    ------
    ValueError
        If save_models is True but model_file_path is not provided,
        or if test_realtime is True but any of specialties, cdf_cut_points, or curve_params are not provided.

    Notes
    -----
    The function generates model names internally:
    - "admissions": "admissions"
    - "specialty": "ed_specialty"
    - "yet_to_arrive": f"yet_to_arrive_{int(prediction_window.total_seconds()/3600)}_hours"
    """
    # Validate parameters
    if save_models and model_file_path is None:
        raise ValueError("model_file_path must be provided when save_models is True")

    if test_realtime:
        if specialties is None:
            raise ValueError("specialties must be provided when test_realtime is True")
        if cdf_cut_points is None:
            raise ValueError(
                "cdf_cut_points must be provided when test_realtime is True"
            )
        if curve_params is None:
            raise ValueError("curve_params must be provided when test_realtime is True")

    # Set random seed
    np.random.seed(random_seed)

    # Define model names internally
    model_names = {
        "admissions": "admissions",
        "specialty": "ed_specialty",
        "yet_to_arrive": f"yet_to_arrive_{int(prediction_window.total_seconds()/3600)}_hours",
    }

    if "arrival_datetime" in visits.columns:
        col_name = "arrival_datetime"
    else:
        col_name = "snapshot_date"

    train_visits, valid_visits, test_visits = create_temporal_splits(
        visits,
        start_training_set,
        start_validation_set,
        start_test_set,
        end_test_set,
        col_name=col_name,
    )

    train_yta, _, _ = create_temporal_splits(
        yta[(~yta.specialty.isnull())],
        start_training_set,
        start_validation_set,
        start_test_set,
        end_test_set,
        col_name="arrival_datetime",
    )

    # Use predicted_times from visits if not explicitly provided
    if prediction_times is None:
        prediction_times = list(visits.prediction_time.unique())

    # Train admission models
    admission_models = train_multiple_classifiers(
        train_visits=train_visits,
        valid_visits=valid_visits,
        test_visits=test_visits,
        grid=grid_params,
        exclude_from_training_data=exclude_columns,
        ordinal_mappings=ordinal_mappings,
        prediction_times=prediction_times,
        model_name=model_names["admissions"],
        visit_col=visit_col,
    )

    # Save admission models if requested

    if save_models:
        save_model(admission_models, model_names["admissions"], model_file_path)

    # Train specialty model
    specialty_model = train_sequence_predictor(
        train_visits=train_visits,
        model_name=model_names["specialty"],
        input_var="consultation_sequence",
        grouping_var="final_sequence",
        outcome_var="specialty",
        visit_col=visit_col,
    )

    # Save specialty model if requested
    if save_models:
        save_model(specialty_model, model_names["specialty"], model_file_path)

    # Train yet-to-arrive model
    yta_model_name = model_names["yet_to_arrive"]

    num_days = (start_validation_set - start_training_set).days

    yta_model = train_parametric_admission_predictor(
        train_visits=train_visits,
        train_yta=train_yta,
        prediction_window=prediction_window,
        yta_time_interval=yta_time_interval,
        prediction_times=prediction_times,
        epsilon=epsilon,
        num_days=num_days,
    )

    # Save yet-to-arrive model if requested
    if save_models:
        save_model(yta_model, yta_model_name, model_file_path)
        print(f"Models have been saved to {model_file_path}")

    # Test real-time predictions if requested
    if test_realtime:
        visits["elapsed_los"] = visits["elapsed_los"].apply(
            lambda x: timedelta(seconds=x)
        )
        test_real_time_predictions(
            visits=visits,
            models=(admission_models, specialty_model, yta_model),
            prediction_window=prediction_window,
            specialties=specialties,
            cdf_cut_points=cdf_cut_points,
            curve_params=curve_params,
            random_seed=random_seed,
        )

    return

incoming_admission_predictor

Training utility for parametric admission prediction models.

This module provides functions for training parametric admission prediction models, specifically for predicting yet-to-arrive (YTA) patient volumes using parametric curves. It includes utilities for creating specialty filters and training parametric admission predictors.

The logic in this module is specific to the implementation at UCLH.

create_yta_filters(df)

Create specialty filters for categorizing patients by specialty and age group.

This function generates a dictionary of filters based on specialty categories, with special handling for pediatric patients. It uses the SpecialCategoryParams class to determine which specialties correspond to pediatric care.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing patient data with columns that include either 'age_on_arrival' or 'age_group' for pediatric classification.

required

Returns:

Type Description
dict

A dictionary mapping specialty names to filter configurations. Each configuration contains: - For pediatric specialty: {"is_child": True} - For other specialties: {"specialty": specialty_name, "is_child": False}

Source code in src/patientflow/train/incoming_admission_predictor.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def create_yta_filters(df):
    """
    Create specialty filters for categorizing patients by specialty and age group.

    This function generates a dictionary of filters based on specialty categories,
    with special handling for pediatric patients. It uses the SpecialCategoryParams
    class to determine which specialties correspond to pediatric care.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient data with columns that include either
        'age_on_arrival' or 'age_group' for pediatric classification.

    Returns
    -------
    dict
        A dictionary mapping specialty names to filter configurations.
        Each configuration contains:
        - For pediatric specialty: {"is_child": True}
        - For other specialties: {"specialty": specialty_name, "is_child": False}

    """
    # Get the special category parameters using the picklable implementation
    special_params = create_special_category_objects(df.columns)

    # Extract necessary data from the special_params
    special_category_dict = special_params["special_category_dict"]

    # Create the specialty_filters dictionary
    specialty_filters = {}

    for specialty, is_paediatric_flag in special_category_dict.items():
        if is_paediatric_flag == 1.0:
            # For the paediatric specialty, set `is_child` to True
            specialty_filters[specialty] = {"is_child": True}
        else:
            # For other specialties, set `is_child` to False
            specialty_filters[specialty] = {"specialty": specialty, "is_child": False}

    return specialty_filters

train_parametric_admission_predictor(train_visits, train_yta, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=1e-06)

Train a parametric yet-to-arrive prediction model.

Parameters:

Name Type Description Default
train_visits DataFrame

Visits dataset (used for identifying special categories).

required
train_yta DataFrame

Training data for yet-to-arrive predictions.

required
prediction_window timedelta

Time window for predictions as a timedelta.

required
yta_time_interval timedelta

Time interval for predictions as a timedelta.

required
prediction_times List[float]

List of prediction times.

required
num_days int

Number of days to consider.

required
epsilon float

Epsilon parameter for model, by default 10e-7.

1e-06

Returns:

Type Description
ParametricIncomingAdmissionPredictor

Trained ParametricIncomingAdmissionPredictor model.

Raises:

Type Description
TypeError

If prediction_window or yta_time_interval are not timedelta objects.

Source code in src/patientflow/train/incoming_admission_predictor.py
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def train_parametric_admission_predictor(
    train_visits: DataFrame,
    train_yta: DataFrame,
    prediction_window: timedelta,
    yta_time_interval: timedelta,
    prediction_times: List[float],
    num_days: int,
    epsilon: float = 10e-7,
) -> ParametricIncomingAdmissionPredictor:
    """
    Train a parametric yet-to-arrive prediction model.

    Parameters
    ----------
    train_visits : DataFrame
        Visits dataset (used for identifying special categories).
    train_yta : DataFrame
        Training data for yet-to-arrive predictions.
    prediction_window : timedelta
        Time window for predictions as a timedelta.
    yta_time_interval : timedelta
        Time interval for predictions as a timedelta.
    prediction_times : List[float]
        List of prediction times.
    num_days : int
        Number of days to consider.
    epsilon : float, optional
        Epsilon parameter for model, by default 10e-7.

    Returns
    -------
    ParametricIncomingAdmissionPredictor
        Trained ParametricIncomingAdmissionPredictor model.

    Raises
    ------
    TypeError
        If prediction_window or yta_time_interval are not timedelta objects.
    """

    if not isinstance(prediction_window, timedelta):
        raise TypeError("prediction_window must be a timedelta object")
    if not isinstance(yta_time_interval, timedelta):
        raise TypeError("yta_time_interval must be a timedelta object")

    if train_yta.index.name is None:
        if "arrival_datetime" in train_yta.columns:
            # Convert to datetime using the actual values, not pandas objects
            train_yta = train_yta.copy()
            train_yta["arrival_datetime"] = pd.to_datetime(
                train_yta["arrival_datetime"].values, utc=True
            )
            train_yta.set_index("arrival_datetime", inplace=True)

    elif train_yta.index.name != "arrival_datetime":
        print("Dataset needs arrival_datetime column")

    specialty_filters = create_yta_filters(train_visits)

    yta_model = ParametricIncomingAdmissionPredictor(filters=specialty_filters)
    yta_model.fit(
        train_df=train_yta,
        prediction_window=prediction_window,
        yta_time_interval=yta_time_interval,
        prediction_times=prediction_times,
        epsilon=epsilon,
        num_days=num_days,
    )

    return yta_model

sequence_predictor

Training utility for sequence prediction models.

This module provides functions for training sequence-based prediction models, specifically for predicting patient outcomes based on visit sequences. It includes utilities for filtering patient data and training specialized sequence predictors.

The logic in this module is specific to the implementation at UCLH.

get_default_visits(admitted)

Filter a dataframe of patient visits to include only non-pediatric patients.

This function identifies and removes pediatric patients from the dataset based on both age criteria and specialty assignment. It automatically detects the appropriate age column format from the provided dataframe.

Parameters:

Name Type Description Default
admitted DataFrame

A pandas DataFrame containing patient visit information. Must include either 'age_on_arrival' or 'age_group' columns, and a 'specialty' column.

required

Returns:

Type Description
DataFrame

A filtered DataFrame containing only non-pediatric patients (adults).

Notes

The function automatically detects which age-related columns are present in the dataframe and configures the appropriate filtering logic. It removes patients who are either: 1. Identified as pediatric based on age criteria, or 2. Assigned to a pediatric specialty

Source code in src/patientflow/train/sequence_predictor.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def get_default_visits(admitted: DataFrame) -> DataFrame:
    """
    Filter a dataframe of patient visits to include only non-pediatric patients.

    This function identifies and removes pediatric patients from the dataset based on
    both age criteria and specialty assignment. It automatically detects the appropriate
    age column format from the provided dataframe.

    Parameters
    ----------
    admitted : DataFrame
        A pandas DataFrame containing patient visit information. Must include either
        'age_on_arrival' or 'age_group' columns, and a 'specialty' column.

    Returns
    -------
    DataFrame
        A filtered DataFrame containing only non-pediatric patients (adults).

    Notes
    ------
    The function automatically detects which age-related columns are present in the
    dataframe and configures the appropriate filtering logic. It removes patients who
    are either:
    1. Identified as pediatric based on age criteria, or
    2. Assigned to a pediatric specialty

    """
    # Get configuration for categorizing patients based on age columns
    special_params = create_special_category_objects(admitted.columns)

    # Extract function that identifies non-pediatric patients
    opposite_special_category_func = special_params["special_func_map"]["default"]

    # Determine which category is the special category (should be "paediatric")
    special_category_key = next(
        key
        for key, value in special_params["special_category_dict"].items()
        if value == 1.0
    )

    # Filter out pediatric patients based on both age criteria and specialty
    filtered_admitted = admitted[
        admitted.apply(opposite_special_category_func, axis=1)
        & (admitted["specialty"] != special_category_key)
    ]

    return filtered_admitted

train_sequence_predictor(train_visits, model_name, visit_col, input_var, grouping_var, outcome_var)

Train a specialty prediction model.

Parameters:

Name Type Description Default
train_visits DataFrame

Training data containing visit information.

required
model_name str

Name identifier for the model.

required
visit_col str

Column name containing visit identifiers.

required
input_var str

Column name for input sequence.

required
grouping_var str

Column name for grouping sequence.

required
outcome_var str

Column name for target variable.

required

Returns:

Type Description
SequencePredictor

Trained SequencePredictor model.

Source code in src/patientflow/train/sequence_predictor.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def train_sequence_predictor(
    train_visits: DataFrame,
    model_name: str,
    visit_col: str,
    input_var: str,
    grouping_var: str,
    outcome_var: str,
) -> SequenceToOutcomePredictor:
    """
    Train a specialty prediction model.

    Parameters
    ----------
    train_visits : DataFrame
        Training data containing visit information.
    model_name : str
        Name identifier for the model.
    visit_col : str
        Column name containing visit identifiers.
    input_var : str
        Column name for input sequence.
    grouping_var : str
        Column name for grouping sequence.
    outcome_var : str
        Column name for target variable.

    Returns
    -------
    SequencePredictor
        Trained SequencePredictor model.
    """
    visits_single = select_one_snapshot_per_visit(train_visits, visit_col)
    admitted = visits_single[
        (visits_single.is_admitted) & ~(visits_single.specialty.isnull())
    ]
    filtered_admitted = get_default_visits(admitted)

    filtered_admitted.loc[:, input_var] = filtered_admitted[input_var].apply(
        lambda x: tuple(x) if x else ()
    )
    filtered_admitted.loc[:, grouping_var] = filtered_admitted[grouping_var].apply(
        lambda x: tuple(x) if x else ()
    )

    spec_model = SequenceToOutcomePredictor(
        input_var=input_var,
        grouping_var=grouping_var,
        outcome_var=outcome_var,
    )
    spec_model.fit(filtered_admitted)

    return spec_model

utils

save_model(model, model_name, model_file_path)

Save trained model(s) to disk.

Parameters:

Name Type Description Default
model object or dict

A single model instance or a dictionary of models to save.

required
model_name str

Base name to use for saving the model(s).

required
model_file_path Path

Directory path where the model(s) will be saved.

required

Returns:

Type Description
None
Source code in src/patientflow/train/utils.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def save_model(model, model_name, model_file_path):
    """
    Save trained model(s) to disk.

    Parameters
    ----------
    model : object or dict
        A single model instance or a dictionary of models to save.
    model_name : str
        Base name to use for saving the model(s).
    model_file_path : Path
        Directory path where the model(s) will be saved.

    Returns
    -------
    None
    """

    if isinstance(model, dict):
        # Handle dictionary of models (e.g., admission models)
        for name, m in model.items():
            full_path = model_file_path / name
            full_path = full_path.with_suffix(".joblib")
            dump(m, full_path)
    else:
        # Handle single model (e.g., specialty or yet-to-arrive model)
        full_path = model_file_path / model_name
        full_path = full_path.with_suffix(".joblib")
        dump(model, full_path)

viz

Visualization module for patient flow analysis.

This module provides various plotting and visualization functions for analyzing patient flow data, model results, and evaluation metrics.

arrival_rates

Visualization functions for inpatient arrival rates and cumulative statistics.

This module provides functions to visualize time-varying arrival rates and cumulative arrivals, over the course of a day.

Functions:

Name Description
annotate_hour_line : function

Annotate hour lines on a matplotlib plot

plot_arrival_rates : function

Plot arrival rates for one or two datasets

plot_cumulative_arrival_rates : function

Plot cumulative arrival rates with statistical distributions

annotate_hour_line(hour_line, y_value, hour_values, start_plot_index, line_styles, x_margin, annotation_prefix, text_y_offset=1, text_x_position=None, slope=None, x1=None, y1=None)

Annotate hour lines on a matplotlib plot with consistent formatting.

Parameters:

Name Type Description Default
hour_line int

The hour to annotate on the plot.

required
y_value float

The y-coordinate for annotation positioning.

required
hour_values list of int

Hour values corresponding to the x-axis positions.

required
start_plot_index int

Starting index for the plot's data.

required
line_styles dict

Line styles for annotations keyed by hour.

required
x_margin float

Margin added to x-axis for annotation positioning.

required
annotation_prefix str

Prefix for the annotation text (e.g., "On average").

required
text_y_offset float

Vertical offset for the annotation text from the line, by default 1.

1
text_x_position float

Horizontal position for annotation text, by default None.

None
slope float

Slope of a line for extended annotations, by default None.

None
x1 float

Reference x-coordinate for slope-based annotation, by default None.

None
y1 float

Reference y-coordinate for slope-based annotation, by default None.

None
Source code in src/patientflow/viz/arrival_rates.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def annotate_hour_line(
    hour_line,
    y_value,
    hour_values,
    start_plot_index,
    line_styles,
    x_margin,
    annotation_prefix,
    text_y_offset=1,
    text_x_position=None,
    slope=None,
    x1=None,
    y1=None,
):
    """Annotate hour lines on a matplotlib plot with consistent formatting.

    Parameters
    ----------
    hour_line : int
        The hour to annotate on the plot.
    y_value : float
        The y-coordinate for annotation positioning.
    hour_values : list of int
        Hour values corresponding to the x-axis positions.
    start_plot_index : int
        Starting index for the plot's data.
    line_styles : dict
        Line styles for annotations keyed by hour.
    x_margin : float
        Margin added to x-axis for annotation positioning.
    annotation_prefix : str
        Prefix for the annotation text (e.g., "On average").
    text_y_offset : float, optional
        Vertical offset for the annotation text from the line, by default 1.
    text_x_position : float, optional
        Horizontal position for annotation text, by default None.
    slope : float, optional
        Slope of a line for extended annotations, by default None.
    x1 : float, optional
        Reference x-coordinate for slope-based annotation, by default None.
    y1 : float, optional
        Reference y-coordinate for slope-based annotation, by default None.
    """
    a = hour_values[hour_line - start_plot_index]
    if slope is not None and x1 is not None:
        y_a = slope * (a - x1) + y1
        plt.plot([a, a], [0, y_a], color="grey", linestyle=line_styles[hour_line])
        plt.plot(
            [0 - x_margin, a],
            [y_a, y_a],
            color="grey",
            linestyle=line_styles[hour_line],
        )
        annotation_text = (
            f"{annotation_prefix}, {int(y_a)} beds needed by {hour_line}:00"
        )
        y_position = y_a + text_y_offset
    else:
        plt.annotate(
            "",
            xy=(a, y_value),
            xytext=(a, 0),
            arrowprops=dict(
                arrowstyle="-", linestyle=line_styles[hour_line], color="grey"
            ),
        )
        plt.annotate(
            "",
            xy=(a, y_value),
            xytext=(hour_values[0] - x_margin, y_value),
            arrowprops=dict(
                arrowstyle="-", linestyle=line_styles[hour_line], color="grey"
            ),
        )
        annotation_text = (
            f"{annotation_prefix}, {int(y_value)} beds needed by {hour_line}:00"
        ).strip()  # strip() removes leading comma if prefix is empty
        y_position = y_value + text_y_offset

    # Use custom text x position if provided, otherwise use default
    x_position = (
        text_x_position if text_x_position is not None else (hour_values[1] - x_margin)
    )

    plt.annotate(
        annotation_text,
        xy=(a / 2 if slope is not None else a, y_value),
        xytext=(x_position, y_position),
        va="bottom",
        ha="left",
        fontsize=10,
    )

draw_window_visualization(ax, hour_values, window_params, annotation_prefix, start_window, end_window)

Draw the window visualization with annotations.

Parameters:

Name Type Description Default
ax Axes

The axes to draw on

required
hour_values array - like

Hour labels for x-axis

required
window_params tuple

(slope, x1, y1, y2) from get_window_parameters

required
annotation_prefix str

Prefix for annotations

required
start_window int

Start hour for window

required
end_window int

End hour for window

required
Source code in src/patientflow/viz/arrival_rates.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
def draw_window_visualization(
    ax, hour_values, window_params, annotation_prefix, start_window, end_window
):
    """Draw the window visualization with annotations.

    Parameters
    ----------
    ax : matplotlib.axes.Axes
        The axes to draw on
    hour_values : array-like
        Hour labels for x-axis
    window_params : tuple
        (slope, x1, y1, y2) from get_window_parameters
    annotation_prefix : str
        Prefix for annotations
    start_window : int
        Start hour for window
    end_window : int
        End hour for window
    """
    slope, x1, y1, x2, y2 = window_params

    # Draw horizontal line
    ax.hlines(y=y2, xmin=x2, xmax=hour_values[-1], color="blue", linestyle="--")

    # Draw diagonal line
    ax.plot([x1, x2], [y1, y2], color="blue", linestyle="--")

    # Add annotation
    ax.annotate(
        f"{annotation_prefix}, {slope:.0f} beds need to be vacated\n"
        f"each hour between {start_window}:00 and {end_window}:00\n"
        f"to create capacity for all overnight arrivals\n"
        f"by {end_window}:00",
        xy=(hour_values[-1], y2 * 0.25),
        xytext=(hour_values[-1], y2 * 0.25),
        va="top",
        ha="right",
    )

get_window_parameters(data, start_window, end_window, hour_values)

Calculate window parameters for visualization.

Parameters:

Name Type Description Default
data array - like

Reindexed cumulative data

required
start_window int

Start position in reindexed space

required
end_window int

End position in reindexed space

required
hour_values array - like

Original hour values for display

required

Returns:

Type Description
tuple

(slope, x1, y1, x2, y2) where: - slope: float, The calculated slope of the line - x1: float, Start hour value - y1: float, Start y-value - x2: float, End hour value - y2: float, End y-value

Source code in src/patientflow/viz/arrival_rates.py
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
def get_window_parameters(data, start_window, end_window, hour_values):
    """Calculate window parameters for visualization.

    Parameters
    ----------
    data : array-like
        Reindexed cumulative data
    start_window : int
        Start position in reindexed space
    end_window : int
        End position in reindexed space
    hour_values : array-like
        Original hour values for display

    Returns
    -------
    tuple
        (slope, x1, y1, x2, y2) where:
        - slope: float, The calculated slope of the line
        - x1: float, Start hour value
        - y1: float, Start y-value
        - x2: float, End hour value
        - y2: float, End y-value
    """
    y1 = data[start_window]
    y2 = data[-1]
    x1 = hour_values[start_window]  # Get display hour
    x2 = hour_values[end_window]  # Get display hour
    slope = (y2 - y1) / (x2 - x1)

    return slope, x1, y1, x2, y2

plot_arrival_rates(inpatient_arrivals, title, inpatient_arrivals_2=None, labels=None, lagged_by=None, curve_params=None, time_interval=60, start_plot_index=0, x_margin=0.5, file_prefix='', media_file_path=None, file_name=None, num_days=None, num_days_2=None, return_figure=False)

Plot arrival rates for one or two datasets with optional lagged and spread rates.

Parameters:

Name Type Description Default
inpatient_arrivals array - like

Primary dataset of inpatient arrivals.

required
title str

Title of the plot.

required
inpatient_arrivals_2 array - like

Optional second dataset for comparison, by default None.

None
labels tuple of str

Labels for the datasets when comparing two datasets, by default None.

None
lagged_by int

Time lag in hours to apply to the arrival rates, by default None.

None
curve_params tuple of float

Parameters for spread arrival rates as (x1, y1, x2, y2), by default None.

None
time_interval int

Time interval in minutes for arrival rate calculations, by default 60.

60
start_plot_index int

Starting hour index for plotting, by default 0.

0
x_margin float

Margin on the x-axis, by default 0.5.

0.5
file_prefix str

Prefix for the saved file name, by default "".

''
media_file_path str or Path

Directory path to save the plot, by default None.

None
file_name str

Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.

None
num_days int

Number of days in the first dataset, by default None.

None
num_days_2 int

Number of days in the second dataset, by default None.

None
return_figure bool

If True, returns the matplotlib figure instead of displaying it, by default False.

False

Returns:

Type Description
Figure or None

Returns the figure if return_figure is True, otherwise displays the plot.

Source code in src/patientflow/viz/arrival_rates.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
def plot_arrival_rates(
    inpatient_arrivals,
    title,
    inpatient_arrivals_2=None,
    labels=None,
    lagged_by=None,
    curve_params=None,
    time_interval=60,
    start_plot_index=0,
    x_margin=0.5,
    file_prefix="",
    media_file_path=None,
    file_name=None,
    num_days=None,
    num_days_2=None,
    return_figure=False,
):
    """Plot arrival rates for one or two datasets with optional lagged and spread rates.

    Parameters
    ----------
    inpatient_arrivals : array-like
        Primary dataset of inpatient arrivals.
    title : str
        Title of the plot.
    inpatient_arrivals_2 : array-like, optional
        Optional second dataset for comparison, by default None.
    labels : tuple of str, optional
        Labels for the datasets when comparing two datasets, by default None.
    lagged_by : int, optional
        Time lag in hours to apply to the arrival rates, by default None.
    curve_params : tuple of float, optional
        Parameters for spread arrival rates as (x1, y1, x2, y2), by default None.
    time_interval : int, optional
        Time interval in minutes for arrival rate calculations, by default 60.
    start_plot_index : int, optional
        Starting hour index for plotting, by default 0.
    x_margin : float, optional
        Margin on the x-axis, by default 0.5.
    file_prefix : str, optional
        Prefix for the saved file name, by default "".
    media_file_path : str or Path, optional
        Directory path to save the plot, by default None.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.
    num_days : int, optional
        Number of days in the first dataset, by default None.
    num_days_2 : int, optional
        Number of days in the second dataset, by default None.
    return_figure : bool, optional
        If True, returns the matplotlib figure instead of displaying it, by default False.

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot.
    """
    is_dual_plot = inpatient_arrivals_2 is not None
    if is_dual_plot and labels is None:
        labels = ("Dataset 1", "Dataset 2")

    datasets = [(inpatient_arrivals, "C0", "o", num_days)]
    if is_dual_plot:
        datasets.append((inpatient_arrivals_2, "C1", "s", num_days_2))

    # Calculate and process arrival rates for all datasets
    processed_data = []
    max_y_values = []

    for dataset, color, marker, num_days in datasets:
        # Calculate base arrival rates
        arrival_rates_dict = time_varying_arrival_rates(
            dataset, time_interval, num_days=num_days
        )
        arrival_rates, hour_labels, hour_values = process_arrival_rates(
            arrival_rates_dict
        )
        max_y_values.append(max(arrival_rates))

        # Calculate lagged rates if needed
        arrival_rates_lagged = None
        if lagged_by is not None:
            arrival_rates_lagged_dict = time_varying_arrival_rates_lagged(
                dataset, lagged_by, yta_time_interval=time_interval, num_days=num_days
            )
            arrival_rates_lagged, _, _ = process_arrival_rates(
                arrival_rates_lagged_dict
            )
            max_y_values.append(max(arrival_rates_lagged))

        # Calculate spread rates if needed
        arrival_rates_spread = None
        if curve_params is not None:
            x1, y1, x2, y2 = curve_params
            arrival_rates_spread_dict = unfettered_demand_by_hour(
                dataset, x1, y1, x2, y2, num_days=num_days
            )
            arrival_rates_spread, _, _ = process_arrival_rates(
                arrival_rates_spread_dict
            )
            max_y_values.append(max(arrival_rates_spread))

        processed_data.append(
            {
                "arrival_rates": arrival_rates,
                "arrival_rates_lagged": arrival_rates_lagged,
                "arrival_rates_spread": arrival_rates_spread,
                "color": color,
                "marker": marker,
                "dataset_label": labels[len(processed_data)] if is_dual_plot else None,
            }
        )

    # Helper function to create cyclic data
    def get_cyclic_data(data):
        return data[start_plot_index:] + data[0:start_plot_index]

    # Plot setup
    fig = plt.figure(figsize=(10, 6))
    x_values = get_cyclic_data(hour_labels)

    # Plot data for each dataset
    for data in processed_data:
        dataset_suffix = f" ({data['dataset_label']})" if data["dataset_label"] else ""

        # Base arrival rates
        base_label = f"Arrival rates of admitted patients{dataset_suffix}"
        plt.plot(
            x_values,
            get_cyclic_data(data["arrival_rates"]),
            marker="x",
            color=data["color"],
            markersize=4,
            linestyle=":" if (curve_params or lagged_by) else "-",
            linewidth=1 if (curve_params or lagged_by) else None,
            label=base_label,
        )

        if lagged_by is not None:
            # Lagged arrival rates
            lagged_label = f"Average number of beds needed assuming admission\nexactly {lagged_by} hours after arrival{dataset_suffix}"
            plt.plot(
                x_values,
                get_cyclic_data(data["arrival_rates_lagged"]),
                marker="o",
                markersize=4,
                color=data["color"],
                linestyle="--",
                linewidth=1,
                label=lagged_label,
            )

        if curve_params is not None and data["arrival_rates_spread"] is not None:
            # Spread arrival rates
            spread_label = f"Average number of beds applying ED targets of {int(y1*100)}% in {int(x1)} hours{dataset_suffix}"
            plt.plot(
                x_values,
                get_cyclic_data(data["arrival_rates_spread"]),
                marker=data["marker"],  # Keep original dataset marker
                color=data["color"],  # Keep original dataset color
                label=spread_label,
            )

    # Set plot limits and labels
    plt.ylim(0, max(max_y_values) + 0.25)
    plt.xlim(hour_values[0] - x_margin, hour_values[-1] + x_margin)

    plt.xlabel("Hour of day")
    plt.ylabel("Arrival Rate (patients per hour)")
    plt.title(title)
    plt.grid(True, alpha=0.3)

    # Always show legend if there are multiple datasets or multiple rate types
    if is_dual_plot or lagged_by is not None or curve_params is not None:
        plt.legend()

    plt.tight_layout()

    # Save if path provided
    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = f"{file_prefix}{clean_title_for_filename(title)}"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

plot_cumulative_arrival_rates(inpatient_arrivals, title, curve_params=None, lagged_by=None, time_interval=60, start_plot_index=0, draw_window=None, x_margin=0.5, file_prefix='', set_y_lim=None, hour_lines=[12, 17], line_styles={12: '--', 17: ':', 20: '--'}, annotation_prefix='On average', line_colour='red', media_file_path=None, file_name=None, plot_centiles=False, highlight_centile=0.9, centiles=[0.3, 0.5, 0.7, 0.9, 0.99], markers=['D', 's', '^', 'o', 'v'], line_styles_centiles=['-.', '--', ':', '-', '-'], bed_type_spec='', text_y_offset=1, num_days=None, return_figure=False)

Plot cumulative arrival rates with optional statistical distributions.

Parameters:

Name Type Description Default
inpatient_arrivals array - like

Dataset of inpatient arrivals.

required
title str

Title of the plot.

required
curve_params tuple of float

Parameters for spread rates as (x1, y1, x2, y2), by default None.

None
lagged_by int

Time lag in hours for cumulative rates, by default None.

None
time_interval int

Time interval in minutes for rate calculations, by default 60.

60
start_plot_index int

Starting hour index for plotting, by default 0.

0
draw_window tuple of int

Time window for detailed annotation, by default None.

None
x_margin float

Margin on the x-axis, by default 0.5.

0.5
file_prefix str

Prefix for the saved file name, by default "".

''
set_y_lim float

Upper limit for the y-axis, by default None.

None
hour_lines list of int

Specific hours to annotate, by default [12, 17].

[12, 17]
line_styles dict

Line styles for hour annotations keyed by hour, by default {12: "--", 17: ":", 20: "--"}.

{12: '--', 17: ':', 20: '--'}
annotation_prefix str

Prefix for annotations, by default "On average".

'On average'
line_colour str

Color for the main line plot, by default "red".

'red'
media_file_path str or Path

Directory path to save the plot, by default None.

None
file_name str

Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.

None
plot_centiles bool

Whether to include percentile visualization, by default False.

False
highlight_centile float

Percentile to emphasize, by default 0.9. If 1.0 is provided, will use 0.9999 instead.

0.9
centiles list of float

List of percentiles to calculate, by default [0.3, 0.5, 0.7, 0.9, 0.99].

[0.3, 0.5, 0.7, 0.9, 0.99]
markers list of str

Marker styles for percentile lines, by default ["D", "s", "^", "o", "v"].

['D', 's', '^', 'o', 'v']
line_styles_centiles list of str

Line styles for percentile visualization, by default ["-.", "--", ":", "-", "-"].

['-.', '--', ':', '-', '-']
bed_type_spec str

Specification for bed type in annotations, by default "".

''
text_y_offset float

Vertical offset for text annotations, by default 1.

1
num_days int

Number of days in the dataset, by default None.

None
return_figure bool

If True, returns the matplotlib figure instead of displaying it, by default False.

False

Returns:

Type Description
Figure or None

Returns the figure if return_figure is True, otherwise displays the plot.

Source code in src/patientflow/viz/arrival_rates.py
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
def plot_cumulative_arrival_rates(
    inpatient_arrivals,
    title,
    curve_params=None,
    lagged_by=None,
    time_interval=60,
    start_plot_index=0,
    draw_window=None,
    x_margin=0.5,
    file_prefix="",
    set_y_lim=None,
    hour_lines=[12, 17],
    line_styles={12: "--", 17: ":", 20: "--"},
    annotation_prefix="On average",
    line_colour="red",
    media_file_path=None,
    file_name=None,
    plot_centiles=False,
    highlight_centile=0.9,
    centiles=[0.3, 0.5, 0.7, 0.9, 0.99],
    markers=["D", "s", "^", "o", "v"],
    line_styles_centiles=["-.", "--", ":", "-", "-"],
    bed_type_spec="",
    text_y_offset=1,
    num_days=None,
    return_figure=False,
):
    """Plot cumulative arrival rates with optional statistical distributions.

    Parameters
    ----------
    inpatient_arrivals : array-like
        Dataset of inpatient arrivals.
    title : str
        Title of the plot.
    curve_params : tuple of float, optional
        Parameters for spread rates as (x1, y1, x2, y2), by default None.
    lagged_by : int, optional
        Time lag in hours for cumulative rates, by default None.
    time_interval : int, optional
        Time interval in minutes for rate calculations, by default 60.
    start_plot_index : int, optional
        Starting hour index for plotting, by default 0.
    draw_window : tuple of int, optional
        Time window for detailed annotation, by default None.
    x_margin : float, optional
        Margin on the x-axis, by default 0.5.
    file_prefix : str, optional
        Prefix for the saved file name, by default "".
    set_y_lim : float, optional
        Upper limit for the y-axis, by default None.
    hour_lines : list of int, optional
        Specific hours to annotate, by default [12, 17].
    line_styles : dict, optional
        Line styles for hour annotations keyed by hour, by default {12: "--", 17: ":", 20: "--"}.
    annotation_prefix : str, optional
        Prefix for annotations, by default "On average".
    line_colour : str, optional
        Color for the main line plot, by default "red".
    media_file_path : str or Path, optional
        Directory path to save the plot, by default None.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.
    plot_centiles : bool, optional
        Whether to include percentile visualization, by default False.
    highlight_centile : float, optional
        Percentile to emphasize, by default 0.9. If 1.0 is provided, will use 0.9999 instead.
    centiles : list of float, optional
        List of percentiles to calculate, by default [0.3, 0.5, 0.7, 0.9, 0.99].
    markers : list of str, optional
        Marker styles for percentile lines, by default ["D", "s", "^", "o", "v"].
    line_styles_centiles : list of str, optional
        Line styles for percentile visualization, by default ["-.", "--", ":", "-", "-"].
    bed_type_spec : str, optional
        Specification for bed type in annotations, by default "".
    text_y_offset : float, optional
        Vertical offset for text annotations, by default 1.
    num_days : int, optional
        Number of days in the dataset, by default None.
    return_figure : bool, optional
        If True, returns the matplotlib figure instead of displaying it, by default False.

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot.
    """

    # Handle edge case for highlight_centile = 1.0
    original_highlight_centile = highlight_centile
    if highlight_centile >= 1.0:
        highlight_centile = 0.9999  # Use a very high but not exactly 1.0 value

    # Ensure centiles are all valid (no 1.0 values)
    processed_centiles = [min(c, 0.9999) if c >= 1.0 else c for c in centiles]

    # Data processing
    if curve_params is not None:
        x1, y1, x2, y2 = curve_params
        arrival_rates_dict = unfettered_demand_by_hour(
            inpatient_arrivals, x1, y1, x2, y2, num_days=num_days
        )
    elif lagged_by is not None:
        arrival_rates_dict = time_varying_arrival_rates_lagged(
            inpatient_arrivals, lagged_by, time_interval, num_days=num_days
        )
    else:
        arrival_rates_dict = time_varying_arrival_rates(
            inpatient_arrivals, time_interval, num_days=num_days
        )

    # Process arrival rates
    arrival_rates, hour_labels, hour_values = process_arrival_rates(arrival_rates_dict)

    # Reindex based on start_plot_index
    rates_reindexed = (
        list(arrival_rates)[start_plot_index:] + list(arrival_rates)[0:start_plot_index]
    )
    labels_reindexed = (
        list(hour_labels)[start_plot_index:] + list(hour_labels)[0:start_plot_index]
    )

    # Calculate percentiles
    percentiles = [[] for _ in range(len(processed_centiles))]
    cumulative_value_at_centile = np.zeros(len(processed_centiles))

    for hour in range(len(rates_reindexed)):
        for i, centile in enumerate(processed_centiles):
            value_at_centile = stats.poisson.ppf(centile, rates_reindexed[hour])
            cumulative_value_at_centile[i] += value_at_centile
            percentiles[i].append(value_at_centile)

    # Set up plot
    fig = plt.figure(figsize=(10, 6))
    ax = plt.gca()

    # Plot mean line
    label_suffix = f" {bed_type_spec} beds needed" if bed_type_spec else " beds needed"
    cumsum_rates = np.cumsum(rates_reindexed)

    plt.plot(
        labels_reindexed,
        cumsum_rates,
        marker="o",
        markersize=3,
        color=line_colour,
        linewidth=2,
        alpha=0.7,
        label=f"Average number of{label_suffix}",
    )

    # set max y value assuming centiles not plotted
    max_y = cumsum_rates[-1]

    if plot_centiles:
        # Calculate and plot percentiles
        percentiles = [[] for _ in range(len(processed_centiles))]
        cumulative_value_at_centile = np.zeros(len(processed_centiles))
        highlight_percentile_data = None

        # Find the index of highlight_centile in processed_centiles
        highlight_index = -1
        for i, c in enumerate(processed_centiles):
            if (
                abs(c - highlight_centile) < 0.0001
            ):  # Use small epsilon for float comparison
                highlight_index = i
                break

        # If highlight_centile is not in processed_centiles, add it
        if highlight_index == -1:
            processed_centiles.append(highlight_centile)
            percentiles.append([])
            cumulative_value_at_centile = np.append(cumulative_value_at_centile, 0)

        for hour in range(len(rates_reindexed)):
            for i, centile in enumerate(processed_centiles):
                try:
                    # Add error handling for ppf calculation
                    value_at_centile = stats.poisson.ppf(centile, rates_reindexed[hour])

                    # Apply a reasonable upper limit if the value is extremely large
                    if (
                        np.isinf(value_at_centile)
                        or value_at_centile > 1000 * rates_reindexed[hour]
                    ):
                        value_at_centile = 10 * rates_reindexed[hour]

                    cumulative_value_at_centile[i] += value_at_centile
                    percentiles[i].append(value_at_centile)
                except (ValueError, OverflowError, RuntimeError):
                    # Fallback if calculation fails
                    fallback_value = 10 * rates_reindexed[hour]
                    cumulative_value_at_centile[i] += fallback_value
                    percentiles[i].append(fallback_value)

                # Match the highlight centile to the processed value
                if (
                    abs(centile - highlight_centile) < 0.0001
                ):  # Use a small epsilon for floating point comparison
                    highlight_percentile_data = np.cumsum(percentiles[i])

        # Plot percentile lines
        for i, centile in enumerate(processed_centiles):
            marker = markers[i % len(markers)]
            line_style = line_styles_centiles[i % len(line_styles_centiles)]
            linewidth = 2 if centile == highlight_centile else 1
            alpha = 1.0 if centile == highlight_centile else 0.7

            # If the user requested 1.0, display as 99.99% since a Poisson distribution
            # cannot provide exact 100% probability with any finite value
            display_centile = processed_centiles[i]
            if centile == highlight_centile and original_highlight_centile >= 1.0:
                display_centile = (
                    0.9999  # Use 99.99% as the highest displayable probability
                )

            # Format the label text with appropriate precision
            if display_centile >= 0.999:
                # For very high probabilities, show as 99.9% or 99.99% to avoid implying exact 100%
                label_text = f"{display_centile*100:.2f}% probability"
            else:
                label_text = f"{display_centile*100:.0f}% probability"

            cumsum_percentile = np.cumsum(percentiles[i])
            plt.plot(
                labels_reindexed,
                cumsum_percentile,
                marker=marker,
                markersize=3,
                linestyle=line_style,
                color="C0",
                linewidth=linewidth,
                alpha=alpha,
                label=label_text,
            )
        # update max y
        max_y = max(cumulative_value_at_centile)

        # Draw window visualization if requested
        if draw_window:
            start_window, end_window = draw_window
            reindexed_start = (start_window - start_plot_index) % len(
                highlight_percentile_data
            )
            reindexed_end = (end_window - start_plot_index) % len(
                highlight_percentile_data
            )
            window_params = get_window_parameters(
                highlight_percentile_data, reindexed_start, reindexed_end, hour_values
            )
            draw_window_visualization(
                ax,
                hour_values,
                window_params,
                annotation_prefix,
                start_window,
                end_window,
            )
            slope, x1, y1, x2, y2 = window_params
            for hour_line in hour_lines:
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=y1,
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                    slope=slope,
                    x1=x1,
                    y1=y1,
                )

        else:
            # Regular percentile annotations
            for hour_line in hour_lines:
                # Check if highlight_percentile_data is available
                if highlight_percentile_data is None:
                    # Fall back to mean line if no highlight data
                    cumsum_at_hour = cumsum_rates[hour_line - start_plot_index]
                else:
                    cumsum_at_hour = highlight_percentile_data[
                        hour_line - start_plot_index
                    ]
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=cumsum_at_hour,
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                    text_y_offset=text_y_offset,
                )

        # Reverse legend order
        handles, labels = plt.gca().get_legend_handles_labels()
        plt.legend(handles[::-1], labels[::-1], loc="upper left")
    else:
        plt.legend(loc="upper left")

        if draw_window:
            start_window, end_window = draw_window
            reindexed_start = (start_window - start_plot_index) % len(cumsum_rates)
            reindexed_end = (end_window - start_plot_index) % len(cumsum_rates)
            window_params = get_window_parameters(
                cumsum_rates, reindexed_start, reindexed_end, hour_values
            )
            draw_window_visualization(
                ax,
                hour_values,
                window_params,
                annotation_prefix,
                start_window,
                end_window,
            )
            slope, x1, y1, x2, y2 = window_params
            for hour_line in hour_lines:
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=y1,
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                    slope=slope,
                    x1=x1,
                    y1=y1,
                )
        else:
            # Regular mean line annotations
            for hour_line in hour_lines:
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=cumsum_rates[hour_line - start_plot_index],
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                )

    plt.xlabel("Hour of day")
    plt.ylabel("Cumulative number of beds needed")
    plt.xlim(hour_values[0] - x_margin, hour_values[-1] + x_margin)
    plt.ylim(0, set_y_lim if set_y_lim else max(max_y + 2, max_y * 1.2))
    plt.minorticks_on()
    plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(5))

    plt.title(title)
    plt.tight_layout()

    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = f"{file_prefix}{clean_title_for_filename(title)}"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

aspirational_curve

Visualization module for plotting aspirational curves in patient flow analysis.

This module provides functionality for creating and customizing plots of aspirational curves, which represent the probability of admission over time. These curves are useful for setting aspirational targets in healthcare settings.

Functions:

Name Description
plot_curve : function

Plot an aspirational curve with specified points and optional annotations

Examples:

>>> plot_curve(
...     title="Admission Probability Curve",
...     x1=4,
...     y1=0.2,
...     x2=24,
...     y2=0.8,
...     include_titles=True
... )

plot_curve(title, x1, y1, x2, y2, figsize=(10, 5), include_titles=False, text_size=14, media_file_path=None, file_name=None, return_figure=False, annotate_points=False)

Plot an aspirational curve with specified points and optional annotations.

This function creates a plot of an aspirational curve between two points, with options for customization of the visualization including titles, annotations, and saving to a file.

Parameters:

Name Type Description Default
title str

The title of the plot.

required
x1 float

x-coordinate of the first point.

required
y1 float

y-coordinate of the first point (probability value).

required
x2 float

x-coordinate of the second point.

required
y2 float

y-coordinate of the second point (probability value).

required
figsize tuple of int

Figure size in inches (width, height), by default (10, 5).

(10, 5)
include_titles bool

Whether to include axis labels and title, by default False.

False
text_size int

Font size for text elements, by default 14.

14
media_file_path str or Path

Path to save the plot image, by default None.

None
file_name str

Custom filename for saving the plot. If not provided, uses a cleaned version of the title.

None
return_figure bool

Whether to return the figure object instead of displaying it, by default False.

False
annotate_points bool

Whether to add coordinate annotations to the points, by default False.

False

Returns:

Type Description
Figure or None

The figure object if return_figure is True, otherwise None.

Notes

The function creates a curve between two points using the create_curve function and adds various visualization elements including grid lines, annotations, and optional titles.

Source code in src/patientflow/viz/aspirational_curve.py
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def plot_curve(
    title,
    x1,
    y1,
    x2,
    y2,
    figsize=(10, 5),
    include_titles=False,
    text_size=14,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    annotate_points=False,
):
    """Plot an aspirational curve with specified points and optional annotations.

    This function creates a plot of an aspirational curve between two points,
    with options for customization of the visualization including titles,
    annotations, and saving to a file.

    Parameters
    ----------
    title : str
        The title of the plot.
    x1 : float
        x-coordinate of the first point.
    y1 : float
        y-coordinate of the first point (probability value).
    x2 : float
        x-coordinate of the second point.
    y2 : float
        y-coordinate of the second point (probability value).
    figsize : tuple of int, optional
        Figure size in inches (width, height), by default (10, 5).
    include_titles : bool, optional
        Whether to include axis labels and title, by default False.
    text_size : int, optional
        Font size for text elements, by default 14.
    media_file_path : str or Path, optional
        Path to save the plot image, by default None.
    file_name : str, optional
        Custom filename for saving the plot. If not provided, uses a cleaned version of the title.
    return_figure : bool, optional
        Whether to return the figure object instead of displaying it, by default False.
    annotate_points : bool, optional
        Whether to add coordinate annotations to the points, by default False.

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None.

    Notes
    -----
    The function creates a curve between two points using the create_curve function
    and adds various visualization elements including grid lines, annotations,
    and optional titles.
    """
    gamma, lamda, a, x_values, y_values = create_curve(
        x1, y1, x2, y2, generate_values=True
    )

    # Plot the curve
    fig = plt.figure(figsize=figsize)

    plt.plot(x_values, y_values)
    plt.scatter(x1, y1, color="red")  # Mark the point (x1, y1)
    plt.scatter(x2, y2, color="red")  # Mark the point (x2, y2)

    if annotate_points:
        plt.annotate(
            f"({x1}, {y1:.2f})",
            (x1, y1),
            xytext=(10, -15),
            textcoords="offset points",
            fontsize=text_size,
        )
        plt.annotate(
            f"({x2}, {y2:.2f})",
            (x2, y2),
            xytext=(10, -15),
            textcoords="offset points",
            fontsize=text_size,
        )

    if text_size:
        plt.tick_params(axis="both", which="major", labelsize=text_size)

    x_ticks = np.arange(min(x_values), max(x_values) + 1, 2)
    plt.xticks(x_ticks)

    if include_titles:
        plt.title(title, fontsize=text_size)
        plt.xlabel("Hours since admission", fontsize=text_size)
        plt.ylabel("Probability of admission by this point", fontsize=text_size)

    plt.axhline(y=y1, color="green", linestyle="--", label=f"y ={int(y1*100)}%")
    plt.axvline(x=x1, color="gray", linestyle="--", label="x = 4 hours")
    plt.legend(fontsize=text_size)

    plt.tight_layout()

    if media_file_path:
        os.makedirs(media_file_path, exist_ok=True)
        if file_name:
            filename = file_name
        else:
            filename = clean_title_for_filename(title)
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

calibration

Calibration plot visualization module.

This module creates calibration plots for trained models, showing how well the predicted probabilities align with actual outcomes.

Functions:

Name Description
plot_calibration : function

Plot calibration curves for multiple models

plot_calibration(trained_models, test_visits, exclude_from_training_data, strategy='uniform', media_file_path=None, file_name=None, suptitle=None, return_figure=False, label_col='is_admitted')

Plot calibration curves for multiple models.

A calibration plot shows how well the predicted probabilities from a model align with the actual outcomes. The plot compares the mean predicted probability with the fraction of positive outcomes for different probability bins.

Parameters:

Name Type Description Default
trained_models list[TrainedClassifier] or dict[str, TrainedClassifier]

List of TrainedClassifier objects or dictionary with TrainedClassifier values.

required
test_visits DataFrame

DataFrame containing test visit data.

required
exclude_from_training_data list

Columns to exclude from the test data.

required
strategy (uniform, quantile)

Strategy for calibration curve binning. - 'uniform': Bins are of equal width - 'quantile': Bins have equal number of samples

'uniform'
media_file_path Path

Path where the plot should be saved.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "calibration_plot.png".

None
suptitle str

Optional super title for the entire figure.

None
return_figure bool

If True, returns the figure instead of displaying it.

False
label_col str

Name of the column containing the target labels.

'is_admitted'

Returns:

Type Description
Figure or None

If return_figure is True, returns the figure object. Otherwise, displays the plot and returns None.

Notes

The function creates a subplot for each trained model, sorted by prediction time. Each subplot shows the calibration curve and a reference line for perfect calibration.

Source code in src/patientflow/viz/calibration.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def plot_calibration(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits,
    exclude_from_training_data,
    strategy="uniform",
    media_file_path: Optional[Path] = None,
    file_name=None,
    suptitle=None,
    return_figure=False,
    label_col: str = "is_admitted",
):
    """Plot calibration curves for multiple models.

    A calibration plot shows how well the predicted probabilities from a model
    align with the actual outcomes. The plot compares the mean predicted probability
    with the fraction of positive outcomes for different probability bins.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of TrainedClassifier objects or dictionary with TrainedClassifier values.
    test_visits : pandas.DataFrame
        DataFrame containing test visit data.
    exclude_from_training_data : list
        Columns to exclude from the test data.
    strategy : {'uniform', 'quantile'}, default='uniform'
        Strategy for calibration curve binning.
        - 'uniform': Bins are of equal width
        - 'quantile': Bins have equal number of samples
    media_file_path : Path, optional
        Path where the plot should be saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "calibration_plot.png".
    suptitle : str, optional
        Optional super title for the entire figure.
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it.
    label_col : str, default='is_admitted'
        Name of the column containing the target labels.

    Returns
    -------
    matplotlib.figure.Figure or None
        If return_figure is True, returns the figure object. Otherwise, displays
        the plot and returns None.

    Notes
    -----
    The function creates a subplot for each trained model, sorted by prediction time.
    Each subplot shows the calibration curve and a reference line for perfect calibration.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )
    num_plots = len(trained_models_sorted)
    fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 5, 4))

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    for i, trained_model in enumerate(trained_models_sorted):
        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)

        prob_true, prob_pred = calibration_curve(
            y_test, pipeline.predict_proba(X_test)[:, 1], n_bins=10, strategy=strategy
        )

        ax = axs[i]
        hour, minutes = prediction_time

        ax.plot(
            prob_pred,
            prob_true,
            marker="o",
            linewidth=1,
            label="Predictions",
            color=primary_color,
        )
        ax.plot(
            [0, 1],
            [0, 1],
            linestyle="--",
            label="Perfectly calibrated",
            color=secondary_color,
        )
        ax.set_title(f"Calibration Plot for {hour}:{minutes:02}", fontsize=14)
        ax.set_xlabel("Mean Estimated Probability", fontsize=12)
        ax.set_ylabel("Fraction of Positives", fontsize=12)
        ax.legend()

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)

    if media_file_path:
        if file_name:
            calib_plot_path = media_file_path / file_name
        else:
            calib_plot_path = media_file_path / "calibration_plot.png"
        plt.savefig(calib_plot_path)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

data_distribution

Visualisation module for plotting data distributions.

This module provides functions for creating distribution plots of data variables grouped by categories.

Functions:

Name Description
plot_data_distribution : function

Plot distributions of data variables grouped by categories

plot_data_distribution(df, col_name, grouping_var, grouping_var_name, plot_type='both', title=None, rotate_x_labels=False, is_discrete=False, ordinal_order=None, media_file_path=None, file_name=None, return_figure=False, truncate_outliers=True, outlier_method='zscore', outlier_threshold=2.0)

Plot distributions of data variables grouped by categories.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the data to plot

required
col_name str

Name of the column to plot distributions for

required
grouping_var str

Name of the column to group the data by

required
grouping_var_name str

Display name for the grouping variable

required
plot_type (both, hist, kde)

Type of plot to create. 'both' shows histogram with KDE, 'hist' shows only histogram, 'kde' shows only KDE plot

'both'
title str

Title for the plot

None
rotate_x_labels bool

Whether to rotate x-axis labels by 90 degrees

False
is_discrete bool

Whether the data is discrete

False
ordinal_order list

Order of categories for ordinal data

None
media_file_path Path

Path where the plot should be saved

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "data_distributions.png".

None
return_figure bool

If True, returns the figure instead of displaying it

False
truncate_outliers bool

Whether to truncate the x-axis to exclude extreme outliers

True
outlier_method (iqr, zscore)

Method to detect outliers. 'iqr' uses interquartile range, 'zscore' uses z-score

'iqr'
outlier_threshold float

Threshold for outlier detection. For IQR method, this is the multiplier. For z-score method, this is the number of standard deviations.

1.5

Returns:

Type Description
FacetGrid or None

If return_figure is True, returns the FacetGrid object. Otherwise, displays the plot and returns None.

Raises:

Type Description
ValueError

If plot_type is not one of 'both', 'hist', or 'kde' If outlier_method is not one of 'iqr' or 'zscore'

Source code in src/patientflow/viz/data_distribution.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
def plot_data_distribution(
    df,
    col_name,
    grouping_var,
    grouping_var_name,
    plot_type="both",
    title=None,
    rotate_x_labels=False,
    is_discrete=False,
    ordinal_order=None,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    truncate_outliers=True,
    outlier_method="zscore",
    outlier_threshold=2.0,
):
    """Plot distributions of data variables grouped by categories.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the data to plot
    col_name : str
        Name of the column to plot distributions for
    grouping_var : str
        Name of the column to group the data by
    grouping_var_name : str
        Display name for the grouping variable
    plot_type : {'both', 'hist', 'kde'}, default='both'
        Type of plot to create. 'both' shows histogram with KDE, 'hist' shows
        only histogram, 'kde' shows only KDE plot
    title : str, optional
        Title for the plot
    rotate_x_labels : bool, default=False
        Whether to rotate x-axis labels by 90 degrees
    is_discrete : bool, default=False
        Whether the data is discrete
    ordinal_order : list, optional
        Order of categories for ordinal data
    media_file_path : Path, optional
        Path where the plot should be saved
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "data_distributions.png".
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    truncate_outliers : bool, default=True
        Whether to truncate the x-axis to exclude extreme outliers
    outlier_method : {'iqr', 'zscore'}, default='zscore'
        Method to detect outliers. 'iqr' uses interquartile range, 'zscore' uses z-score
    outlier_threshold : float, default=1.5
        Threshold for outlier detection. For IQR method, this is the multiplier.
        For z-score method, this is the number of standard deviations.

    Returns
    -------
    seaborn.FacetGrid or None
        If return_figure is True, returns the FacetGrid object. Otherwise,
        displays the plot and returns None.

    Raises
    ------
    ValueError
        If plot_type is not one of 'both', 'hist', or 'kde'
        If outlier_method is not one of 'iqr' or 'zscore'
    """
    sns.set_theme(style="whitegrid")

    if ordinal_order is not None:
        df[col_name] = pd.Categorical(
            df[col_name], categories=ordinal_order, ordered=True
        )

    # Calculate outlier bounds if truncation is requested
    x_limits = None
    if truncate_outliers:
        values = df[col_name].dropna()
        if pd.api.types.is_numeric_dtype(values) and len(values) > 0:
            # Check if data is actually discrete (all values are integers)
            is_actually_discrete = np.allclose(values, values.round())

            # Apply outlier truncation to continuous data OR discrete data with outliers
            # For discrete data, we still want to truncate if there are extreme outliers
            if outlier_method == "iqr":
                Q1 = values.quantile(0.25)
                Q3 = values.quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - outlier_threshold * IQR
                upper_bound = Q3 + outlier_threshold * IQR
            elif outlier_method == "zscore":
                mean_val = values.mean()
                std_val = values.std()
                lower_bound = mean_val - outlier_threshold * std_val
                upper_bound = mean_val + outlier_threshold * std_val
            else:
                raise ValueError(
                    "Invalid outlier_method. Choose from 'iqr' or 'zscore'."
                )

            # Only apply truncation if there are actual outliers
            # For discrete data, ensure lower bound is at least 0
            if values.min() < lower_bound or values.max() > upper_bound:
                if is_actually_discrete:
                    # For discrete data, ensure bounds are reasonable
                    lower_bound = max(0, lower_bound)
                x_limits = (lower_bound, upper_bound)

    g = sns.FacetGrid(df, col=grouping_var, height=3, aspect=1.5)

    if is_discrete:
        valid_values = sorted([x for x in df[col_name].unique() if pd.notna(x)])
        min_val = min(valid_values)
        max_val = max(valid_values)
        bins = np.arange(min_val - 0.5, max_val + 1.5, 1)
    else:
        # Handle numeric data
        values = df[col_name].dropna()
        if pd.api.types.is_numeric_dtype(values):
            if np.allclose(values, values.round()):
                bins = np.arange(values.min() - 0.5, values.max() + 1.5, 1)
            else:
                n_bins = min(100, max(10, int(np.sqrt(len(values)))))
                bins = n_bins
        else:
            bins = "auto"

    if plot_type == "both":
        g.map(sns.histplot, col_name, kde=True, bins=bins)
    elif plot_type == "hist":
        g.map(sns.histplot, col_name, kde=False, bins=bins)
    elif plot_type == "kde":
        g.map(sns.kdeplot, col_name, fill=True)
    else:
        raise ValueError("Invalid plot_type. Choose from 'both', 'hist', or 'kde'.")

    g.set_axis_labels(
        col_name, "Frequency" if plot_type != "kde" else "Density", fontsize=10
    )

    # Set facet titles with smaller font
    g.set_titles(col_template=f"{grouping_var}: {{col_name}}", size=11)

    # Add thousands separators to y-axis
    for ax in g.axes.flat:
        ax.yaxis.set_major_formatter(
            plt.FuncFormatter(lambda x, p: format(int(x), ","))
        )

    if rotate_x_labels:
        for ax in g.axes.flat:
            for label in ax.get_xticklabels():
                label.set_rotation(90)

    if is_discrete:
        for ax in g.axes.flat:
            ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
            # Apply outlier truncation if available, otherwise use default discrete limits
            if x_limits is not None:
                # Ensure discrete limits are reasonable: min ≥ 0, max ≥ 1, and use integers
                lower_limit = max(0, int(x_limits[0]))
                upper_limit = max(
                    1, int(x_limits[1] + 0.5)
                )  # Round up to ensure we include the max value
                ax.set_xlim(lower_limit - 0.5, upper_limit + 0.5)
            else:
                # Ensure default discrete limits are reasonable: min ≥ 0, max ≥ 1
                # Use the actual min/max values to center the bars properly
                lower_limit = max(0, min_val)
                upper_limit = max(1, max_val)
                ax.set_xlim(lower_limit - 0.5, upper_limit + 0.5)
    elif x_limits is not None:
        # Apply outlier truncation to x-axis
        for ax in g.axes.flat:
            ax.set_xlim(x_limits)
            # Ensure integer tick marks for numeric data with outliers
            ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
    else:
        # Let matplotlib auto-scale the x-axis
        pass

    plt.subplots_adjust(top=0.80)
    if title:
        g.figure.suptitle(title, fontsize=14)
    else:
        g.figure.suptitle(
            f"Distribution of {col_name} grouped by {grouping_var_name}", fontsize=14
        )

    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = "data_distributions.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return g
    else:
        plt.show()
        plt.close()

epudd

Generate plots comparing observed values with model predictions for discrete distributions.

An Evaluating Predictions for Unique, Discrete, Distributions (EPUDD) plot displays the model's predicted CDF values alongside the actual observed values' positions within their predicted CDF intervals. For discrete distributions, each predicted value has an associated probability, and the CDF is calculated by sorting the values and computing cumulative probabilities.

The plot can show three possible positions for each observation within its predicted interval:

* lower bound of the interval
* midpoint of the interval
* upper bound of the interval

By default, the plot only shows the midpoint of the interval.

For a well-calibrated model, the observed values should fall within their predicted intervals, with the distribution of positions showing appropriate uncertainty.

The visualisation helps assess model calibration by comparing: 1. The predicted cumulative distribution function (CDF) values 2. The actual positions of observations within their predicted intervals 3. The spread and distribution of these positions

Functions:

Name Description
plot_epudd : function

Generates and plots the comparison of model predictions with observed values.

plot_epudd(prediction_times, prob_dist_dict_all, model_name='admissions', return_figure=False, return_dataframe=False, figsize=None, suptitle=None, media_file_path=None, file_name=None, plot_all_bounds=False)

Generates plots comparing model predictions with observed values for discrete distributions.

For discrete distributions, each predicted value has an associated probability. The CDF is calculated by sorting the values and computing cumulative probabilities, normalized by the number of time points.

Parameters:

Name Type Description Default
prediction_times list of tuple

List of (hour, minute) tuples representing times for which predictions were made.

required
prob_dist_dict_all dict

Dictionary of probability distributions keyed by model_key. Each entry contains information about predicted distributions and observed values for different snapshot dates. The predicted distributions should be discrete probability mass functions, with each value having an associated probability.

required
model_name str

Base name of the model to construct model keys, by default "admissions".

'admissions'
return_figure bool

If True, returns the figure object instead of displaying it, by default False.

False
return_dataframe bool

If True, returns a dictionary of observation dataframes by model_key, by default False. The dataframes contain the merged observation and prediction data for analysis.

False
figsize tuple of (float, float)

Size of the figure in inches as (width, height). If None, calculated automatically based on number of plots, by default None.

None
suptitle str

Super title for the entire figure, displayed above all subplots, by default None.

None
media_file_path Path

Path to save the plot, by default None. If provided, saves the plot as a PNG file.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "plot_epudd.png".

None
plot_all_bounds bool

If True, plots all bounds (lower, mid, upper). If False, only plots mid bounds. By default False.

False

Returns:

Type Description
Figure

The figure object containing the plots, if return_figure is True.

dict

Dictionary of observation dataframes by model_key, if return_dataframe is True.

tuple

Tuple of (figure, dataframes_dict) if both return_figure and return_dataframe are True.

None

If neither return_figure nor return_dataframe is True, displays the plots and returns None.

Notes

For discrete distributions, the CDF is calculated by:

1. Sorting the predicted values
2. Computing cumulative probabilities for each value
3. Normalizing by the number of time points

The plot shows three possible positions for each observation:

* lower_cdf (pink): Uses the lower bound of the CDF interval
* mid_cdf (green): Uses the midpoint of the CDF interval
* upper_cdf (light blue): Uses the upper bound of the CDF interval

The black points represent the model's predicted CDF values, calculated from the sorted values and their associated probabilities, while the colored points show where the actual observations fall within their predicted intervals. For a well-calibrated model, the observed values should fall within their predicted intervals, with the distribution of positions showing appropriate uncertainty.

Source code in src/patientflow/viz/epudd.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def plot_epudd(
    prediction_times: List[Tuple[int, int]],
    prob_dist_dict_all: Dict[str, Dict],
    model_name: str = "admissions",
    return_figure: bool = False,
    return_dataframe: bool = False,
    figsize: Optional[Tuple[float, float]] = None,
    suptitle: Optional[str] = None,
    media_file_path: Optional[Path] = None,
    file_name=None,
    plot_all_bounds: bool = False,
) -> Union[
    Figure, Dict[str, pd.DataFrame], Tuple[Figure, Dict[str, pd.DataFrame]], None
]:
    """
    Generates plots comparing model predictions with observed values for discrete distributions.

    For discrete distributions, each predicted value has an associated probability. The CDF
    is calculated by sorting the values and computing cumulative probabilities, normalized
    by the number of time points.

    Parameters
    ----------
    prediction_times : list of tuple
        List of (hour, minute) tuples representing times for which predictions were made.
    prob_dist_dict_all : dict
        Dictionary of probability distributions keyed by model_key. Each entry contains
        information about predicted distributions and observed values for different
        snapshot dates. The predicted distributions should be discrete probability mass
        functions, with each value having an associated probability.
    model_name : str, optional
        Base name of the model to construct model keys, by default "admissions".
    return_figure : bool, optional
        If True, returns the figure object instead of displaying it, by default False.
    return_dataframe : bool, optional
        If True, returns a dictionary of observation dataframes by model_key, by default False.
        The dataframes contain the merged observation and prediction data for analysis.
    figsize : tuple of (float, float), optional
        Size of the figure in inches as (width, height). If None, calculated automatically
        based on number of plots, by default None.
    suptitle : str, optional
        Super title for the entire figure, displayed above all subplots, by default None.
    media_file_path : Path, optional
        Path to save the plot, by default None. If provided, saves the plot as a PNG file.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "plot_epudd.png".
    plot_all_bounds : bool, optional
        If True, plots all bounds (lower, mid, upper). If False, only plots mid bounds.
        By default False.

    Returns
    -------
    matplotlib.figure.Figure
        The figure object containing the plots, if return_figure is True.
    dict
        Dictionary of observation dataframes by model_key, if return_dataframe is True.
    tuple
        Tuple of (figure, dataframes_dict) if both return_figure and return_dataframe are True.
    None
        If neither return_figure nor return_dataframe is True, displays the plots and returns None.

    Notes
    -----
    For discrete distributions, the CDF is calculated by:

        1. Sorting the predicted values
        2. Computing cumulative probabilities for each value
        3. Normalizing by the number of time points

    The plot shows three possible positions for each observation:

        * lower_cdf (pink): Uses the lower bound of the CDF interval
        * mid_cdf (green): Uses the midpoint of the CDF interval
        * upper_cdf (light blue): Uses the upper bound of the CDF interval

    The black points represent the model's predicted CDF values, calculated from the sorted
    values and their associated probabilities, while the colored points show where the actual
    observations fall within their predicted intervals. For a well-calibrated model, the
    observed values should fall within their predicted intervals, with the distribution of
    positions showing appropriate uncertainty.

    """
    # Sort prediction times by converting to minutes since midnight
    prediction_times_sorted: List[Tuple[int, int]] = sorted(
        prediction_times,
        key=lambda x: x[0] * 60 + x[1],
    )

    # Calculate figure parameters
    num_plots: int = len(prediction_times_sorted)
    figsize = figsize or (num_plots * 5, 4)

    # Create subplot layout
    fig: Figure
    axs: np.ndarray
    fig, axs = plt.subplots(1, num_plots, figsize=figsize)
    axs = [axs] if num_plots == 1 else axs

    # Define plotting types and colors
    all_types = ["lower", "mid", "upper"]
    plot_types = all_types if plot_all_bounds else ["mid"]
    colors: Dict[str, str] = {
        "lower": "#FF1493",  # deeppink
        "mid": "#228B22",  # chartreuse4/forest green
        "upper": "#ADD8E6",  # lightblue
    }

    all_obs_dfs: Dict[str, pd.DataFrame] = {}

    # Process each subplot
    for i, prediction_time in enumerate(prediction_times_sorted):
        model_key: str = get_model_key(model_name, prediction_time)
        prob_dist_dict: Dict = prob_dist_dict_all[model_key]

        if not prob_dist_dict:
            continue

        # Create distribution and observation dataframes
        all_distributions = _create_distribution_records(prob_dist_dict, all_types)
        distr_coll: pd.DataFrame = pd.DataFrame(all_distributions)

        all_observations = _create_observation_records(prob_dist_dict)
        adm_coll: pd.DataFrame = pd.DataFrame(all_observations)

        # For each actual observation, find its position in the predicted CDF
        # by matching datetime and admission count to get lower/mid/upper bounds
        merged_df: pd.DataFrame = pd.merge(
            adm_coll,
            distr_coll.rename(
                columns={
                    "num_adm_pred": "num_adm",
                    **{f"{t}_predicted_cdf": f"{t}_observed_cdf" for t in all_types},
                }
            ),
            on=["dt", "num_adm"],
            how="inner",
        )

        if merged_df.empty:
            continue

        all_obs_dfs[model_key] = merged_df
        ax = axs[i]
        num_time_points: int = len(prob_dist_dict)

        # Plot predictions and observations
        _plot_predictions(ax, distr_coll, num_time_points, plot_types)
        _plot_observations(ax, merged_df, plot_types, colors, i == 0)
        _setup_subplot(ax, prediction_time, i == 0)

    # Final plot configuration
    plt.tight_layout()
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)
    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = "plot_epudd.png"
        plt.savefig(media_file_path / filename, dpi=300)

    # Return based on flags
    if return_figure and return_dataframe:
        return fig, all_obs_dfs
    elif return_figure:
        return fig
    elif return_dataframe:
        plt.show()
        plt.close()
        return all_obs_dfs
    else:
        plt.show()
        plt.close()
        return None

estimated_probabilities

Visualization module for plotting estimated probabilities from trained models.

This module provides functions for creating distribution plots of estimated probabilities from trained classification models.

Functions:

Name Description
plot_estimated_probabilities : function

Plot estimated probability distributions for multiple models

plot_estimated_probabilities(trained_models, test_visits, exclude_from_training_data, bins=30, media_file_path=None, file_name=None, suptitle=None, return_figure=False, label_col='is_admitted')

Plot estimated probability distributions for multiple models.

Parameters:

Name Type Description Default
trained_models list[TrainedClassifier] or dict[str, TrainedClassifier]

List of TrainedClassifier objects or dict with TrainedClassifier values

required
test_visits DataFrame

DataFrame containing test visit data

required
exclude_from_training_data list

Columns to exclude from the test data

required
bins int

Number of bins for the histograms

30
media_file_path Path

Path where the plot should be saved

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "estimated_probabilities.png".

None
suptitle str

Optional super title for the entire figure

None
return_figure bool

If True, returns the figure instead of displaying it

False
label_col str

Name of the column containing the target labels

"is_admitted"

Returns:

Type Description
Figure or None

If return_figure is True, returns the figure object. Otherwise, displays the plot and returns None.

Source code in src/patientflow/viz/estimated_probabilities.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def plot_estimated_probabilities(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits,
    exclude_from_training_data,
    bins=30,
    media_file_path: Optional[Path] = None,
    file_name=None,
    suptitle: Optional[str] = None,
    return_figure=False,
    label_col: str = "is_admitted",
):
    """Plot estimated probability distributions for multiple models.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of TrainedClassifier objects or dict with TrainedClassifier values
    test_visits : pandas.DataFrame
        DataFrame containing test visit data
    exclude_from_training_data : list
        Columns to exclude from the test data
    bins : int, default=30
        Number of bins for the histograms
    media_file_path : Path, optional
        Path where the plot should be saved
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "estimated_probabilities.png".
    suptitle : str, optional
        Optional super title for the entire figure
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    label_col : str, default="is_admitted"
        Name of the column containing the target labels

    Returns
    -------
    matplotlib.figure.Figure or None
        If return_figure is True, returns the figure object. Otherwise, displays
        the plot and returns None.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )
    num_plots = len(trained_models_sorted)
    fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 5, 4))

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    for i, trained_model in enumerate(trained_models_sorted):
        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)

        # Get predictions
        y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

        # Separate predictions for positive and negative cases
        pos_preds = y_pred_proba[y_test == 1]
        neg_preds = y_pred_proba[y_test == 0]

        ax = axs[i]
        hour, minutes = prediction_time

        # Plot distributions
        ax.hist(
            neg_preds,
            bins=bins,
            alpha=0.5,
            color=primary_color,
            density=True,
            label="Negative Cases",
            histtype="step",
            linewidth=2,
        )
        ax.hist(
            pos_preds,
            bins=bins,
            alpha=0.5,
            color=secondary_color,
            density=True,
            label="Positive Cases",
            histtype="step",
            linewidth=2,
        )

        # Optional: Fill with lower opacity
        ax.hist(neg_preds, bins=bins, alpha=0.2, color=primary_color, density=True)
        ax.hist(pos_preds, bins=bins, alpha=0.2, color=secondary_color, density=True)

        ax.set_title(
            f"Distribution of Estimated Probabilities at {hour}:{minutes:02}",
            fontsize=14,
        )
        ax.set_xlabel("Estimated Probability", fontsize=12)
        ax.set_ylabel("Density", fontsize=12)
        ax.set_xlim(0, 1)
        ax.legend()

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle is not None:
        plt.suptitle(suptitle, y=1.05, fontsize=16)

    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = "estimated_probabilities.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

features

Visualisation module for plotting feature importances from trained models.

This module provides functionality to visualize feature importances from trained classifiers, allowing for comparison across different prediction time points.

Functions:

Name Description
plot_features : function

Plot feature importance for multiple models

plot_features(trained_models, media_file_path=None, file_name=None, top_n=20, suptitle=None, return_figure=False)

Plot feature importance for multiple models.

Parameters:

Name Type Description Default
trained_models list[TrainedClassifier] or dict[str, TrainedClassifier]

List of TrainedClassifier objects or dictionary with TrainedClassifier values.

required
media_file_path Path

Path where the plot should be saved. If None, the plot is only displayed.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "feature_importance_plots.png".

None
top_n int

Number of top features to display.

20
suptitle str

Super title for the entire figure.

None
return_figure bool

If True, returns the figure instead of displaying it.

False

Returns:

Type Description
Figure or None

The matplotlib figure if return_figure is True, otherwise None.

Notes

The function sorts models by prediction time and creates a horizontal bar plot for each model showing the top N most important features. Feature names are truncated to 25 characters for better display.

Source code in src/patientflow/viz/features.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
def plot_features(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    media_file_path: Optional[Path] = None,
    file_name=None,
    top_n: int = 20,
    suptitle: Optional[str] = None,
    return_figure: bool = False,
) -> Optional[plt.Figure]:
    """Plot feature importance for multiple models.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of TrainedClassifier objects or dictionary with TrainedClassifier values.
    media_file_path : Path, optional
        Path where the plot should be saved. If None, the plot is only displayed.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "feature_importance_plots.png".
    top_n : int, default=20
        Number of top features to display.
    suptitle : str, optional
        Super title for the entire figure.
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it.

    Returns
    -------
    plt.Figure or None
        The matplotlib figure if return_figure is True, otherwise None.

    Notes
    -----
    The function sorts models by prediction time and creates a horizontal bar plot
    for each model showing the top N most important features. Feature names are
    truncated to 25 characters for better display.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )

    num_plots = len(trained_models_sorted)
    fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 6, 12))

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    for i, trained_model in enumerate(trained_models_sorted):
        # Always use regular pipeline
        pipeline: Pipeline = trained_model.pipeline
        prediction_time = trained_model.training_results.prediction_time

        # Get feature names from the pipeline
        transformed_cols = pipeline.named_steps[
            "feature_transformer"
        ].get_feature_names_out()
        transformed_cols = [col.split("__")[-1] for col in transformed_cols]
        truncated_cols = [col[:25] for col in transformed_cols]

        # Get feature importances
        feature_importances = pipeline.named_steps["classifier"].feature_importances_
        indices = np.argsort(feature_importances)[
            -top_n:
        ]  # Get indices of the top N features

        # Plot for this prediction time
        ax = axs[i]
        hour, minutes = prediction_time
        ax.barh(range(len(indices)), feature_importances[indices], align="center")
        ax.set_yticks(range(len(indices)))
        ax.set_yticklabels(np.array(truncated_cols)[indices])
        ax.set_xlabel("Importance")
        ax.set_ylabel("Features")
        ax.set_title(f"Feature Importances for {hour}:{minutes:02}")

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle is not None:
        plt.suptitle(suptitle, y=1.05, fontsize=16)

    if media_file_path:
        # Save and display plot
        if file_name:
            feature_plot_path = media_file_path / file_name
        else:
            feature_plot_path = media_file_path / "feature_importance_plots.png"
        plt.savefig(feature_plot_path, bbox_inches="tight")

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()
        return None

madcap

Module for generating MADCAP (Model Accuracy and Discriminative Calibration Plots) visualizations.

MADCAP plots compare model-predicted probabilities to observed outcomes, helping to assess model calibration and discrimination. The plots can be generated for individual prediction times or for specific groups (e.g., age groups).

Functions:

Name Description
classify_age : function

Classifies age into categories based on numeric values or age group strings.

plot_madcap : function

Generates MADCAP plots for a list of trained models, comparing estimated probabilities to observed values.

_plot_madcap_subplot : function

Plots a single MADCAP subplot showing cumulative predicted and observed values.

_plot_madcap_by_group_single : function

Generates MADCAP plots for specific groups at a given prediction time.

plot_madcap_by_group : function

Generates MADCAP plots for different groups across multiple prediction times.

plot_madcap_by_group

Generates MADCAP plots for groups (e.g., age groups) across a series of prediction times.

classify_age(age, age_categories=None)

Classify age into categories based on numeric values or age group strings.

Parameters:

Name Type Description Default
age int, float, or str

Age value (e.g., 30) or age group string (e.g., '18-24').

required
age_categories dict

Dictionary defining age categories and their ranges. If not provided, uses DEFAULT_AGE_CATEGORIES. Expected format: { "category_name": { "numeric": {"min": min_age, "max": max_age}, "groups": ["age_group1", "age_group2", ...] } }

None

Returns:

Type Description
str

Category name based on the age or age group, or 'unknown' for unexpected or invalid values.

Examples:

>>> classify_age(25)
'adults'
>>> classify_age('65-74')
'65 or over'
Source code in src/patientflow/viz/madcap.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def classify_age(age, age_categories=None):
    """Classify age into categories based on numeric values or age group strings.

    Parameters
    ----------
    age : int, float, or str
        Age value (e.g., 30) or age group string (e.g., '18-24').
    age_categories : dict, optional
        Dictionary defining age categories and their ranges. If not provided, uses DEFAULT_AGE_CATEGORIES.
        Expected format:
        {
            "category_name": {
                "numeric": {"min": min_age, "max": max_age},
                "groups": ["age_group1", "age_group2", ...]
            }
        }

    Returns
    -------
    str
        Category name based on the age or age group, or 'unknown' for unexpected or invalid values.

    Examples
    --------
    >>> classify_age(25)
    'adults'
    >>> classify_age('65-74')
    '65 or over'
    """
    if age_categories is None:
        age_categories = DEFAULT_AGE_CATEGORIES

    if isinstance(age, (int, float)):
        for category, rules in age_categories.items():
            numeric_rules = rules.get("numeric", {})
            min_age = numeric_rules.get("min", float("-inf"))
            max_age = numeric_rules.get("max", float("inf"))

            if min_age <= age <= max_age:
                return category
        return "unknown"
    elif isinstance(age, str):
        for category, rules in age_categories.items():
            if age in rules.get("groups", []):
                return category
        return "unknown"
    else:
        return "unknown"

plot_madcap(trained_models, test_visits, exclude_from_training_data, media_file_path=None, file_name=None, suptitle=None, return_figure=False, label_col='is_admitted')

Generate MADCAP plots for a list of trained models.

Parameters:

Name Type Description Default
trained_models list[TrainedClassifier] or dict[str, TrainedClassifier]

List of trained classifier objects or dictionary with TrainedClassifier values.

required
test_visits DataFrame

DataFrame containing test visit data.

required
exclude_from_training_data List[str]

List of columns to exclude from training data.

required
media_file_path Path

Directory path where the generated plots will be saved.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "madcap_plot.png".

None
suptitle str

Suptitle for the plot.

None
return_figure bool

If True, returns the figure object instead of displaying it.

False
label_col str

Name of the column containing the target labels.

"is_admitted"

Returns:

Type Description
Optional[Figure]

The figure if return_figure is True, None otherwise.

Source code in src/patientflow/viz/madcap.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def plot_madcap(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits: pd.DataFrame,
    exclude_from_training_data: List[str],
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    suptitle: Optional[str] = None,
    return_figure: bool = False,
    label_col: str = "is_admitted",
) -> Optional[plt.Figure]:
    """Generate MADCAP plots for a list of trained models.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of trained classifier objects or dictionary with TrainedClassifier values.
    test_visits : pd.DataFrame
        DataFrame containing test visit data.
    exclude_from_training_data : List[str]
        List of columns to exclude from training data.
    media_file_path : Path, optional
        Directory path where the generated plots will be saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "madcap_plot.png".
    suptitle : str, optional
        Suptitle for the plot.
    return_figure : bool, default=False
        If True, returns the figure object instead of displaying it.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels.

    Returns
    -------
    Optional[plt.Figure]
        The figure if return_figure is True, None otherwise.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )
    num_plots = len(trained_models_sorted)

    # Calculate the number of rows and columns for the subplots
    num_cols = min(num_plots, 5)  # Maximum 5 columns
    num_rows = math.ceil(num_plots / num_cols)

    fig, axes = plt.subplots(num_rows, num_cols, figsize=(num_plots * 5, 4))

    # Handle the case of a single plot differently
    if num_plots == 1:
        # When there's only one plot, axes is a single Axes object, not an array
        trained_model = trained_models_sorted[0]

        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)
        predict_proba = pipeline.predict_proba(X_test)[:, 1]

        # Plot directly on the single axes
        _plot_madcap_subplot(predict_proba, y_test, prediction_time, axes)
    else:
        # For multiple plots, ensure axes is always a 2D array
        if num_rows == 1:
            axes = axes.reshape(1, -1)

        for i, trained_model in enumerate(trained_models_sorted):
            # Use calibrated pipeline if available, otherwise use regular pipeline
            if (
                hasattr(trained_model, "calibrated_pipeline")
                and trained_model.calibrated_pipeline is not None
            ):
                pipeline = trained_model.calibrated_pipeline
            else:
                pipeline = trained_model.pipeline

            prediction_time = trained_model.training_results.prediction_time

            # Get test data for this prediction time
            X_test, y_test = prepare_patient_snapshots(
                df=test_visits,
                prediction_time=prediction_time,
                exclude_columns=exclude_from_training_data,
                single_snapshot_per_visit=False,
                label_col=label_col,
            )

            X_test = add_missing_columns(pipeline, X_test)
            predict_proba = pipeline.predict_proba(X_test)[:, 1]

            row = i // num_cols
            col = i % num_cols
            _plot_madcap_subplot(predict_proba, y_test, prediction_time, axes[row, col])

        # Hide any unused subplots
        for j in range(i + 1, num_rows * num_cols):
            row = j // num_cols
            col = j % num_cols
            axes[row, col].axis("off")

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle:
        fig.suptitle(suptitle, fontsize=16, y=1.05)
        # Adjust layout to accommodate suptitle
        plt.subplots_adjust(top=0.85)

    if media_file_path:
        plot_name = file_name if file_name else "madcap_plot.png"
        madcap_plot_path = Path(media_file_path) / plot_name
        plt.savefig(madcap_plot_path, bbox_inches="tight")

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close(fig)
        return None

plot_madcap_by_group(trained_models, test_visits, exclude_from_training_data, grouping_var, grouping_var_name, media_file_path=None, file_name=None, plot_difference=False, return_figure=False, label_col='is_admitted')

Generate MADCAP plots for different groups across multiple prediction times.

Parameters:

Name Type Description Default
trained_models list[TrainedClassifier] or dict[str, TrainedClassifier]

List of trained classifier objects or dictionary with TrainedClassifier values.

required
test_visits DataFrame

DataFrame containing the test visit data.

required
exclude_from_training_data List[str]

List of columns to exclude from training data.

required
grouping_var str

The column name in the dataset that defines the grouping variable.

required
grouping_var_name str

A descriptive name for the grouping variable, used in plot titles.

required
media_file_path Path

Directory path where the generated plots will be saved.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to a generated name based on group and time.

None
plot_difference bool

If True, includes difference plot between predicted and observed outcomes.

False
return_figure bool

If True, returns a list of figure objects instead of displaying them.

False
label_col str

Name of the column containing the target labels.

"is_admitted"

Returns:

Type Description
Optional[List[Figure]]

List of figures if return_figure is True, None otherwise.

Source code in src/patientflow/viz/madcap.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
def plot_madcap_by_group(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits: pd.DataFrame,
    exclude_from_training_data: List[str],
    grouping_var: str,
    grouping_var_name: str,
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    plot_difference: bool = False,
    return_figure: bool = False,
    label_col: str = "is_admitted",
) -> Optional[List[plt.Figure]]:
    """Generate MADCAP plots for different groups across multiple prediction times.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of trained classifier objects or dictionary with TrainedClassifier values.
    test_visits : pd.DataFrame
        DataFrame containing the test visit data.
    exclude_from_training_data : List[str]
        List of columns to exclude from training data.
    grouping_var : str
        The column name in the dataset that defines the grouping variable.
    grouping_var_name : str
        A descriptive name for the grouping variable, used in plot titles.
    media_file_path : Path, optional
        Directory path where the generated plots will be saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to a generated name based on group and time.
    plot_difference : bool, default=False
        If True, includes difference plot between predicted and observed outcomes.
    return_figure : bool, default=False
        If True, returns a list of figure objects instead of displaying them.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels.

    Returns
    -------
    Optional[List[plt.Figure]]
        List of figures if return_figure is True, None otherwise.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )

    figures = []
    for trained_model in trained_models_sorted:
        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        # Check if the grouping variable exists in X_test columns
        if grouping_var not in X_test.columns:
            raise ValueError(f"'{grouping_var}' not found in the dataset columns.")

        X_test = add_missing_columns(pipeline, X_test)
        predict_proba = pipeline.predict_proba(X_test)[:, 1]

        # Apply classification based on the grouping variable
        if grouping_var == "age_group":
            group = X_test["age_group"].apply(classify_age)
        elif grouping_var == "age_on_arrival":
            group = X_test["age_on_arrival"].apply(classify_age)
        else:
            group = X_test[grouping_var]

        fig = _plot_madcap_by_group_single(
            predict_proba,
            y_test,
            group,
            prediction_time,
            grouping_var_name,
            media_file_path,
            file_name=file_name,
            plot_difference=plot_difference,
            return_figure=True,
        )
        if return_figure:
            figures.append(fig)

    if return_figure:
        return figures
    else:
        return None

observed_against_expected

Visualisation utilities for evaluating patient flow predictions.

This module provides functions for creating visualizations to evaluate the accuracy and performance of patient flow predictions, particularly focusing on comparing observed versus expected values.

Functions:

Name Description
plot_deltas : function

Plot histograms of observed minus expected values

plot_arrival_delta_single_instance : function

Plot comparison between observed arrivals and expected arrival rates

plot_arrival_deltas : function

Plot delta charts for multiple snapshot dates on the same figure

plot_arrival_delta_single_instance(df, prediction_time, snapshot_date, prediction_window, yta_time_interval=timedelta(minutes=15), show_delta=True, show_only_delta=False, media_file_path=None, file_name=None, return_figure=False, fig_size=(10, 4))

Plot comparison between observed arrivals and expected arrival rates.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing arrival data

required
prediction_time tuple

(hour, minute) of prediction time

required
snapshot_date date

Date to analyze

required
prediction_window int

Prediction window in minutes

required
show_delta bool

If True, plot the difference between actual and expected arrivals

True
show_only_delta bool

If True, only plot the delta between actual and expected arrivals

False
yta_time_interval int

Time interval in minutes for calculating arrival rates

15
media_file_path Path

Path to save the plot

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "arrival_comparison.png"

None
return_figure bool

If True, returns the figure instead of displaying it

False
fig_size tuple

Figure size as (width, height) in inches

(10, 4)

Returns:

Type Description
Figure or None

The figure object if return_figure is True, otherwise None

Source code in src/patientflow/viz/observed_against_expected.py
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
def plot_arrival_delta_single_instance(
    df,
    prediction_time,
    snapshot_date,
    prediction_window: timedelta,
    yta_time_interval: timedelta = timedelta(minutes=15),
    show_delta=True,
    show_only_delta=False,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    fig_size=(10, 4),
):
    """Plot comparison between observed arrivals and expected arrival rates.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing arrival data
    prediction_time : tuple
        (hour, minute) of prediction time
    snapshot_date : datetime.date
        Date to analyze
    prediction_window : int
        Prediction window in minutes
    show_delta : bool, default=True
        If True, plot the difference between actual and expected arrivals
    show_only_delta : bool, default=False
        If True, only plot the delta between actual and expected arrivals
    yta_time_interval : int, default=15
        Time interval in minutes for calculating arrival rates
    media_file_path : Path, optional
        Path to save the plot
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "arrival_comparison.png"
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    fig_size : tuple, default=(10, 4)
        Figure size as (width, height) in inches

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None
    """
    # Prepare data
    df_copy, snapshot_datetime, default_datetime, prediction_time_obj = (
        _prepare_arrival_data(
            df, prediction_time, snapshot_date, prediction_window, yta_time_interval
        )
    )

    # Get arrivals within the prediction window
    arrivals = df_copy[
        (df_copy.index > snapshot_datetime)
        & (df_copy.index <= snapshot_datetime + prediction_window)
    ]

    # Sort arrivals by time and create cumulative count
    arrivals = arrivals.sort_values("arrival_datetime")
    arrivals["cumulative_count"] = range(1, len(arrivals) + 1)

    # Calculate arrival rates and prepare time points
    mean_arrival_rates = _calculate_arrival_rates(
        df_copy, prediction_time_obj, prediction_window, yta_time_interval
    )

    # Prepare arrival times
    arrival_times_piecewise = _prepare_arrival_times(
        mean_arrival_rates, prediction_time_obj, default_date=datetime(2024, 1, 1)
    )

    # Calculate cumulative rates
    cumulative_rates = _calculate_cumulative_rates(
        arrival_times_piecewise, mean_arrival_rates
    )

    # Create figure with subplots if showing delta
    if show_delta and not show_only_delta:
        fig, (ax1, ax2) = plt.subplots(
            2, 1, figsize=(fig_size[0], fig_size[1] * 2), sharex=True
        )
        ax = ax1
    else:
        plt.figure(figsize=fig_size)
        ax = plt.gca()

    # Ensure arrivals index is timezone-aware
    if arrivals.index.tz is None:
        arrivals.index = arrivals.index.tz_localize("UTC")

    # Convert arrival times to use default date for plotting
    arrival_times_plot = [
        default_datetime + (t - snapshot_datetime) for t in arrivals.index
    ]

    # Create combined timeline
    all_times = _create_combined_timeline(
        default_datetime, arrival_times_plot, prediction_window, arrival_times_piecewise
    )

    # Interpolate both actual and expected to the combined timeline
    actual_counts = np.interp(
        [t.timestamp() for t in all_times],
        [
            t.timestamp()
            for t in [default_datetime]
            + arrival_times_plot
            + [default_datetime + prediction_window]
        ],
        [0]
        + list(arrivals["cumulative_count"])
        + [arrivals["cumulative_count"].iloc[-1] if len(arrivals) > 0 else 0],
    )

    expected_counts = np.interp(
        [t.timestamp() for t in all_times],
        [t.timestamp() for t in arrival_times_piecewise],
        cumulative_rates,
    )

    # Calculate delta
    delta = actual_counts - expected_counts
    delta[0] = 0  # Ensure delta starts at 0

    if not show_only_delta:
        # Plot actual and expected arrivals
        ax.step(
            [default_datetime]
            + arrival_times_plot
            + [default_datetime + prediction_window],
            [0]
            + list(arrivals["cumulative_count"])
            + [arrivals["cumulative_count"].iloc[-1] if len(arrivals) > 0 else 0],
            where="post",
            label="Actual Arrivals",
        )
        ax.scatter(
            arrival_times_piecewise,
            cumulative_rates,
            label="Expected Arrivals",
            color="orange",
        )

        ax.set_xlabel("Time")
        ax.set_title(
            f"Cumulative Arrivals in the {int(prediction_window.total_seconds()/3600)} hours after {format_prediction_time(prediction_time)} on {snapshot_date}"
        )
        ax.legend()

    if show_delta or show_only_delta:
        if show_only_delta:
            _plot_arrival_delta_chart(
                ax, all_times, delta, prediction_time, prediction_window, snapshot_date
            )
        else:
            _plot_arrival_delta_chart(
                ax2, all_times, delta, prediction_time, prediction_window, snapshot_date
            )
        plt.tight_layout()

    # Format time axis for all subplots
    for ax in plt.gcf().get_axes():
        _format_time_axis(ax, all_times)

    if media_file_path:
        filename = file_name if file_name else "arrival_comparison.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

plot_arrival_deltas(df, prediction_time, snapshot_dates, prediction_window, yta_time_interval=timedelta(minutes=15), media_file_path=None, file_name=None, return_figure=False, fig_size=(15, 6))

Plot delta charts for multiple snapshot dates on the same figure.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing arrival data

required
prediction_time tuple

(hour, minute) of prediction time

required
snapshot_dates list

List of datetime.date objects to analyze

required
prediction_window timedelta

Prediction window in minutes

required
yta_time_interval int

Time interval in minutes for calculating arrival rates

15
media_file_path Path

Path to save the plot

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "multiple_deltas.png"

None
return_figure bool

If True, returns the figure instead of displaying it

False
fig_size tuple

Figure size as (width, height) in inches

(15, 6)

Returns:

Type Description
Figure or None

The figure object if return_figure is True, otherwise None

Source code in src/patientflow/viz/observed_against_expected.py
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
def plot_arrival_deltas(
    df,
    prediction_time,
    snapshot_dates,
    prediction_window: timedelta,
    yta_time_interval: timedelta = timedelta(minutes=15),
    media_file_path=None,
    file_name=None,
    return_figure=False,
    fig_size=(15, 6),
):
    """Plot delta charts for multiple snapshot dates on the same figure.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing arrival data
    prediction_time : tuple
        (hour, minute) of prediction time
    snapshot_dates : list
        List of datetime.date objects to analyze
    prediction_window : timedelta
        Prediction window in minutes
    yta_time_interval : int, default=15
        Time interval in minutes for calculating arrival rates
    media_file_path : Path, optional
        Path to save the plot
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "multiple_deltas.png"
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    fig_size : tuple, default=(15, 6)
        Figure size as (width, height) in inches

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None
    """
    # Create figure with subplots
    fig = plt.figure(figsize=fig_size)
    gs = plt.GridSpec(1, 2, width_ratios=[2, 1])
    ax1 = plt.subplot(gs[0])
    ax2 = plt.subplot(gs[1])

    # Store all deltas for averaging
    all_deltas = []
    all_times_list = []
    final_deltas = []  # Store final delta values for histogram

    # Calculate common values once
    prediction_time_obj, default_datetime = _prepare_common_values(prediction_time)

    for snapshot_date in snapshot_dates:
        # Prepare data for this date
        df_copy, snapshot_datetime, _, _ = _prepare_arrival_data(
            df, prediction_time, snapshot_date, prediction_window, yta_time_interval
        )

        # Get arrivals within the prediction window
        arrivals = df_copy[
            (df_copy.index > snapshot_datetime)
            & (df_copy.index <= snapshot_datetime + pd.Timedelta(prediction_window))
        ]

        if len(arrivals) == 0:
            continue

        # Sort arrivals by time and create cumulative count
        arrivals = arrivals.sort_values("arrival_datetime")
        arrivals["cumulative_count"] = range(1, len(arrivals) + 1)

        # Calculate arrival rates and prepare time points
        mean_arrival_rates = _calculate_arrival_rates(
            df_copy, prediction_time_obj, prediction_window, yta_time_interval
        )

        # Prepare arrival times
        arrival_times_piecewise = _prepare_arrival_times(
            mean_arrival_rates, prediction_time_obj, default_date=datetime(2024, 1, 1)
        )

        # Calculate cumulative rates
        cumulative_rates = _calculate_cumulative_rates(
            arrival_times_piecewise, mean_arrival_rates
        )

        # Convert arrival times to use default date for plotting
        arrival_times_plot = [
            default_datetime + (t - snapshot_datetime) for t in arrivals.index
        ]

        # Create combined timeline
        all_times = _create_combined_timeline(
            default_datetime,
            arrival_times_plot,
            prediction_window,
            arrival_times_piecewise,
        )

        # Interpolate both actual and expected to the combined timeline
        actual_counts = np.interp(
            [t.timestamp() for t in all_times],
            [
                t.timestamp()
                for t in [default_datetime]
                + arrival_times_plot
                + [default_datetime + pd.Timedelta(prediction_window)]
            ],
            [0]
            + list(arrivals["cumulative_count"])
            + [arrivals["cumulative_count"].iloc[-1]],
        )

        expected_counts = np.interp(
            [t.timestamp() for t in all_times],
            [t.timestamp() for t in arrival_times_piecewise],
            cumulative_rates,
        )

        # Calculate delta
        delta = actual_counts - expected_counts
        delta[0] = 0  # Ensure delta starts at 0

        # Store for averaging
        all_deltas.append(delta)
        all_times_list.append(all_times)

        # Store final delta value for histogram
        final_deltas.append(delta[-1])

        # Plot delta for this snapshot date
        ax1.step(all_times, delta, where="post", color="grey", alpha=0.5)

    # Calculate and plot average delta
    if all_deltas:
        # Find the common time points across all dates
        common_times = sorted(set().union(*[set(times) for times in all_times_list]))

        # Interpolate all deltas to common time points
        interpolated_deltas = []
        for times, delta in zip(all_times_list, all_deltas):
            # Only interpolate within the actual time range for each date
            min_time = min(times)
            max_time = max(times)
            valid_times = [t for t in common_times if min_time <= t <= max_time]

            if valid_times:
                interpolated = np.interp(
                    [t.timestamp() for t in valid_times],
                    [t.timestamp() for t in times],
                    delta,
                )
                # Pad with NaN for times outside the valid range
                padded = np.full(len(common_times), np.nan)
                valid_indices = [
                    i for i, t in enumerate(common_times) if t in valid_times
                ]
                padded[valid_indices] = interpolated
                interpolated_deltas.append(padded)

        # Calculate average delta, ignoring NaN values
        avg_delta = np.nanmean(interpolated_deltas, axis=0)

        # Plot average delta as a solid line
        # Only plot where we have valid data (not NaN)
        valid_mask = ~np.isnan(avg_delta)
        if np.any(valid_mask):
            ax1.step(
                [t for t, m in zip(common_times, valid_mask) if m],
                avg_delta[valid_mask],
                where="post",
                color="red",
                linewidth=2,
            )

    # Add horizontal line at y=0
    ax1.axhline(y=0, color="gray", linestyle="--", alpha=0.5)

    # Format the main plot
    ax1.set_xlabel("Time")
    ax1.set_ylabel("Difference (Actual - Expected)")
    ax1.set_title(
        f"Difference Between Actual and Expected Arrivals in the {(int(prediction_window.total_seconds()/3600))} hours after {format_prediction_time(prediction_time)} on all dates"
    )

    # Format time axis
    _format_time_axis(ax1, common_times)

    # Create histogram of final delta values
    if final_deltas:
        # Round values to nearest integer for binning
        rounded_deltas = np.round(final_deltas)
        unique_values = np.unique(rounded_deltas)

        # Create bins centered on integer values
        bin_edges = np.arange(unique_values.min() - 0.5, unique_values.max() + 1.5, 1)

        ax2.hist(final_deltas, bins=bin_edges, color="grey", alpha=0.7)
        ax2.axvline(x=0, color="gray", linestyle="--", alpha=0.5)
        ax2.set_xlabel("Final Difference (Actual - Expected)")
        ax2.set_ylabel("Count")
        ax2.set_title("Distribution of Final Differences")

        # Set x-axis ticks to integer values with appropriate spacing
        value_range = unique_values.max() - unique_values.min()
        step_size = max(1, int(value_range / 10))  # Aim for about 10 ticks
        ax2.set_xticks(
            np.arange(unique_values.min(), unique_values.max() + 1, step_size)
        )

    plt.tight_layout()

    if media_file_path:
        filename = file_name if file_name else "multiple_deltas.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

plot_deltas(results1, results2=None, title1=None, title2=None, main_title='Histograms of Observed - Expected Values', xlabel='Observed minus expected', media_file_path=None, file_name=None, return_figure=False)

Plot histograms of observed minus expected values.

Creates a grid of histograms showing the distribution of differences between observed and expected values for different prediction times. Optionally compares two sets of results side by side.

Parameters:

Name Type Description Default
results1 dict

First set of results containing observed and expected values for different prediction times. Keys are prediction times, values are dicts with 'observed' and 'expected' arrays.

required
results2 dict

Second set of results for comparison, following the same format as results1.

None
title1 str

Title for the first set of results.

None
title2 str

Title for the second set of results.

None
main_title str

Main title for the entire plot.

"Histograms of Observed - Expected Values"
xlabel str

Label for the x-axis of each histogram.

"Observed minus expected"
media_file_path Path

Path where the plot should be saved. If provided, saves the plot as a PNG file.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "observed_vs_expected.png".

None
return_figure bool

If True, returns the matplotlib figure object instead of displaying it.

False

Returns:

Type Description
Figure or None

The figure object if return_figure is True, otherwise None.

Notes

The function creates a grid of histograms with a maximum of 5 columns. Each histogram shows the distribution of differences between observed and expected values for a specific prediction time. A red dashed line at x=0 indicates where observed equals expected.

Source code in src/patientflow/viz/observed_against_expected.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
def plot_deltas(
    results1,
    results2=None,
    title1=None,
    title2=None,
    main_title="Histograms of Observed - Expected Values",
    xlabel="Observed minus expected",
    media_file_path=None,
    file_name=None,
    return_figure=False,
):
    """Plot histograms of observed minus expected values.

    Creates a grid of histograms showing the distribution of differences between
    observed and expected values for different prediction times. Optionally compares
    two sets of results side by side.

    Parameters
    ----------
    results1 : dict
        First set of results containing observed and expected values for different
        prediction times. Keys are prediction times, values are dicts with 'observed'
        and 'expected' arrays.
    results2 : dict, optional
        Second set of results for comparison, following the same format as results1.
    title1 : str, optional
        Title for the first set of results.
    title2 : str, optional
        Title for the second set of results.
    main_title : str, default="Histograms of Observed - Expected Values"
        Main title for the entire plot.
    xlabel : str, default="Observed minus expected"
        Label for the x-axis of each histogram.
    media_file_path : Path, optional
        Path where the plot should be saved. If provided, saves the plot as a PNG file.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "observed_vs_expected.png".
    return_figure : bool, default=False
        If True, returns the matplotlib figure object instead of displaying it.

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None.

    Notes
    -----
    The function creates a grid of histograms with a maximum of 5 columns.
    Each histogram shows the distribution of differences between observed and
    expected values for a specific prediction time. A red dashed line at x=0
    indicates where observed equals expected.
    """
    # Calculate the number of subplots needed
    num_plots = len(results1)

    # Calculate the number of rows and columns for the subplots
    num_cols = min(5, num_plots)  # Maximum of 5 columns
    num_rows = math.ceil(num_plots / num_cols)

    if results2:
        num_rows *= 2  # Double the number of rows if we have two result sets

    # Set a minimum width for the figure
    min_width = 8  # minimum width in inches
    width = max(min_width, 4 * num_cols)
    height = 4 * num_rows

    # Create the plot
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(width, height), squeeze=False)
    fig.suptitle(main_title, fontsize=14)

    # Flatten the axes array
    axes = axes.flatten()

    def plot_results(
        results, start_index, result_title, global_min, global_max, max_freq
    ):
        # Convert prediction times to minutes for sorting
        prediction_times_sorted = sorted(
            results.items(),
            key=lambda x: int(x[0].split("_")[-1][:2]) * 60
            + int(x[0].split("_")[-1][2:]),
        )

        # Create symmetric bins around zero
        bins = np.arange(global_min, global_max + 2) - 0.5

        for i, (_prediction_time, values) in enumerate(prediction_times_sorted):
            observed = np.array(values["observed"])
            expected = np.array(values["expected"])
            difference = observed - expected

            ax = axes[start_index + i]

            ax.hist(difference, bins=bins, edgecolor="black", alpha=0.7)
            ax.axvline(x=0, color="r", linestyle="--", linewidth=1)

            # Format the prediction time
            formatted_time = format_prediction_time(_prediction_time)

            # Combine the result_title and formatted_time
            if result_title:
                ax.set_title(f"{result_title} {formatted_time}")
            else:
                ax.set_title(formatted_time)

            ax.set_xlabel(xlabel)
            ax.set_ylabel("Frequency")
            ax.set_xlim(global_min - 0.5, global_max + 0.5)
            ax.set_ylim(0, max_freq)

    # Calculate global min and max differences for consistent x-axis across both result sets
    all_differences = []
    max_counts = []

    # Gather all differences and compute histogram data for both result sets
    for results in [results1] + ([results2] if results2 else []):
        for _, values in results.items():
            observed = np.array(values["observed"])
            expected = np.array(values["expected"])
            differences = observed - expected
            all_differences.extend(differences)
            # Compute histogram data to find maximum frequency
            counts, _ = np.histogram(differences)
            max_counts.append(max(counts))

    # Find the symmetric range around zero
    abs_max = max(abs(min(all_differences)), abs(max(all_differences)))
    global_min = -math.ceil(abs_max)
    global_max = math.ceil(abs_max)

    # Find the maximum frequency across all histograms
    max_freq = math.ceil(max(max_counts) * 1.1)  # Add 10% padding

    # Plot the first results set
    plot_results(results1, 0, title1, global_min, global_max, max_freq)

    # Plot the second results set if provided
    if results2:
        plot_results(results2, num_plots, title2, global_min, global_max, max_freq)

    # Hide any unused subplots
    for j in range(num_plots * (2 if results2 else 1), len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()

    if media_file_path:
        if file_name:
            plt.savefig(media_file_path / file_name, dpi=300)
        else:
            plt.savefig(media_file_path / "observed_vs_expected.png", dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

probability_distribution

Module for generating probability distribution visualizations.

Functions:

Name Description
plot_prob_dist : Plot a probability distribution as a bar chart with enhanced plotting options.

plot_prob_dist(prob_dist_data, title, media_file_path=None, figsize=(6, 3), include_titles=False, truncate_at_beds=None, text_size=None, bar_colour='#5B9BD5', file_name=None, probability_thresholds=None, show_probability_thresholds=True, probability_levels=None, plot_bed_base=None, xlabel='Number of beds', return_figure=False)

Plot a probability distribution as a bar chart with enhanced plotting options.

This function generates a bar plot for a given probability distribution, either as a pandas DataFrame, a scipy.stats distribution object (e.g., Poisson), or a dictionary. The plot can be customized with titles, axis labels, markers, and additional visual properties.

Parameters:

Name Type Description Default
prob_dist_data pandas.DataFrame, dict, scipy.stats distribution, or array-like

The probability distribution data to be plotted. Can be: - pandas DataFrame - dictionary (keys are indices, values are probabilities) - scipy.stats distribution (e.g., Poisson). If a scipy.stats distribution is provided, the function computes probabilities for integer values within the specified range. - array-like of probabilities (indices will be 0 to len(array)-1)

required
title str

The title of the plot, used for display and optionally as the file name.

required
media_file_path str or Path

Directory where the plot image will be saved. If not provided, the plot is displayed without saving.

None
figsize tuple of float

The size of the figure, specified as (width, height). Default is (6, 3)

(6, 3)
include_titles bool

Whether to include titles and axis labels in the plot. Default is False

False
truncate_at_beds int or tuple of (int, int)

Either a single number specifying the upper bound, or a tuple of (lower_bound, upper_bound) for the x-axis range. If None, the full range of the data will be displayed.

None
text_size int

Font size for plot text, including titles and tick labels.

None
bar_colour str

The color of the bars in the plot. Default is "#5B9BD5"

'#5B9BD5'
file_name str

Custom filename to use when saving the plot. If not provided, defaults to a generated name based on the title.

None
probability_thresholds dict

A dictionary where keys are points on the cumulative distribution function (as decimals, e.g., 0.9 for 90%) and values are the corresponding resource thresholds (bed counts). For example, {0.9: 15} indicates there is a 90% probability that at least 15 beds will be needed (represents the lower tail of the distribution).

None
show_probability_thresholds bool

Whether to show vertical lines indicating the resource requirements at different points on the cumulative distribution function. Default is True

True
probability_levels list of float

List of probability levels for automatic threshold calculation.

None
plot_bed_base dict

Dictionary of bed balance lines to plot in red. Keys are labels and values are x-axis positions.

None
xlabel str

A label for the x axis. Default is "Number of beds"

'Number of beds'
return_figure bool

If True, returns the matplotlib figure instead of displaying it. Default is False

False

Returns:

Type Description
Figure or None

Returns the figure if return_figure is True, otherwise displays the plot

Examples:

Basic usage with an array of probabilities:

>>> probabilities = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
>>> plot_prob_dist(probabilities, "Bed Demand Distribution")

With thresholds:

>>> thresholds = _calculate_probability_thresholds(probabilities, [0.8, 0.95])
>>> plot_prob_dist(probabilities, "Bed Demand with Confidence Levels",
...                probability_thresholds=thresholds)

Using with a scipy stats distribution:

>>> from scipy import stats
>>> poisson_dist = stats.poisson(mu=5)  # Poisson with mean of 5
>>> plot_prob_dist(poisson_dist, "Poisson Distribution (μ=5)",
...                truncate_at_beds=(0, 15))
Source code in src/patientflow/viz/probability_distribution.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
def plot_prob_dist(
    prob_dist_data,
    title,
    media_file_path=None,
    figsize=(6, 3),
    include_titles=False,
    truncate_at_beds=None,
    text_size=None,
    bar_colour="#5B9BD5",
    file_name=None,
    probability_thresholds=None,
    show_probability_thresholds=True,
    probability_levels=None,
    plot_bed_base=None,
    xlabel="Number of beds",
    return_figure=False,
):
    """Plot a probability distribution as a bar chart with enhanced plotting options.

    This function generates a bar plot for a given probability distribution, either
    as a pandas DataFrame, a scipy.stats distribution object (e.g., Poisson), or a
    dictionary. The plot can be customized with titles, axis labels, markers, and
    additional visual properties.

    Parameters
    ----------
    prob_dist_data : pandas.DataFrame, dict, scipy.stats distribution, or array-like
        The probability distribution data to be plotted. Can be:
        - pandas DataFrame
        - dictionary (keys are indices, values are probabilities)
        - scipy.stats distribution (e.g., Poisson). If a `scipy.stats` distribution is provided,
        the function computes probabilities for integer values within the specified range.
        - array-like of probabilities (indices will be 0 to len(array)-1)
    title : str
        The title of the plot, used for display and optionally as the file name.
    media_file_path : str or pathlib.Path, optional
        Directory where the plot image will be saved. If not provided, the plot is
        displayed without saving.
    figsize : tuple of float, optional
        The size of the figure, specified as (width, height).
        Default is (6, 3)
    include_titles : bool, optional
        Whether to include titles and axis labels in the plot.
        Default is False
    truncate_at_beds : int or tuple of (int, int), optional
        Either a single number specifying the upper bound, or a tuple of
        (lower_bound, upper_bound) for the x-axis range. If None, the full
        range of the data will be displayed.
    text_size : int, optional
        Font size for plot text, including titles and tick labels.
    bar_colour : str, optional
        The color of the bars in the plot.
        Default is "#5B9BD5"
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to a generated name based on the title.
    probability_thresholds : dict, optional
        A dictionary where keys are points on the cumulative distribution function (as decimals, e.g., 0.9 for 90%)
        and values are the corresponding resource thresholds (bed counts).
        For example, {0.9: 15} indicates there is a 90% probability that
        at least 15 beds will be needed (represents the lower tail of the distribution).
    show_probability_thresholds : bool, optional
        Whether to show vertical lines indicating the resource requirements
        at different points on the cumulative distribution function.
        Default is True
    probability_levels : list of float, optional
        List of probability levels for automatic threshold calculation.
    plot_bed_base : dict, optional
        Dictionary of bed balance lines to plot in red.
        Keys are labels and values are x-axis positions.
    xlabel : str, optional
        A label for the x axis.
        Default is "Number of beds"
    return_figure : bool, optional
        If True, returns the matplotlib figure instead of displaying it.
        Default is False

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot

    Examples
    --------
    Basic usage with an array of probabilities:

    >>> probabilities = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
    >>> plot_prob_dist(probabilities, "Bed Demand Distribution")

    With thresholds:

    >>> thresholds = _calculate_probability_thresholds(probabilities, [0.8, 0.95])
    >>> plot_prob_dist(probabilities, "Bed Demand with Confidence Levels",
    ...                probability_thresholds=thresholds)

    Using with a scipy stats distribution:

    >>> from scipy import stats
    >>> poisson_dist = stats.poisson(mu=5)  # Poisson with mean of 5
    >>> plot_prob_dist(poisson_dist, "Poisson Distribution (μ=5)",
    ...                truncate_at_beds=(0, 15))
    """

    # Handle array-like input
    if isinstance(prob_dist_data, (np.ndarray, list)):
        array_length = len(prob_dist_data)
        prob_dist_data = pd.DataFrame(
            {"agg_proba": prob_dist_data}, index=range(array_length)
        )

    # Handle scipy.stats distribution input
    elif hasattr(prob_dist_data, "pmf") and callable(prob_dist_data.pmf):
        # Determine range for the distribution
        if truncate_at_beds is None:
            # Default range for distributions if not specified
            lower_bound = 0
            upper_bound = 20  # Reasonable default for most discrete distributions
        elif isinstance(truncate_at_beds, (int, float)):
            lower_bound = 0
            upper_bound = truncate_at_beds
        else:
            lower_bound, upper_bound = truncate_at_beds

        # Generate x values and probabilities
        x = np.arange(lower_bound, upper_bound + 1)
        probs = prob_dist_data.pmf(x)
        prob_dist_data = pd.DataFrame({"agg_proba": probs}, index=x)

        # No need to filter later
        truncate_at_beds = None

    # Handle dictionary input
    elif isinstance(prob_dist_data, dict):
        prob_dist_data = pd.DataFrame(
            {"agg_proba": list(prob_dist_data.values())},
            index=list(prob_dist_data.keys()),
        )

    # Apply truncation if specified
    if truncate_at_beds is not None:
        # Determine bounds
        if isinstance(truncate_at_beds, (int, float)):
            lower_bound = 0
            upper_bound = truncate_at_beds
        else:
            lower_bound, upper_bound = truncate_at_beds

        # Apply filtering
        mask = (prob_dist_data.index >= lower_bound) & (
            prob_dist_data.index <= upper_bound
        )
        filtered_data = prob_dist_data[mask]
    else:
        # Use all available data
        filtered_data = prob_dist_data

    # Calculate probability thresholds if probability_levels is provided
    if probability_thresholds is None and probability_levels is not None:
        probability_thresholds = _calculate_probability_thresholds(
            filtered_data["agg_proba"].values, probability_levels
        )

    # Create the plot
    fig = plt.figure(figsize=figsize)

    if not file_name:
        file_name = (
            title.replace(" ", "_").replace("/n", "_").replace("%", "percent") + ".png"
        )

    # Plot bars
    plt.bar(
        filtered_data.index,
        filtered_data["agg_proba"].values,
        color=bar_colour,
    )

    # Generate appropriate ticks based on data range
    if len(filtered_data) > 0:
        data_min = min(filtered_data.index)
        data_max = max(filtered_data.index)
        data_range = data_max - data_min

        if data_range <= 10:
            tick_step = 1
        elif data_range <= 50:
            tick_step = 5
        else:
            tick_step = 10

        tick_start = (data_min // tick_step) * tick_step
        tick_end = data_max + 1
        plt.xticks(np.arange(tick_start, tick_end, tick_step))

    # Plot probability threshold lines
    if show_probability_thresholds and probability_thresholds:
        colors = itertools.cycle(
            plt.cm.gray(np.linspace(0.3, 0.7, len(probability_thresholds)))
        )
        for probability, bed_count in probability_thresholds.items():
            plt.axvline(
                x=bed_count,
                linestyle="--",
                linewidth=2,
                color=next(colors),
                label=f"{probability*100:.0f}% probability of needing ≥ {bed_count} beds",
            )
        plt.legend(loc="upper right")

    # Add bed balance lines
    if plot_bed_base:
        for point in plot_bed_base:
            plt.axvline(
                x=plot_bed_base[point],
                linewidth=2,
                color="red",
                label=f"bed balance: {point}",
            )
        plt.legend(loc="upper right")

    # Add text and labels
    if text_size:
        plt.tick_params(axis="both", which="major", labelsize=text_size)
        plt.xlabel(xlabel, fontsize=text_size)
        if include_titles:
            plt.title(title, fontsize=text_size)
            plt.ylabel("Probability", fontsize=text_size)
    else:
        plt.xlabel(xlabel)
        if include_titles:
            plt.title(title)
            plt.ylabel("Probability")

    plt.tight_layout()

    # Save or display the figure
    if media_file_path:
        plt.savefig(media_file_path / file_name.replace(" ", "_"), dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

quantile_quantile

Generate Quantile-Quantile (QQ) plots to compare observed values with model predictions.

This module creates QQ plots for healthcare bed demand predictions, comparing observed values with model predictions. A QQ plot is a graphical technique for determining if two data sets come from populations with a common distribution. If the points form a line approximately along the reference line y=x, this suggests the distributions are similar.

Functions:

Name Description
qq_plot : function

Generate multiple QQ plots comparing observed values with model predictions

Notes

To prepare the predicted distribution: * Treat the predicted distributions (saved as cdfs) for all time points of interest as if they were one distribution * Within this predicted distribution, because each probability is over a discrete rather than continuous number of input values, the upper and lower of values of the probability range are saved at each value * The mid point between upper and lower is calculated and saved * The distribution of cdf mid points (one for each horizon date) is sorted by value of the mid point and a cdf of this is calculated (this is a cdf of cdfs, in effect) * These are weighted by the probability of each value occurring

To prepare the observed distribution: * Take observed number each horizon date and save the cdf of that value from its predicted distribution * The distribution of cdf values (one per horizon date) is sorted * These are weighted by the probability of each value occurring, which is a uniform probability (1 / over the number of horizon dates)

qq_plot(prediction_times, prob_dist_dict_all, model_name='admissions', return_figure=False, figsize=None, suptitle=None, media_file_path=None, file_name=None)

Generate multiple QQ plots comparing observed values with model predictions.

Parameters:

Name Type Description Default
prediction_times list of tuple

List of (hour, minute) tuples for prediction times.

required
prob_dist_dict_all dict

Dictionary of probability distributions keyed by model_key.

required
model_name str

Base name of the model to construct model keys.

"admissions"
return_figure bool

If True, returns the figure object instead of displaying it.

False
figsize tuple of float

Size of the figure in inches as (width, height). If None, calculated automatically based on number of plots.

None
suptitle str

Super title for the entire figure, displayed above all subplots.

None
media_file_path Path

Path to save the plot.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "qq_plot.png".

None

Returns:

Type Description
Figure or None

Returns the figure if return_figure is True, otherwise displays the plot and returns None.

Notes

The function creates a QQ plot for each prediction time, comparing the observed distribution with the predicted distribution. Each subplot shows how well the model's predictions match the actual observations.

Source code in src/patientflow/viz/quantile_quantile.py
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
def qq_plot(
    prediction_times,
    prob_dist_dict_all,
    model_name="admissions",
    return_figure=False,
    figsize=None,
    suptitle=None,
    media_file_path=None,
    file_name=None,
):
    """Generate multiple QQ plots comparing observed values with model predictions.

    Parameters
    ----------
    prediction_times : list of tuple
        List of (hour, minute) tuples for prediction times.
    prob_dist_dict_all : dict
        Dictionary of probability distributions keyed by model_key.
    model_name : str, default="admissions"
        Base name of the model to construct model keys.
    return_figure : bool, default=False
        If True, returns the figure object instead of displaying it.
    figsize : tuple of float, optional
        Size of the figure in inches as (width, height). If None, calculated automatically
        based on number of plots.
    suptitle : str, optional
        Super title for the entire figure, displayed above all subplots.
    media_file_path : Path, optional
        Path to save the plot.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "qq_plot.png".

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot and returns None.

    Notes
    -----
    The function creates a QQ plot for each prediction time, comparing the observed
    distribution with the predicted distribution. Each subplot shows how well the
    model's predictions match the actual observations.
    """
    # Sort prediction times by converting to minutes since midnight
    prediction_times_sorted = sorted(
        prediction_times,
        key=lambda x: x[0] * 60
        + x[1],  # Convert (hour, minute) to minutes since midnight
    )

    num_plots = len(prediction_times_sorted)
    if figsize is None:
        figsize = (num_plots * 5, 4)

    # Create subplot layout
    fig, axs = plt.subplots(1, num_plots, figsize=figsize)

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    # Loop through each subplot
    for i, prediction_time in enumerate(prediction_times_sorted):
        # Initialize lists to store CDF and observed data
        cdf_data = []
        observed_data = []

        # Get model key and corresponding prob_dist_dict
        model_key = get_model_key(model_name, prediction_time)
        prob_dist_dict = prob_dist_dict_all[model_key]

        # Process data for current subplot
        for dt in prob_dist_dict:
            agg_predicted = np.array(prob_dist_dict[dt]["agg_predicted"])
            agg_observed = prob_dist_dict[dt]["agg_observed"]

            upper = agg_predicted.cumsum()
            lower = np.hstack((0, upper[:-1]))
            mid = (upper + lower) / 2

            cdf_data.append(np.column_stack((upper, lower, mid, agg_predicted)))
            # Round the observed data to nearest integer before using as index
            agg_observed_int = int(round(agg_observed))
            observed_data.append(mid[agg_observed_int])

        if not cdf_data:
            continue

        # Prepare data for plotting
        cdf_data = np.vstack(cdf_data)
        qq_model = pd.DataFrame(
            cdf_data, columns=["cdf_upper", "cdf_mid", "cdf_lower", "weights"]
        )
        qq_model = qq_model.sort_values("cdf_mid")
        qq_model["cum_weight"] = qq_model["weights"].cumsum()
        qq_model["cum_weight_normed"] = (
            qq_model["cum_weight"] / qq_model["weights"].sum()
        )

        qq_observed = pd.DataFrame(observed_data, columns=["cdf_observed"])
        qq_observed = qq_observed.sort_values("cdf_observed")
        qq_observed["weights"] = 1 / len(observed_data)
        qq_observed["cum_weight_normed"] = qq_observed["weights"].cumsum()

        qq_observed["max_model_cdf_at_this_value"] = qq_observed["cdf_observed"].apply(
            lambda x: qq_model[qq_model["cdf_mid"] <= x]["cum_weight_normed"].max()
        )

        # Plot on current subplot
        ax = axs[i]
        ax.set_aspect("equal")
        ax.set_xlim([0, 1])
        ax.set_ylim([0, 1])

        # Reference line y=x
        ax.plot([0, 1], [0, 1], linestyle="--")

        # Plot QQ data points
        ax.plot(
            qq_observed["max_model_cdf_at_this_value"],
            qq_observed["cum_weight_normed"],
            marker=".",
            linewidth=0,
        )

        # Set labels and title for subplot with hour:minute format
        hour, minutes = prediction_time
        ax.set_xlabel("Cdf of model distribution")
        ax.set_ylabel("Cdf of observed distribution")
        ax.set_title(f"QQ Plot for {hour}:{minutes:02}")

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)

    if media_file_path:
        plt.savefig(media_file_path / (file_name or "qq_plot.png"), dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close(fig)

randomised_pit

plot_randomised_pit(prediction_times, prob_dist_dict_all, model_name='admissions', return_figure=False, return_dataframe=False, figsize=None, suptitle=None, media_file_path=None, file_name=None, n_bins=10, seed=42)

Generate randomised PIT histograms for multiple prediction times side by side.

Parameters:

Name Type Description Default
prediction_times list of tuple

List of (hour, minute) tuples representing times for which predictions were made.

required
prob_dist_dict_all dict

Dictionary of probability distributions keyed by model_key. Each entry contains information about predicted distributions and observed values for different snapshot dates.

required
model_name str

Base name of the model to construct model keys, by default "admissions".

'admissions'
return_figure bool

If True, returns the figure object instead of displaying it, by default False.

False
return_dataframe bool

If True, returns a dictionary of PIT values by model_key, by default False.

False
figsize tuple of (float, float)

Size of the figure in inches as (width, height). If None, calculated automatically based on number of plots, by default None.

None
suptitle str

Super title for the entire figure, displayed above all subplots, by default None.

None
media_file_path Path

Path to save the plot, by default None. If provided, saves the plot as a PNG file.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "plot_randomised_pit.png".

None
n_bins int

Number of histogram bins, by default 10.

10
seed int

Random seed for reproducibility, by default 42.

42

Returns:

Type Description
Figure

The figure object containing the plots, if return_figure is True.

dict

Dictionary of PIT values by model_key, if return_dataframe is True.

tuple

Tuple of (figure, pit_values_dict) if both return_figure and return_dataframe are True.

None

If neither return_figure nor return_dataframe is True, displays the plots and returns None.

Source code in src/patientflow/viz/randomised_pit.py
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def plot_randomised_pit(
    prediction_times: List[Tuple[int, int]],
    prob_dist_dict_all: Dict[str, Dict],
    model_name: str = "admissions",
    return_figure: bool = False,
    return_dataframe: bool = False,
    figsize: Optional[Tuple[float, float]] = None,
    suptitle: Optional[str] = None,
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    n_bins: int = 10,
    seed: Optional[int] = 42,
) -> Union[
    plt.Figure, Dict[str, List[float]], Tuple[plt.Figure, Dict[str, List[float]]], None
]:
    """
    Generate randomised PIT histograms for multiple prediction times side by side.

    Parameters
    ----------
    prediction_times : list of tuple
        List of (hour, minute) tuples representing times for which predictions were made.
    prob_dist_dict_all : dict
        Dictionary of probability distributions keyed by model_key. Each entry contains
        information about predicted distributions and observed values for different
        snapshot dates.
    model_name : str, optional
        Base name of the model to construct model keys, by default "admissions".
    return_figure : bool, optional
        If True, returns the figure object instead of displaying it, by default False.
    return_dataframe : bool, optional
        If True, returns a dictionary of PIT values by model_key, by default False.
    figsize : tuple of (float, float), optional
        Size of the figure in inches as (width, height). If None, calculated automatically
        based on number of plots, by default None.
    suptitle : str, optional
        Super title for the entire figure, displayed above all subplots, by default None.
    media_file_path : Path, optional
        Path to save the plot, by default None. If provided, saves the plot as a PNG file.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "plot_randomised_pit.png".
    n_bins : int, optional
        Number of histogram bins, by default 10.
    seed : int, optional
        Random seed for reproducibility, by default 42.

    Returns
    -------
    matplotlib.figure.Figure
        The figure object containing the plots, if return_figure is True.
    dict
        Dictionary of PIT values by model_key, if return_dataframe is True.
    tuple
        Tuple of (figure, pit_values_dict) if both return_figure and return_dataframe are True.
    None
        If neither return_figure nor return_dataframe is True, displays the plots and returns None.
    """
    if seed is not None:
        np.random.seed(seed)

    # Sort prediction times by converting to minutes since midnight
    prediction_times_sorted = sorted(
        prediction_times,
        key=lambda x: x[0] * 60 + x[1],
    )

    # Calculate figure parameters
    num_plots = len(prediction_times_sorted)
    figsize = figsize or (num_plots * 5, 4)

    # Create subplot layout
    fig, axs = plt.subplots(1, num_plots, figsize=figsize)
    axs = [axs] if num_plots == 1 else axs

    all_pit_values: Dict[str, List[float]] = {}
    max_density = 0.0  # Track maximum density across all histograms

    # Process each subplot
    for i, prediction_time in enumerate(prediction_times_sorted):
        model_key = get_model_key(model_name, prediction_time)
        prob_dist_dict = prob_dist_dict_all[model_key]

        if not prob_dist_dict:
            continue

        observations = []
        cdf_functions = []

        # Extract data for each date
        for dt in prob_dist_dict:
            try:
                observation = prob_dist_dict[dt]["agg_observed"]
                predicted_dist = prob_dist_dict[dt]["agg_predicted"]["agg_proba"]

                # Convert probability distribution to CDF function
                cdf_func = _prob_to_cdf(predicted_dist)

                observations.append(observation)
                cdf_functions.append(cdf_func)

            except Exception as e:
                print(f"Skipping date {dt} due to error: {e}")
                continue

        if len(observations) == 0:
            continue

        # Generate PIT values
        pit_values = []

        for obs, cdf_func in zip(observations, cdf_functions):
            try:
                # Calculate PIT range bounds
                lower = cdf_func(obs - 1) if obs > 0 else 0.0
                upper = cdf_func(obs)

                # Sample randomly within the range
                pit_value = np.random.uniform(lower, upper)
                pit_values.append(pit_value)

            except Exception as e:
                print(f"Error processing observation {obs}: {e}")
                continue

        all_pit_values[model_key] = pit_values

        # Calculate histogram to get density
        hist, _ = np.histogram(pit_values, bins=n_bins, density=True)
        max_density = max(max_density, np.max(hist))

    # Now plot with consistent y-axis scale
    for i, prediction_time in enumerate(prediction_times_sorted):
        model_key = get_model_key(model_name, prediction_time)
        pit_values = all_pit_values.get(model_key, [])

        if not pit_values:
            continue

        # Plot histogram
        ax = axs[i]
        ax.hist(
            pit_values,
            bins=n_bins,
            density=True,
            alpha=0.7,
            edgecolor="black",
            label="Randomised PIT",
        )

        # Add uniform reference line
        ax.axhline(
            y=1.0, color="red", linestyle="--", linewidth=2, label="Perfect Uniform"
        )

        # Set labels and title
        hour, minutes = prediction_time
        ax.set_xlabel("PIT Value")
        ax.set_ylabel("Density")
        ax.set_title(f"PIT Histogram for {hour}:{minutes:02}")
        ax.set_xlim(0, 1)
        ax.set_ylim(0, max_density * 1.1)  # Add 10% padding
        ax.grid(True, alpha=0.3)

        if i == 0:  # Only show legend on first subplot
            ax.legend()

    # Final plot configuration
    plt.tight_layout()
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)
    if media_file_path:
        plt.savefig(media_file_path / (file_name or "plot_randomised_pit.png"), dpi=300)

    # Return based on flags
    if return_figure and return_dataframe:
        return fig, all_pit_values
    elif return_figure:
        return fig
    elif return_dataframe:
        plt.show()
        plt.close()
        return all_pit_values
    else:
        plt.show()
        plt.close()
        return None

shap

SHAP (SHapley Additive exPlanations) visualization module.

This module provides functionality for generating SHAP plots. These are useful for visualizing feature importance and their impact on model decisions.

Functions:

Name Description
plot_shap : function

Generate SHAP plots for multiple trained models.

plot_shap(trained_models, test_visits, exclude_from_training_data, media_file_path=None, file_name=None, return_figure=False, label_col='is_admitted')

Generate SHAP plots for multiple trained models.

This function creates SHAP (SHapley Additive exPlanations) summary plots for each trained model, showing the impact of features on model predictions. The plots can be saved to a specified media file path or displayed directly.

Parameters:

Name Type Description Default
trained_models list[TrainedClassifier] or dict[str, TrainedClassifier]

List of trained classifier objects or dictionary with TrainedClassifier values.

required
test_visits DataFrame

DataFrame containing the test visit data.

required
exclude_from_training_data list[str]

List of columns to exclude from training data.

required
media_file_path Path

Directory path where the generated plots will be saved. If None, plots are only displayed.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "shap_plot.png".

None
return_figure bool

If True, returns the figure instead of displaying it.

False
label_col str

Name of the column containing the target labels.

"is_admitted"

Returns:

Type Description
Figure or None

If return_figure is True, returns the generated figure. Otherwise, returns None.

Source code in src/patientflow/viz/shap.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
def plot_shap(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits,
    exclude_from_training_data,
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    return_figure=False,
    label_col: str = "is_admitted",
):
    """Generate SHAP plots for multiple trained models.

    This function creates SHAP (SHapley Additive exPlanations) summary plots for each
    trained model, showing the impact of features on model predictions. The plots can
    be saved to a specified media file path or displayed directly.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of trained classifier objects or dictionary with TrainedClassifier values.
    test_visits : pandas.DataFrame
        DataFrame containing the test visit data.
    exclude_from_training_data : list[str]
        List of columns to exclude from training data.
    media_file_path : Path, optional
        Directory path where the generated plots will be saved. If None, plots are
        only displayed.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "shap_plot.png".
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels.

    Returns
    -------
    matplotlib.figure.Figure or None
        If return_figure is True, returns the generated figure. Otherwise, returns None.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )

    for trained_model in trained_models_sorted:
        fig, ax = plt.subplots(figsize=(8, 12))

        # use non-calibrated pipeline
        pipeline: Pipeline = trained_model.pipeline
        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, _ = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)
        transformed_cols = pipeline.named_steps[
            "feature_transformer"
        ].get_feature_names_out()
        transformed_cols = [col.split("__")[-1] for col in transformed_cols]
        truncated_cols = [col[:45] for col in transformed_cols]

        # Transform features
        X_test = pipeline.named_steps["feature_transformer"].transform(X_test)

        # Create SHAP explainer
        explainer = shap.TreeExplainer(pipeline.named_steps["classifier"])

        # Convert sparse matrix to dense if necessary
        if scipy.sparse.issparse(X_test):
            X_test = X_test.toarray()

        shap_values = explainer.shap_values(X_test)

        # Print prediction distribution
        predictions = pipeline.named_steps["classifier"].predict(X_test)
        print(
            "Predicted classification (not admitted, admitted): ",
            np.bincount(predictions),
        )

        # Print mean SHAP values for each class
        if isinstance(shap_values, list):
            print("SHAP values shape:", [arr.shape for arr in shap_values])
            print("Mean SHAP values (class 0):", np.abs(shap_values[0]).mean(0))
            print("Mean SHAP values (class 1):", np.abs(shap_values[1]).mean(0))

        # Create SHAP summary plot
        rng = np.random.default_rng()
        shap.summary_plot(
            shap_values,
            X_test,
            feature_names=truncated_cols,
            show=False,
            rng=rng,
        )

        hour, minutes = prediction_time
        ax.set_title(f"SHAP Values for Time of Day: {hour}:{minutes:02}")
        ax.set_xlabel("SHAP Value")
        plt.tight_layout()

        if media_file_path:
            # Save plot
            if file_name:
                shap_plot_path = str(media_file_path / file_name)
            else:
                shap_plot_path = str(
                    media_file_path / f"shap_plot_{hour:02}{minutes:02}.png"
                )
            plt.savefig(shap_plot_path)

        if return_figure:
            return fig
        else:
            plt.show()
            plt.close(fig)

survival_curve

Visualization tools for patient flow analysis using survival curves.

This module provides functions to create and analyze survival curves for time-to-event analysis.

Functions:

Name Description
plot_admission_time_survival_curve : function

Create single or multiple survival curves for ward admission times

Notes
  • The survival curves show the proportion of patients who have not yet experienced an event (e.g., admission to ward) over time
  • Time is measured in hours from the initial event (e.g., arrival)
  • A 4-hour target line is included by default to show performance against common healthcare targets
  • The curves are created without external survival analysis packages for simplicity and transparency
  • Multiple curves can be plotted on the same figure for comparison

plot_admission_time_survival_curve(df, start_time_col='arrival_datetime', end_time_col='departure_datetime', title='Time to Event Survival Curve', target_hours=[4], xlabel='Elapsed time from start', ylabel='Proportion not yet experienced event', annotation_string='{:.1%} experienced event\nwithin {:.0f} hours', labels=None, media_file_path=None, file_name=None, return_figure=False, return_df=False)

Create a survival curve for time-to-event analysis.

This function creates a survival curve showing the proportion of patients
who have not yet experienced an event over time. Can plot single or multiple
survival curves on the same plot.
Parameters
df : pandas.DataFrame or list of pandas.DataFrame
    DataFrame(s) containing patient visit data. If a list is provided,
    multiple survival curves will be plotted on the same figure.
start_time_col : str, default="arrival_datetime"
    Name of the column containing the start time (e.g., arrival time)
end_time_col : str, default="admitted_to_ward_datetime"
    Name of the column containing the end time (e.g., admission time)
title : str, default="Time to Event Survival Curve"
    Title for the plot
target_hours : list of float, default=[4]
    List of target times in hours to show on the plot
xlabel : str, default="Elapsed time from start"
    Label for the x-axis
ylabel : str, default="Proportion not yet experienced event"
    Label for the y-axis
annotation_string : str, default="{:.1%} experienced event

within {:.0f} hours" String template for the text annotation. Use {:.1%} for the proportion and {:.0f} for the hours. Annotations are only shown for the first curve when plotting multiple curves. labels : list of str, optional Labels for each survival curve when plotting multiple curves. If None and multiple dataframes are provided, default labels will be used. Ignored when plotting a single curve. media_file_path : pathlib.Path, optional Path to save the plot. If None, the plot is not saved. file_name : str, optional Custom filename to use when saving the plot. If not provided, defaults to "survival_curve.png". return_figure : bool, default=False If True, returns the figure instead of displaying it return_df : bool, default=False If True, returns a DataFrame containing the survival curve data. For multiple curves, returns a list of DataFrames.

Returns
matplotlib.figure.Figure or pandas.DataFrame or list or tuple or None
    - If return_figure is True and return_df is False: returns the figure object
    - If return_figure is False and return_df is True: returns the DataFrame(s) with survival curve data
    - If both return_figure and return_df are True: returns a tuple of (figure, DataFrame(s))
    - If both are False: returns None
Notes
The survival curve shows the proportion of patients who have not yet experienced
the event at each time point. Vertical lines are drawn at each target hour
to indicate the target times, with the corresponding proportion of patients
who experienced the event within these timeframes.

When plotting multiple curves, different colors are automatically assigned
and a legend is displayed. Target line annotations are only shown for the
first curve to avoid visual clutter.
Source code in src/patientflow/viz/survival_curve.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def plot_admission_time_survival_curve(
    df,
    start_time_col="arrival_datetime",
    end_time_col="departure_datetime",
    title="Time to Event Survival Curve",
    target_hours=[4],
    xlabel="Elapsed time from start",
    ylabel="Proportion not yet experienced event",
    annotation_string="{:.1%} experienced event\nwithin {:.0f} hours",
    labels=None,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    return_df=False,
):
    """Create a survival curve for time-to-event analysis.

    This function creates a survival curve showing the proportion of patients
    who have not yet experienced an event over time. Can plot single or multiple
    survival curves on the same plot.

    Parameters
    ----------
    df : pandas.DataFrame or list of pandas.DataFrame
        DataFrame(s) containing patient visit data. If a list is provided,
        multiple survival curves will be plotted on the same figure.
    start_time_col : str, default="arrival_datetime"
        Name of the column containing the start time (e.g., arrival time)
    end_time_col : str, default="admitted_to_ward_datetime"
        Name of the column containing the end time (e.g., admission time)
    title : str, default="Time to Event Survival Curve"
        Title for the plot
    target_hours : list of float, default=[4]
        List of target times in hours to show on the plot
    xlabel : str, default="Elapsed time from start"
        Label for the x-axis
    ylabel : str, default="Proportion not yet experienced event"
        Label for the y-axis
    annotation_string : str, default="{:.1%} experienced event\nwithin {:.0f} hours"
        String template for the text annotation. Use {:.1%} for the proportion and {:.0f} for the hours.
        Annotations are only shown for the first curve when plotting multiple curves.
    labels : list of str, optional
        Labels for each survival curve when plotting multiple curves.
        If None and multiple dataframes are provided, default labels will be used.
        Ignored when plotting a single curve.
    media_file_path : pathlib.Path, optional
        Path to save the plot. If None, the plot is not saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "survival_curve.png".
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    return_df : bool, default=False
        If True, returns a DataFrame containing the survival curve data.
        For multiple curves, returns a list of DataFrames.

    Returns
    -------
    matplotlib.figure.Figure or pandas.DataFrame or list or tuple or None
        - If return_figure is True and return_df is False: returns the figure object
        - If return_figure is False and return_df is True: returns the DataFrame(s) with survival curve data
        - If both return_figure and return_df are True: returns a tuple of (figure, DataFrame(s))
        - If both are False: returns None

    Notes
    -----
    The survival curve shows the proportion of patients who have not yet experienced
    the event at each time point. Vertical lines are drawn at each target hour
    to indicate the target times, with the corresponding proportion of patients
    who experienced the event within these timeframes.

    When plotting multiple curves, different colors are automatically assigned
    and a legend is displayed. Target line annotations are only shown for the
    first curve to avoid visual clutter.
    """
    # Handle single dataframe vs list of dataframes
    if isinstance(df, pd.DataFrame):
        dataframes = [df]
        is_single_curve = True
    else:
        dataframes = df
        is_single_curve = False

    # Handle labels
    if labels is None:
        if is_single_curve:
            curve_labels = [None]
        else:
            curve_labels = [f"Curve {i+1}" for i in range(len(dataframes))]
    else:
        curve_labels = labels

    # Validate inputs
    if len(dataframes) != len(curve_labels):
        raise ValueError("Number of dataframes must match number of labels")

    # Create the plot
    fig = plt.figure(figsize=(10, 6))

    # Define colors for multiple curves
    colors = plt.cm.Set1(np.linspace(0, 1, len(dataframes)))

    survival_dfs = []

    # Process each dataframe
    for idx, (current_df, label) in enumerate(zip(dataframes, curve_labels)):
        # Calculate survival curve using the extracted function
        survival_df = calculate_survival_curve(current_df, start_time_col, end_time_col)

        # Extract arrays for plotting
        unique_times = survival_df["time_hours"].values
        survival_prob = survival_df["survival_probability"].values

        # Store DataFrame if requested
        if return_df:
            survival_dfs.append(survival_df)

        # Plot the survival curve
        color = colors[idx] if not is_single_curve else None
        plt.step(
            unique_times,
            survival_prob,
            where="post",
            color=color,
            label=label if not is_single_curve else None,
        )

        # Plot target lines and annotations only for the first curve (or single curve)
        if idx == 0:
            # Plot target lines for each target hour
            for target_hour in target_hours:
                # Find the survival probability at target hours
                closest_time_idx = np.abs(unique_times - target_hour).argmin()
                if closest_time_idx < len(survival_prob):
                    survival_at_target = survival_prob[closest_time_idx]
                    event_at_target = 1 - survival_at_target

                    # Add text annotation to the plot (only for single curve or first curve)
                    if is_single_curve or len(dataframes) == 1:
                        plt.text(
                            target_hour + 0.5,
                            survival_at_target,
                            annotation_string.format(event_at_target, target_hour),
                            bbox=dict(facecolor="white", alpha=0.8),
                        )

                        # Draw a vertical line from x-axis to the curve at target hours
                        plt.plot(
                            [target_hour, target_hour],
                            [0, survival_at_target],
                            color="grey",
                            linestyle="--",
                            linewidth=2,
                        )

                        # Draw a horizontal line from the curve to the y-axis at the survival probability level
                        plt.plot(
                            [0, target_hour],
                            [survival_at_target, survival_at_target],
                            color="grey",
                            linestyle="--",
                            linewidth=2,
                        )

    # Configure the plot
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.grid(True, alpha=0.3)

    # Make axes meet at the origin
    plt.xlim(left=0)
    plt.ylim(bottom=0)

    # Move spines to the origin
    ax = plt.gca()
    ax.spines["left"].set_position(("data", 0))
    ax.spines["bottom"].set_position(("data", 0))

    # Hide the top and right spines
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)

    # Add legend for multiple curves
    if not is_single_curve:
        plt.legend()

    plt.tight_layout()

    if media_file_path:
        if file_name:
            plt.savefig(media_file_path / file_name, dpi=300)
        else:
            plt.savefig(media_file_path / "survival_curve.png", dpi=300)

    # Handle return values
    return_data = (
        survival_dfs[0]
        if (return_df and is_single_curve)
        else survival_dfs
        if return_df
        else None
    )

    if return_figure and return_df:
        return fig, return_data
    elif return_figure:
        return fig
    elif return_df:
        return return_data
    else:
        plt.show()
        plt.close()

trial_results

Charts for hyperparameter optimisation trials.

This module provides tools to visualise the performance metrics of multiple hyperparameter tuning trials, highlighting the best trials for each metric.

Functions:

Name Description
plot_trial_results : function

Plot selected performance metrics for a list of hyperparameter trials.

plot_trial_results(trials_list, metrics=None, media_file_path=None, file_name=None, return_figure=False)

Plot selected performance metrics from hyperparameter trials as scatter plots.

This function visualizes the performance metrics of a series of hyperparameter trials. It creates scatter plots for each selected metric, with the best-performing trial highlighted and annotated with its hyperparameters.

Optionally, the plot can be saved to disk or returned as a figure object.

Parameters:

Name Type Description Default
trials_list List[HyperParameterTrial]

A list of HyperParameterTrial instances containing validation set results (not cross-validation fold results) and hyperparameter settings. Each trial's cv_results dictionary contains metrics such as 'valid_auc' and 'valid_logloss', which are computed on a held-out validation set for each hyperparameter configuration.

required
metrics List[str]

List of metric names to plot. If None, defaults to ["valid_auc", "valid_logloss"]. Each metric should be a key in the trial's cv_results dictionary.

None
media_file_path Path or None

Directory path where the generated plot image will be saved as "trial_results.png". If None, the plot is not saved.

None
file_name str

Custom filename to use when saving the plot. If not provided, defaults to "trial_results.png".

None
return_figure bool

If True, the matplotlib figure is returned instead of being displayed directly. Default is False.

False

Returns:

Type Description
Figure or None

The matplotlib figure object if return_figure is True; otherwise, None.

Notes
  • Assumes that each HyperParameterTrial in trials_list has a cv_results dictionary containing the requested metrics, which are computed on the validation set.
  • Parameters from the best-performing trials are shown in the plots.
Source code in src/patientflow/viz/trial_results.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def plot_trial_results(
    trials_list: List[HyperParameterTrial],
    metrics: Optional[List[str]] = None,
    media_file_path=None,
    file_name=None,
    return_figure=False,
):
    """
    Plot selected performance metrics from hyperparameter trials as scatter plots.

    This function visualizes the performance metrics of a series of hyperparameter trials.
    It creates scatter plots for each selected metric, with the best-performing trial
    highlighted and annotated with its hyperparameters.

    Optionally, the plot can be saved to disk or returned as a figure object.

    Parameters
    ----------
    trials_list : List[HyperParameterTrial]
        A list of `HyperParameterTrial` instances containing validation set results
        (not cross-validation fold results) and hyperparameter settings. Each trial's
        `cv_results` dictionary contains metrics such as 'valid_auc' and 'valid_logloss',
        which are computed on a held-out validation set for each hyperparameter configuration.
    metrics : List[str], optional
        List of metric names to plot. If None, defaults to ["valid_auc", "valid_logloss"].
        Each metric should be a key in the trial's cv_results dictionary.
    media_file_path : pathlib.Path or None, optional
        Directory path where the generated plot image will be saved as "trial_results.png".
        If None, the plot is not saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "trial_results.png".
    return_figure : bool, optional
        If True, the matplotlib figure is returned instead of being displayed directly.
        Default is False.

    Returns
    -------
    matplotlib.figure.Figure or None
        The matplotlib figure object if `return_figure` is True; otherwise, None.

    Notes
    -----
    - Assumes that each `HyperParameterTrial` in `trials_list` has a `cv_results` dictionary
      containing the requested metrics, which are computed on the validation set.
    - Parameters from the best-performing trials are shown in the plots.
    """
    # Set default metrics if none provided
    if metrics is None:
        metrics = ["valid_auc", "valid_logloss"]

    # Extract metrics from trials
    metric_values = {
        metric: [trial.cv_results.get(metric, 0) for trial in trials_list]
        for metric in metrics
    }

    # Create trial indices
    trial_indices = list(range(len(trials_list)))

    # Create figure with subplots
    n_metrics = len(metrics)
    fig, axes = plt.subplots(1, n_metrics, figsize=(7 * n_metrics, 6))
    if n_metrics == 1:
        axes = [axes]

    # Plot each metric
    for idx, (metric, values) in enumerate(metric_values.items()):
        ax = axes[idx]

        # Plot metric as dots
        ax.scatter(trial_indices, values, s=50, alpha=0.7)
        ax.set_xlabel("Trial Number")
        ax.set_ylabel(metric.replace("valid_", "").upper())
        ax.set_title(metric.replace("valid_", "").replace("_", " ").title())
        ax.grid(True, linestyle="--", alpha=0.7)

        # Set x-axis to display integers
        ax.set_xticks(trial_indices)
        ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: str(int(x))))

        # Set y-axis limits
        if "loss" in metric.lower():
            best_idx = values.index(min(values))
            ax.set_ylim(bottom=0, top=max(values) * 1.1)
        else:
            best_idx = values.index(max(values))
            ax.set_ylim(bottom=0, top=max(values) * 1.1)

        # Highlight best value
        highlight_color = "green" if "loss" not in metric.lower() else "darkred"
        ax.scatter(
            [best_idx],
            [values[best_idx]],
            color=highlight_color,
            s=150,
            edgecolor="black",
            zorder=5,
        )

        # Add annotation with best parameters
        best_trial = trials_list[best_idx]
        param_text = "\n".join([f"{k}: {v}" for k, v in best_trial.parameters.items()])
        best_value = values[best_idx]
        ax.text(
            0.05,
            0.05,
            f"Best {metric.replace('valid_', '').upper()}: {best_value:.4f}\n\nParameters:\n{param_text}",
            transform=ax.transAxes,
            bbox=dict(facecolor="white", alpha=0.7),
            fontsize=9,
        )

    # Add overall title
    fig.suptitle("Hyperparameter Trial Results", fontsize=14)

    # Adjust layout
    plt.tight_layout()

    if media_file_path:
        if file_name:
            plt.savefig(media_file_path / file_name, dpi=300)
        else:
            plt.savefig(media_file_path / "trial_results.png", dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

utils

Utility functions for visualization and data formatting.

This module provides helper functions for cleaning and formatting data for visualization purposes, including filename cleaning and prediction time formatting.

Functions:

Name Description
clean_title_for_filename : function

Clean a title string to make it suitable for use in filenames

format_prediction_time : function

Format prediction time to 'HH:MM' format

clean_title_for_filename(title)

Clean a title string to make it suitable for use in filenames.

Parameters:

Name Type Description Default
title str

The title to clean.

required

Returns:

Type Description
str

The cleaned title, safe for use in filenames.

Source code in src/patientflow/viz/utils.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def clean_title_for_filename(title):
    """Clean a title string to make it suitable for use in filenames.

    Parameters
    ----------
    title : str
        The title to clean.

    Returns
    -------
    str
        The cleaned title, safe for use in filenames.
    """
    replacements = {" ": "_", "%": "", "\n": "", ",": "", ".": ""}

    clean_title = title
    for old, new in replacements.items():
        clean_title = clean_title.replace(old, new)
    return clean_title

format_prediction_time(prediction_time)

Format prediction time to 'HH:MM' format.

Parameters:

Name Type Description Default
prediction_time str or tuple

Either: - A string in 'HHMM' format, possibly containing underscores - A tuple of (hour, minute)

required

Returns:

Type Description
str

Formatted time string in 'HH:MM' format.

Source code in src/patientflow/viz/utils.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def format_prediction_time(prediction_time):
    """Format prediction time to 'HH:MM' format.

    Parameters
    ----------
    prediction_time : str or tuple
        Either:
            - A string in 'HHMM' format, possibly containing underscores
            - A tuple of (hour, minute)

    Returns
    -------
    str
        Formatted time string in 'HH:MM' format.
    """
    if isinstance(prediction_time, tuple):
        hour, minute = prediction_time
        return f"{hour:02d}:{minute:02d}"
    else:
        # Split the string by underscores and take the last element
        last_part = prediction_time.split("_")[-1]
        # Add a colon in the middle
        return f"{last_part[:2]}:{last_part[2:]}"