API reference

PatientFlow: A package for predicting short-term hospital bed demand.

This package provides tools and models for analysing patient flow data and making predictions about emergency demand, elective demand, and hospital discharges.

`aggregate`

Aggregate Prediction From Patient-Level Probabilities

This submodule provides functions to aggregate patient-level predicted probabilities into a probability distribution. The module uses symbolic mathematics to generate and manipulate expressions, enabling the computation of aggregate probabilities based on individual patient-level predictions.

Functions:

Name	Description
`create_symbols : function`	Generate a sequence of symbolic objects intended for use in mathematical expressions.
`compute_core_expression : function`	Compute a symbolic expression involving a basic mathematical operation with a symbol and a constant.
`build_expression : function`	Construct a cumulative product expression by combining individual symbolic expressions.
`expression_subs : function`	Substitute values into a symbolic expression based on a mapping from symbols to predictions.
`return_coeff : function`	Extract the coefficient of a specified power from an expanded symbolic expression.
`model_input_to_pred_proba : function`	Use a predictive model to convert model input data into predicted probabilities.
`pred_proba_to_agg_predicted : function`	Convert individual probability predictions into aggregate predicted probability distribution using optional weights.
`get_prob_dist_for_prediction_moment : function`	Calculate both predicted distributions and observed values for a given date using test data.
`get_prob_dist : function`	Calculate probability distributions for each snapshot date based on given model predictions.
`get_prob_dist_without_patient_snapshots : function`	Calculate probability distributions for each snapshot date using an EmpiricalSurvivalPredictor.

`build_expression(syms, n)`

Construct a cumulative product expression by combining individual symbolic expressions.

Parameters:

Name	Type	Description	Default
`syms`	`iterable`	Iterable containing symbols to use in the expressions.	required
`n`	`int`	The number of terms to include in the cumulative product.	required

Returns:

Type	Description
`Expr`	The cumulative product of the expressions.

Source code in src/patientflow/aggregate.py

def build_expression(syms, n):
    """
    Construct a cumulative product expression by combining individual symbolic expressions.

    Parameters
    ----------
    syms : iterable
        Iterable containing symbols to use in the expressions.
    n : int
        The number of terms to include in the cumulative product.

    Returns
    -------
    Expr
        The cumulative product of the expressions.

    """
    s = sym.Symbol("s")
    expression = 1
    for i in range(n):
        expression *= compute_core_expression(syms[i], s)
    return expression

`compute_core_expression(ri, s)`

Compute a symbolic expression involving a basic mathematical operation with a symbol and a constant.

Parameters:

Name	Type	Description	Default
`ri`	`float`	The constant value to substitute into the expression.	required
`s`	`Symbol`	The symbolic object used in the expression.	required

Returns:

Type	Description
`Expr`	The symbolic expression after substitution.

Source code in src/patientflow/aggregate.py

def compute_core_expression(ri, s):
    """
    Compute a symbolic expression involving a basic mathematical operation with a symbol and a constant.

    Parameters
    ----------
    ri : float
        The constant value to substitute into the expression.
    s : Symbol
        The symbolic object used in the expression.

    Returns
    -------
    Expr
        The symbolic expression after substitution.

    """
    r = sym.Symbol("r")
    core_expression = (1 - r) + r * s
    return core_expression.subs({r: ri})

`create_symbols(n)`

Generate a sequence of symbolic objects intended for use in mathematical expressions.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of symbols to create.	required

Returns:

Type	Description
`tuple`	A tuple containing the generated symbolic objects.

Source code in src/patientflow/aggregate.py

def create_symbols(n):
    """
    Generate a sequence of symbolic objects intended for use in mathematical expressions.

    Parameters
    ----------
    n : int
        Number of symbols to create.

    Returns
    -------
    tuple
        A tuple containing the generated symbolic objects.

    """
    return symbols(f"r0:{n}")

`expression_subs(expression, n, predictions)`

Substitute values into a symbolic expression based on a mapping from symbols to predictions.

Parameters:

Name	Type	Description	Default
`expression`	`Expr`	The symbolic expression to perform substitution on.	required
`n`	`int`	Number of symbols and corresponding predictions.	required
`predictions`	`list`	List of numerical predictions to substitute.	required

Returns:

Type	Description
`Expr`	The expression after performing the substitution.

Source code in src/patientflow/aggregate.py

def expression_subs(expression, n, predictions):
    """
    Substitute values into a symbolic expression based on a mapping from symbols to predictions.

    Parameters
    ----------
    expression : Expr
        The symbolic expression to perform substitution on.
    n : int
        Number of symbols and corresponding predictions.
    predictions : list
        List of numerical predictions to substitute.

    Returns
    -------
    Expr
        The expression after performing the substitution.

    """
    syms = create_symbols(n)
    substitution = dict(zip(syms, predictions))
    return expression.subs(substitution)

`get_prob_dist(snapshots_dict, X_test, y_test, model, weights=None, verbose=False, category_filter=None, normal_approx_threshold=30)`

Calculate probability distributions for each snapshot date based on given model predictions.

Parameters:

Name	Type	Description	Default
`snapshots_dict`	`dict`	A dictionary mapping snapshot dates to indices in `X_test` and `y_test`. Must have datetime.date objects as keys and lists of indices as values.	required
`X_test`	`DataFrame or array - like`	Input test data to be passed to the model.	required
`y_test`	`array - like`	Observed target values.	required
`model`	`object or TrainedClassifier`	Either a predictive model which provides a `predict_proba` method, or a TrainedClassifier object containing a pipeline.	required
`weights`	`Series`	A Series containing weights for the test data points, which may influence the prediction, by default None. If provided, the weights should be indexed similarly to `X_test` and `y_test`.	`None`
`verbose`	`(bool, optional(default=False))`	If True, print progress information.	`False`
`category_filter`	`array - like`	Boolean mask indicating which samples belong to the specific outcome category being analyzed. Should be the same length as y_test.	`None`
`normal_approx_threshold`	`(int, optional(default=30))`	If the number of rows in a snapshot exceeds this threshold, use a Normal distribution approximation. Set to None or a very large number to always use the exact symbolic computation.	`30`

Returns:

Type	Description
`dict`	A dictionary mapping snapshot dates to probability distributions.

Raises:

Type	Description
`ValueError`	If snapshots_dict is not properly formatted or empty. If model has no predict_proba method and is not a TrainedClassifier.

Source code in src/patientflow/aggregate.py

def get_prob_dist(
    snapshots_dict,
    X_test,
    y_test,
    model,
    weights=None,
    verbose=False,
    category_filter=None,
    normal_approx_threshold=30,
):
    """
    Calculate probability distributions for each snapshot date based on given model predictions.

    Parameters
    ----------
    snapshots_dict : dict
        A dictionary mapping snapshot dates to indices in `X_test` and `y_test`.
        Must have datetime.date objects as keys and lists of indices as values.
    X_test : DataFrame or array-like
        Input test data to be passed to the model.
    y_test : array-like
        Observed target values.
    model : object or TrainedClassifier
        Either a predictive model which provides a `predict_proba` method,
        or a TrainedClassifier object containing a pipeline.
    weights : pandas.Series, optional
        A Series containing weights for the test data points, which may influence the prediction,
        by default None. If provided, the weights should be indexed similarly to `X_test` and `y_test`.
    verbose : bool, optional (default=False)
        If True, print progress information.
    category_filter : array-like, optional
        Boolean mask indicating which samples belong to the specific outcome category being analyzed.
        Should be the same length as y_test.
    normal_approx_threshold : int, optional (default=30)
        If the number of rows in a snapshot exceeds this threshold, use a Normal distribution approximation.
        Set to None or a very large number to always use the exact symbolic computation.

    Returns
    -------
    dict
        A dictionary mapping snapshot dates to probability distributions.

    Raises
    ------
    ValueError
        If snapshots_dict is not properly formatted or empty.
        If model has no predict_proba method and is not a TrainedClassifier.
    """
    # Validate snapshots_dict format
    if not snapshots_dict:
        raise ValueError("snapshots_dict cannot be empty")

    for dt, indices in snapshots_dict.items():
        if not isinstance(dt, date):
            raise ValueError(
                f"snapshots_dict keys must be datetime.date objects, got {type(dt)}"
            )
        if not isinstance(indices, list):
            raise ValueError(
                f"snapshots_dict values must be lists, got {type(indices)}"
            )
        if indices and not all(isinstance(idx, int) for idx in indices):
            raise ValueError("All indices in snapshots_dict must be integers")

    # Extract pipeline if model is a TrainedClassifier
    if hasattr(model, "calibrated_pipeline") and model.calibrated_pipeline is not None:
        model = model.calibrated_pipeline
    elif hasattr(model, "pipeline"):
        model = model.pipeline
    # Validate that model has predict_proba method
    elif not hasattr(model, "predict_proba"):
        raise ValueError(
            "Model must either be a TrainedClassifier or have a predict_proba method"
        )

    prob_dist_dict = {}
    if verbose:
        print(
            f"Calculating probability distributions for {len(snapshots_dict)} snapshot dates"
        )

        if len(snapshots_dict) > 10:
            print("This may take a minute or more")

    # Initialize a counter for notifying the user every 10 snapshot dates processed
    count = 0

    for dt, snapshots_to_include in snapshots_dict.items():
        if len(snapshots_to_include) == 0:
            # Create an empty dictionary for the current snapshot date
            prob_dist_dict[dt] = {
                "agg_predicted": pd.DataFrame({"agg_proba": [1]}, index=[0]),
                "agg_observed": 0,
            }
        else:
            # Ensure the lengths of test features and outcomes are equal
            assert len(X_test.loc[snapshots_to_include]) == len(
                y_test.loc[snapshots_to_include]
            ), "Mismatch in lengths of X_test and y_test snapshots."

            if weights is None:
                prediction_moment_weights = None
            else:
                prediction_moment_weights = weights.loc[snapshots_to_include].values

            # Apply category filter
            if category_filter is None:
                prediction_moment_category_filter = None
            else:
                prediction_moment_category_filter = category_filter.loc[
                    snapshots_to_include
                ]

            # Pass the normal_approx_threshold to get_prob_dist_for_prediction_moment
            prob_dist_dict[dt] = get_prob_dist_for_prediction_moment(
                X_test=X_test.loc[snapshots_to_include],
                y_test=y_test.loc[snapshots_to_include],
                model=model,
                weights=prediction_moment_weights,
                category_filter=prediction_moment_category_filter,
                normal_approx_threshold=normal_approx_threshold,
            )

        # Increment the counter and notify the user every 10 snapshot dates processed
        count += 1
        if verbose and count % 10 == 0 and count != len(snapshots_dict):
            print(f"Processed {count} snapshot dates")

    if verbose:
        print(f"Processed {len(snapshots_dict)} snapshot dates")

    return prob_dist_dict

`get_prob_dist_for_prediction_moment(X_test, model, weights=None, inference_time=False, y_test=None, category_filter=None, normal_approx_threshold=30)`

Calculate both predicted distributions and observed values for a given date using test data.

Parameters:

Name	Type	Description	Default
`X_test`	`array - like`	Test features for a specific snapshot date.	required
`model`	`object or TrainedClassifier`	Either a predictive model which provides a `predict_proba` method, or a TrainedClassifier object containing a pipeline.	required
`weights`	`array - like`	Weights to apply to the predictions for aggregate calculation.	`None`
`inference_time`	`(bool, optional(default=False))`	If True, do not calculate or return actual aggregate.	`False`
`y_test`	`array - like`	Actual outcomes corresponding to the test features. Required if inference_time is False.	`None`
`category_filter`	`array - like`	Boolean mask indicating which samples belong to the specific outcome category being analyzed. Should be the same length as y_test.	`None`
`normal_approx_threshold`	`(int, optional(default=30))`	If the number of rows in X_test exceeds this threshold, use a Normal distribution approximation. Set to None or a very large number to always use the exact symbolic computation.	`30`

Returns:

Type	Description
`dict`	A dictionary with keys 'agg_predicted' and, if inference_time is False, 'agg_observed'.

Raises:

Type	Description
`ValueError`	If y_test is not provided when inference_time is False. If model has no predict_proba method and is not a TrainedClassifier.

Source code in src/patientflow/aggregate.py

def get_prob_dist_for_prediction_moment(
    X_test,
    model,
    weights=None,
    inference_time=False,
    y_test=None,
    category_filter=None,
    normal_approx_threshold=30,
):
    """
    Calculate both predicted distributions and observed values for a given date using test data.

    Parameters
    ----------
    X_test : array-like
        Test features for a specific snapshot date.
    model : object or TrainedClassifier
        Either a predictive model which provides a `predict_proba` method,
        or a TrainedClassifier object containing a pipeline.
    weights : array-like, optional
        Weights to apply to the predictions for aggregate calculation.
    inference_time : bool, optional (default=False)
        If True, do not calculate or return actual aggregate.
    y_test : array-like, optional
        Actual outcomes corresponding to the test features. Required if inference_time is False.
    category_filter : array-like, optional
        Boolean mask indicating which samples belong to the specific outcome category being analyzed.
        Should be the same length as y_test.
    normal_approx_threshold : int, optional (default=30)
        If the number of rows in X_test exceeds this threshold, use a Normal distribution approximation.
        Set to None or a very large number to always use the exact symbolic computation.

    Returns
    -------
    dict
        A dictionary with keys 'agg_predicted' and, if inference_time is False, 'agg_observed'.

    Raises
    ------
    ValueError
        If y_test is not provided when inference_time is False.
        If model has no predict_proba method and is not a TrainedClassifier.
    """
    if not inference_time and y_test is None:
        raise ValueError("y_test must be provided if inference_time is False.")

    # Extract pipeline if model is a TrainedClassifier
    if hasattr(model, "calibrated_pipeline") and model.calibrated_pipeline is not None:
        model = model.calibrated_pipeline
    elif hasattr(model, "pipeline"):
        model = model.pipeline
    # Validate that model has predict_proba method
    elif not hasattr(model, "predict_proba"):
        raise ValueError(
            "Model must either be a TrainedClassifier or have a predict_proba method"
        )

    prediction_moment_dict = {}

    if len(X_test) > 0:
        pred_proba = model_input_to_pred_proba(X_test, model)
        agg_predicted = pred_proba_to_agg_predicted(
            pred_proba, weights, normal_approx_threshold
        )
        prediction_moment_dict["agg_predicted"] = agg_predicted

        if not inference_time:
            # Apply category filter when calculating observed sum
            if category_filter is None:
                prediction_moment_dict["agg_observed"] = sum(y_test)
            else:
                prediction_moment_dict["agg_observed"] = sum(y_test & category_filter)
    else:
        prediction_moment_dict["agg_predicted"] = pd.DataFrame(
            {"agg_proba": [1]}, index=[0]
        )
        if not inference_time:
            prediction_moment_dict["agg_observed"] = 0

    return prediction_moment_dict

`get_prob_dist_using_survival_curve(snapshot_dates, test_visits, category, prediction_time, prediction_window, start_time_col, end_time_col, model, verbose=False)`

Calculate probability distributions for each snapshot date using an EmpiricalIncomingAdmissionPredictor.

Parameters:

Name	Type	Description	Default
`snapshot_dates`	`array - like`	Array of dates for which to calculate probability distributions.	required
`test_visits`	`DataFrame`	DataFrame containing test visit data. Must have either: - start_time_col as a column and end_time_col as a column, or - start_time_col as the index and end_time_col as a column	required
`category`	`str`	Category to use for predictions (e.g., 'medical', 'surgical')	required
`prediction_time`	`tuple`	Tuple of (hour, minute) representing the time of day for predictions	required
`prediction_window`	`timedelta`	The prediction window duration	required
`start_time_col`	`str`	Name of the column containing start times (or index name if using index)	required
`end_time_col`	`str`	Name of the column containing end times	required
`model`	`EmpiricalSurvivalPredictor`	A fitted instance of EmpiricalSurvivalPredictor	required
`verbose`	`(bool, optional(default=False))`	If True, print progress information	`False`

Returns:

Type	Description
`dict`	A dictionary mapping snapshot dates to probability distributions.

Raises:

Type	Description
`ValueError`	If test_visits does not have the required columns or if model is not fitted.

Source code in src/patientflow/aggregate.py

def get_prob_dist_using_survival_curve(
    snapshot_dates: List[date],
    test_visits: pd.DataFrame,
    category: str,
    prediction_time: Tuple[int, int],
    prediction_window: timedelta,
    start_time_col: str,
    end_time_col: str,
    model: EmpiricalIncomingAdmissionPredictor,
    verbose=False,
):
    """
    Calculate probability distributions for each snapshot date using an EmpiricalIncomingAdmissionPredictor.

    Parameters
    ----------
    snapshot_dates : array-like
        Array of dates for which to calculate probability distributions.
    test_visits : pandas.DataFrame
        DataFrame containing test visit data. Must have either:
        - start_time_col as a column and end_time_col as a column, or
        - start_time_col as the index and end_time_col as a column
    category : str
        Category to use for predictions (e.g., 'medical', 'surgical')
    prediction_time : tuple
        Tuple of (hour, minute) representing the time of day for predictions
    prediction_window : timedelta
        The prediction window duration
    start_time_col : str
        Name of the column containing start times (or index name if using index)
    end_time_col : str
        Name of the column containing end times
    model : EmpiricalSurvivalPredictor
        A fitted instance of EmpiricalSurvivalPredictor
    verbose : bool, optional (default=False)
        If True, print progress information

    Returns
    -------
    dict
        A dictionary mapping snapshot dates to probability distributions.

    Raises
    ------
    ValueError
        If test_visits does not have the required columns or if model is not fitted.
    """

    # Validate test_visits has required columns
    if start_time_col in test_visits.columns:
        # start_time_col is a regular column
        if end_time_col not in test_visits.columns:
            raise ValueError(f"Column '{end_time_col}' not found in DataFrame")
    else:
        # Check if start_time_col is the index
        if test_visits.index.name != start_time_col:
            raise ValueError(
                f"'{start_time_col}' not found in DataFrame columns or index (index.name is '{test_visits.index.name}')"
            )
        if end_time_col not in test_visits.columns:
            raise ValueError(f"Column '{end_time_col}' not found in DataFrame")

    # Validate model is fitted
    if not hasattr(model, "survival_df") or model.survival_df is None:
        raise ValueError("Model must be fitted before calling get_prob_dist_empirical")

    prob_dist_dict = {}
    if verbose:
        print(
            f"Calculating probability distributions for {len(snapshot_dates)} snapshot dates"
        )

    # Create prediction context that will be the same for all dates
    prediction_context = {category: {"prediction_time": prediction_time}}

    for dt in snapshot_dates:
        # Create prediction moment by combining snapshot date and prediction time
        prediction_moment = datetime.combine(
            dt, time(prediction_time[0], prediction_time[1])
        )
        # Convert to UTC if the test_visits timestamps are timezone-aware
        if start_time_col in test_visits.columns:
            if test_visits[start_time_col].dt.tz is not None:
                prediction_moment = prediction_moment.replace(tzinfo=timezone.utc)
        else:
            if test_visits.index.tz is not None:
                prediction_moment = prediction_moment.replace(tzinfo=timezone.utc)

        # Get predictions from model
        predictions = model.predict(prediction_context)
        prob_dist_dict[dt] = {"agg_predicted": predictions[category]}

        # Calculate observed values
        if start_time_col in test_visits.columns:
            # start_time_col is a regular column
            mask = (test_visits[start_time_col] > prediction_moment) & (
                test_visits[end_time_col] <= prediction_moment + prediction_window
            )
        else:
            # start_time_col is the index
            mask = (test_visits.index > prediction_moment) & (
                test_visits[end_time_col] <= prediction_moment + prediction_window
            )
        nrow = mask.sum()
        prob_dist_dict[dt]["agg_observed"] = int(nrow) if nrow > 0 else 0

    if verbose:
        print(f"Processed {len(snapshot_dates)} snapshot dates")

    return prob_dist_dict

`model_input_to_pred_proba(model_input, model)`

Use a predictive model to convert model input data into predicted probabilities.

Parameters:

Name	Type	Description	Default
`model_input`	`array - like`	The input data to the model, typically as features used for predictions.	required
`model`	`object`	A model object with a `predict_proba` method that computes probability estimates.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the predicted probabilities for the positive class, with one column labeled 'pred_proba'.

Source code in src/patientflow/aggregate.py

def model_input_to_pred_proba(model_input, model):
    """
    Use a predictive model to convert model input data into predicted probabilities.

    Parameters
    ----------
    model_input : array-like
        The input data to the model, typically as features used for predictions.
    model : object
        A model object with a `predict_proba` method that computes probability estimates.

    Returns
    -------
    DataFrame
        A pandas DataFrame containing the predicted probabilities for the positive class,
        with one column labeled 'pred_proba'.

    """
    if len(model_input) == 0:
        return pd.DataFrame(columns=["pred_proba"])
    else:
        predictions = model.predict_proba(model_input)[:, 1]
        return pd.DataFrame(
            predictions, index=model_input.index, columns=["pred_proba"]
        )

`pred_proba_to_agg_predicted(predictions_proba, weights=None, normal_approx_threshold=30)`

Convert individual probability predictions into aggregate predicted probability distribution using optional weights. Uses a Normal approximation for large datasets (> normal_approx_threshold) for better performance.

Parameters:

Name	Type	Description	Default
`predictions_proba`	`DataFrame`	A DataFrame containing the probability predictions; must have a single column named 'pred_proba'.	required
`weights`	`array - like`	An array of weights, of the same length as the DataFrame rows, to apply to each prediction.	`None`
`normal_approx_threshold`	`(int, optional(default=30))`	If the number of rows in predictions_proba exceeds this threshold, use a Normal distribution approximation. Set to None or a very large number to always use the exact symbolic computation.	`30`

Returns:

Type	Description
`DataFrame`	A DataFrame with a single column 'agg_proba' showing the aggregated probability, indexed from 0 to n, where n is the number of predictions.

Source code in src/patientflow/aggregate.py

def pred_proba_to_agg_predicted(
    predictions_proba, weights=None, normal_approx_threshold=30
):
    """
    Convert individual probability predictions into aggregate predicted probability distribution using optional weights.
    Uses a Normal approximation for large datasets (> normal_approx_threshold) for better performance.

    Parameters
    ----------
    predictions_proba : DataFrame
        A DataFrame containing the probability predictions; must have a single column named 'pred_proba'.
    weights : array-like, optional
        An array of weights, of the same length as the DataFrame rows, to apply to each prediction.
    normal_approx_threshold : int, optional (default=30)
        If the number of rows in predictions_proba exceeds this threshold, use a Normal distribution approximation.
        Set to None or a very large number to always use the exact symbolic computation.

    Returns
    -------
    DataFrame
        A DataFrame with a single column 'agg_proba' showing the aggregated probability,
        indexed from 0 to n, where n is the number of predictions.
    """
    n = len(predictions_proba)

    if n == 0:
        agg_predicted_dict = {0: 1}
    elif normal_approx_threshold is not None and n > normal_approx_threshold:
        # Apply a normal approximation for large datasets
        import numpy as np
        from scipy.stats import norm

        # Apply weights if provided
        if weights is not None:
            probs = predictions_proba["pred_proba"].values * weights
        else:
            probs = predictions_proba["pred_proba"].values

        # Calculate mean and variance for the normal approximation
        # For a sum of Bernoulli variables, mean = sum of probabilities
        mean = probs.sum()
        # Variance = sum of p_i * (1-p_i)
        variance = (probs * (1 - probs)).sum()

        # Handle the case where variance is zero (all probabilities are 0 or 1)
        if variance == 0:
            # If variance is zero, all probabilities are the same (either all 0 or all 1)
            # The distribution is deterministic - all probability mass is at the mean
            agg_predicted_dict = {int(round(mean)): 1.0}
        else:
            # Generate probabilities for each possible count using normal approximation
            counts = np.arange(n + 1)
            agg_predicted_dict = {}

            for i in counts:
                # Probability that count = i is the probability that a normal RV falls between i-0.5 and i+0.5
                if i == 0:
                    p = norm.cdf(0.5, loc=mean, scale=np.sqrt(variance))
                elif i == n:
                    p = 1 - norm.cdf(n - 0.5, loc=mean, scale=np.sqrt(variance))
                else:
                    p = norm.cdf(i + 0.5, loc=mean, scale=np.sqrt(variance)) - norm.cdf(
                        i - 0.5, loc=mean, scale=np.sqrt(variance)
                    )
                agg_predicted_dict[i] = p

            # Normalize to ensure the probabilities sum to 1
            total = sum(agg_predicted_dict.values())
            if total > 0:
                for i in agg_predicted_dict:
                    agg_predicted_dict[i] /= total
            else:
                # If all probabilities are zero, set a uniform distribution
                n = len(agg_predicted_dict)
                for i in agg_predicted_dict:
                    agg_predicted_dict[i] = 1.0 / n
    else:
        # Use the original symbolic computation for smaller datasets
        local_proba = predictions_proba.copy()
        if weights is not None:
            local_proba["pred_proba"] *= weights

        syms = create_symbols(n)
        expression = build_expression(syms, n)
        expression = expression_subs(expression, n, local_proba["pred_proba"])
        agg_predicted_dict = {i: return_coeff(expression, i) for i in range(n + 1)}

    agg_predicted = pd.DataFrame.from_dict(
        agg_predicted_dict, orient="index", columns=["agg_proba"]
    )
    return agg_predicted

`return_coeff(expression, i)`

Extract the coefficient of a specified power from an expanded symbolic expression.

Parameters:

Name	Type	Description	Default
`expression`	`Expr`	The expression to expand and extract from.	required
`i`	`int`	The power of the term whose coefficient is to be extracted.	required

Returns:

Type	Description
`number`	The coefficient of the specified power in the expression.

Source code in src/patientflow/aggregate.py

def return_coeff(expression, i):
    """
    Extract the coefficient of a specified power from an expanded symbolic expression.

    Parameters
    ----------
    expression : Expr
        The expression to expand and extract from.
    i : int
        The power of the term whose coefficient is to be extracted.

    Returns
    -------
    number
        The coefficient of the specified power in the expression.

    """
    s = sym.Symbol("s")
    return expand(expression).coeff(s, i)

`calculate`

Calculation module for patient flow metrics.

This module provides functions for calculating various patient flow metrics such as arrival rates and admission probabilities within prediction windows.

`admission_in_prediction_window`

This module provides functions to model and analyze a curve consisting of an exponential growth segment followed by an exponential decay segment. It includes functions to create the curve, calculate specific points on it, and evaluate probabilities based on its shape.

Its intended use is to derive the probability of a patient being admitted to a hospital within a certain elapsed time after their arrival in the Emergency Department (ED), given the hospital's aspirations for the time it takes patients to be admitted. For this purpose, two points on the curve are required as parameters:

* (x1,y1) : The target proportion of patients y1 (eg 76%) who have been admitted or discharged by time x1 (eg 4 hours).
* (x2, y2) : The time x2 by which all but a small proportion y2 of patients have been admitted.

It is assumed that values of y where x < x1 is a growth curve grow exponentially towards x1 and that (x1,y1) the curve switches to a decay curve.

Functions:

Name	Description
`growth_curve : function`	Calculate exponential growth at a point where x < x1.
`decay_curve : function`	Calculate exponential decay at a point where x >= x1.
`create_curve : function`	Generate a full curve with both growth and decay segments.
`get_y_from_aspirational_curve : function`	Read from the curve a value for y, the probability of being admitted, for a given moment x hours after arrival
`calculate_probability : function`	Compute the probability of a patient being admitted by the end of a prediction window, given how much time has elapsed since their arrival.
`get_survival_probability : function`	Calculate the probability of a patient still being in the ED after a certain time using survival curve data.

`calculate_probability(elapsed_los, prediction_window, x1, y1, x2, y2)`

Calculates the probability of an admission occurring within a specified prediction window after the moment of prediction, based on the patient's elapsed time in the ED prior to the moment of prediction and the length of the window

Parameters:

Name	Type	Description	Default
`elapsed_los`	`timedelta`	The elapsed time since the patient arrived at the ED.	required
`prediction_window`	`timedelta`	The duration of the prediction window after the point of prediction, for which the probability is calculated.	required
`x1`	`float`	The time target for the first key point on the curve.	required
`y1`	`float`	The proportion target for the first key point (e.g., 76% of patients admitted by time x1).	required
`x2`	`float`	The time target for the second key point on the curve.	required
`y2`	`float`	The proportion target for the second key point (e.g., 99% of patients admitted by time x2).	required

Returns:

Type	Description
`float`	The probability of the event occurring within the given prediction window.

Edge Case Handling

When elapsed_los is extremely high, such as values significantly greater than x2, the admission probability prior to the current time (prob_admission_prior_to_now) can reach 1.0 despite the curve being asymptotic. This scenario can cause computational errors when calculating the conditional probability, as it involves a division by zero. In such cases, this function directly returns a probability of 1.0, reflecting certainty of admission.

Example

Calculate the probability that a patient, who has already been in the ED for 3 hours, will be admitted in the next 2 hours. The ED targets that 76% of patients are admitted or discharged within 4 hours, and 99% within 12 hours.

from datetime import timedelta calculate_probability(timedelta(hours=3), timedelta(hours=2), 4, 0.76, 12, 0.99)

Source code in src/patientflow/calculate/admission_in_prediction_window.py

def calculate_probability(
    elapsed_los: timedelta,
    prediction_window: timedelta,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
):
    """
    Calculates the probability of an admission occurring within a specified prediction window after the moment of prediction, based on the patient's elapsed time in the ED prior to the moment of prediction and the length of the window

    Parameters
    ----------
    elapsed_los : timedelta
        The elapsed time since the patient arrived at the ED.
    prediction_window : timedelta
        The duration of the prediction window after the point of prediction, for which the probability is calculated.
    x1 : float
        The time target for the first key point on the curve.
    y1 : float
        The proportion target for the first key point (e.g., 76% of patients admitted by time x1).
    x2 : float
        The time target for the second key point on the curve.
    y2 : float
        The proportion target for the second key point (e.g., 99% of patients admitted by time x2).

    Returns
    -------
    float
        The probability of the event occurring within the given prediction window.

    Edge Case Handling
    ------------------
    When elapsed_los is extremely high, such as values significantly greater than x2, the admission probability prior to the current time (`prob_admission_prior_to_now`) can reach 1.0 despite the curve being asymptotic. This scenario can cause computational errors when calculating the conditional probability, as it involves a division by zero. In such cases, this function directly returns a probability of 1.0, reflecting certainty of admission.

    Example
    -------
    Calculate the probability that a patient, who has already been in the ED for 3 hours, will be admitted in the next 2 hours. The ED targets that 76% of patients are admitted or discharged within 4 hours, and 99% within 12 hours.

    >>> from datetime import timedelta
    >>> calculate_probability(timedelta(hours=3), timedelta(hours=2), 4, 0.76, 12, 0.99)

    """
    # Validate inputs
    if not isinstance(elapsed_los, timedelta):
        raise TypeError("elapsed_los must be a timedelta object")
    if not isinstance(prediction_window, timedelta):
        raise TypeError("prediction_window must be a timedelta object")

    # Convert timedelta to hours
    elapsed_hours = elapsed_los.total_seconds() / 3600
    prediction_window_hours = prediction_window.total_seconds() / 3600

    # Validate elapsed time to ensure it represents a reasonable time value in hours
    if elapsed_hours < 0:
        raise ValueError(
            "elapsed_los must be non-negative (cannot have negative elapsed time)"
        )

    if elapsed_hours > 168:  # 168 hours = 1 week
        warnings.warn(
            "elapsed_los appears to be longer than 168 hours (1 week). "
            "Check that the units of elapsed_los are correct"
        )

    if not np.isfinite(elapsed_hours):
        raise ValueError("elapsed_los must be a finite time duration")

    # Validate prediction window to ensure it represents a reasonable time value in hours
    if prediction_window_hours < 0:
        raise ValueError(
            "prediction_window must be non-negative (cannot have negative prediction window)"
        )

    if prediction_window_hours > 72:  # 72 hours = 3 days
        warnings.warn(
            "prediction_window appears to be longer than 72 hours (3 days). "
            "Check that the units of prediction_window are correct"
        )

    if not np.isfinite(prediction_window_hours):
        raise ValueError("prediction_window must be a finite time duration")

    # probability of still being in the ED now (a function of elapsed time since arrival)
    prob_admission_prior_to_now = get_y_from_aspirational_curve(
        elapsed_hours, x1, y1, x2, y2
    )

    # prob admission when adding the prediction window added to elapsed time since arrival
    prob_admission_by_end_of_window = get_y_from_aspirational_curve(
        elapsed_hours + prediction_window_hours, x1, y1, x2, y2
    )

    # Direct return for edge cases where `prob_admission_prior_to_now` reaches 1.0
    if prob_admission_prior_to_now == 1:
        return 1.0

    # Calculate the conditional probability of admission within the prediction window
    # given that the patient hasn't been admitted yet
    conditional_prob = (
        prob_admission_by_end_of_window - prob_admission_prior_to_now
    ) / (1 - prob_admission_prior_to_now)

    return conditional_prob

`create_curve(x1, y1, x2, y2, a=0.01, generate_values=False)`

Generates parameters for an exponential growth and decay curve. Optionally generates x-values and corresponding y-values across a default or specified range.

Parameters:

Name	Type	Description	Default
`x1`	`float`	The x-value where the curve transitions from growth to decay.	required
`y1`	`float`	The y-value at the transition point x1.	required
`x2`	`float`	The x-value defining the end of the decay curve for calculation purposes.	required
`y2`	`float`	The y-value at x2, intended to fine-tune the decay rate.	required
`a`	`float`	The initial value coefficient for the growth curve, defaults to 0.01.	`0.01`
`generate_values`	`bool`	Flag to determine whether to generate x-values and y-values for visualization purposes.	`False`

Returns:

Type	Description
`tuple`	If generate_values is False, returns (gamma, lamda, a). If generate_values is True, returns (gamma, lamda, a, x_values, y_values).

Source code in src/patientflow/calculate/admission_in_prediction_window.py

def create_curve(x1, y1, x2, y2, a=0.01, generate_values=False):
    """
    Generates parameters for an exponential growth and decay curve.
    Optionally generates x-values and corresponding y-values across a default or specified range.

    Parameters
    ----------
    x1 : float
        The x-value where the curve transitions from growth to decay.
    y1 : float
        The y-value at the transition point x1.
    x2 : float
        The x-value defining the end of the decay curve for calculation purposes.
    y2 : float
        The y-value at x2, intended to fine-tune the decay rate.
    a : float, optional
        The initial value coefficient for the growth curve, defaults to 0.01.
    generate_values : bool, optional
        Flag to determine whether to generate x-values and y-values for visualization purposes.

    Returns
    -------
    tuple
        If generate_values is False, returns (gamma, lamda, a).
        If generate_values is True, returns (gamma, lamda, a, x_values, y_values).

    """
    # Validate inputs
    if not (x1 < x2):
        raise ValueError("x1 must be less than x2")
    if not (0 < y1 < y2 < 1):
        raise ValueError("y1 must be less than y2, and both must be between 0 and 1")

    # Constants for growth and decay
    gamma = np.log(y1 / a) / x1
    lamda = np.log((1 - y1) / (1 - y2)) / (x2 - x1)

    if generate_values:
        x_values = np.linspace(0, 20, 200)
        y_values = [
            (growth_curve(x, a, gamma) if x <= x1 else decay_curve(x, x1, y1, lamda))
            for x in x_values
        ]
        return gamma, lamda, a, x_values, y_values

    return gamma, lamda, a

`decay_curve(x, x1, y1, lamda)`

Calculate the exponential decay value at a given x using specified parameters. The function supports both scalar and array inputs for x.

Parameters:

Name	Type	Description	Default
`x`	`float or ndarray`	The x-value(s) at which to evaluate the curve.	required
`x1`	`float`	The x-value where the growth curve transitions to the decay curve.	required
`y1`	`float`	The y-value at the transition point, where the decay curve starts.	required
`lamda`	`float`	The decay rate coefficient.	required

Returns:

Type	Description
`float or ndarray`	The y-value(s) of the decay curve at x.

Source code in src/patientflow/calculate/admission_in_prediction_window.py

def decay_curve(x, x1, y1, lamda):
    """
    Calculate the exponential decay value at a given x using specified parameters.
    The function supports both scalar and array inputs for x.

    Parameters
    ----------
    x : float or np.ndarray
        The x-value(s) at which to evaluate the curve.
    x1 : float
        The x-value where the growth curve transitions to the decay curve.
    y1 : float
        The y-value at the transition point, where the decay curve starts.
    lamda : float
        The decay rate coefficient.

    Returns
    -------
    float or np.ndarray
        The y-value(s) of the decay curve at x.

    """
    return y1 + (1 - y1) * (1 - np.exp(-lamda * (x - x1)))

`get_survival_probability(survival_df, time_hours)`

Calculate the probability of a patient still being in the ED after a specified time using survival curve data.

Parameters:

Name	Type	Description	Default
`survival_df`	`DataFrame`	DataFrame containing survival curve data with columns: - time_hours: Time points in hours - survival_probability: Probability of still being in ED at each time point	required
`time_hours`	`float`	The time point (in hours) at which to calculate the survival probability	required

Returns:

Type	Description
`float`	The probability of still being in the ED at the specified time

Notes

If the exact time_hours is not in the survival curve data, the function will interpolate between the nearest time points
If time_hours is less than the minimum time in the data, returns 1.0
If time_hours is greater than the maximum time in the data, returns the last known survival probability

Examples:

>>> survival_df = pd.DataFrame({
...     'time_hours': [0, 2, 4, 6],
...     'survival_probability': [1.0, 0.8, 0.5, 0.2]
... })
>>> get_survival_probability(survival_df, 3.5)
0.65  # interpolated between 0.8 and 0.5

Source code in src/patientflow/calculate/admission_in_prediction_window.py

def get_survival_probability(survival_df, time_hours):
    """
    Calculate the probability of a patient still being in the ED after a specified time
    using survival curve data.

    Parameters
    ----------
    survival_df : pandas.DataFrame
        DataFrame containing survival curve data with columns:
        - time_hours: Time points in hours
        - survival_probability: Probability of still being in ED at each time point
    time_hours : float
        The time point (in hours) at which to calculate the survival probability

    Returns
    -------
    float
        The probability of still being in the ED at the specified time

    Notes
    -----
    - If the exact time_hours is not in the survival curve data, the function will
      interpolate between the nearest time points
    - If time_hours is less than the minimum time in the data, returns 1.0
    - If time_hours is greater than the maximum time in the data, returns the last
      known survival probability

    Examples
    --------
    >>> survival_df = pd.DataFrame({
    ...     'time_hours': [0, 2, 4, 6],
    ...     'survival_probability': [1.0, 0.8, 0.5, 0.2]
    ... })
    >>> get_survival_probability(survival_df, 3.5)
    0.65  # interpolated between 0.8 and 0.5
    """
    if time_hours < survival_df["time_hours"].min():
        return 1.0

    if time_hours > survival_df["time_hours"].max():
        return survival_df["survival_probability"].iloc[-1]

    # Find the closest time points for interpolation
    lower_idx = survival_df["time_hours"].searchsorted(time_hours, side="right") - 1
    upper_idx = lower_idx + 1

    if lower_idx < 0:
        return 1.0

    if upper_idx >= len(survival_df):
        return survival_df["survival_probability"].iloc[-1]

    # Get the surrounding points
    t1 = survival_df["time_hours"].iloc[lower_idx]
    t2 = survival_df["time_hours"].iloc[upper_idx]
    p1 = survival_df["survival_probability"].iloc[lower_idx]
    p2 = survival_df["survival_probability"].iloc[upper_idx]

    # Linear interpolation
    return p1 + (p2 - p1) * (time_hours - t1) / (t2 - t1)

`get_y_from_aspirational_curve(x, x1, y1, x2, y2)`

Calculate the probability y that a patient will have been admitted by a specified x after their arrival, by reading from the aspirational curve that has been constrained to pass through points (x1, y1) and (x2, y2) with an exponential growth curve where x < x1 and an exponential decay where x < x2

The function handles scalar or array inputs for x and determines y using either an exponential growth curve (for x < x1) or an exponential decay curve (for x >= x1). The curve parameters are derived to ensure the curve passes through specified points (x1, y1) and (x2, y2).

Parameters:

Name	Type	Description	Default
`x`	`float or ndarray`	The x-coordinate(s) at which to calculate the y-value on the curve. Can be a single value or an array of values.	required
`x1`	`float`	The x-coordinate of the first key point on the curve, where the growth phase ends and the decay phase begins.	required
`y1`	`float`	The y-coordinate of the first key point (x1), representing the target proportion of patients admitted by time x1.	required
`x2`	`float`	The x-coordinate of the second key point on the curve, beyond which all but a few patients are expected to be admitted.	required
`y2`	`float`	The y-coordinate of the second key point (x2), representing the target proportion of patients admitted by time x2.	required

Returns:

Type	Description
`float or ndarray`	The calculated y-value(s) (probability of admission) at the given x. The type of the return matches the input type for x (either scalar or array).

Source code in src/patientflow/calculate/admission_in_prediction_window.py

def get_y_from_aspirational_curve(x, x1, y1, x2, y2):
    """
    Calculate the probability y that a patient will have been admitted by a specified x after their arrival, by reading from the aspirational curve that has been constrained to pass through points (x1, y1) and (x2, y2) with an exponential growth curve where x < x1 and an exponential decay where x < x2

    The function handles scalar or array inputs for x and determines y using either an exponential growth curve (for x < x1)
    or an exponential decay curve (for x >= x1). The curve parameters are derived to ensure the curve passes through
    specified points (x1, y1) and (x2, y2).

    Parameters
    ----------
    x : float or np.ndarray
        The x-coordinate(s) at which to calculate the y-value on the curve. Can be a single value or an array of values.
    x1 : float
        The x-coordinate of the first key point on the curve, where the growth phase ends and the decay phase begins.
    y1 : float
        The y-coordinate of the first key point (x1), representing the target proportion of patients admitted by time x1.
    x2 : float
        The x-coordinate of the second key point on the curve, beyond which all but a few patients are expected to be admitted.
    y2 : float
        The y-coordinate of the second key point (x2), representing the target proportion of patients admitted by time x2.

    Returns
    -------
    float or np.ndarray
        The calculated y-value(s) (probability of admission) at the given x. The type of the return matches the input type
        for x (either scalar or array).

    """
    gamma, lamda, a = create_curve(x1, y1, x2, y2)
    y = np.where(x < x1, growth_curve(x, a, gamma), decay_curve(x, x1, y1, lamda))
    return y

`growth_curve(x, a, gamma)`

Calculate the exponential growth value at a given x using specified parameters. The function supports both scalar and array inputs for x.

Parameters:

Name	Type	Description	Default
`x`	`float or ndarray`	The x-value(s) at which to evaluate the curve.	required
`a`	`float`	The coefficient that defines the starting point of the growth curve when x is 0.	required
`gamma`	`float`	The growth rate coefficient of the curve.	required

Returns:

Type	Description
`float or ndarray`	The y-value(s) of the growth curve at x.

Source code in src/patientflow/calculate/admission_in_prediction_window.py

def growth_curve(x, a, gamma):
    """
    Calculate the exponential growth value at a given x using specified parameters.
    The function supports both scalar and array inputs for x.

    Parameters
    ----------
    x : float or np.ndarray
        The x-value(s) at which to evaluate the curve.
    a : float
        The coefficient that defines the starting point of the growth curve when x is 0.
    gamma : float
        The growth rate coefficient of the curve.

    Returns
    -------
    float or np.ndarray
        The y-value(s) of the growth curve at x.

    """
    return a * np.exp(x * gamma)

`arrival_rates`

Calculate and process time-varying arrival rates and admission probabilities.

This module provides functions for calculating arrival rates, admission probabilities, and unfettered demand rates for inpatient arrivals using an aspirational approach.

Functions:

Name	Description
`time_varying_arrival_rates : function`	Calculate arrival rates for each time interval across the dataset's date range.
`time_varying_arrival_rates_lagged : function`	Create lagged arrival rates based on time intervals.
`admission_probabilities : function`	Compute cumulative and hourly admission probabilities using aspirational curves.
`weighted_arrival_rates : function`	Aggregate weighted arrival rates for specific time intervals.
`unfettered_demand_by_hour : function`	Estimate inpatient demand by hour using historical data and aspirational curves.
`count_yet_to_arrive : function`	Count patients who arrived after prediction times and were admitted within prediction windows.

Notes

All times are handled in local timezone
Arrival rates are normalized by the number of unique days in the dataset
Demand calculations consider both historical patterns and admission probabilities
Time intervals must divide evenly into 24 hours
Aspirational curves use (x1,y1) and (x2,y2) coordinates to model admission probabilities

Examples:

>>> # Generate random arrival times over a week
>>> np.random.seed(42)  # For reproducibility
>>> n_arrivals = 1000
>>> random_times = [
...     pd.Timestamp('2024-01-01') +
...     pd.Timedelta(days=np.random.randint(0, 7)) +
...     pd.Timedelta(hours=np.random.randint(0, 24)) +
...     pd.Timedelta(minutes=np.random.randint(0, 60))
...     for _ in range(n_arrivals)
... ]
>>> df = pd.DataFrame(index=sorted(random_times))
>>>
>>> # Calculate various rates and demand
>>> rates = time_varying_arrival_rates(df, yta_time_interval=60)
>>> lagged_rates = time_varying_arrival_rates_lagged(df, lagged_by=4)
>>> demand = unfettered_demand_by_hour(df, x1=4, y1=0.8, x2=8, y2=0.95)

`admission_probabilities(hours_since_arrival, x1, y1, x2, y2)`

Calculate probability of admission for each hour since arrival.

Parameters:

Name	Type	Description	Default
`hours_since_arrival`	`ndarray`	Array of hours since arrival.	required
`x1`	`float`	First x-coordinate of the aspirational curve.	required
`y1`	`float`	First y-coordinate of the aspirational curve.	required
`x2`	`float`	Second x-coordinate of the aspirational curve.	required
`y2`	`float`	Second y-coordinate of the aspirational curve.	required

Returns:

Type	Description
`Tuple[ndarray, ndarray]`	A tuple containing: - np.ndarray: Cumulative admission probabilities - np.ndarray: Hourly admission probabilities

Notes

The aspirational curve is defined by two points (x1,y1) and (x2,y2) and is used to model the probability of admission over time.

Source code in src/patientflow/calculate/arrival_rates.py

def admission_probabilities(
    hours_since_arrival: np.ndarray, x1: float, y1: float, x2: float, y2: float
) -> Tuple[np.ndarray, np.ndarray]:
    """Calculate probability of admission for each hour since arrival.

    Parameters
    ----------
    hours_since_arrival : np.ndarray
        Array of hours since arrival.
    x1 : float
        First x-coordinate of the aspirational curve.
    y1 : float
        First y-coordinate of the aspirational curve.
    x2 : float
        Second x-coordinate of the aspirational curve.
    y2 : float
        Second y-coordinate of the aspirational curve.

    Returns
    -------
    Tuple[np.ndarray, np.ndarray]
        A tuple containing:
        - np.ndarray: Cumulative admission probabilities
        - np.ndarray: Hourly admission probabilities

    Notes
    -----
    The aspirational curve is defined by two points (x1,y1) and (x2,y2) and is used
    to model the probability of admission over time.
    """
    prob_admission_by_hour = np.array(
        [
            get_y_from_aspirational_curve(hour, x1, y1, x2, y2)
            for hour in hours_since_arrival
        ]
    )
    prob_admission_within_hour = np.diff(prob_admission_by_hour)

    return prob_admission_by_hour, prob_admission_within_hour

`count_yet_to_arrive(df, snapshot_dates, prediction_times, prediction_window_hours)`

Count patients who arrived after prediction times and were admitted within prediction windows.

This function counts patients who arrived after specified prediction times and were admitted to a ward within the specified prediction window for each combination of snapshot date and prediction time.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A DataFrame containing patient data with 'arrival_datetime', 'admitted_to_ward_datetime', and 'patient_id' columns.	required
`snapshot_dates`	`list`	List of dates (datetime.date objects) to analyze.	required
`prediction_times`	`list`	List of (hour, minute) tuples representing prediction times.	required
`prediction_window_hours`	`float`	Length of prediction window in hours after the prediction time.	required

Returns:

Type	Description
`DataFrame`	DataFrame with columns: - 'snapshot_date': The date of the snapshot - 'prediction_time': Tuple of (hour, minute) for the prediction time - 'count': Number of unique patients who arrived after prediction time and were admitted within the prediction window

Raises:

Type	Description
`TypeError`	If df is not a DataFrame or if required columns are missing.
`ValueError`	If prediction_window_hours is not positive.

Notes

This function is useful for analyzing historical patterns of patient arrivals and admissions to inform predictive models for emergency department demand. Only patients with non-null admitted_to_ward_datetime are counted.

Examples:

>>> import pandas as pd
>>> from datetime import date, time
>>> prediction_times = [(12, 0), (15, 30)]
>>> snapshot_dates = [date(2023, 1, 1), date(2023, 1, 2)]
>>> results = count_yet_to_arrive(df, snapshot_dates, prediction_times, 8.0)

Source code in src/patientflow/calculate/arrival_rates.py

def count_yet_to_arrive(
    df: DataFrame,
    snapshot_dates: List,
    prediction_times: List,
    prediction_window_hours: float,
) -> DataFrame:
    """Count patients who arrived after prediction times and were admitted within prediction windows.

    This function counts patients who arrived after specified prediction times and were
    admitted to a ward within the specified prediction window for each combination of
    snapshot date and prediction time.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame containing patient data with 'arrival_datetime',
        'admitted_to_ward_datetime', and 'patient_id' columns.
    snapshot_dates : list
        List of dates (datetime.date objects) to analyze.
    prediction_times : list
        List of (hour, minute) tuples representing prediction times.
    prediction_window_hours : float
        Length of prediction window in hours after the prediction time.

    Returns
    -------
    pandas.DataFrame
        DataFrame with columns:
        - 'snapshot_date': The date of the snapshot
        - 'prediction_time': Tuple of (hour, minute) for the prediction time
        - 'count': Number of unique patients who arrived after prediction time
                  and were admitted within the prediction window

    Raises
    ------
    TypeError
        If df is not a DataFrame or if required columns are missing.
    ValueError
        If prediction_window_hours is not positive.

    Notes
    -----
    This function is useful for analyzing historical patterns of patient arrivals
    and admissions to inform predictive models for emergency department demand.
    Only patients with non-null admitted_to_ward_datetime are counted.

    Examples
    --------
    >>> import pandas as pd
    >>> from datetime import date, time
    >>> prediction_times = [(12, 0), (15, 30)]
    >>> snapshot_dates = [date(2023, 1, 1), date(2023, 1, 2)]
    >>> results = count_yet_to_arrive(df, snapshot_dates, prediction_times, 8.0)
    """
    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")

    required_columns = ["arrival_datetime", "admitted_to_ward_datetime", "patient_id"]
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise TypeError(f"DataFrame missing required columns: {missing_columns}")

    if (
        not isinstance(prediction_window_hours, (int, float))
        or prediction_window_hours <= 0
    ):
        raise ValueError("prediction_window_hours must be a positive number.")

    # Create an empty list to store results
    results = []

    # For each combination of date and time
    for date_val in snapshot_dates:
        for hour, minute in prediction_times:
            # Create the prediction datetime
            prediction_datetime = pd.Timestamp(
                datetime.combine(date_val, time(hour=hour, minute=minute))
            )

            # Calculate the end of the prediction window
            prediction_window_end = prediction_datetime + pd.Timedelta(
                hours=prediction_window_hours
            )

            # Count patients who arrived after prediction time and were admitted within the window
            admitted_within_window = len(
                df[
                    (df["arrival_datetime"] > prediction_datetime)
                    & (df["admitted_to_ward_datetime"] <= prediction_window_end)
                ]
            )

            # Store the result
            results.append(
                {
                    "snapshot_date": date_val,
                    "prediction_time": (hour, minute),
                    "count": admitted_within_window,
                }
            )

    # Convert results to a DataFrame
    results_df = pd.DataFrame(results)

    return results_df

`process_arrival_rates(arrival_rates_dict)`

Process arrival rates dictionary into formats needed for plotting.

Parameters

arrival_rates_dict : Dict[datetime.time, float]
    Mapping of times to arrival rates.

Returns

Tuple[List[float], List[str], List[int]]
    A tuple containing:
    - List[float]: Arrival rate values
    - List[str]: Formatted hour range strings (e.g., "09-

10") - List[int]: Integers for x-axis positioning

Notes

The hour labels are formatted with line breaks for better plot readability.

Source code in src/patientflow/calculate/arrival_rates.py

def process_arrival_rates(
    arrival_rates_dict: Dict[time, float],
) -> Tuple[List[float], List[str], List[int]]:
    """Process arrival rates dictionary into formats needed for plotting.

    Parameters
    ----------
    arrival_rates_dict : Dict[datetime.time, float]
        Mapping of times to arrival rates.

    Returns
    -------
    Tuple[List[float], List[str], List[int]]
        A tuple containing:
        - List[float]: Arrival rate values
        - List[str]: Formatted hour range strings (e.g., "09-\n10")
        - List[int]: Integers for x-axis positioning

    Notes
    -----
    The hour labels are formatted with line breaks for better plot readability.
    """
    # Extract hours and rates
    hours = list(arrival_rates_dict.keys())
    arrival_rates = list(arrival_rates_dict.values())

    # Create formatted hour labels with line breaks for better plot readability
    hour_labels = [
        f'{hour.strftime("%H")}-\n{str((hour.hour + 1) % 24).zfill(2)}'
        for hour in hours
    ]

    # Generate numerical values for x-axis positioning
    hour_values = list(range(len(hour_labels)))

    return arrival_rates, hour_labels, hour_values

`time_varying_arrival_rates(df, yta_time_interval, num_days=None, verbose=False)`

Calculate the time-varying arrival rates for a dataset indexed by datetime.

This function computes the arrival rates for each time interval specified, across the entire date range present in the dataframe. The arrival rate is calculated as the number of entries in the dataframe for each time interval, divided by the number of days in the dataset's timespan.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A DataFrame indexed by datetime, representing the data for which arrival rates are to be calculated. The index of the DataFrame should be of datetime type.	required
`yta_time_interval`	`int or timedelta`	The time interval for which the arrival rates are to be calculated. If int, assumed to be in minutes. If timedelta, will be converted to minutes. For example, if `yta_time_interval=60`, the function will calculate hourly arrival rates.	required
`num_days`	`int`	The number of days that the DataFrame spans. If not provided, the number of days is calculated from the date of the min and max arrival datetimes.	`None`
`verbose`	`bool`	If True, enable info-level logging. Defaults to False.	`False`

Returns:

Type	Description
`OrderedDict[time, float]`	A dictionary mapping times to arrival rates, where times are datetime.time objects and rates are float values.

Raises:

Type	Description
`TypeError`	If 'df' is not a pandas DataFrame, 'yta_time_interval' is not an integer or timedelta, or the DataFrame index is not a DatetimeIndex.
`ValueError`	If 'yta_time_interval' is less than or equal to 0 or does not divide evenly into 24 hours.

Notes

The minimum and maximum dates in the dataset are used to determine the timespan if num_days is not provided.

Source code in src/patientflow/calculate/arrival_rates.py

def time_varying_arrival_rates(
    df: DataFrame,
    yta_time_interval: Union[int, timedelta],
    num_days: Optional[int] = None,
    verbose: bool = False,
) -> OrderedDict[time, float]:
    """Calculate the time-varying arrival rates for a dataset indexed by datetime.

    This function computes the arrival rates for each time interval specified, across
    the entire date range present in the dataframe. The arrival rate is calculated as
    the number of entries in the dataframe for each time interval, divided by the
    number of days in the dataset's timespan.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame indexed by datetime, representing the data for which arrival rates
        are to be calculated. The index of the DataFrame should be of datetime type.
    yta_time_interval : int or timedelta
        The time interval for which the arrival rates are to be calculated.
        If int, assumed to be in minutes. If timedelta, will be converted to minutes.
        For example, if `yta_time_interval=60`, the function will calculate hourly
        arrival rates.
    num_days : int, optional
        The number of days that the DataFrame spans. If not provided, the number of
        days is calculated from the date of the min and max arrival datetimes.
    verbose : bool, optional
        If True, enable info-level logging. Defaults to False.

    Returns
    -------
    OrderedDict[datetime.time, float]
        A dictionary mapping times to arrival rates, where times are datetime.time
        objects and rates are float values.

    Raises
    ------
    TypeError
        If 'df' is not a pandas DataFrame, 'yta_time_interval' is not an integer or timedelta,
        or the DataFrame index is not a DatetimeIndex.
    ValueError
        If 'yta_time_interval' is less than or equal to 0 or does not divide evenly
        into 24 hours.

    Notes
    -----
    The minimum and maximum dates in the dataset are used to determine the timespan
    if num_days is not provided.
    """
    import logging
    import sys

    if verbose:
        # Create logger with a unique name
        logger = logging.getLogger(f"{__name__}.time_varying_arrival_rates")

        # Only set up handlers if they don't exist
        if not logger.handlers:
            logger.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create handler that writes to sys.stdout
            handler = logging.StreamHandler(sys.stdout)
            handler.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create a formatting configuration
            formatter = logging.Formatter("%(message)s")
            handler.setFormatter(formatter)

            # Add the handler to the logger
            logger.addHandler(handler)

            # Prevent propagation to root logger
            logger.propagate = False

    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")
    if not isinstance(yta_time_interval, (int, timedelta)):
        raise TypeError(
            "The parameter 'yta_time_interval' must be an integer or timedelta."
        )
    if not isinstance(df.index, pd.DatetimeIndex):
        raise TypeError("The DataFrame index must be a pandas DatetimeIndex.")

    # Handle both timedelta and numeric inputs for yta_time_interval
    if isinstance(yta_time_interval, timedelta):
        yta_time_interval_minutes = int(yta_time_interval.total_seconds() / 60)
    elif isinstance(yta_time_interval, int):
        yta_time_interval_minutes = yta_time_interval
    else:
        raise TypeError("yta_time_interval must be a timedelta object or integer")

    # Validate time interval
    minutes_in_day = 24 * 60
    if yta_time_interval_minutes <= 0:
        raise ValueError("The parameter 'yta_time_interval' must be positive.")
    if minutes_in_day % yta_time_interval_minutes != 0:
        raise ValueError(
            f"Time interval ({yta_time_interval_minutes} minutes) must divide evenly into 24 hours."
        )

    if num_days is None:
        # Calculate total days between first and last date
        if verbose and logger:
            logger.info("Inferring number of days from dataset")
        start_date = df.index.date.min()
        end_date = df.index.date.max()
        num_days = (end_date - start_date).days + 1

    if num_days == 0:
        raise ValueError("DataFrame contains no data.")

    if verbose and logger:
        logger.info(
            f"Calculating time-varying arrival rates for data provided, which spans {num_days} unique dates"
        )

    arrival_rates_dict = OrderedDict()

    # Initialize a time object to iterate through one day in the specified intervals
    _start_datetime = datetime(1970, 1, 1, 0, 0, 0, 0)
    _stop_datetime = _start_datetime + timedelta(days=1)

    # Iterate over each interval in a single day to calculate the arrival rate
    while _start_datetime != _stop_datetime:
        _start_time = _start_datetime.time()
        _end_time = (
            _start_datetime + timedelta(minutes=yta_time_interval_minutes)
        ).time()

        # Filter the dataframe for entries within the current time interval
        _df = df.between_time(_start_time, _end_time, inclusive="left")

        # Calculate and store the arrival rate for the interval
        arrival_rates_dict[_start_time] = _df.shape[0] / num_days

        # Move to the next interval
        _start_datetime = _start_datetime + timedelta(minutes=yta_time_interval_minutes)

    return arrival_rates_dict

`time_varying_arrival_rates_lagged(df, lagged_by, num_days=None, yta_time_interval=60)`

Calculate lagged time-varying arrival rates for a dataset indexed by datetime.

This function first calculates the basic arrival rates and then adjusts them by a specified lag time, returning the rates sorted by the lagged times.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A DataFrame indexed by datetime, representing the data for which arrival rates are to be calculated. The index must be a DatetimeIndex.	required
`lagged_by`	`int`	Number of hours to lag the arrival times.	required
`num_days`	`int`	The number of days that the DataFrame spans. If not provided, the number of days is calculated from the date of the min and max arrival datetimes.	`None`
`yta_time_interval`	`int or timedelta`	The time interval for which the arrival rates are to be calculated. If int, assumed to be in minutes. If timedelta, will be converted to minutes. Defaults to 60.	`60`

Returns:

Type	Description
`OrderedDict[time, float]`	A dictionary mapping lagged times (datetime.time objects) to their corresponding arrival rates.

Raises:

Type	Description
`TypeError`	If df is not a DataFrame, lagged_by is not an integer, yta_time_interval is not an integer or timedelta, or DataFrame index is not DatetimeIndex.
`ValueError`	If lagged_by is negative or yta_time_interval is not positive.

Notes

The lagged times are calculated by adding the specified number of hours to each time in the original arrival rates dictionary.

Source code in src/patientflow/calculate/arrival_rates.py

def time_varying_arrival_rates_lagged(
    df: DataFrame,
    lagged_by: int,
    num_days: Optional[int] = None,
    yta_time_interval: Union[int, timedelta] = 60,
) -> OrderedDict[time, float]:
    """Calculate lagged time-varying arrival rates for a dataset indexed by datetime.

    This function first calculates the basic arrival rates and then adjusts them by
    a specified lag time, returning the rates sorted by the lagged times.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame indexed by datetime, representing the data for which arrival rates
        are to be calculated. The index must be a DatetimeIndex.
    lagged_by : int
        Number of hours to lag the arrival times.
    num_days : int, optional
        The number of days that the DataFrame spans. If not provided, the number of
        days is calculated from the date of the min and max arrival datetimes.
    yta_time_interval : int or timedelta, optional
        The time interval for which the arrival rates are to be calculated.
        If int, assumed to be in minutes. If timedelta, will be converted to minutes.
        Defaults to 60.

    Returns
    -------
    OrderedDict[datetime.time, float]
        A dictionary mapping lagged times (datetime.time objects) to their
        corresponding arrival rates.

    Raises
    ------
    TypeError
        If df is not a DataFrame, lagged_by is not an integer, yta_time_interval is not an integer or timedelta,
        or DataFrame index is not DatetimeIndex.
    ValueError
        If lagged_by is negative or yta_time_interval is not positive.

    Notes
    -----
    The lagged times are calculated by adding the specified number of hours to each
    time in the original arrival rates dictionary.
    """
    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")

    if not isinstance(lagged_by, int):
        raise TypeError("The parameter 'lagged_by' must be an integer.")

    if not isinstance(yta_time_interval, (int, timedelta)):
        raise TypeError(
            "The parameter 'yta_time_interval' must be an integer or timedelta."
        )

    if not isinstance(df.index, pd.DatetimeIndex):
        raise TypeError("The DataFrame index must be a pandas DatetimeIndex.")

    if lagged_by < 0:
        raise ValueError("The parameter 'lagged_by' must be non-negative.")

    # Handle both timedelta and numeric inputs for yta_time_interval
    if isinstance(yta_time_interval, timedelta):
        yta_time_interval_minutes = int(yta_time_interval.total_seconds() / 60)
    elif isinstance(yta_time_interval, int):
        yta_time_interval_minutes = yta_time_interval
    else:
        raise TypeError("yta_time_interval must be a timedelta object or integer")

    if yta_time_interval_minutes <= 0:
        raise ValueError("The parameter 'yta_time_interval' must be positive.")

    # Calculate base arrival rates
    arrival_rates_dict = time_varying_arrival_rates(
        df, yta_time_interval, num_days=num_days
    )

    # Apply lag to the times
    lagged_dict = OrderedDict()
    reference_date = datetime(2000, 1, 1)  # Use arbitrary reference date

    for base_time, rate in arrival_rates_dict.items():
        # Combine with reference date and apply lag
        lagged_datetime = datetime.combine(reference_date, base_time) + timedelta(
            hours=lagged_by
        )
        lagged_dict[lagged_datetime.time()] = rate

    # Sort by lagged times
    return OrderedDict(sorted(lagged_dict.items()))

`unfettered_demand_by_hour(df, x1, y1, x2, y2, yta_time_interval=60, max_hours_since_arrival=10, num_days=None)`

Calculate true inpatient demand by hour based on historical arrival data.

This function estimates demand rates using historical arrival data and an aspirational curve for admission probabilities. It takes a DataFrame of historical arrivals and parameters defining an aspirational curve to calculate hourly demand rates.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A DataFrame indexed by datetime, representing historical arrival data. The index must be a DatetimeIndex.	required
`x1`	`float`	First x-coordinate of the aspirational curve.	required
`y1`	`float`	First y-coordinate of the aspirational curve (0-1).	required
`x2`	`float`	Second x-coordinate of the aspirational curve.	required
`y2`	`float`	Second y-coordinate of the aspirational curve (0-1).	required
`yta_time_interval`	`int or timedelta`	Time interval for which the arrival rates are to be calculated. If int, assumed to be in minutes. If timedelta, will be converted to minutes. Defaults to 60.	`60`
`max_hours_since_arrival`	`int`	Maximum hours since arrival to consider. Defaults to 10.	`10`
`num_days`	`int`	The number of days that the DataFrame spans. If not provided, the number of days is calculated from the date of the min and max arrival datetimes.	`None`

Returns:

Type	Description
`OrderedDict[time, float]`	A dictionary mapping times (datetime.time objects) to their corresponding demand rates.

Raises:

Type	Description
`TypeError`	If df is not a DataFrame, coordinates are not floats, or DataFrame index is not DatetimeIndex.
`ValueError`	If coordinates are outside valid ranges, yta_time_interval is not positive, or doesn't divide evenly into 24 hours.

Notes

The function combines historical arrival patterns with admission probabilities to estimate true inpatient demand. The aspirational curve is used to model how admission probabilities change over time.

Source code in src/patientflow/calculate/arrival_rates.py

def unfettered_demand_by_hour(
    df: DataFrame,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    yta_time_interval: Union[int, timedelta] = 60,
    max_hours_since_arrival: int = 10,
    num_days: Optional[int] = None,
) -> OrderedDict[time, float]:
    """Calculate true inpatient demand by hour based on historical arrival data.

    This function estimates demand rates using historical arrival data and an aspirational
    curve for admission probabilities. It takes a DataFrame of historical arrivals and
    parameters defining an aspirational curve to calculate hourly demand rates.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame indexed by datetime, representing historical arrival data.
        The index must be a DatetimeIndex.
    x1 : float
        First x-coordinate of the aspirational curve.
    y1 : float
        First y-coordinate of the aspirational curve (0-1).
    x2 : float
        Second x-coordinate of the aspirational curve.
    y2 : float
        Second y-coordinate of the aspirational curve (0-1).
    yta_time_interval : int or timedelta, optional
        Time interval for which the arrival rates are to be calculated.
        If int, assumed to be in minutes. If timedelta, will be converted to minutes.
        Defaults to 60.
    max_hours_since_arrival : int, optional
        Maximum hours since arrival to consider. Defaults to 10.
    num_days : int, optional
        The number of days that the DataFrame spans. If not provided, the number of
        days is calculated from the date of the min and max arrival datetimes.

    Returns
    -------
    OrderedDict[datetime.time, float]
        A dictionary mapping times (datetime.time objects) to their corresponding
        demand rates.

    Raises
    ------
    TypeError
        If df is not a DataFrame, coordinates are not floats, or DataFrame index
        is not DatetimeIndex.
    ValueError
        If coordinates are outside valid ranges, yta_time_interval is not positive,
        or doesn't divide evenly into 24 hours.

    Notes
    -----
    The function combines historical arrival patterns with admission probabilities
    to estimate true inpatient demand. The aspirational curve is used to model
    how admission probabilities change over time.
    """
    # Input validation
    if not isinstance(df, DataFrame):
        raise TypeError("The input 'df' must be a pandas DataFrame.")

    if not isinstance(df.index, pd.DatetimeIndex):
        raise TypeError("The DataFrame index must be a pandas DatetimeIndex.")

    if not all(isinstance(x, (int, float)) for x in [x1, y1, x2, y2]):
        raise TypeError("Curve coordinates must be numeric values.")

    if not isinstance(yta_time_interval, (int, timedelta)):
        raise TypeError(
            "The parameter 'yta_time_interval' must be an integer or timedelta."
        )

    if not isinstance(max_hours_since_arrival, int):
        raise TypeError("The parameter 'max_hours_since_arrival' must be an integer.")

    # Handle both timedelta and numeric inputs for yta_time_interval
    if isinstance(yta_time_interval, timedelta):
        yta_time_interval_minutes = int(yta_time_interval.total_seconds() / 60)
    elif isinstance(yta_time_interval, int):
        yta_time_interval_minutes = yta_time_interval
    else:
        raise TypeError("yta_time_interval must be a timedelta object or integer")

    # Validate time interval
    minutes_in_day = 24 * 60
    if yta_time_interval_minutes <= 0:
        raise ValueError("The parameter 'yta_time_interval' must be positive.")
    if minutes_in_day % yta_time_interval_minutes != 0:
        raise ValueError(
            f"Time interval ({yta_time_interval_minutes} minutes) must divide evenly into 24 hours."
        )

    if max_hours_since_arrival <= 0:
        raise ValueError("The parameter 'max_hours_since_arrival' must be positive.")

    if not (0 <= y1 <= 1 and 0 <= y2 <= 1):
        raise ValueError("Y-coordinates must be between 0 and 1.")

    if x1 >= x2:
        raise ValueError("x1 must be less than x2.")

    # Calculate number of intervals in a day
    num_intervals = minutes_in_day // yta_time_interval_minutes

    # Calculate admission probabilities
    hours_since_arrival = np.arange(max_hours_since_arrival + 1)
    _, prob_admission_within_hour = admission_probabilities(
        hours_since_arrival, x1, y1, x2, y2
    )

    # Calculate base arrival rates from historical data
    arrival_rates_dict = time_varying_arrival_rates(
        df, yta_time_interval_minutes, num_days=num_days
    )

    # Convert dict to arrays while preserving order
    hour_keys = list(arrival_rates_dict.keys())
    arrival_rates = np.array([arrival_rates_dict[hour] for hour in hour_keys])

    # Initialize array for weighted arrival rates
    weighted_rates = np.zeros((max_hours_since_arrival, len(arrival_rates)))

    # Calculate weighted arrival rates for each hour and elapsed time
    for hour_idx, _ in enumerate(hour_keys):
        arrival_rate = arrival_rates[hour_idx]
        weighted_rates[:, hour_idx] = (
            arrival_rate * prob_admission_within_hour[:max_hours_since_arrival]
        )

    # Calculate summed demand rates for each hour
    demand_by_hour = OrderedDict()
    elapsed_hours = range(max_hours_since_arrival)

    for hour_idx, hour_key in enumerate(hour_keys):
        demand_by_hour[hour_key] = weighted_arrival_rates(
            weighted_rates, elapsed_hours, hour_idx, num_intervals
        )

    return demand_by_hour

`weighted_arrival_rates(weighted_rates, elapsed_hours, hour_idx, num_intervals)`

Calculate sum of weighted arrival rates for a specific time interval.

Parameters:

Name	Type	Description	Default
`weighted_rates`	`ndarray`	Array of weighted arrival rates.	required
`elapsed_hours`	`range`	Range of elapsed hours to consider.	required
`hour_idx`	`int`	Current interval index.	required
`num_intervals`	`int`	Total number of intervals in a day.	required

Returns:

Type	Description
`float`	Sum of weighted arrival rates.

Notes

The function calculates the sum of weighted arrival rates by iterating through the elapsed hours and considering the appropriate interval index for each hour.

Source code in src/patientflow/calculate/arrival_rates.py

def weighted_arrival_rates(
    weighted_rates: np.ndarray, elapsed_hours: range, hour_idx: int, num_intervals: int
) -> float:
    """Calculate sum of weighted arrival rates for a specific time interval.

    Parameters
    ----------
    weighted_rates : np.ndarray
        Array of weighted arrival rates.
    elapsed_hours : range
        Range of elapsed hours to consider.
    hour_idx : int
        Current interval index.
    num_intervals : int
        Total number of intervals in a day.

    Returns
    -------
    float
        Sum of weighted arrival rates.

    Notes
    -----
    The function calculates the sum of weighted arrival rates by iterating through
    the elapsed hours and considering the appropriate interval index for each hour.
    """
    total = 0
    for elapsed_hour in elapsed_hours:
        interval_index = (hour_idx - elapsed_hour) % num_intervals
        total += weighted_rates[elapsed_hour][interval_index]
    return total

`survival_curve`

`calculate_survival_curve(df, start_time_col, end_time_col)`

Calculate survival curve data from patient visit data.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing patient visit data	required
`start_time_col`	`str`	Name of the column containing the start time (e.g., arrival_datetime)	required
`end_time_col`	`str`	Name of the column containing the end time (e.g., departure_datetime)	required

Returns:

Type	Description
`DataFrame`	DataFrame with columns: - time_hours: Time points in hours - survival_probability: Survival probabilities at each time point - event_probability: Event probabilities (1 - survival_probability)

Source code in src/patientflow/calculate/survival_curve.py

def calculate_survival_curve(df, start_time_col, end_time_col):
    """Calculate survival curve data from patient visit data.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient visit data
    start_time_col : str
        Name of the column containing the start time (e.g., arrival_datetime)
    end_time_col : str
        Name of the column containing the end time (e.g., departure_datetime)

    Returns
    -------
    pandas.DataFrame
        DataFrame with columns:
        - time_hours: Time points in hours
        - survival_probability: Survival probabilities at each time point
        - event_probability: Event probabilities (1 - survival_probability)
    """
    # Calculate the wait time in hours
    df = df.copy()
    df["wait_time_hours"] = (
        df[end_time_col] - df[start_time_col]
    ).dt.total_seconds() / 3600

    # Drop any rows with missing wait times
    df_clean = df.dropna(subset=["wait_time_hours"]).copy()

    # Sort the data by wait time
    df_clean = df_clean.sort_values("wait_time_hours")

    # Calculate the number of patients
    n_patients = len(df_clean)

    # Calculate the survival function manually
    # For each time point, calculate proportion of patients who are still waiting
    unique_times = np.sort(df_clean["wait_time_hours"].unique())
    survival_prob = []

    for t in unique_times:
        # Number of patients who experienced the event after this time point
        n_event_after = sum(df_clean["wait_time_hours"] > t)
        # Proportion of patients still waiting
        survival_prob.append(n_event_after / n_patients)

    # Add zero hours wait time (everyone is waiting at time 0)
    unique_times = np.insert(unique_times, 0, 0)
    survival_prob = np.insert(survival_prob, 0, 1.0)

    # Return structured DataFrame
    return pd.DataFrame(
        {
            "time_hours": unique_times,
            "survival_probability": survival_prob,
            "event_probability": 1 - survival_prob,
        }
    )

`errors`

Custom exception classes for model loading and validation.

This module defines specialized exceptions used during model loading

Classes:

Name	Description
`ModelLoadError`	Raised when a model fails to load due to an unspecified error.
`MissingKeysError`	Raised when expected keys are missing from a dictionary of special parameters.

`MissingKeysError`

Bases: ValueError

Exception raised when required keys are missing from special_params.

Parameters:

Name	Type	Description	Default
`missing_keys`	`list or set`	The keys that are required but missing from the input dictionary.	required

Attributes:

Name	Type	Description
`missing_keys`	`list or set`	Stores the missing keys that caused the exception.

Source code in src/patientflow/errors.py

class MissingKeysError(ValueError):
    """
    Exception raised when required keys are missing from special_params.

    Parameters
    ----------
    missing_keys : list or set
        The keys that are required but missing from the input dictionary.

    Attributes
    ----------
    missing_keys : list or set
        Stores the missing keys that caused the exception.
    """

    def __init__(self, missing_keys):
        super().__init__(f"special_params is missing required keys: {missing_keys}")
        self.missing_keys = missing_keys

`ModelLoadError`

Bases: Exception

Exception raised when a model fails to load.

This generic exception can be used to signal a failure during the model loading process due to unexpected issues such as file corruption, invalid formats, or unsupported configurations.

Source code in src/patientflow/errors.py

class ModelLoadError(Exception):
    """
    Exception raised when a model fails to load.

    This generic exception can be used to signal a failure during the model
    loading process due to unexpected issues such as file corruption,
    invalid formats, or unsupported configurations.
    """

    pass

`evaluate`

Patient Flow Evaluation Module

This module provides functions for evaluating and comparing different prediction models for non-clincal outcomes in a healthcare setting. It includes utilities for calculating metrics such as Mean Absolute Error (MAE) and Mean Percentage Error (MPE), as well as functions for predicting admissions based on historical data and combining different prediction models.

Functions:

Name	Description
`calculate_results : function`	Calculate evaluation metrics based on expected and observed values
`calc_mae_mpe : function`	Calculate MAE and MPE for probability distribution predictions
`calculate_admission_probs_relative_to_prediction : function`	Calculate admission probabilities for arrivals relative to a prediction time window
`get_arrivals_with_admission_probs : function`	Get arrivals before and after prediction time with their admission probabilities
`calculate_weighted_observed : function`	Calculate actual admissions assuming ED targets are met
`create_time_mask : function`	Create a mask for times before/after a specific hour:minute
`predict_using_previous_weeks : function`	Predict admissions using average from previous weeks
`evaluate_six_week_average : function`	Evaluate the six-week average prediction model
`combine_distributions : function`	Combine two probability distributions using convolution
`evaluate_combined_model : function`	Evaluate a combined prediction model

`calc_mae_mpe(prob_dist_dict_all, use_most_probable=False)`

Calculate MAE and MPE for all prediction times in the given probability distribution dictionary.

Parameters:

Name	Type	Description	Default
`prob_dist_dict_all`	`Dict[Any, Dict[Any, Dict[str, Any]]]`	Nested dictionary containing probability distributions.	required
`use_most_probable`	`bool`	Whether to use the most probable value or mathematical expectation of the distribution. Default is False.	`False`

Returns:

Type	Description
`Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]`	Dictionary of results sorted by prediction time, containing: - expected : List[Union[int, float]] Expected values for each prediction - observed : List[float] Observed values for each prediction - mae : float Mean Absolute Error - mpe : float Mean Percentage Error

Source code in src/patientflow/evaluate.py

def calc_mae_mpe(
    prob_dist_dict_all: Dict[Any, Dict[Any, Dict[str, Any]]],
    use_most_probable: bool = False,
) -> Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]:
    """Calculate MAE and MPE for all prediction times in the given probability distribution dictionary.

    Parameters
    ----------
    prob_dist_dict_all : Dict[Any, Dict[Any, Dict[str, Any]]]
        Nested dictionary containing probability distributions.
    use_most_probable : bool, optional
        Whether to use the most probable value or mathematical expectation of the distribution.
        Default is False.

    Returns
    -------
    Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]
        Dictionary of results sorted by prediction time, containing:
        - expected : List[Union[int, float]]
            Expected values for each prediction
        - observed : List[float]
            Observed values for each prediction
        - mae : float
            Mean Absolute Error
        - mpe : float
            Mean Percentage Error
    """
    # Create temporary results dictionary
    unsorted_results: Dict[Any, Dict[str, Union[List[Union[int, float]], float]]] = {}

    # Process results as before
    for _prediction_time in prob_dist_dict_all.keys():
        expected_values: List[Union[int, float]] = []
        observed_values: List[float] = []

        for dt in prob_dist_dict_all[_prediction_time].keys():
            preds: Dict[str, Any] = prob_dist_dict_all[_prediction_time][dt]

            expected_value: Union[int, float] = (
                int(preds["agg_predicted"].idxmax().values[0])
                if use_most_probable
                else float(
                    np.dot(
                        preds["agg_predicted"].index,
                        preds["agg_predicted"].values.flatten(),
                    )
                )
            )

            observed_value: float = float(preds["agg_observed"])

            expected_values.append(expected_value)
            observed_values.append(observed_value)

        unsorted_results[_prediction_time] = calculate_results(
            expected_values, observed_values
        )

    # Sort results by prediction time
    def get_time_value(key: str) -> int:
        # Extract time from key (e.g., 'admissions_1530' -> 1530)
        time_str = key.split("_")[1]
        return int(time_str)

    # Create sorted dictionary
    sorted_results = dict(
        sorted(unsorted_results.items(), key=lambda x: get_time_value(x[0]))
    )

    return sorted_results

`calculate_admission_probs_relative_to_prediction(df, prediction_datetime, prediction_window, x1, y1, x2, y2, is_before=True)`

Calculate admission probabilities for arrivals relative to a prediction time window.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing arrival_datetime column.	required
`prediction_datetime`	`datetime`	Datetime for prediction window start.	required
`prediction_window`	`int`	Window length in minutes.	required
`x1`	`float`	First x-coordinate for aspirational curve.	required
`y1`	`float`	First y-coordinate for aspirational curve.	required
`x2`	`float`	Second x-coordinate for aspirational curve.	required
`y2`	`float`	Second y-coordinate for aspirational curve.	required
`is_before`	`bool`	Boolean indicating if arrivals are before prediction time. Default is True.	`True`

Returns:

Type	Description
`DataFrame`	DataFrame with added probability columns: - hours_before_pred_window : float Hours before prediction window (if is_before=True) - hours_after_pred_window : float Hours after prediction window (if is_before=False) - prob_admission_before_pred_window : float Probability of admission before prediction window - prob_admission_in_pred_window : float Probability of admission within prediction window

Source code in src/patientflow/evaluate.py

def calculate_admission_probs_relative_to_prediction(
    df, prediction_datetime, prediction_window, x1, y1, x2, y2, is_before=True
):
    """Calculate admission probabilities for arrivals relative to a prediction time window.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing arrival_datetime column.
    prediction_datetime : datetime
        Datetime for prediction window start.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    is_before : bool, optional
        Boolean indicating if arrivals are before prediction time.
        Default is True.

    Returns
    -------
    pandas.DataFrame
        DataFrame with added probability columns:
        - hours_before_pred_window : float
            Hours before prediction window (if is_before=True)
        - hours_after_pred_window : float
            Hours after prediction window (if is_before=False)
        - prob_admission_before_pred_window : float
            Probability of admission before prediction window
        - prob_admission_in_pred_window : float
            Probability of admission within prediction window
    """
    result = df.copy()

    if is_before:
        result["hours_before_pred_window"] = result["arrival_datetime"].apply(
            lambda x: (prediction_datetime - x).seconds / 3600
        )
        result["prob_admission_before_pred_window"] = result[
            "hours_before_pred_window"
        ].apply(lambda x: get_y_from_aspirational_curve(x, x1, y1, x2, y2))
        result["prob_admission_in_pred_window"] = result[
            "hours_before_pred_window"
        ].apply(
            lambda x: get_y_from_aspirational_curve(
                x + prediction_window / 60, x1, y1, x2, y2
            )
            - get_y_from_aspirational_curve(x, x1, y1, x2, y2)
        )
    else:
        result["hours_after_pred_window"] = result["arrival_datetime"].apply(
            lambda x: (x - prediction_datetime).seconds / 3600
        )
        result["prob_admission_in_pred_window"] = result[
            "hours_after_pred_window"
        ].apply(
            lambda x: get_y_from_aspirational_curve(
                (prediction_window / 60) - x, x1, y1, x2, y2
            )
        )

    return result

`calculate_results(expected_values, observed_values)`

Calculate evaluation metrics based on expected and observed values.

Parameters:

Name	Type	Description	Default
`expected_values`	`List[Union[int, float]]`	List of expected values.	required
`observed_values`	`List[float]`	List of observed values.	required

Returns:

Type	Description
`Dict[str, Union[List[Union[int, float]], float]]`	Dictionary containing: - expected : List[Union[int, float]] Original expected values - observed : List[float] Original observed values - mae : float Mean Absolute Error - mpe : float Mean Percentage Error

Source code in src/patientflow/evaluate.py

def calculate_results(
    expected_values: List[Union[int, float]], observed_values: List[float]
) -> Dict[str, Union[List[Union[int, float]], float]]:
    """Calculate evaluation metrics based on expected and observed values.

    Parameters
    ----------
    expected_values : List[Union[int, float]]
        List of expected values.
    observed_values : List[float]
        List of observed values.

    Returns
    -------
    Dict[str, Union[List[Union[int, float]], float]]
        Dictionary containing:
        - expected : List[Union[int, float]]
            Original expected values
        - observed : List[float]
            Original observed values
        - mae : float
            Mean Absolute Error
        - mpe : float
            Mean Percentage Error
    """
    expected_array: np.ndarray = np.array(expected_values)
    observed_array: np.ndarray = np.array(observed_values)

    if len(expected_array) == 0 or len(observed_array) == 0:
        return {
            "expected": expected_values,
            "observed": observed_values,
            "mae": 0.0,
            "mpe": 0.0,
        }

    absolute_errors: np.ndarray = np.abs(expected_array - observed_array)
    mae: float = float(np.mean(absolute_errors)) if len(absolute_errors) > 0 else 0.0

    non_zero_mask: np.ndarray = observed_array != 0
    filtered_absolute_errors: np.ndarray = absolute_errors[non_zero_mask]
    filtered_observed_array: np.ndarray = observed_array[non_zero_mask]

    mpe: float = 0.0
    if len(filtered_absolute_errors) > 0 and len(filtered_observed_array) > 0:
        percentage_errors: np.ndarray = (
            filtered_absolute_errors / filtered_observed_array * 100
        )
        mpe = float(np.mean(percentage_errors))

    return {
        "expected": expected_values,
        "observed": observed_values,
        "mae": mae,
        "mpe": mpe,
    }

`calculate_weighted_observed(df, dt, prediction_window, x1, y1, x2, y2, prediction_time)`

Calculate weighted observed admissions for a specific date and prediction window.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with arrival_datetime column.	required
`dt`	`date`	Target date for calculation.	required
`prediction_window`	`int`	Window length in minutes.	required
`x1`	`float`	First x-coordinate for aspirational curve.	required
`y1`	`float`	First y-coordinate for aspirational curve.	required
`x2`	`float`	Second x-coordinate for aspirational curve.	required
`y2`	`float`	Second y-coordinate for aspirational curve.	required
`prediction_time`	`tuple`	Tuple of (hour, minute) for prediction time.	required

Returns:

Type	Description
`float`	Weighted sum of observed admissions for the specified time period.

Source code in src/patientflow/evaluate.py

def calculate_weighted_observed(
    df, dt, prediction_window, x1, y1, x2, y2, prediction_time
):
    """Calculate weighted observed admissions for a specific date and prediction window.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with arrival_datetime column.
    dt : datetime.date
        Target date for calculation.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    prediction_time : tuple
        Tuple of (hour, minute) for prediction time.

    Returns
    -------
    float
        Weighted sum of observed admissions for the specified time period.
    """
    # Create prediction datetime
    prediction_datetime = pd.to_datetime(dt).replace(
        hour=prediction_time[0], minute=prediction_time[1]
    )

    # Filter for target date and get arrivals with probabilities
    filtered_df = df[df["arrival_datetime"].dt.date == dt]
    arrived_before, arrived_after = get_arrivals_with_admission_probs(
        filtered_df,
        prediction_datetime,
        prediction_window,
        prediction_time,
        x1,
        y1,
        x2,
        y2,
        target_date=dt,
    )

    # Calculate weighted sum
    weighted_observed = (
        arrived_before["prob_admission_in_pred_window"].sum()
        + arrived_after["prob_admission_in_pred_window"].sum()
    )

    return weighted_observed

`combine_distributions(dist1, dist2)`

Combine two probability distributions using convolution.

Parameters:

Name	Type	Description	Default
`dist1`	`DataFrame`	First probability distribution.	required
`dist2`	`DataFrame`	Second probability distribution.	required

Returns:

Type	Description
`DataFrame`	Combined probability distribution with columns: - agg_predicted : float Combined probability values

Source code in src/patientflow/evaluate.py

def combine_distributions(dist1: pd.DataFrame, dist2: pd.DataFrame) -> pd.DataFrame:
    """Combine two probability distributions using convolution.

    Parameters
    ----------
    dist1 : pandas.DataFrame
        First probability distribution.
    dist2 : pandas.DataFrame
        Second probability distribution.

    Returns
    -------
    pandas.DataFrame
        Combined probability distribution with columns:
        - agg_predicted : float
            Combined probability values
    """
    arr1 = dist1.values
    arr2 = dist2.values

    combined = signal.convolve(arr1, arr2)
    new_index = range(len(combined))

    combined_df = pd.DataFrame(combined, index=new_index, columns=["agg_predicted"])
    combined_df["agg_predicted"] = (
        combined_df["agg_predicted"] / combined_df["agg_predicted"].sum()
    )

    return combined_df

`create_time_mask(df, hour, minute)`

Create a mask for times before/after a specific hour:minute.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing arrival_datetime column.	required
`hour`	`int`	Target hour (0-23).	required
`minute`	`int`	Target minute (0-59).	required

Returns:

Type	Description
`Series`	Boolean mask indicating times after the specified hour:minute.

Source code in src/patientflow/evaluate.py

def create_time_mask(df, hour, minute):
    """Create a mask for times before/after a specific hour:minute.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing arrival_datetime column.
    hour : int
        Target hour (0-23).
    minute : int
        Target minute (0-59).

    Returns
    -------
    pandas.Series
        Boolean mask indicating times after the specified hour:minute.
    """
    return (df["arrival_datetime"].dt.hour > hour) | (
        (df["arrival_datetime"].dt.hour == hour)
        & (df["arrival_datetime"].dt.minute > minute)
    )

`evaluate_combined_model(prob_dist_dict_all, df, yta_preds, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name, use_most_probable=True)`

Evaluate the combined prediction model.

Parameters:

Name	Type	Description	Default
`prob_dist_dict_all`	`Dict[Any, Dict[Any, Dict[str, Any]]]`	Nested dictionary containing probability distributions.	required
`df`	`DataFrame`	DataFrame containing patient data.	required
`yta_preds`	`DataFrame`	Yet-to-arrive predictions.	required
`prediction_window`	`int`	Window length in minutes.	required
`x1`	`float`	First x-coordinate for aspirational curve.	required
`y1`	`float`	First y-coordinate for aspirational curve.	required
`x2`	`float`	Second x-coordinate for aspirational curve.	required
`y2`	`float`	Second y-coordinate for aspirational curve.	required
`prediction_time`	`Tuple[int, int]`	Hour and minute of prediction.	required
`num_weeks`	`int`	Number of previous weeks to consider.	required
`model_name`	`str`	Name of the model.	required
`use_most_probable`	`bool`	Whether to use the most probable value or expected value. Default is True.	`True`

Returns:

Type	Description
`Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]`	Dictionary containing evaluation results: - expected : List[Union[int, float]] Expected values for each prediction - observed : List[float] Observed values for each prediction - mae : float Mean Absolute Error - mpe : float Mean Percentage Error

Source code in src/patientflow/evaluate.py

def evaluate_combined_model(
    prob_dist_dict_all: Dict[Any, Dict[Any, Dict[str, Any]]],
    df: pd.DataFrame,
    yta_preds: pd.DataFrame,
    prediction_window: int,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    prediction_time: Tuple[int, int],
    num_weeks: int,
    model_name: str,
    use_most_probable: bool = True,
) -> Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]:
    """Evaluate the combined prediction model.

    Parameters
    ----------
    prob_dist_dict_all : Dict[Any, Dict[Any, Dict[str, Any]]]
        Nested dictionary containing probability distributions.
    df : pandas.DataFrame
        DataFrame containing patient data.
    yta_preds : pandas.DataFrame
        Yet-to-arrive predictions.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    prediction_time : Tuple[int, int]
        Hour and minute of prediction.
    num_weeks : int
        Number of previous weeks to consider.
    model_name : str
        Name of the model.
    use_most_probable : bool, optional
        Whether to use the most probable value or expected value.
        Default is True.

    Returns
    -------
    Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]
        Dictionary containing evaluation results:
        - expected : List[Union[int, float]]
            Expected values for each prediction
        - observed : List[float]
            Observed values for each prediction
        - mae : float
            Mean Absolute Error
        - mpe : float
            Mean Percentage Error
    """
    expected_values: List[Union[int, float]] = []
    observed_values: List[float] = []

    model_name = get_model_key(model_name, prediction_time)

    for dt in prob_dist_dict_all[model_name].keys():
        in_ed_preds: Dict[str, Any] = prob_dist_dict_all[model_name][dt]
        combined = combine_distributions(yta_preds, in_ed_preds["agg_predicted"])

        expected_value: Union[int, float] = (
            int(combined["agg_predicted"].idxmax())
            if use_most_probable
            else float(
                np.dot(
                    combined["agg_predicted"].index,
                    combined["agg_predicted"].values.flatten(),
                )
            )
        )

        observed_value: float = float(
            calculate_weighted_observed(
                df, dt, prediction_window, x1, y1, x2, y2, prediction_time
            )
        )

        expected_values.append(expected_value)
        observed_values.append(observed_value)

    results = {model_name: calculate_results(expected_values, observed_values)}
    return results

`evaluate_six_week_average(prob_dist_dict_all, df, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name)`

Evaluate the six-week average prediction model.

Parameters:

Name	Type	Description	Default
`prob_dist_dict_all`	`Dict[Any, Dict[Any, Dict[str, Any]]]`	Nested dictionary containing probability distributions.	required
`df`	`DataFrame`	DataFrame containing patient data.	required
`prediction_window`	`int`	Prediction window in minutes.	required
`x1`	`float`	First x-coordinate for aspirational curve.	required
`y1`	`float`	First y-coordinate for aspirational curve.	required
`prediction_time`	`Tuple[int, int]`	Hour and minute of prediction.	required
`num_weeks`	`int`	Number of previous weeks to consider.	required
`model_name`	`str`	Name of the model.	required

Returns:

Type	Description
`Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]`	Dictionary containing evaluation results: - expected : List[Union[int, float]] Expected values for each prediction - observed : List[float] Observed values for each prediction

Source code in src/patientflow/evaluate.py

def evaluate_six_week_average(
    prob_dist_dict_all: Dict[Any, Dict[Any, Dict[str, Any]]],
    df: pd.DataFrame,
    prediction_window: int,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    prediction_time: Tuple[int, int],
    num_weeks: int,
    model_name: str,
) -> Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]:
    """
    Evaluate the six-week average prediction model.

    Parameters
    ----------
    prob_dist_dict_all : Dict[Any, Dict[Any, Dict[str, Any]]]
        Nested dictionary containing probability distributions.
    df : pandas.DataFrame
        DataFrame containing patient data.
    prediction_window : int
        Prediction window in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    prediction_time : Tuple[int, int]
        Hour and minute of prediction.
    num_weeks : int
        Number of previous weeks to consider.
    model_name : str
        Name of the model.

    Returns
    -------
    Dict[Any, Dict[str, Union[List[Union[int, float]], float]]]
        Dictionary containing evaluation results:
        - expected : List[Union[int, float]]
            Expected values for each prediction
        - observed : List[float]
            Observed values for each prediction
    """
    expected_values: List[Union[int, float]] = []
    observed_values: List[float] = []

    model_name = get_model_key(model_name, prediction_time)

    for dt in prob_dist_dict_all[model_name].keys():
        expected_value: float = float(
            predict_using_previous_weeks(
                df, dt, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks
            )
        )
        observed_value: float = float(
            calculate_weighted_observed(
                df, dt, prediction_window, x1, y1, x2, y2, prediction_time
            )
        )

        expected_values.append(expected_value)
        observed_values.append(observed_value)

    results = {model_name: calculate_results(expected_values, observed_values)}
    return results

`get_arrivals_with_admission_probs(df, prediction_datetime, prediction_window, prediction_time, x1, y1, x2, y2, date_range=None, target_date=None, target_weekday=None)`

Get arrivals before and after prediction time with their admission probabilities.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with arrival_datetime column.	required
`prediction_datetime`	`datetime`	Datetime for prediction window start.	required
`prediction_window`	`int`	Window length in minutes.	required
`prediction_time`	`tuple`	Tuple of (hour, minute) for prediction time.	required
`x1`	`float`	First x-coordinate for aspirational curve.	required
`y1`	`float`	First y-coordinate for aspirational curve.	required
`x2`	`float`	Second x-coordinate for aspirational curve.	required
`y2`	`float`	Second y-coordinate for aspirational curve.	required
`date_range`	`tuple`	Optional tuple of (start_date, end_date) to filter data.	`None`
`target_date`	`date`	Optional specific date to analyze.	`None`
`target_weekday`	`int`	Optional specific weekday to filter for (0-6, where 0 is Monday).	`None`

Returns:

Type	Description
`tuple`	Tuple of (arrived_before, arrived_after) DataFrames containing: - arrived_before : pandas.DataFrame DataFrame with arrivals before prediction time - arrived_after : pandas.DataFrame DataFrame with arrivals after prediction time

Source code in src/patientflow/evaluate.py

def get_arrivals_with_admission_probs(
    df,
    prediction_datetime,
    prediction_window,
    prediction_time,
    x1,
    y1,
    x2,
    y2,
    date_range=None,
    target_date=None,
    target_weekday=None,
):
    """Get arrivals before and after prediction time with their admission probabilities.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with arrival_datetime column.
    prediction_datetime : datetime
        Datetime for prediction window start.
    prediction_window : int
        Window length in minutes.
    prediction_time : tuple
        Tuple of (hour, minute) for prediction time.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    date_range : tuple, optional
        Optional tuple of (start_date, end_date) to filter data.
    target_date : datetime.date, optional
        Optional specific date to analyze.
    target_weekday : int, optional
        Optional specific weekday to filter for (0-6, where 0 is Monday).

    Returns
    -------
    tuple
        Tuple of (arrived_before, arrived_after) DataFrames containing:
        - arrived_before : pandas.DataFrame
            DataFrame with arrivals before prediction time
        - arrived_after : pandas.DataFrame
            DataFrame with arrivals after prediction time
    """
    hour, minute = prediction_time

    # Create base time masks
    after_mask = create_time_mask(df, hour, minute)
    before_mask = ~after_mask

    # Add date and weekday conditions if specified
    if date_range:
        start_date, end_date = date_range
        date_mask = (df["arrival_datetime"].dt.date >= start_date) & (
            df["arrival_datetime"].dt.date < end_date
        )
        if target_weekday is not None:
            date_mask &= df["arrival_datetime"].dt.weekday == target_weekday

        after_mask &= date_mask
        before_mask &= date_mask

    if target_date:
        target_mask = df["arrival_datetime"].dt.date == target_date
        after_mask &= target_mask
        before_mask &= target_mask

    # Calculate probabilities for filtered groups
    arrived_before = calculate_admission_probs_relative_to_prediction(
        df[before_mask],
        prediction_datetime,
        prediction_window,
        x1,
        y1,
        x2,
        y2,
        is_before=True,
    )

    arrived_after = calculate_admission_probs_relative_to_prediction(
        df[after_mask],
        prediction_datetime,
        prediction_window,
        x1,
        y1,
        x2,
        y2,
        is_before=False,
    )

    return arrived_before, arrived_after

`predict_using_previous_weeks(df, dt, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, weighted=True)`

Calculate predicted admissions remaining until midnight.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing patient data.	required
`dt`	`datetime`	Date for prediction.	required
`prediction_window`	`int`	Window length in minutes.	required
`x1`	`float`	First x-coordinate for aspirational curve.	required
`y1`	`float`	First y-coordinate for aspirational curve.	required
`x2`	`float`	Second x-coordinate for aspirational curve.	required
`y2`	`float`	Second y-coordinate for aspirational curve.	required
`prediction_time`	`Tuple[int, int]`	Hour and minute of prediction.	required
`num_weeks`	`int`	Number of previous weeks to consider.	required
`weighted`	`bool`	Whether to weight the numbers according to aspirational ED targets. Default is True.	`True`

Returns:

Type	Description
`float`	Predicted number of admissions remaining until midnight.

Source code in src/patientflow/evaluate.py

def predict_using_previous_weeks(
    df: pd.DataFrame,
    dt: datetime,
    prediction_window: int,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    prediction_time: Tuple[int, int],
    num_weeks: int,
    weighted: bool = True,
) -> float:
    """Calculate predicted admissions remaining until midnight.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient data.
    dt : datetime
        Date for prediction.
    prediction_window : int
        Window length in minutes.
    x1 : float
        First x-coordinate for aspirational curve.
    y1 : float
        First y-coordinate for aspirational curve.
    x2 : float
        Second x-coordinate for aspirational curve.
    y2 : float
        Second y-coordinate for aspirational curve.
    prediction_time : Tuple[int, int]
        Hour and minute of prediction.
    num_weeks : int
        Number of previous weeks to consider.
    weighted : bool, optional
        Whether to weight the numbers according to aspirational ED targets.
        Default is True.

    Returns
    -------
    float
        Predicted number of admissions remaining until midnight.
    """
    prediction_datetime = pd.to_datetime(dt).replace(
        hour=prediction_time[0], minute=prediction_time[1]
    )
    target_day_of_week = dt.weekday()

    end_date = dt - timedelta(days=1)
    start_date = end_date - timedelta(weeks=num_weeks)

    if weighted:
        # Create mask for historical data
        historical_mask = (
            (df["arrival_datetime"].dt.date >= start_date)
            & (df["arrival_datetime"].dt.date <= end_date)
            & (df["arrival_datetime"].dt.weekday == target_day_of_week)
        )

        # Create explicit copy of filtered data
        historical_data = df[historical_mask].copy()

        # Calculate minutes until midnight
        midnight_times = (
            historical_data["arrival_datetime"].dt.normalize()
            + pd.Timedelta(days=1)
            - pd.Timedelta(minutes=1)
        )
        historical_data.loc[:, "minutes_to_midnight"] = (
            midnight_times - historical_data["arrival_datetime"]
        ).dt.total_seconds() / 60

        # Calculate admission probabilities
        historical_data.loc[:, "admission_probability"] = historical_data[
            "minutes_to_midnight"
        ].apply(lambda x: get_y_from_aspirational_curve(x / 60, x1, y1, x2, y2))

        # Group by date and calculate average
        historical_daily_sums = historical_data.groupby(
            historical_data["arrival_datetime"].dt.date
        )["admission_probability"].sum()
        historical_average = historical_daily_sums.mean()

        # Create mask for today's data
        today_mask = (df["arrival_datetime"].dt.date == dt) & (
            df["arrival_datetime"] < prediction_datetime
        )

        # Create explicit copy of today's filtered data
        today_data = df[today_mask].copy()

        # Calculate minutes until midnight for today's data
        midnight_today = (
            pd.to_datetime(dt).normalize()
            + pd.Timedelta(days=1)
            - pd.Timedelta(minutes=1)
        )
        today_data.loc[:, "minutes_to_midnight"] = (
            midnight_today - today_data["arrival_datetime"]
        ).dt.total_seconds() / 60

        # Calculate admission probabilities for today
        today_data.loc[:, "admission_probability"] = today_data[
            "minutes_to_midnight"
        ].apply(lambda x: get_y_from_aspirational_curve(x / 60, x1, y1, x2, y2))

        today_sum = today_data["admission_probability"].sum()

        still_to_admit = max(historical_average - today_sum, 0)

    else:
        # Original unweighted logic with explicit copies
        historical_mask = (
            (df["arrival_datetime"].dt.date >= start_date)
            & (df["arrival_datetime"].dt.date < end_date)
            & (df["arrival_datetime"].dt.weekday == target_day_of_week)
        )
        historical_df = df[historical_mask].copy()
        average_count = len(historical_df) / num_weeks

        target_mask = (df["arrival_datetime"].dt.date == dt) & (
            df["arrival_datetime"] < prediction_datetime
        )
        target_date_count = len(df[target_mask])

        still_to_admit = max(average_count - target_date_count, 0)

    return still_to_admit

`generate`

Generate fake Emergency Department visit data.

This module provides functions to generate fake datasets for patient visits to an emergency department (ED). It generates arrival and departure times, triage scores, lab orders, and patient admissions. The functions are used for illustrative purposes in some of the notebooks.

Functions:

Name	Description
`create_fake_finished_visits`	Generate synthetic patient visits, triage observations, and lab orders.
`create_fake_snapshots`	Create patient-level snapshots at specific times with visit, triage, and lab features.

`create_fake_finished_visits(start_date, end_date, mean_patients_per_day, admitted_only=False)`

Generate synthetic patient visit data for an emergency department.

This function simulates a realistic distribution of patient arrivals, triage scores, lengths of stay, admissions, and lab orders over a specified date range. Some patients may have multiple visits.

Parameters:

Name	Type	Description	Default
`start_date`	`str or datetime`	The starting date for the simulation (inclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.	required
`end_date`	`str or datetime`	The ending date for the simulation (exclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.	required
`mean_patients_per_day`	`float`	The average number of patient visits to generate per day.	required
`admitted_only`	`bool`	If True, only return admitted patients. The mean_patients_per_day will be adjusted to maintain the same total number of admitted patients as would be expected in the full dataset.	`False`

Returns:

Name	Type	Description
`visits_df`	`DataFrame`	DataFrame containing visit records with the following columns: - 'visit_number' - 'patient_id' - 'arrival_datetime' - 'departure_datetime' - 'is_admitted' - 'specialty' - 'age'
`observations_df`	`DataFrame`	DataFrame containing triage score observations with columns: - 'visit_number' - 'observation_datetime' - 'triage_score'
`lab_orders_df`	`DataFrame`	DataFrame containing lab test orders with columns: - 'visit_number' - 'order_datetime' - 'lab_name'

Notes

Patients are more likely to arrive during daytime hours.
20% of patients will have more than one visit during the simulation period.
Lab test ordering likelihood depends on the severity of the triage score.
When admitted_only=True, the mean_patients_per_day is adjusted to maintain the same number of admitted patients as would be expected in the full dataset.

Source code in src/patientflow/generate.py

def create_fake_finished_visits(
    start_date, end_date, mean_patients_per_day, admitted_only=False
):
    """
    Generate synthetic patient visit data for an emergency department.

    This function simulates a realistic distribution of patient arrivals, triage scores, lengths of stay,
    admissions, and lab orders over a specified date range. Some patients may have multiple visits.

    Parameters
    ----------
    start_date : str or datetime
        The starting date for the simulation (inclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.
    end_date : str or datetime
        The ending date for the simulation (exclusive). Can be a datetime object or a string in 'YYYY-MM-DD' format.
    mean_patients_per_day : float
        The average number of patient visits to generate per day.
    admitted_only : bool, optional
        If True, only return admitted patients. The mean_patients_per_day will be adjusted to maintain
        the same total number of admitted patients as would be expected in the full dataset.

    Returns
    -------
    visits_df : pandas.DataFrame
        DataFrame containing visit records with the following columns:
        - 'visit_number'
        - 'patient_id'
        - 'arrival_datetime'
        - 'departure_datetime'
        - 'is_admitted'
        - 'specialty'
        - 'age'
    observations_df : pandas.DataFrame
        DataFrame containing triage score observations with columns:
        - 'visit_number'
        - 'observation_datetime'
        - 'triage_score'
    lab_orders_df : pandas.DataFrame
        DataFrame containing lab test orders with columns:
        - 'visit_number'
        - 'order_datetime'
        - 'lab_name'

    Notes
    -----
    - Patients are more likely to arrive during daytime hours.
    - 20% of patients will have more than one visit during the simulation period.
    - Lab test ordering likelihood depends on the severity of the triage score.
    - When admitted_only=True, the mean_patients_per_day is adjusted to maintain the same number
      of admitted patients as would be expected in the full dataset.
    """

    # Convert string dates to datetime if needed
    if isinstance(start_date, str):
        start_date = datetime.strptime(start_date, "%Y-%m-%d")
    if isinstance(end_date, str):
        end_date = datetime.strptime(end_date, "%Y-%m-%d")

    # Set random seed for reproducibility
    np.random.seed(42)  # You can change this seed value as needed

    # Define admission probabilities based on triage score
    # Triage 1: 80% admission, Triage 2: 60%, Triage 3: 30%, Triage 4: 10%, Triage 5: 2%
    admission_probabilities = {
        1: 0.80,  # Highest severity - highest admission probability
        2: 0.60,
        3: 0.30,
        4: 0.10,
        5: 0.02,  # Lowest severity - lowest admission probability
    }

    # Define triage score distribution
    # Most common is 3-4, less common are 2 and 5, least common is 1 (most severe)
    triage_probabilities = [0.05, 0.15, 0.35, 0.35, 0.10]  # For scores 1-5

    # Calculate total days in range (changed to exclusive end date)
    days_range = (end_date - start_date).days

    # If admitted_only is True, adjust mean_patients_per_day to maintain the same number of admitted patients
    if admitted_only:
        # Calculate expected admission rate based on triage probabilities and admission probabilities
        expected_admission_rate = sum(
            triage_prob * admission_prob
            for triage_prob, admission_prob in zip(
                triage_probabilities, admission_probabilities.values()
            )
        )
        # Adjust mean_patients_per_day to maintain the same number of admitted patients
        mean_patients_per_day = mean_patients_per_day / expected_admission_rate

    # Generate random number of patients for each day using Poisson distribution
    daily_patients = np.random.poisson(mean_patients_per_day, days_range)

    # Calculate the total number of visits
    total_visits = sum(daily_patients)

    # Calculate approximately how many unique patients we need
    # If 20% of patients have more than one visit (let's assume they have exactly 2),
    # then for N total visits, we need approximately N * 0.8 + (N * 0.2) / 2 unique patients
    # Simplifying: N * (0.8 + 0.1) = N * 0.9 unique patients
    num_unique_patients = int(total_visits * 0.9)

    # Create patient ids
    patient_ids = list(range(1, num_unique_patients + 1))

    # Define common ED lab tests and their ordering probabilities based on triage score
    lab_tests = ["CBC", "BMP", "Troponin", "D-dimer", "Urinalysis"]
    lab_probabilities = {
        # Higher severity -> more likely to get labs
        1: {
            "CBC": 0.95,
            "BMP": 0.95,
            "Troponin": 0.90,
            "D-dimer": 0.70,
            "Urinalysis": 0.60,
        },
        2: {
            "CBC": 0.90,
            "BMP": 0.90,
            "Troponin": 0.80,
            "D-dimer": 0.60,
            "Urinalysis": 0.50,
        },
        3: {
            "CBC": 0.80,
            "BMP": 0.80,
            "Troponin": 0.60,
            "D-dimer": 0.40,
            "Urinalysis": 0.40,
        },
        4: {
            "CBC": 0.60,
            "BMP": 0.60,
            "Troponin": 0.30,
            "D-dimer": 0.20,
            "Urinalysis": 0.30,
        },
        5: {
            "CBC": 0.40,
            "BMP": 0.40,
            "Troponin": 0.15,
            "D-dimer": 0.10,
            "Urinalysis": 0.20,
        },
    }

    visits = []
    observations = []
    lab_orders = []
    visit_number = 1

    # Create a dictionary to track number of visits per patient
    patient_visit_count = {patient_id: 0 for patient_id in patient_ids}

    # Create a pool of patients who will have multiple visits (20% of patients)
    multi_visit_patients = set(
        np.random.choice(
            patient_ids, size=int(num_unique_patients * 0.2), replace=False
        )
    )

    for day_idx, num_patients in enumerate(daily_patients):
        current_date = start_date + timedelta(days=day_idx)

        # Generate patients for this day
        for _ in range(num_patients):
            # Select a patient ID based on our requirements
            # If we haven't assigned all patients yet, use a new one
            # Otherwise, pick from multi-visit patients
            available_new_patients = [
                pid for pid in patient_ids if patient_visit_count[pid] == 0
            ]

            if available_new_patients:
                # Use a new patient
                patient_id = np.random.choice(available_new_patients)
            else:
                # All patients have at least one visit, now use multi-visit patients
                patient_id = np.random.choice(list(multi_visit_patients))

            # Increment the visit count for this patient
            patient_visit_count[patient_id] += 1

            # Random hour for arrival (more likely during daytime)
            arrival_hour = np.random.normal(13, 4)  # Mean at 1 PM, std dev of 4 hours
            arrival_hour = max(0, min(23, int(arrival_hour)))  # Clamp between 0-23

            # Random minutes
            arrival_minute = np.random.randint(0, 60)

            # Create arrival datetime
            arrival_datetime = current_date.replace(
                hour=arrival_hour,
                minute=arrival_minute,
                second=np.random.randint(0, 60),
            )

            # Generate triage score (1-5)
            triage_score = np.random.choice([1, 2, 3, 4, 5], p=triage_probabilities)

            # Generate admission status based on triage score
            admission_prob = admission_probabilities[triage_score]
            is_admitted = np.random.choice(
                [0, 1], p=[1 - admission_prob, admission_prob]
            )

            # Generate specialty for admitted patients
            if is_admitted:
                specialty = np.random.choice(
                    ["medical", "surgical", "haem/onc", "paediatric"],
                    p=[0.65, 0.25, 0.05, 0.05],
                )
            else:
                specialty = None

            # Skip this visit if admitted_only is True and patient is not admitted
            if admitted_only and not is_admitted:
                continue

            # Generate length of stay (in minutes) - log-normal distribution
            # Most visits are 4 to 12 hours, but some can be shorter or longer
            length_of_stay = np.random.lognormal(mean=5.8, sigma=0.5)
            length_of_stay = max(
                60, min(2880, length_of_stay)
            )  # Between 1 hour and 48 hours

            # Make higher triage scores (more severe) stay longer on average
            if triage_score <= 2:
                length_of_stay *= 1.8  # 80% longer stays for more severe cases

            # Calculate departure time
            departure_datetime = arrival_datetime + timedelta(
                minutes=int(length_of_stay)
            )

            # For returning patients, use the same age as their first visit
            if patient_id in [v["patient_id"] for v in visits]:
                # Find the age from a previous visit
                age = next(v["age"] for v in visits if v["patient_id"] == patient_id)
            else:
                # Generate age with a distribution skewed towards older adults
                age = int(
                    np.random.lognormal(mean=3.8, sigma=0.5)
                )  # Centers around 45 years
                age = max(0, min(100, age))  # Clamp between 0-100 years

            # Add visit record (without triage score, but with patient_id)
            visits.append(
                {
                    "patient_id": patient_id,
                    "visit_number": visit_number,
                    "arrival_datetime": arrival_datetime,
                    "departure_datetime": departure_datetime,
                    "age": age,
                    "is_admitted": is_admitted,
                    "specialty": specialty,
                }
            )

            # Generate triage observation within first 10 minutes
            minutes_after_arrival = np.random.uniform(0, 10)
            observation_datetime = arrival_datetime + timedelta(
                minutes=minutes_after_arrival
            )

            observations.append(
                {
                    "visit_number": visit_number,
                    "observation_datetime": observation_datetime,
                    "triage_score": triage_score,
                }
            )

            # Generate lab orders if visit is longer than 2 hours
            if length_of_stay > 120:
                # For each lab test, decide if it should be ordered based on triage score
                for lab_test in lab_tests:
                    if np.random.random() < lab_probabilities[triage_score][lab_test]:
                        # Order time is after triage but within first 90 minutes
                        minutes_after_triage = np.random.uniform(
                            0, 90 - minutes_after_arrival
                        )
                        order_datetime = observation_datetime + timedelta(
                            minutes=minutes_after_triage
                        )

                        lab_orders.append(
                            {
                                "visit_number": visit_number,
                                "order_datetime": order_datetime,
                                "lab_name": lab_test,
                            }
                        )

            visit_number += 1

    # Create DataFrames and sort by time
    visits_df = pd.DataFrame(visits)
    visits_df = visits_df.sort_values("arrival_datetime").reset_index(drop=True)

    observations_df = pd.DataFrame(observations)
    observations_df = observations_df.sort_values("observation_datetime").reset_index(
        drop=True
    )

    lab_orders_df = pd.DataFrame(lab_orders)
    if not lab_orders_df.empty:
        lab_orders_df = lab_orders_df.sort_values("order_datetime").reset_index(
            drop=True
        )

    return visits_df, observations_df, lab_orders_df

`create_fake_snapshots(prediction_times, start_date, end_date, df=None, observations_df=None, lab_orders_df=None, mean_patients_per_day=50)`

Generate patient-level snapshots at specific times for prediction modeling.

For each specified time on each date in the range, this function returns a snapshot of patients who are currently in the emergency department, along with their visit features, latest triage score, and number of lab tests ordered prior to that time.

Parameters:

Name	Type	Description	Default
`prediction_times`	`list of tuple of int`	A list of (hour, minute) tuples indicating times of day to create snapshots.	required
`start_date`	`str or datetime`	The starting date for generating snapshots (inclusive).	required
`end_date`	`str or datetime`	The ending date for generating snapshots (exclusive).	required
`df`	`DataFrame`	Patient visit data from `create_fake_finished_visits`. If None, synthetic data is generated.	`None`
`observations_df`	`DataFrame`	Triage score data from `create_fake_finished_visits`. If None, synthetic data is generated.	`None`
`lab_orders_df`	`DataFrame`	Lab order data from `create_fake_finished_visits`. If None, synthetic data is generated.	`None`
`mean_patients_per_day`	`float`	Average number of patients per day (used only if synthetic data is generated).	`50`

Returns:

Name	Type	Description
`final_df`	`DataFrame`	A DataFrame with one row per patient visit present at the snapshot time. Columns include: - 'snapshot_date' - 'prediction_time' - 'patient_id' - 'visit_number' - 'is_admitted' - 'age' - 'latest_triage_score' - One column per lab test: 'num__orders'

Notes

Only patients present in the ED at the snapshot time are included.
Lab order columns reflect counts of tests ordered before the snapshot time.
If no patients are present at a snapshot time, that snapshot is omitted.

Source code in src/patientflow/generate.py

def create_fake_snapshots(
    prediction_times,
    start_date,
    end_date,
    df=None,
    observations_df=None,
    lab_orders_df=None,
    mean_patients_per_day=50,
):
    """
    Generate patient-level snapshots at specific times for prediction modeling.

    For each specified time on each date in the range, this function returns a snapshot of patients
    who are currently in the emergency department, along with their visit features, latest triage score,
    and number of lab tests ordered prior to that time.

    Parameters
    ----------
    prediction_times : list of tuple of int
        A list of (hour, minute) tuples indicating times of day to create snapshots.
    start_date : str or datetime
        The starting date for generating snapshots (inclusive).
    end_date : str or datetime
        The ending date for generating snapshots (exclusive).
    df : pandas.DataFrame, optional
        Patient visit data from `create_fake_finished_visits`. If None, synthetic data is generated.
    observations_df : pandas.DataFrame, optional
        Triage score data from `create_fake_finished_visits`. If None, synthetic data is generated.
    lab_orders_df : pandas.DataFrame, optional
        Lab order data from `create_fake_finished_visits`. If None, synthetic data is generated.
    mean_patients_per_day : float, optional
        Average number of patients per day (used only if synthetic data is generated).

    Returns
    -------
    final_df : pandas.DataFrame
        A DataFrame with one row per patient visit present at the snapshot time. Columns include:
        - 'snapshot_date'
        - 'prediction_time'
        - 'patient_id'
        - 'visit_number'
        - 'is_admitted'
        - 'age'
        - 'latest_triage_score'
        - One column per lab test: 'num_<lab_name>_orders'

    Notes
    -----
    - Only patients present in the ED at the snapshot time are included.
    - Lab order columns reflect counts of tests ordered before the snapshot time.
    - If no patients are present at a snapshot time, that snapshot is omitted.
    """

    # Generate fake data if not provided
    if df is None or observations_df is None or lab_orders_df is None:
        df, observations_df, lab_orders_df = create_fake_finished_visits(
            start_date, end_date, mean_patients_per_day
        )

    # Add date conversion at the start
    if isinstance(start_date, str):
        start_date = datetime.strptime(start_date, "%Y-%m-%d").date()
    elif isinstance(start_date, datetime):
        start_date = start_date.date()

    if isinstance(end_date, str):
        end_date = datetime.strptime(end_date, "%Y-%m-%d").date()
    elif isinstance(end_date, datetime):
        end_date = end_date.date()

    # Create date range (changed to exclusive end date)
    snapshot_dates = []
    current_date = start_date
    while current_date < end_date:  # Changed from <= to <
        snapshot_dates.append(current_date)
        current_date += timedelta(days=1)

    # Get unique lab test names
    lab_tests = lab_orders_df["lab_name"].unique() if not lab_orders_df.empty else []

    # Create empty list to store all results
    all_results = []

    # For each combination of date and time
    for date in snapshot_dates:
        for hour, minute in prediction_times:
            snapshot_datetime = datetime.combine(date, time(hour=hour, minute=minute))

            # Filter dataframe for this snapshot
            mask = (df["arrival_datetime"] <= snapshot_datetime) & (
                df["departure_datetime"] > snapshot_datetime
            )
            snapshot_df = df[mask].copy()  # Create copy to avoid SettingWithCopyWarning

            # Skip if no patients at this time
            if len(snapshot_df) == 0:
                continue

            # Get triage scores recorded before the snapshot time
            valid_observations = observations_df[
                (observations_df["visit_number"].isin(snapshot_df["visit_number"]))
                & (observations_df["observation_datetime"] <= snapshot_datetime)
            ].copy()

            # Keep only the most recent triage score for each visit
            if not valid_observations.empty:
                valid_observations = valid_observations.sort_values(
                    "observation_datetime"
                )
                valid_observations = (
                    valid_observations.groupby("visit_number").last().reset_index()
                )
                valid_observations = valid_observations.rename(
                    columns={"triage_score": "latest_triage_score"}
                )

            # Get lab orders placed before the snapshot time
            valid_orders = lab_orders_df[
                (lab_orders_df["visit_number"].isin(snapshot_df["visit_number"]))
                & (lab_orders_df["order_datetime"] <= snapshot_datetime)
            ].copy()

            # Initialize lab_counts with zeros for all visits in snapshot_df
            lab_counts = pd.DataFrame(
                0,
                index=pd.Index(
                    snapshot_df["visit_number"].unique(), name="visit_number"
                ),
                columns=[f"num_{test.lower()}_orders" for test in lab_tests],
            )

            # Update counts if there are any valid orders
            if not valid_orders.empty:
                order_counts = (
                    valid_orders.groupby(["visit_number", "lab_name"])
                    .size()
                    .unstack(fill_value=0)
                )
                order_counts.columns = [
                    f"num_{test.lower()}_orders" for test in order_counts.columns
                ]
                # Update the counts in lab_counts where we have orders
                lab_counts.update(order_counts)

            lab_counts = lab_counts.reset_index()

            # Add snapshot information columns
            snapshot_df["snapshot_date"] = date
            snapshot_df["prediction_time"] = [(hour, minute)] * len(snapshot_df)

            # Merge with valid observations to get triage scores, handling empty case
            if not valid_observations.empty:
                snapshot_df = pd.merge(
                    snapshot_df,
                    valid_observations[["visit_number", "latest_triage_score"]],
                    on="visit_number",
                    how="left",
                )
            else:
                snapshot_df["latest_triage_score"] = pd.Series(
                    [np.nan] * len(snapshot_df),
                    dtype="float64",
                    index=snapshot_df.index,
                )
            # Merge with lab counts
            snapshot_df = pd.merge(
                snapshot_df, lab_counts, on="visit_number", how="left"
            )

            # Fill NA values in lab count columns with 0
            for col in snapshot_df.columns:
                if col.endswith("_orders"):
                    snapshot_df[col] = snapshot_df[col].fillna(0)
            if not snapshot_df.empty:
                # Optionally check for all-NA in key columns
                snapshot_cols = [
                    "snapshot_date",
                    "prediction_time",
                    "snapshot_datetime",
                ]
                # Only check columns that exist in the DataFrame
                check_cols = [
                    col for col in snapshot_cols if col in snapshot_df.columns
                ]

                if not check_cols or not snapshot_df[check_cols].isna().all().any():
                    all_results.append(snapshot_df)
                else:
                    print(
                        f"Skipping DataFrame with all-NA values in key columns: {check_cols}"
                    )
            else:
                print("Skipping empty DataFrame")

    # Combine all results into single dataframe
    if all_results:
        final_df = pd.concat(all_results, ignore_index=True)

        # Define column order
        snapshot_cols = ["snapshot_date", "prediction_time"]
        visit_cols = [
            "patient_id",
            "visit_number",
            "is_admitted",
            "age",
            "latest_triage_score",
        ]
        lab_cols = [col for col in final_df.columns if col.endswith("_orders")]

        # Ensure all required columns exist
        for col in visit_cols:
            if col not in final_df.columns:
                if col == "latest_triage_score":
                    final_df[col] = pd.NA
                else:
                    final_df[col] = None

        # Reorder columns
        final_df = final_df[snapshot_cols + visit_cols + lab_cols]
    else:
        # Create empty dataframe with correct columns if no results found
        lab_cols = [f"num_{test.lower()}_orders" for test in lab_tests]
        columns = [
            "snapshot_date",
            "prediction_time",
            "visit_number",
            "is_admitted",
            "age",
            "latest_triage_score",
        ] + lab_cols
        final_df = pd.DataFrame(columns=columns)

    # Name the index snapshot_id before returning
    final_df.index.name = "snapshot_id"
    return final_df

`load`

This module provides functionality for loading configuration files, data from CSV files, and trained machine learning models.

It includes the following features:

Loading Configurations: Parse YAML configuration files and extract necessary parameters for data processing and modeling.
Data Handling: Load and preprocess data from CSV files, including optional operations like setting an index, sorting, and applying literal evaluation on columns.
Model Management: Load saved machine learning models, customize model filenames based on time, and categorize DataFrame columns into predefined groups for analysis.

The module handles common file and parsing errors, returning appropriate error messages or exceptions.

Functions:

Name	Description
`parse_args:`	Parses command-line arguments for training models.
`set_project_root:`	Validates project root path from specified environment variable.
`load_config_file:`	Load a YAML configuration file and extract key parameters.
`set_file_paths:`	Sets up the file paths based on UCLH-specific or default parameters.
`set_data_file_names:`	Set file locations based on UCLH-specific or default data sources.
`safe_literal_eval:`	Safely evaluate string literals into Python objects when loading from csv.
`load_data:`	Load and preprocess data from a CSV or pickle file.
`get_model_key:`	Generate a model name based on the time of day.
`load_saved_model:`	Load a machine learning model saved in a joblib file.
`get_dict_cols:`	Categorize columns from a DataFrame into predefined groups for analysis.

`data_from_csv(csv_path, index_column=None, sort_columns=None, eval_columns=None)`

Loads data from a CSV file, with optional transformations. LEGACY!

This function loads a CSV file into a pandas DataFrame and provides the following optional features: - Setting a specified column as the index. - Sorting the DataFrame by one or more specified columns. - Applying safe literal evaluation to specified columns to handle string representations of Python objects.

Parameters:

Name	Type	Description	Default
`csv_path`	`str`	The relative or absolute path to the CSV file.	required
`index_column`	`str`	The column to set as the index of the DataFrame. If not provided, no index column is set.	`None`
`sort_columns`	`list of str`	A list of columns by which to sort the DataFrame. If not provided, the DataFrame is not sorted.	`None`
`eval_columns`	`list of str`	A list of columns to which `safe_literal_eval` should be applied. This is useful for columns containing string representations of Python data structures (e.g., lists, dictionaries).	`None`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the loaded data with any specified transformations applied.

Raises:

Type	Description
`SystemExit`	If the file cannot be found or another error occurs during loading or processing.

Notes

The function will terminate the program with a message if the file is not found or if any errors occur while loading the data. If sorting columns or applying safe_literal_eval fails, a warning message is printed, but execution continues.

Source code in src/patientflow/load.py

def data_from_csv(csv_path, index_column=None, sort_columns=None, eval_columns=None):
    """
    Loads data from a CSV file, with optional transformations. LEGACY!

    This function loads a CSV file into a pandas DataFrame and provides the following optional features:
    - Setting a specified column as the index.
    - Sorting the DataFrame by one or more specified columns.
    - Applying safe literal evaluation to specified columns to handle string representations of Python objects.

    Parameters
    ----------
    csv_path : str
        The relative or absolute path to the CSV file.
    index_column : str, optional
        The column to set as the index of the DataFrame. If not provided, no index column is set.
    sort_columns : list of str, optional
        A list of columns by which to sort the DataFrame. If not provided, the DataFrame is not sorted.
    eval_columns : list of str, optional
        A list of columns to which `safe_literal_eval` should be applied. This is useful for columns containing
        string representations of Python data structures (e.g., lists, dictionaries).

    Returns
    -------
    pd.DataFrame
        A pandas DataFrame containing the loaded data with any specified transformations applied.

    Raises
    ------
    SystemExit
        If the file cannot be found or another error occurs during loading or processing.

    Notes
    -----
    The function will terminate the program with a message if the file is not found or if any errors
    occur while loading the data. If sorting columns or applying `safe_literal_eval` fails,
    a warning message is printed, but execution continues.

    """
    path = os.path.join(Path().home(), csv_path)

    if not os.path.exists(path):
        print(f"Data file not found at path: {path}")
        sys.exit(1)

    try:
        df = pd.read_csv(path, parse_dates=True)
    except FileNotFoundError:
        print(f"Data file not found at path: {path}")
        sys.exit(1)
    except Exception as e:
        print(f"Error loading data: {e}")
        sys.exit(1)

    if index_column:
        try:
            if df.index.name != index_column:
                df = df.set_index(index_column)
        except KeyError:
            print(f"Index column '{index_column}' not found in dataframe")

    if sort_columns:
        try:
            df.sort_values(sort_columns, inplace=True)
        except KeyError:
            print("One or more sort columns not found in dataframe")

    if eval_columns:
        for column in eval_columns:
            if column in df.columns:
                try:
                    df[column] = df[column].apply(safe_literal_eval)
                except Exception as e:
                    print(f"Error applying safe_literal_eval to column '{column}': {e}")

    return df

`get_dict_cols(df)`

Categorize DataFrame columns into predefined groups.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to categorize.	required

Returns:

Type	Description
`dict`	A dictionary where keys are column group names and values are lists of column names in each group.

Source code in src/patientflow/load.py

def get_dict_cols(df):
    """
    Categorize DataFrame columns into predefined groups.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame to categorize.

    Returns
    -------
    dict
        A dictionary where keys are column group names and values are lists of column names in each group.
    """
    not_used_in_training_vars = [
        "snapshot_id",
        "snapshot_date",
        "prediction_time",
        "visit_number",
        "training_validation_test",
        "random_number",
    ]
    arrival_and_demographic_vars = [
        "elapsed_los",
        "sex",
        "age_group",
        "age_on_arrival",
        "arrival_method",
    ]
    summary_vars = [
        "num_obs",
        "num_obs_events",
        "num_obs_types",
        "num_lab_batteries_ordered",
    ]

    location_vars = []
    observations_vars = []
    labs_vars = []
    consults_vars = [
        "has_consultation",
        "consultation_sequence",
        "final_sequence",
        "specialty",
    ]
    outcome_vars = ["is_admitted"]

    for col in df.columns:
        if (
            col in not_used_in_training_vars
            or col in arrival_and_demographic_vars
            or col in summary_vars
        ):
            continue
        elif "visited" in col or "location" in col:
            location_vars.append(col)
        elif "num_obs" in col or "latest_obs" in col:
            observations_vars.append(col)
        elif "lab_orders" in col or "latest_lab_results" in col:
            labs_vars.append(col)
        elif col in consults_vars or col in outcome_vars:
            continue  # Already categorized
        else:
            print(f"Column '{col}' did not match any predefined group")

    # Create a list of column groups
    col_group_names = [
        "not used in training",
        "arrival and demographic",
        "summary",
        "location",
        "observations",
        "lab orders and results",
        "consults",
        "outcome",
    ]

    # Create a list of the column names within those groups
    col_groups = [
        not_used_in_training_vars,
        arrival_and_demographic_vars,
        summary_vars,
        location_vars,
        observations_vars,
        labs_vars,
        consults_vars,
        outcome_vars,
    ]

    # Use dictionary to combine them
    dict_col_groups = {
        category: var_list for category, var_list in zip(col_group_names, col_groups)
    }

    return dict_col_groups

`get_model_key(model_name, prediction_time)`

Create a model name based on the time of day.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The base name of the model.	required
`prediction_time`	`tuple of int`	A tuple representing the time of day (hour, minute).	required

Returns:

Type	Description
`str`	A string representing the model name based on the time of day.

Source code in src/patientflow/load.py

def get_model_key(model_name, prediction_time):
    """
    Create a model name based on the time of day.

    Parameters
    ----------
    model_name : str
        The base name of the model.
    prediction_time : tuple of int
        A tuple representing the time of day (hour, minute).

    Returns
    -------
    str
        A string representing the model name based on the time of day.
    """

    hour_, min_ = prediction_time
    min_ = f"{min_}0" if min_ % 60 == 0 else str(min_)
    model_name = model_name + "_" + f"{hour_:02}" + min_
    return model_name

`load_config_file(config_file_path, return_start_end_dates=False)`

Load configuration from a YAML file.

Parameters:

Name	Type	Description	Default
`config_file_path`	`str`	The path to the configuration file.	required
`return_start_end_dates`	`bool`	If True, return only the start and end dates from the file (default is False).	`False`

Returns:

Type	Description
`dict or tuple or None`	If `return_start_end_dates` is True, returns a tuple of start and end dates (str). Otherwise, returns a dictionary containing the configuration parameters. Returns None if an error occurs during file reading or parsing.

Source code in src/patientflow/load.py

def load_config_file(
    config_file_path: str, return_start_end_dates: bool = False
) -> Optional[Union[Dict[str, Any], Tuple[str, str]]]:
    """
    Load configuration from a YAML file.

    Parameters
    ----------
    config_file_path : str
        The path to the configuration file.
    return_start_end_dates : bool, optional
        If True, return only the start and end dates from the file (default is False).

    Returns
    -------
    dict or tuple or None
        If `return_start_end_dates` is True, returns a tuple of start and end dates (str).
        Otherwise, returns a dictionary containing the configuration parameters.
        Returns None if an error occurs during file reading or parsing.
    """
    try:
        with open(config_file_path, "r") as file:
            config = yaml.safe_load(file)
    except FileNotFoundError:
        print(f"Error: The file '{config_file_path}' was not found.")
        return None
    except yaml.YAMLError as e:
        print(f"Error parsing YAML file: {e}")
        return None

    try:
        if return_start_end_dates:
            # load the dates used in saved data for uclh versions
            if "file_dates" in config and config["file_dates"]:
                start_date, end_date = [str(item) for item in config["file_dates"]]
                return (start_date, end_date)
            else:
                print(
                    "Error: 'file_dates' key not found or empty in the configuration file."
                )
                return None

        params: Dict[str, Any] = {}

        if "prediction_times" in config:
            params["prediction_times"] = [
                tuple(item) for item in config["prediction_times"]
            ]
        else:
            print("Error: 'prediction_times' key not found in the configuration file.")
            sys.exit(1)

        if "modelling_dates" in config and len(config["modelling_dates"]) == 4:
            (
                params["start_training_set"],
                params["start_validation_set"],
                params["start_test_set"],
                params["end_test_set"],
            ) = [item for item in config["modelling_dates"]]
        else:
            print(
                f"Error: expecting 4 modelling dates and only got {len(config.get('modelling_dates', []))}"
            )
            return None

        params["x1"] = float(config.get("x1", 4))
        params["y1"] = float(config.get("y1", 0.76))
        params["x2"] = float(config.get("x2", 12))
        params["y2"] = float(config.get("y2", 0.99))
        params["prediction_window"] = config.get("prediction_window", 480)
        params["epsilon"] = config.get("epsilon", 10**-7)
        params["yta_time_interval"] = config.get("yta_time_interval", 15)

        return params

    except KeyError as e:
        print(f"Error: Missing key in the configuration file: {e}")
        return None
    except ValueError as e:
        print(f"Error: Invalid value found in the configuration file: {e}")
        return None

`load_data(data_file_path, file_name, index_column=None, sort_columns=None, eval_columns=None, home_path=None, encoding=None)`

Loads data from CSV or pickle file with optional transformations.

Parameters:

Name	Type	Description	Default
`data_file_path`	`str`	Directory path containing the data file	required
`file_name`	`str`	Name of the CSV or pickle file to load	required
`index_column`	`str`	Column to set as DataFrame index	`None`
`sort_columns`	`list of str`	Columns to sort DataFrame by	`None`
`eval_columns`	`list of str`	Columns to apply safe_literal_eval to	`None`
`home_path`	`str or Path`	Base path to use instead of user's home directory	`None`
`encoding`	`str`	The encoding to use when reading CSV files (e.g., 'utf-8', 'latin1')	`None`

Returns:

Type	Description
`DataFrame`	Loaded and transformed DataFrame

Raises:

Type	Description
`FileNotFoundError`	If the specified file does not exist
`ValueError`	If the file format is not supported or other processing errors occur

Source code in src/patientflow/load.py

def load_data(
    data_file_path,
    file_name,
    index_column=None,
    sort_columns=None,
    eval_columns=None,
    home_path=None,
    encoding=None,
):
    """
    Loads data from CSV or pickle file with optional transformations.

    Parameters
    ----------
    data_file_path : str
        Directory path containing the data file
    file_name : str
        Name of the CSV or pickle file to load
    index_column : str, optional
        Column to set as DataFrame index
    sort_columns : list of str, optional
        Columns to sort DataFrame by
    eval_columns : list of str, optional
        Columns to apply safe_literal_eval to
    home_path : str or Path, optional
        Base path to use instead of user's home directory
    encoding : str, optional
        The encoding to use when reading CSV files (e.g., 'utf-8', 'latin1')

    Returns
    -------
    pd.DataFrame
        Loaded and transformed DataFrame

    Raises
    ------
    FileNotFoundError
        If the specified file does not exist
    ValueError
        If the file format is not supported or other processing errors occur
    """
    from pathlib import Path

    # Use provided home_path if available, otherwise default to user's home directory
    base_path = Path(home_path) if home_path else Path.home()
    path = base_path / data_file_path / file_name

    if not path.exists():
        raise FileNotFoundError(f"Data file not found at path: {path}")

    try:
        if path.suffix.lower() == ".csv":
            df = pd.read_csv(path, parse_dates=True, encoding=encoding)
        elif path.suffix.lower() == ".pkl":
            df = pd.read_pickle(path)
        else:
            raise ValueError(
                f"Unsupported file format: {path.suffix}. Must be .csv or .pkl"
            )
    except Exception as e:
        raise ValueError(f"Error loading data: {str(e)}")

    if index_column and df.index.name != index_column:
        try:
            df = df.set_index(index_column)
        except KeyError:
            print(f"Warning: Index column '{index_column}' not found in dataframe")

    if sort_columns:
        try:
            df.sort_values(sort_columns, inplace=True)
        except KeyError:
            print("Warning: One or more sort columns not found in dataframe")

    if eval_columns:
        for column in eval_columns:
            if column in df.columns:
                try:
                    df[column] = df[column].apply(safe_literal_eval)
                except Exception as e:
                    print(
                        f"Warning: Error applying safe_literal_eval to column '{column}': {str(e)}"
                    )

    return df

`parse_args()`

Parse command-line arguments for the training script.

Returns: argparse.Namespace: The parsed arguments containing 'data_folder_name' and 'uclh' keys.

Source code in src/patientflow/load.py

def parse_args() -> argparse.Namespace:
    """
    Parse command-line arguments for the training script.

    Returns:
        argparse.Namespace: The parsed arguments containing 'data_folder_name' and 'uclh' keys.
    """
    parser = argparse.ArgumentParser(description="Train emergency demand models")
    parser.add_argument(
        "--data_folder_name",
        type=str,
        default="data-synthetic",
        help="Location of data for training",
    )
    parser.add_argument(
        "--uclh",
        type=lambda x: x.lower() in ["true", "1", "yes", "y"],
        default=False,
        help="Train using UCLH data (True) or Public data (False)",
    )
    args = parser.parse_args()
    return args

`safe_literal_eval(s)`

Safely evaluate a string literal into a Python object. Handles list-like strings by converting them to lists.

Parameters:

Name	Type	Description	Default
`s`	`str`	The string to evaluate.	required

Returns:

Type	Description
`Any, list, or None`	The evaluated Python object if successful, a list if the input is list-like, or None for empty/null values.

Source code in src/patientflow/load.py

def safe_literal_eval(s):
    """
    Safely evaluate a string literal into a Python object.
    Handles list-like strings by converting them to lists.

    Parameters
    ----------
    s : str
        The string to evaluate.

    Returns
    -------
    Any, list, or None
        The evaluated Python object if successful, a list if the input is list-like,
        or None for empty/null values.
    """
    if pd.isna(s) or str(s).strip().lower() in ["nan", "none", ""]:
        return None

    if isinstance(s, str):
        s = s.strip()
        if s.startswith("[") and s.endswith("]"):
            try:
                # Remove square brackets and split by comma
                items = s[1:-1].split(",")
                # Strip whitespace from each item and remove empty strings
                return [item.strip() for item in items if item.strip()]
            except Exception:
                # If the above fails, fall back to ast.literal_eval
                pass

    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        # If ast.literal_eval fails, return the original string
        return s

`set_data_file_names(uclh, data_file_path, config_file_path=None)`

Set file locations based on UCLH or default data source.

Parameters:

Name	Type	Description	Default
`uclh`	`bool`	If True, use UCLH-specific file locations. If False, use default file locations.	required
`data_file_path`	`Path`	The base path to the data directory.	required
`config_file_path`	`str`	The path to the configuration file, required if `uclh` is True.	`None`

Returns:

Type	Description
`tuple`	Paths to the required files (visits, arrivals) based on the configuration.

Source code in src/patientflow/load.py

def set_data_file_names(uclh, data_file_path, config_file_path=None):
    """
    Set file locations based on UCLH or default data source.

    Parameters
    ----------
    uclh : bool
        If True, use UCLH-specific file locations. If False, use default file locations.
    data_file_path : Path
        The base path to the data directory.
    config_file_path : str, optional
        The path to the configuration file, required if `uclh` is True.

    Returns
    -------
    tuple
        Paths to the required files (visits, arrivals) based on the configuration.
    """
    if not isinstance(data_file_path, Path):
        data_file_path = Path(data_file_path)

    if not uclh:
        csv_filename = "ed_visits.csv"
        yta_csv_filename = "inpatient_arrivals.csv"

        visits_csv_path = data_file_path / csv_filename
        yta_csv_path = data_file_path / yta_csv_filename

        return visits_csv_path, yta_csv_path

    else:
        start_date, end_date = load_config_file(
            config_file_path, return_start_end_dates=True
        )
        data_filename = (
            "uclh_visits_exc_beds_inc_minority_"
            + str(start_date)
            + "_"
            + str(end_date)
            + ".pickle"
        )
        csv_filename = "uclh_ed_visits.csv"
        yta_filename = (
            "uclh_yet_to_arrive_" + str(start_date) + "_" + str(end_date) + ".pickle"
        )
        yta_csv_filename = "uclh_inpatient_arrivals.csv"

        visits_path = data_file_path / data_filename
        yta_path = data_file_path / yta_filename

        visits_csv_path = data_file_path / csv_filename
        yta_csv_path = data_file_path / yta_csv_filename

    return visits_path, visits_csv_path, yta_path, yta_csv_path

`set_file_paths(project_root, data_folder_name, train_dttm=None, inference_time=False, config_file='config.yaml', prefix=None, verbose=True)`

Sets up the file paths

Args: project_root (Path): Root path of the project data_folder_name (str): Name of the folder where data files are located train_dttm (Optional[str], optional): A string representation of the datetime at which training commenced. Defaults to None inference_time (bool, optional): A flag indicating whether it is inference time or not. Defaults to False config_file (str, optional): Name of config file. Defaults to "config.yaml" prefix (Optional[str], optional): String to prefix model folder names. Defaults to None verbose (bool, optional): Whether to print path information. Defaults to True

Returns: tuple: Contains (data_file_path, media_file_path, model_file_path, config_path)

Source code in src/patientflow/load.py

def set_file_paths(
    project_root: Path,
    data_folder_name: str,
    train_dttm: Optional[str] = None,
    inference_time: bool = False,
    config_file: str = "config.yaml",
    prefix: Optional[str] = None,
    verbose: bool = True,
) -> Tuple[Path, Path, Path, Path]:
    """
    Sets up the file paths

    Args:
        project_root (Path): Root path of the project
        data_folder_name (str): Name of the folder where data files are located
        train_dttm (Optional[str], optional): A string representation of the datetime at which training commenced. Defaults to None
        inference_time (bool, optional): A flag indicating whether it is inference time or not. Defaults to False
        config_file (str, optional): Name of config file. Defaults to "config.yaml"
        prefix (Optional[str], optional): String to prefix model folder names. Defaults to None
        verbose (bool, optional): Whether to print path information. Defaults to True

    Returns:
        tuple: Contains (data_file_path, media_file_path, model_file_path, config_path)
    """

    config_path = Path(project_root) / config_file
    if verbose:
        print(f"Configuration will be loaded from: {config_path}")

    data_file_path = Path(project_root) / data_folder_name
    if verbose:
        print(f"Data files will be loaded from: {data_file_path}")

    model_id = data_folder_name.lstrip("data-")
    if prefix:
        model_id = f"{prefix}_{model_id}"
    if train_dttm:
        model_id = f"{model_id}_{train_dttm}"

    model_file_path = Path(project_root) / "trained-models" / model_id
    media_file_path = model_file_path / "media"

    if not inference_time:
        if verbose:
            print(f"Trained models will be saved to: {model_file_path}")
        model_file_path.mkdir(parents=True, exist_ok=True)
        (model_file_path / "model-output").mkdir(parents=False, exist_ok=True)
        media_file_path.mkdir(parents=False, exist_ok=True)
        if verbose:
            print(f"Images will be saved to: {media_file_path}")

    return data_file_path, media_file_path, model_file_path, config_path

`set_project_root(env_var=None)`

Sets project root path from environment variable or infers it from current path.

First checks specified environment variable for project root path. If not found, searches current path hierarchy for highest-level 'patientflow' directory.

Args: env_var (Optional[str]): Name of environment variable containing project root path

Returns: Path: Validated project root path

Raises: ValueError: If environment variable not set and 'patientflow' not found in path NotADirectoryError: If path doesn't exist TypeError: If env_var is not None and not a string

Source code in src/patientflow/load.py

def set_project_root(env_var: Optional[str] = None) -> Path:
    """
    Sets project root path from environment variable or infers it from current path.

    First checks specified environment variable for project root path.
    If not found, searches current path hierarchy for highest-level 'patientflow' directory.

    Args:
        env_var (Optional[str]): Name of environment variable containing project root path

    Returns:
        Path: Validated project root path

    Raises:
        ValueError: If environment variable not set and 'patientflow' not found in path
        NotADirectoryError: If path doesn't exist
        TypeError: If env_var is not None and not a string
    """
    # Only try to get env path if env_var is provided
    env_path: Optional[str] = os.getenv(env_var) if env_var is not None else None
    project_root: Optional[Path] = None

    # Try getting from environment variable first
    if env_path is not None:
        try:
            project_root = Path(env_path)
            if not project_root.is_dir():
                raise NotADirectoryError(f"Path does not exist: {project_root}")
            print(f"Project root from environment: {project_root}")
            return project_root
        except (TypeError, ValueError) as e:
            print(f"Error converting {env_path} to Path: {e}")
            raise
    else:
        # If not in env var, try to infer from current path
        current: Path = Path().absolute()

        # Search through parents to find highest-level 'patientflow' directory
        for parent in [current, *current.parents]:
            if parent.name == "patientflow" and parent.is_dir():
                project_root = parent
                # Continue searching to find highest level

        if project_root:
            print(f"Inferred project root: {project_root}")
            return project_root

        print(
            f"Could not find project root - {env_var} not set and 'patientflow' not found in path"
        )
        print(f"\nCurrent directory: {Path().absolute()}")
        if env_var:
            print(f"\nRun one of these commands in a new cell to set {env_var}:")
            print("# Linux/Mac:")
            print(f"%env {env_var}=/path/to/project")
            print("\n# Windows:")
            print(f"%env {env_var}=C:\\path\\to\\project")
        raise ValueError("Project root not found")

`model_artifacts`

Model training results containers.

This module defines a set of data classes to organise results from model training, including hyperparameter tuning, cross-validation fold metrics, and final trained classifier artifacts. These classes serve as structured containers for various types of model evaluation outputs and metadata.

Classes:

Name	Description
`HyperParameterTrial`	Container for storing hyperparameter tuning trial results.
`FoldResults`	Stores evaluation metrics from a single cross-validation fold.
`TrainingResults`	Encapsulates comprehensive evaluation metrics and metadata from model training.
`TrainedClassifier`	Container for a trained model and associated training results.

`FoldResults` `dataclass`

Store evaluation metrics for a single fold.

Attributes:

Name	Type	Description
`auc`	`float`	Area Under the ROC Curve (AUC) for this fold.
`logloss`	`float`	Logarithmic loss (cross-entropy loss) for this fold.
`auprc`	`float`	Area Under the Precision-Recall Curve (AUPRC) for this fold.

Source code in src/patientflow/model_artifacts.py

@dataclass
class FoldResults:
    """
    Store evaluation metrics for a single fold.

    Attributes
    ----------
    auc : float
        Area Under the ROC Curve (AUC) for this fold.
    logloss : float
        Logarithmic loss (cross-entropy loss) for this fold.
    auprc : float
        Area Under the Precision-Recall Curve (AUPRC) for this fold.
    """

    auc: float
    logloss: float
    auprc: float

`HyperParameterTrial` `dataclass`

Container for a single hyperparameter tuning trial.

Attributes:

Name	Type	Description
`parameters`	`dict of str to Any`	Dictionary of hyperparameters used in the trial.
`cv_results`	`dict of str to float`	Cross-validation metrics obtained using the specified parameters.

Source code in src/patientflow/model_artifacts.py

@dataclass
class HyperParameterTrial:
    """
    Container for a single hyperparameter tuning trial.

    Attributes
    ----------
    parameters : dict of str to Any
        Dictionary of hyperparameters used in the trial.
    cv_results : dict of str to float
        Cross-validation metrics obtained using the specified parameters.
    """

    parameters: Dict[str, Any]
    cv_results: Dict[str, float]

`TrainedClassifier` `dataclass`

Container for trained model artifacts and their associated information.

Attributes:

Name	Type	Description
`training_results`	`TrainingResults`	Evaluation metrics and training metadata for the classifier.
`pipeline`	`(Pipeline or None, optional)`	The scikit-learn pipeline representing the trained classifier.
`calibrated_pipeline`	`(Pipeline or None, optional)`	The calibrated version of the pipeline, if model calibration was performed.

Source code in src/patientflow/model_artifacts.py

@dataclass
class TrainedClassifier:
    """
    Container for trained model artifacts and their associated information.

    Attributes
    ----------
    training_results : TrainingResults
        Evaluation metrics and training metadata for the classifier.
    pipeline : sklearn.pipeline.Pipeline or None, optional
        The scikit-learn pipeline representing the trained classifier.
    calibrated_pipeline : sklearn.pipeline.Pipeline or None, optional
        The calibrated version of the pipeline, if model calibration was performed.
    """

    training_results: TrainingResults
    pipeline: Optional[Pipeline] = None
    calibrated_pipeline: Optional[Pipeline] = None

`TrainingResults` `dataclass`

Store comprehensive evaluation metrics and metadata from model training.

Attributes:

Name	Type	Description
`prediction_time`	`tuple of int`	Start and end time of prediction, represented as UNIX timestamps.
`training_info`	`dict of str to Any, optional`	Metadata or logs collected during training.
`calibration_info`	`dict of str to Any, optional`	Information about model calibration, if applicable.
`test_results`	`dict of str to float, optional`	Evaluation metrics computed on the test dataset. None if test evaluation was not performed.
`balance_info`	`dict of str to bool or int or float, optional`	Information related to class balance (e.g., whether data was balanced, class ratios).

Source code in src/patientflow/model_artifacts.py

@dataclass
class TrainingResults:
    """
    Store comprehensive evaluation metrics and metadata from model training.

    Attributes
    ----------
    prediction_time : tuple of int
        Start and end time of prediction, represented as UNIX timestamps.
    training_info : dict of str to Any, optional
        Metadata or logs collected during training.
    calibration_info : dict of str to Any, optional
        Information about model calibration, if applicable.
    test_results : dict of str to float, optional
        Evaluation metrics computed on the test dataset. None if test evaluation was not performed.
    balance_info : dict of str to bool or int or float, optional
        Information related to class balance (e.g., whether data was balanced, class ratios).
    """

    prediction_time: Tuple[int, int]
    training_info: Dict[str, Any] = field(default_factory=dict)
    calibration_info: Dict[str, Any] = field(default_factory=dict)
    test_results: Optional[Dict[str, float]] = None
    balance_info: Dict[str, Union[bool, int, float]] = field(default_factory=dict)

`predict`

Prediction module for patient flow forecasting.

This module provides functions for making predictions about future patient flow, including emergency demand forecasting and other predictive analytics.

`emergency_demand`

Emergency demand prediction module.

This module provides functionality for predicting emergency department demand, including specialty-specific predictions for both current patients and yet-to-arrive patients. It handles probability calculations, model predictions, and threshold-based resource estimation.

The module integrates multiple prediction models: - Admission prediction classifier - Specialty sequence predictor - Yet-to-arrive weighted Poisson predictor

Functions:

Name	Description
`add_missing_columns : function`	Add missing columns required by the prediction pipeline
`find_probability_threshold_index : function`	Find index where cumulative probability exceeds threshold
`get_specialty_probs : function`	Calculate specialty probability distributions
`create_predictions : function`	Create predictions for emergency demand

`add_missing_columns(pipeline, df)`

Add missing columns required by the prediction pipeline from the training data.

Parameters:

Name	Type	Description	Default
`pipeline`	`Pipeline`	The trained pipeline containing the feature transformer	required
`df`	`DataFrame`	Input dataframe that may be missing required columns	required

Returns:

Type	Description
`DataFrame`	DataFrame with missing columns added and filled with appropriate default values

Notes

Adds columns with default values based on column name patterns: - lab_orders_, visited_, has_ : False - num_, total_ : 0 - latest_ : pd.NA - arrival_method : "None" - others : pd.NA

Source code in src/patientflow/predict/emergency_demand.py

def add_missing_columns(pipeline, df):
    """Add missing columns required by the prediction pipeline from the training data.

    Parameters
    ----------
    pipeline : sklearn.pipeline.Pipeline
        The trained pipeline containing the feature transformer
    df : pandas.DataFrame
        Input dataframe that may be missing required columns

    Returns
    -------
    pandas.DataFrame
        DataFrame with missing columns added and filled with appropriate default values

    Notes
    -----
    Adds columns with default values based on column name patterns:
    - lab_orders_, visited_, has_ : False
    - num_, total_ : 0
    - latest_ : pd.NA
    - arrival_method : "None"
    - others : pd.NA
    """
    # check input data for missing columns
    column_transformer = pipeline.named_steps["feature_transformer"]

    # Function to get feature names before one-hot encoding
    def get_feature_names_before_encoding(column_transformer):
        feature_names = []
        for name, transformer, columns in column_transformer.transformers:
            if isinstance(transformer, OneHotEncoder):
                feature_names.extend(columns)
            elif isinstance(transformer, OrdinalEncoder):
                feature_names.extend(columns)
            elif isinstance(transformer, StandardScaler):
                feature_names.extend(columns)
            else:
                feature_names.extend(columns)
        return feature_names

    feature_names_before_encoding = get_feature_names_before_encoding(
        column_transformer
    )

    added_columns = []
    for missing_col in set(feature_names_before_encoding).difference(set(df.columns)):
        if missing_col.startswith(("lab_orders_", "visited_", "has_")):
            df[missing_col] = False
        elif missing_col.startswith(("num_", "total_")):
            df[missing_col] = 0
        elif missing_col.startswith("latest_"):
            df[missing_col] = pd.NA
        elif missing_col == "arrival_method":
            df[missing_col] = "None"
        else:
            df[missing_col] = pd.NA
        added_columns.append(missing_col)

    if added_columns:
        print(
            f"Warning: The following columns were used in training, but not found in the real-time data. These have been added to the dataframe: {', '.join(added_columns)}"
        )

    return df

`create_predictions(models, prediction_time, prediction_snapshots, specialties, prediction_window, x1, y1, x2, y2, cdf_cut_points, use_admission_in_window_prob=True)`

Create predictions for emergency demand for a single prediction moment.

Parameters:

Name	Type	Description	Default
`models`	`Tuple[TrainedClassifier, Union[SequenceToOutcomePredictor, ValueToOutcomePredictor], ParametricIncomingAdmissionPredictor]`	Tuple containing: - classifier: TrainedClassifier containing admission predictions - spec_model: SequenceToOutcomePredictor or ValueToOutcomePredictor for specialty predictions - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions	required
`prediction_time`	`Tuple`	Hour and minute of time for model inference	required
`prediction_snapshots`	`DataFrame`	DataFrame containing prediction snapshots. Must have an 'elapsed_los' column of type timedelta.	required
`specialties`	`List[str]`	List of specialty names for predictions (e.g., ['surgical', 'medical'])	required
`prediction_window`	`timedelta`	Prediction window as a timedelta object	required
`x1`	`float`	X-coordinate of first point for probability curve	required
`y1`	`float`	Y-coordinate of first point for probability curve	required
`x2`	`float`	X-coordinate of second point for probability curve	required
`y2`	`float`	Y-coordinate of second point for probability curve	required
`cdf_cut_points`	`List[float]`	List of cumulative distribution function cut points (e.g., [0.9, 0.7])	required
`use_admission_in_window_prob`	`bool`	Whether to use probability calculation for admission within prediction window for patients already in the ED. If False, probability is set to 1.0 for all current ED patients. This parameter does not affect the yet-to-arrive predictions. By default True	`True`

Returns:

Type	Description
`Dict[str, Dict[str, List[int]]]`	Nested dictionary containing predictions for each specialty: { 'specialty_name': { 'in_ed': [pred1, pred2, ...], 'yet_to_arrive': [pred1, pred2, ...] } }

Raises:

Type	Description
`TypeError`	If any of the models are not of the expected type or if prediction_window is not a timedelta
`ValueError`	If models have not been fit or if prediction parameters don't match training parameters If 'elapsed_los' column is missing or not of type timedelta

Notes

The models in the models dictionary must be ModelResults objects that contain either a 'pipeline' or 'calibrated_pipeline' attribute. The pipeline will be used for making predictions, with calibrated_pipeline taking precedence if both exist.

Source code in src/patientflow/predict/emergency_demand.py

def create_predictions(
    models: Tuple[
        TrainedClassifier,
        Union[SequenceToOutcomePredictor, ValueToOutcomePredictor],
        ParametricIncomingAdmissionPredictor,
    ],
    prediction_time: Tuple,
    prediction_snapshots: pd.DataFrame,
    specialties: List[str],
    prediction_window: timedelta,
    x1: float,
    y1: float,
    x2: float,
    y2: float,
    cdf_cut_points: List[float],
    use_admission_in_window_prob: bool = True,
) -> Dict[str, Dict[str, List[int]]]:
    """Create predictions for emergency demand for a single prediction moment.

    Parameters
    ----------
    models : Tuple[TrainedClassifier, Union[SequenceToOutcomePredictor, ValueToOutcomePredictor], ParametricIncomingAdmissionPredictor]
        Tuple containing:
        - classifier: TrainedClassifier containing admission predictions
        - spec_model: SequenceToOutcomePredictor or ValueToOutcomePredictor for specialty predictions
        - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions
    prediction_time : Tuple
        Hour and minute of time for model inference
    prediction_snapshots : pandas.DataFrame
        DataFrame containing prediction snapshots. Must have an 'elapsed_los' column of type timedelta.
    specialties : List[str]
        List of specialty names for predictions (e.g., ['surgical', 'medical'])
    prediction_window : timedelta
        Prediction window as a timedelta object
    x1 : float
        X-coordinate of first point for probability curve
    y1 : float
        Y-coordinate of first point for probability curve
    x2 : float
        X-coordinate of second point for probability curve
    y2 : float
        Y-coordinate of second point for probability curve
    cdf_cut_points : List[float]
        List of cumulative distribution function cut points (e.g., [0.9, 0.7])
    use_admission_in_window_prob : bool, optional
        Whether to use probability calculation for admission within prediction window for patients
        already in the ED. If False, probability is set to 1.0 for all current ED patients.
        This parameter does not affect the yet-to-arrive predictions. By default True

    Returns
    -------
    Dict[str, Dict[str, List[int]]]
        Nested dictionary containing predictions for each specialty:
        {
            'specialty_name': {
                'in_ed': [pred1, pred2, ...],
                'yet_to_arrive': [pred1, pred2, ...]
            }
        }

    Raises
    ------
    TypeError
        If any of the models are not of the expected type or if prediction_window is not a timedelta
    ValueError
        If models have not been fit or if prediction parameters don't match training parameters
        If 'elapsed_los' column is missing or not of type timedelta

    Notes
    -----
    The models in the models dictionary must be ModelResults objects
    that contain either a 'pipeline' or 'calibrated_pipeline' attribute. The pipeline
    will be used for making predictions, with calibrated_pipeline taking precedence
    if both exist.
    """
    # Validate model types
    classifier, spec_model, yet_to_arrive_model = models

    if not isinstance(classifier, TrainedClassifier):
        raise TypeError("First model must be of type TrainedClassifier")
    if not isinstance(
        spec_model, (SequenceToOutcomePredictor, ValueToOutcomePredictor)
    ):
        raise TypeError(
            "Second model must be of type SequenceToOutcomePredictor or ValueToOutcomePredictor"
        )
    if not isinstance(yet_to_arrive_model, ParametricIncomingAdmissionPredictor):
        raise TypeError(
            "Third model must be of type ParametricIncomingAdmissionPredictor"
        )
    if "elapsed_los" not in prediction_snapshots.columns:
        raise ValueError("Column 'elapsed_los' not found in prediction_snapshots")
    if not pd.api.types.is_timedelta64_dtype(prediction_snapshots["elapsed_los"]):
        actual_type = prediction_snapshots["elapsed_los"].dtype
        raise ValueError(
            f"Column 'elapsed_los' must be a timedelta column, but found type: {actual_type}"
        )

    # Check that all models have been fit
    if not hasattr(classifier, "pipeline") or classifier.pipeline is None:
        raise ValueError("Classifier model has not been fit")
    if not hasattr(spec_model, "weights") or spec_model.weights is None:
        raise ValueError("Specialty model has not been fit")
    if (
        not hasattr(yet_to_arrive_model, "prediction_window")
        or yet_to_arrive_model.prediction_window is None
    ):
        raise ValueError("Yet-to-arrive model has not been fit")

    # Validate that the correct models have been passed for the requested prediction time and prediction window
    if not classifier.training_results.prediction_time == prediction_time:
        raise ValueError(
            f"Requested prediction time {prediction_time} does not match the prediction time of the trained classifier {classifier.training_results.prediction_time}"
        )

    # Compare prediction windows directly
    if prediction_window != yet_to_arrive_model.prediction_window:
        raise ValueError(
            f"Requested prediction window {prediction_window} does not match the prediction window of the trained yet-to-arrive model {yet_to_arrive_model.prediction_window}"
        )

    if not set(yet_to_arrive_model.filters.keys()) == set(specialties):
        raise ValueError(
            f"Requested specialties {set(specialties)} do not match the specialties of the trained yet-to-arrive model {set(yet_to_arrive_model.filters.keys())}"
        )

    special_params = spec_model.special_params

    if special_params:
        special_category_func = special_params["special_category_func"]
        special_category_dict = special_params["special_category_dict"]
        special_func_map = special_params["special_func_map"]
    else:
        special_category_func = special_category_dict = special_func_map = None

    if special_category_dict is not None and not set(specialties) == set(
        special_category_dict.keys()
    ):
        raise ValueError(
            "Requested specialties do not match the specialty dictionary defined in special_params"
        )

    predictions: Dict[str, Dict[str, List[int]]] = {
        specialty: {"in_ed": [], "yet_to_arrive": []} for specialty in specialties
    }

    # Use calibrated pipeline if available, otherwise use regular pipeline
    if (
        hasattr(classifier, "calibrated_pipeline")
        and classifier.calibrated_pipeline is not None
    ):
        pipeline = classifier.calibrated_pipeline
    else:
        pipeline = classifier.pipeline

    # Add missing columns expected by the model
    prediction_snapshots = add_missing_columns(pipeline, prediction_snapshots)

    # Before we get predictions, we need to create a temp copy with the elapsed_los column in seconds
    prediction_snapshots_temp = prediction_snapshots.copy()
    prediction_snapshots_temp["elapsed_los"] = prediction_snapshots_temp[
        "elapsed_los"
    ].dt.total_seconds()

    # Get predictions of admissions for ED patients
    prob_admission_after_ed = model_input_to_pred_proba(
        prediction_snapshots_temp, pipeline
    )

    # Get predictions of admission to specialty
    prediction_snapshots.loc[:, "specialty_prob"] = get_specialty_probs(
        specialties,
        spec_model,
        prediction_snapshots,
        special_category_func=special_category_func,
        special_category_dict=special_category_dict,
    )

    # Get probability of admission within prediction window for current ED patients
    if use_admission_in_window_prob:
        prob_admission_in_window = prediction_snapshots.apply(
            lambda row: calculate_probability(
                row["elapsed_los"], prediction_window, x1, y1, x2, y2
            ),
            axis=1,
        )
    else:
        prob_admission_in_window = pd.Series(1.0, index=prediction_snapshots.index)

    if special_func_map is None:
        special_func_map = {"default": lambda row: True}

    for specialty in specialties:
        func = special_func_map.get(specialty, special_func_map["default"])
        non_zero_indices = prediction_snapshots[
            prediction_snapshots.apply(func, axis=1)
        ].index

        filtered_prob_admission_after_ed = prob_admission_after_ed.loc[non_zero_indices]
        prob_admission_to_specialty = prediction_snapshots["specialty_prob"].apply(
            lambda x: x[specialty]
        )

        filtered_prob_admission_to_specialty = prob_admission_to_specialty.loc[
            non_zero_indices
        ]
        filtered_prob_admission_in_window = prob_admission_in_window.loc[
            non_zero_indices
        ]

        filtered_weights = (
            filtered_prob_admission_to_specialty * filtered_prob_admission_in_window
        )

        agg_predicted_in_ed = pred_proba_to_agg_predicted(
            filtered_prob_admission_after_ed, weights=filtered_weights
        )

        prediction_context = {specialty: {"prediction_time": prediction_time}}
        agg_predicted_yta = yet_to_arrive_model.predict(
            prediction_context, x1=x1, y1=y1, x2=x2, y2=y2
        )

        predictions[specialty]["in_ed"] = [
            find_probability_threshold_index(
                agg_predicted_in_ed["agg_proba"].values.cumsum(), cut_point
            )
            for cut_point in cdf_cut_points
        ]
        predictions[specialty]["yet_to_arrive"] = [
            find_probability_threshold_index(
                agg_predicted_yta[specialty]["agg_proba"].values.cumsum(), cut_point
            )
            for cut_point in cdf_cut_points
        ]

    return predictions

`find_probability_threshold_index(sequence, threshold)`

Find index where cumulative probability exceeds threshold.

Parameters:

Name	Type	Description	Default
`sequence`	`List[float]`	The probability mass function (PMF) of resource needs	required
`threshold`	`float`	The probability threshold (e.g., 0.9 for 90%)	required

Returns:

Type	Description
`int`	The index where the cumulative probability exceeds 1 - threshold, indicating the number of resources needed with the specified probability

Examples:

>>> pmf = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
>>> find_probability_threshold_index(pmf, 0.9)
5
# This means there is a 90% probability of needing at least 5 beds

Source code in src/patientflow/predict/emergency_demand.py

def find_probability_threshold_index(sequence: List[float], threshold: float) -> int:
    """Find index where cumulative probability exceeds threshold.

    Parameters
    ----------
    sequence : List[float]
        The probability mass function (PMF) of resource needs
    threshold : float
        The probability threshold (e.g., 0.9 for 90%)

    Returns
    -------
    int
        The index where the cumulative probability exceeds 1 - threshold,
        indicating the number of resources needed with the specified probability

    Examples
    --------
    >>> pmf = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
    >>> find_probability_threshold_index(pmf, 0.9)
    5
    # This means there is a 90% probability of needing at least 5 beds
    """
    cumulative_sum = 0.0
    for i, value in enumerate(sequence):
        cumulative_sum += value
        if cumulative_sum >= 1 - threshold:
            return i
    return len(sequence) - 1  # Return the last index if the threshold isn't reached

`get_specialty_probs(specialties, specialty_model, snapshots_df, special_category_func=None, special_category_dict=None)`

Calculate specialty probability distributions for patient visits.

Parameters:

Name	Type	Description	Default
`specialties`	`str`	List of specialty names for which predictions are required	required
`specialty_model`	`object`	Trained model for making specialty predictions	required
`snapshots_df`	`DataFrame`	DataFrame containing the data on which predictions are to be made. Must include the input_var column if no special_category_func is applied	required
`special_category_func`	`callable`	A function that takes a DataFrame row (Series) as input and returns True if the row belongs to a special category that requires a fixed probability distribution	`None`
`special_category_dict`	`dict`	A dictionary containing the fixed probability distribution for special category cases. Required if special_category_func is provided	`None`

Returns:

Type	Description
`Series`	A Series containing dictionaries as values. Each dictionary represents the probability distribution of specialties for each patient visit

Raises:

Type	Description
`ValueError`	If special_category_func is provided but special_category_dict is None

Source code in src/patientflow/predict/emergency_demand.py

def get_specialty_probs(
    specialties,
    specialty_model,
    snapshots_df,
    special_category_func=None,
    special_category_dict=None,
):
    """Calculate specialty probability distributions for patient visits.

    Parameters
    ----------
    specialties : str
        List of specialty names for which predictions are required
    specialty_model : object
        Trained model for making specialty predictions
    snapshots_df : pandas.DataFrame
        DataFrame containing the data on which predictions are to be made. Must include
        the input_var column if no special_category_func is applied
    special_category_func : callable, optional
        A function that takes a DataFrame row (Series) as input and returns True if the row
        belongs to a special category that requires a fixed probability distribution
    special_category_dict : dict, optional
        A dictionary containing the fixed probability distribution for special category cases.
        Required if special_category_func is provided

    Returns
    -------
    pandas.Series
        A Series containing dictionaries as values. Each dictionary represents the probability
        distribution of specialties for each patient visit

    Raises
    ------
    ValueError
        If special_category_func is provided but special_category_dict is None

    """

    # Convert input_var to tuple if not already a tuple
    if len(snapshots_df[specialty_model.input_var]) > 0 and not isinstance(
        snapshots_df[specialty_model.input_var].iloc[0], tuple
    ):
        snapshots_df.loc[:, specialty_model.input_var] = snapshots_df[
            specialty_model.input_var
        ].apply(lambda x: tuple(x) if x else ())

    if special_category_func and not special_category_dict:
        raise ValueError(
            "special_category_dict must be provided if special_category_func is specified."
        )

    # Function to determine the specialty probabilities
    def determine_specialty(row):
        if special_category_func and special_category_func(row):
            return special_category_dict
        else:
            return specialty_model.predict(row[specialty_model.input_var])

    # Apply the determine_specialty function to each row
    specialty_prob_series = snapshots_df.apply(determine_specialty, axis=1)

    # Find all unique keys used in any dictionary within the series
    all_keys = set().union(
        *(d.keys() for d in specialty_prob_series if isinstance(d, dict))
    )

    # Combine all_keys with the specialties requested
    all_keys = set(all_keys).union(set(specialties))

    # Ensure each dictionary contains all keys found, with default values of 0 for missing keys
    specialty_prob_series = specialty_prob_series.apply(
        lambda d: (
            {key: d.get(key, 0) for key in all_keys} if isinstance(d, dict) else d
        )
    )

    return specialty_prob_series

`predictors`

Predictor models for patient flow analysis.

This module contains various predictor model implementations, including sequence-based predictors and weighted Poisson predictors for modeling patient flow patterns.

`incoming_admission_predictors`

Hospital Admissions Forecasting Predictors.

This module implements custom predictors to estimate the number of hospital admissions within a specified prediction window using historical admission data. It provides two approaches: parametric curves with Poisson-binomial distributions and empirical survival curves with convolution of Poisson distributions. Both predictors accommodate different data filters for tailored predictions across various hospital settings.

Classes:

Name	Description
`IncomingAdmissionPredictor : BaseEstimator, TransformerMixin`	Base class for admission predictors that handles filtering and arrival rate calculation.
`ParametricIncomingAdmissionPredictor : IncomingAdmissionPredictor`	Predicts the number of admissions within a given prediction window based on historical data and Poisson-binomial distribution using parametric aspirational curves.
`EmpiricalIncomingAdmissionPredictor : IncomingAdmissionPredictor`	Predicts the number of admissions using empirical survival curves and convolution of Poisson distributions instead of parametric curves.

Notes

The ParametricIncomingAdmissionPredictor uses a combination of Poisson and binomial distributions to model the probability of admissions within a prediction window using parametric curves defined by transition points (x1, y1, x2, y2).

The EmpiricalIncomingAdmissionPredictor inherits the arrival rate calculation and filtering logic but replaces the parametric approach with empirical survival probabilities and convolution of individual Poisson distributions for each time interval.

Both predictors take into account historical data patterns and can be filtered for specific hospital settings or specialties.

`EmpiricalIncomingAdmissionPredictor`

Bases: IncomingAdmissionPredictor

A predictor that uses empirical survival curves instead of parameterised curves.

This predictor inherits all the arrival rate calculation and filtering logic from IncomingAdmissionPredictor but uses empirical survival probabilities and convolution of Poisson distributions for prediction instead of the Poisson-binomial approach.

The survival curve is automatically calculated from the training data during the fit process by analysing time-to-admission patterns.

Parameters:

Name	Type	Description	Default
`filters`	`dict`	Optional filters for data categorization. If None, no filtering is applied.	`None`
`verbose`	`bool`	Whether to enable verbose logging.	`False`

Attributes:

Name	Type	Description
`survival_df`	`DataFrame`	The survival data calculated from training data, containing time-to-event information for empirical probability calculations.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

class EmpiricalIncomingAdmissionPredictor(IncomingAdmissionPredictor):
    """A predictor that uses empirical survival curves instead of parameterised curves.

    This predictor inherits all the arrival rate calculation and filtering logic from
    IncomingAdmissionPredictor but uses empirical survival probabilities and convolution
    of Poisson distributions for prediction instead of the Poisson-binomial approach.

    The survival curve is automatically calculated from the training data during the
    fit process by analysing time-to-admission patterns.

    Parameters
    ----------
    filters : dict, optional
        Optional filters for data categorization. If None, no filtering is applied.
    verbose : bool, default=False
        Whether to enable verbose logging.

    Attributes
    ----------
    survival_df : pandas.DataFrame
        The survival data calculated from training data, containing time-to-event
        information for empirical probability calculations.
    """

    def __init__(self, filters=None, verbose=False):
        """Initialize the EmpiricalIncomingAdmissionPredictor."""
        super().__init__(filters, verbose)
        self.survival_df = None

    def fit(
        self,
        train_df: pd.DataFrame,
        prediction_window,
        yta_time_interval,
        prediction_times: List[float],
        num_days: int,
        epsilon=10**-7,
        y=None,
        start_time_col="arrival_datetime",
        end_time_col="departure_datetime",
    ) -> "EmpiricalIncomingAdmissionPredictor":
        """Fit the model to the training data and calculate empirical survival curve.

        Parameters
        ----------
        train_df : pandas.DataFrame
            The training dataset with historical admission data.
            Expected to have start_time_col as the index and end_time_col as a column.
            Alternatively, both can be regular columns.
        prediction_window : int or timedelta
            The prediction window in minutes. If timedelta, will be converted to minutes.
            If int, assumed to be in minutes.
        yta_time_interval : int or timedelta
            The interval in minutes for splitting the prediction window. If timedelta, will be converted to minutes.
            If int, assumed to be in minutes.
        prediction_times : list
            Times of day at which predictions are made, in hours.
        num_days : int
            The number of days that the train_df spans.
        epsilon : float, default=1e-7
            A small value representing acceptable error rate to enable calculation
            of the maximum value of the random variable representing number of beds.
        y : None, optional
            Ignored, present for compatibility with scikit-learn's fit method.
        start_time_col : str, default='arrival_datetime'
            Name of the column containing the start time (e.g., arrival time).
            Expected to be the DataFrame index, but can also be a regular column.
        end_time_col : str, default='departure_datetime'
            Name of the column containing the end time (e.g., departure time).

        Returns
        -------
        EmpiricalIncomingAdmissionPredictor
            The instance itself, fitted with the training data.
        """
        # Calculate survival curve from training data using existing function
        # Handle case where start_time_col is in the index
        if start_time_col in train_df.columns:
            # start_time_col is a regular column
            df_for_survival = train_df
        else:
            # start_time_col is likely the index, reset it to make it a column
            df_for_survival = train_df.reset_index()
            # Verify that start_time_col is now available
            if start_time_col not in df_for_survival.columns:
                raise ValueError(
                    f"Column '{start_time_col}' not found in DataFrame columns or index"
                )

        self.survival_df = calculate_survival_curve(
            df_for_survival, start_time_col=start_time_col, end_time_col=end_time_col
        )

        # Verify survival curve was calculated and saved successfully
        if self.survival_df is None or len(self.survival_df) == 0:
            raise RuntimeError("Failed to calculate survival curve from training data")

        # Ensure train_df has start_time_col as index for parent fit method
        if start_time_col in train_df.columns:
            train_df = train_df.set_index(start_time_col)

        # Call parent fit method to handle arrival rate calculation and validation
        super().fit(
            train_df,
            prediction_window,
            yta_time_interval,
            prediction_times,
            num_days,
            epsilon=epsilon,
            y=y,
        )

        if self.verbose:
            self.logger.info(
                f"EmpiricalIncomingAdmissionPredictor has been fitted with survival curve containing {len(self.survival_df)} time points"
            )

        return self

    def get_survival_curve(self):
        """Get the survival curve calculated during fitting.

        Returns
        -------
        pandas.DataFrame
            DataFrame containing the survival curve with columns:
            - time_hours: Time points in hours
            - survival_probability: Survival probabilities at each time point
            - event_probability: Event probabilities (1 - survival_probability)

        Raises
        ------
        RuntimeError
            If the model has not been fitted yet.
        """
        if self.survival_df is None:
            raise RuntimeError("Model has not been fitted yet. Call fit() first.")
        return self.survival_df.copy()

    def _calculate_survival_probabilities(self, prediction_window, yta_time_interval):
        """Calculate survival probabilities for each time interval.

        Parameters
        ----------
        prediction_window : int or timedelta
            The prediction window.
        yta_time_interval : int or timedelta
            The time interval for splitting the prediction window.

        Returns
        -------
        numpy.ndarray
            Array of admission probabilities for each time interval.
        """
        # Calculate number of time intervals
        if isinstance(prediction_window, timedelta) and isinstance(
            yta_time_interval, timedelta
        ):
            NTimes = int(prediction_window / yta_time_interval)
        elif isinstance(prediction_window, timedelta):
            NTimes = int(prediction_window.total_seconds() / 60 / yta_time_interval)
        elif isinstance(yta_time_interval, timedelta):
            NTimes = int(prediction_window / (yta_time_interval.total_seconds() / 60))
        else:
            NTimes = int(prediction_window / yta_time_interval)

        # Convert to hours for survival probability calculation
        if isinstance(prediction_window, timedelta):
            prediction_window_hours = prediction_window.total_seconds() / 3600
        else:
            prediction_window_hours = prediction_window / 60

        if isinstance(yta_time_interval, timedelta):
            yta_time_interval_hours = yta_time_interval.total_seconds() / 3600
        else:
            yta_time_interval_hours = yta_time_interval / 60

        # Calculate admission probabilities for each time interval
        probabilities = []
        for i in range(NTimes):
            # Time remaining until end of prediction window
            time_remaining = prediction_window_hours - (i * yta_time_interval_hours)

            # Interpolate survival probability from survival curve
            if time_remaining <= 0:
                prob_admission = (
                    1.0  # If time remaining is 0 or negative, probability is 1
                )
            else:
                # Find the survival probability at this time point
                # Linear interpolation between points in survival curve
                survival_curve = self.survival_df
                if time_remaining >= survival_curve["time_hours"].max():
                    # If time is beyond our data, use the last survival probability
                    survival_prob = survival_curve["survival_probability"].iloc[-1]
                elif time_remaining <= survival_curve["time_hours"].min():
                    # If time is before our data, use the first survival probability
                    survival_prob = survival_curve["survival_probability"].iloc[0]
                else:
                    # Interpolate between points
                    survival_prob = np.interp(
                        time_remaining,
                        survival_curve["time_hours"],
                        survival_curve["survival_probability"],
                    )

                # Probability of admission = 1 - survival probability
                prob_admission = 1 - survival_prob

            probabilities.append(prob_admission)

        return np.array(probabilities)

    def _convolve_poisson_distributions(
        self, arrival_rates, probabilities, max_value=20
    ):
        """Convolve Poisson distributions for each time interval.

        Parameters
        ----------
        arrival_rates : numpy.ndarray
            Array of arrival rates for each time interval.
        probabilities : numpy.ndarray
            Array of admission probabilities for each time interval.
        max_value : int, default=20
            Maximum value for the discrete distribution support.

        Returns
        -------
        pandas.DataFrame
            DataFrame with 'sum' and 'agg_proba' columns representing the final distribution.
        """
        from scipy import stats

        # Create weighted Poisson distributions for each time interval
        weighted_rates = arrival_rates * probabilities
        poisson_dists = [stats.poisson(rate) for rate in weighted_rates]

        # Get PMF for each distribution
        x = np.arange(max_value)
        pmfs = [dist.pmf(x) for dist in poisson_dists]

        # Convolve all distributions together
        if len(pmfs) == 0:
            # Handle edge case of no distributions
            combined_pmf = np.zeros(max_value)
            combined_pmf[0] = 1.0  # All probability at 0
        else:
            combined_pmf = pmfs[0]
            for pmf in pmfs[1:]:
                combined_pmf = np.convolve(combined_pmf, pmf)

        # Create result DataFrame
        result_df = pd.DataFrame(
            {"sum": range(len(combined_pmf)), "agg_proba": combined_pmf}
        )

        # Filter out near-zero probabilities and normalize
        result_df = result_df[result_df["agg_proba"] > 1e-10]
        result_df["agg_proba"] = result_df["agg_proba"] / result_df["agg_proba"].sum()

        return result_df.set_index("sum")

    def predict(self, prediction_context: Dict, **kwargs) -> Dict:
        """Predict the number of admissions using empirical survival curves.

        Parameters
        ----------
        prediction_context : dict
            A dictionary defining the context for which predictions are to be made.
            It should specify either a general context or one based on the applied filters.
        **kwargs
            Additional keyword arguments for prediction configuration:

            max_value : int, default=20
                Maximum value for the discrete distribution support.

        Returns
        -------
        dict
            A dictionary with predictions for each specified context.

        Raises
        ------
        ValueError
            If filter key is not recognized or prediction_time is not provided.
        KeyError
            If required keys are missing from the prediction context.
        RuntimeError
            If survival_df was not provided during fitting.
        """
        if self.survival_df is None:
            raise RuntimeError(
                "No survival data available. Please call fit() method first to calculate survival curve from training data."
            )

        # Extract parameters from kwargs with defaults
        max_value = kwargs.get("max_value", 20)

        predictions = {}

        # Calculate survival probabilities once (they're the same for all contexts)
        survival_probabilities = self._calculate_survival_probabilities(
            self.prediction_window, self.yta_time_interval
        )

        for filter_key, filter_values in prediction_context.items():
            try:
                if filter_key not in self.weights:
                    raise ValueError(
                        f"Filter key '{filter_key}' is not recognized in the model weights."
                    )

                prediction_time = filter_values.get("prediction_time")
                if prediction_time is None:
                    raise ValueError(
                        f"No 'prediction_time' provided for filter '{filter_key}'."
                    )

                if prediction_time not in self.prediction_times:
                    prediction_time = find_nearest_previous_prediction_time(
                        prediction_time, self.prediction_times
                    )

                arrival_rates = self.weights[filter_key][prediction_time].get(
                    "arrival_rates"
                )
                if arrival_rates is None:
                    raise ValueError(
                        f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                    )

                # Convert arrival rates to numpy array
                arrival_rates = np.array(arrival_rates)

                # Generate prediction using convolution approach
                predictions[filter_key] = self._convolve_poisson_distributions(
                    arrival_rates, survival_probabilities, max_value=max_value
                )

                # if self.verbose:
                #     total_expected = (arrival_rates * survival_probabilities).sum()
                #     self.logger.info(
                #         f"Prediction for {filter_key} at {prediction_time}: "
                #         f"Expected value ≈ {total_expected:.2f}"
                #     )

            except KeyError as e:
                raise KeyError(f"Key error occurred: {e!s}")

        return predictions

`init(filters=None, verbose=False)`

Initialize the EmpiricalIncomingAdmissionPredictor.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def __init__(self, filters=None, verbose=False):
    """Initialize the EmpiricalIncomingAdmissionPredictor."""
    super().__init__(filters, verbose)
    self.survival_df = None

`fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None, start_time_col='arrival_datetime', end_time_col='departure_datetime')`

Fit the model to the training data and calculate empirical survival curve.

Parameters:

Name	Type	Description	Default
`train_df`	`DataFrame`	The training dataset with historical admission data. Expected to have start_time_col as the index and end_time_col as a column. Alternatively, both can be regular columns.	required
`prediction_window`	`int or timedelta`	The prediction window in minutes. If timedelta, will be converted to minutes. If int, assumed to be in minutes.	required
`yta_time_interval`	`int or timedelta`	The interval in minutes for splitting the prediction window. If timedelta, will be converted to minutes. If int, assumed to be in minutes.	required
`prediction_times`	`list`	Times of day at which predictions are made, in hours.	required
`num_days`	`int`	The number of days that the train_df spans.	required
`epsilon`	`float`	A small value representing acceptable error rate to enable calculation of the maximum value of the random variable representing number of beds.	`1e-7`
`y`	`None`	Ignored, present for compatibility with scikit-learn's fit method.	`None`
`start_time_col`	`str`	Name of the column containing the start time (e.g., arrival time). Expected to be the DataFrame index, but can also be a regular column.	`'arrival_datetime'`
`end_time_col`	`str`	Name of the column containing the end time (e.g., departure time).	`'departure_datetime'`

Returns:

Type	Description
`EmpiricalIncomingAdmissionPredictor`	The instance itself, fitted with the training data.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def fit(
    self,
    train_df: pd.DataFrame,
    prediction_window,
    yta_time_interval,
    prediction_times: List[float],
    num_days: int,
    epsilon=10**-7,
    y=None,
    start_time_col="arrival_datetime",
    end_time_col="departure_datetime",
) -> "EmpiricalIncomingAdmissionPredictor":
    """Fit the model to the training data and calculate empirical survival curve.

    Parameters
    ----------
    train_df : pandas.DataFrame
        The training dataset with historical admission data.
        Expected to have start_time_col as the index and end_time_col as a column.
        Alternatively, both can be regular columns.
    prediction_window : int or timedelta
        The prediction window in minutes. If timedelta, will be converted to minutes.
        If int, assumed to be in minutes.
    yta_time_interval : int or timedelta
        The interval in minutes for splitting the prediction window. If timedelta, will be converted to minutes.
        If int, assumed to be in minutes.
    prediction_times : list
        Times of day at which predictions are made, in hours.
    num_days : int
        The number of days that the train_df spans.
    epsilon : float, default=1e-7
        A small value representing acceptable error rate to enable calculation
        of the maximum value of the random variable representing number of beds.
    y : None, optional
        Ignored, present for compatibility with scikit-learn's fit method.
    start_time_col : str, default='arrival_datetime'
        Name of the column containing the start time (e.g., arrival time).
        Expected to be the DataFrame index, but can also be a regular column.
    end_time_col : str, default='departure_datetime'
        Name of the column containing the end time (e.g., departure time).

    Returns
    -------
    EmpiricalIncomingAdmissionPredictor
        The instance itself, fitted with the training data.
    """
    # Calculate survival curve from training data using existing function
    # Handle case where start_time_col is in the index
    if start_time_col in train_df.columns:
        # start_time_col is a regular column
        df_for_survival = train_df
    else:
        # start_time_col is likely the index, reset it to make it a column
        df_for_survival = train_df.reset_index()
        # Verify that start_time_col is now available
        if start_time_col not in df_for_survival.columns:
            raise ValueError(
                f"Column '{start_time_col}' not found in DataFrame columns or index"
            )

    self.survival_df = calculate_survival_curve(
        df_for_survival, start_time_col=start_time_col, end_time_col=end_time_col
    )

    # Verify survival curve was calculated and saved successfully
    if self.survival_df is None or len(self.survival_df) == 0:
        raise RuntimeError("Failed to calculate survival curve from training data")

    # Ensure train_df has start_time_col as index for parent fit method
    if start_time_col in train_df.columns:
        train_df = train_df.set_index(start_time_col)

    # Call parent fit method to handle arrival rate calculation and validation
    super().fit(
        train_df,
        prediction_window,
        yta_time_interval,
        prediction_times,
        num_days,
        epsilon=epsilon,
        y=y,
    )

    if self.verbose:
        self.logger.info(
            f"EmpiricalIncomingAdmissionPredictor has been fitted with survival curve containing {len(self.survival_df)} time points"
        )

    return self

`get_survival_curve()`

Get the survival curve calculated during fitting.

Returns:

Type	Description
`DataFrame`	DataFrame containing the survival curve with columns: - time_hours: Time points in hours - survival_probability: Survival probabilities at each time point - event_probability: Event probabilities (1 - survival_probability)

Raises:

Type	Description
`RuntimeError`	If the model has not been fitted yet.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def get_survival_curve(self):
    """Get the survival curve calculated during fitting.

    Returns
    -------
    pandas.DataFrame
        DataFrame containing the survival curve with columns:
        - time_hours: Time points in hours
        - survival_probability: Survival probabilities at each time point
        - event_probability: Event probabilities (1 - survival_probability)

    Raises
    ------
    RuntimeError
        If the model has not been fitted yet.
    """
    if self.survival_df is None:
        raise RuntimeError("Model has not been fitted yet. Call fit() first.")
    return self.survival_df.copy()

`predict(prediction_context, **kwargs)`

Predict the number of admissions using empirical survival curves.

Parameters:

Name	Type	Description	Default
`prediction_context`	`dict`	A dictionary defining the context for which predictions are to be made. It should specify either a general context or one based on the applied filters.	required
`**kwargs`		Additional keyword arguments for prediction configuration: max_value : int, default=20 Maximum value for the discrete distribution support.	`{}`

Returns:

Type	Description
`dict`	A dictionary with predictions for each specified context.

Raises:

Type	Description
`ValueError`	If filter key is not recognized or prediction_time is not provided.
`KeyError`	If required keys are missing from the prediction context.
`RuntimeError`	If survival_df was not provided during fitting.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def predict(self, prediction_context: Dict, **kwargs) -> Dict:
    """Predict the number of admissions using empirical survival curves.

    Parameters
    ----------
    prediction_context : dict
        A dictionary defining the context for which predictions are to be made.
        It should specify either a general context or one based on the applied filters.
    **kwargs
        Additional keyword arguments for prediction configuration:

        max_value : int, default=20
            Maximum value for the discrete distribution support.

    Returns
    -------
    dict
        A dictionary with predictions for each specified context.

    Raises
    ------
    ValueError
        If filter key is not recognized or prediction_time is not provided.
    KeyError
        If required keys are missing from the prediction context.
    RuntimeError
        If survival_df was not provided during fitting.
    """
    if self.survival_df is None:
        raise RuntimeError(
            "No survival data available. Please call fit() method first to calculate survival curve from training data."
        )

    # Extract parameters from kwargs with defaults
    max_value = kwargs.get("max_value", 20)

    predictions = {}

    # Calculate survival probabilities once (they're the same for all contexts)
    survival_probabilities = self._calculate_survival_probabilities(
        self.prediction_window, self.yta_time_interval
    )

    for filter_key, filter_values in prediction_context.items():
        try:
            if filter_key not in self.weights:
                raise ValueError(
                    f"Filter key '{filter_key}' is not recognized in the model weights."
                )

            prediction_time = filter_values.get("prediction_time")
            if prediction_time is None:
                raise ValueError(
                    f"No 'prediction_time' provided for filter '{filter_key}'."
                )

            if prediction_time not in self.prediction_times:
                prediction_time = find_nearest_previous_prediction_time(
                    prediction_time, self.prediction_times
                )

            arrival_rates = self.weights[filter_key][prediction_time].get(
                "arrival_rates"
            )
            if arrival_rates is None:
                raise ValueError(
                    f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                )

            # Convert arrival rates to numpy array
            arrival_rates = np.array(arrival_rates)

            # Generate prediction using convolution approach
            predictions[filter_key] = self._convolve_poisson_distributions(
                arrival_rates, survival_probabilities, max_value=max_value
            )

            # if self.verbose:
            #     total_expected = (arrival_rates * survival_probabilities).sum()
            #     self.logger.info(
            #         f"Prediction for {filter_key} at {prediction_time}: "
            #         f"Expected value ≈ {total_expected:.2f}"
            #     )

        except KeyError as e:
            raise KeyError(f"Key error occurred: {e!s}")

    return predictions

`IncomingAdmissionPredictor`

Bases: BaseEstimator, TransformerMixin, ABC

Base class for admission predictors that handles filtering and arrival rate calculation.

This abstract base class provides the common functionality for predicting hospital admissions, including data filtering, arrival rate calculation, and basic prediction infrastructure. Subclasses implement specific prediction strategies.

Parameters:

Name	Type	Description	Default
`filters`	`dict`	Optional filters for data categorization. If None, no filtering is applied.	`None`
`verbose`	`bool`	Whether to enable verbose logging.	`False`

Attributes:

Name	Type	Description
`filters`	`dict`	Filters for data categorization.
`verbose`	`bool`	Verbose logging flag.
`metrics`	`dict`	Stores metadata about the model and training data.
`weights`	`dict`	Model parameters computed during fitting.

Notes

The predictor implements scikit-learn's BaseEstimator and TransformerMixin interfaces for compatibility with scikit-learn pipelines.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

class IncomingAdmissionPredictor(BaseEstimator, TransformerMixin, ABC):
    """Base class for admission predictors that handles filtering and arrival rate calculation.

    This abstract base class provides the common functionality for predicting hospital
    admissions, including data filtering, arrival rate calculation, and basic prediction
    infrastructure. Subclasses implement specific prediction strategies.

    Parameters
    ----------
    filters : dict, optional
        Optional filters for data categorization. If None, no filtering is applied.
    verbose : bool, default=False
        Whether to enable verbose logging.

    Attributes
    ----------
    filters : dict
        Filters for data categorization.
    verbose : bool
        Verbose logging flag.
    metrics : dict
        Stores metadata about the model and training data.
    weights : dict
        Model parameters computed during fitting.

    Notes
    -----
    The predictor implements scikit-learn's BaseEstimator and TransformerMixin
    interfaces for compatibility with scikit-learn pipelines.
    """

    def __init__(self, filters=None, verbose=False):
        """
        Initialize the IncomingAdmissionPredictor with optional filters.

        Args:
            filters (dict, optional): A dictionary defining filters for different categories or specialties.
                                    If None or empty, no filtering will be applied.
            verbose (bool, optional): If True, enable info-level logging. Defaults to False.
        """
        self.filters = filters if filters else {}
        self.verbose = verbose
        self.metrics = {}  # Add metrics dictionary to store metadata

        if verbose:
            # Configure logging for Jupyter notebook compatibility
            import logging
            import sys

            # Create logger
            self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

            # Only set up handlers if they don't exist
            if not self.logger.handlers:
                self.logger.setLevel(logging.INFO if verbose else logging.WARNING)

                # Create handler that writes to sys.stdout
                handler = logging.StreamHandler(sys.stdout)
                handler.setLevel(logging.INFO if verbose else logging.WARNING)

                # Create a formatting configuration
                formatter = logging.Formatter("%(message)s")
                handler.setFormatter(formatter)

                # Add the handler to the logger
                self.logger.addHandler(handler)

                # Prevent propagation to root logger
                self.logger.propagate = False

        # Apply filters
        self.filters = filters if filters else {}

    def filter_dataframe(self, df: pd.DataFrame, filters: Dict) -> pd.DataFrame:
        """Apply a set of filters to a dataframe.

        Parameters
        ----------
        df : pandas.DataFrame
            The DataFrame to filter.
        filters : dict
            A dictionary where keys are column names and values are the criteria
            or function to filter by.

        Returns
        -------
        pandas.DataFrame
            A filtered DataFrame.
        """
        filtered_df = df
        for column, criteria in filters.items():
            if callable(criteria):  # If the criteria is a function, apply it directly
                filtered_df = filtered_df[filtered_df[column].apply(criteria)]
            else:  # Otherwise, assume the criteria is a value or list of values for equality check
                filtered_df = filtered_df[filtered_df[column] == criteria]
        return filtered_df

    def _calculate_parameters(
        self,
        df,
        prediction_window: timedelta,
        yta_time_interval: timedelta,
        prediction_times,
        num_days,
    ):
        """Calculate parameters required for the model.

        Parameters
        ----------
        df : pandas.DataFrame
            The data frame to process.
        prediction_window : timedelta
            The total prediction window for prediction.
        yta_time_interval : timedelta
            The interval for splitting the prediction window.
        prediction_times : list
            Times of day at which predictions are made.
        num_days : int
            Number of days over which to calculate time-varying arrival rates.

        Returns
        -------
        dict
            Calculated arrival_rates parameters organized by time of day.
        """

        # Calculate Ntimes - Python handles the division naturally
        Ntimes = int(prediction_window / yta_time_interval)

        # Pass original type to time_varying_arrival_rates
        arrival_rates_dict = time_varying_arrival_rates(
            df, yta_time_interval, num_days, verbose=self.verbose
        )
        prediction_time_dict = {}

        for prediction_time_ in prediction_times:
            prediction_time_hr, prediction_time_min = (
                (prediction_time_, 0)
                if isinstance(prediction_time_, int)
                else prediction_time_
            )
            arrival_rates = [
                arrival_rates_dict[
                    (
                        datetime(1970, 1, 1, prediction_time_hr, prediction_time_min)
                        + i * yta_time_interval
                    ).time()
                ]
                for i in range(Ntimes)
            ]
            prediction_time_dict[(prediction_time_hr, prediction_time_min)] = {
                "arrival_rates": arrival_rates
            }

        return prediction_time_dict

    def fit(
        self,
        train_df: pd.DataFrame,
        prediction_window: timedelta,
        yta_time_interval: timedelta,
        prediction_times: List[float],
        num_days: int,
        epsilon: float = 10**-7,
        y: Optional[None] = None,
    ) -> "IncomingAdmissionPredictor":
        """Fit the model to the training data.

        Parameters
        ----------
        train_df : pandas.DataFrame
            The training dataset with historical admission data.
        prediction_window : timedelta
            The prediction window as a timedelta object.
        yta_time_interval : timedelta
            The interval for splitting the prediction window as a timedelta object.
        prediction_times : list
            Times of day at which predictions are made, in hours.
        num_days : int
            The number of days that the train_df spans.
        epsilon : float, default=1e-7
            A small value representing acceptable error rate to enable calculation
            of the maximum value of the random variable representing number of beds.
        y : None, optional
            Ignored, present for compatibility with scikit-learn's fit method.

        Returns
        -------
        IncomingAdmissionPredictor
            The instance itself, fitted with the training data.

        Raises
        ------
        TypeError
            If prediction_window or yta_time_interval are not timedelta objects.
        ValueError
            If prediction_window/yta_time_interval is not greater than 1.
        """

        # Validate inputs
        if not isinstance(prediction_window, timedelta):
            raise TypeError("prediction_window must be a timedelta object")
        if not isinstance(yta_time_interval, timedelta):
            raise TypeError("yta_time_interval must be a timedelta object")

        if prediction_window.total_seconds() <= 0:
            raise ValueError("prediction_window must be positive")
        if yta_time_interval.total_seconds() <= 0:
            raise ValueError("yta_time_interval must be positive")
        if yta_time_interval.total_seconds() > 4 * 3600:  # 4 hours in seconds
            warnings.warn("yta_time_interval appears to be longer than 4 hours")

        # Validate the ratio makes sense
        ratio = prediction_window / yta_time_interval
        if int(ratio) == 0:
            raise ValueError(
                "prediction_window must be significantly larger than yta_time_interval"
            )

        # Store original types
        self.prediction_window = prediction_window
        self.yta_time_interval = yta_time_interval
        self.epsilon = epsilon
        self.prediction_times = [
            tuple(x)
            if isinstance(x, (list, np.ndarray))
            else (x, 0)
            if isinstance(x, (int, float))
            else x
            for x in prediction_times
        ]

        # Initialize yet_to_arrive_dict
        self.weights = {}

        # If there are filters specified, calculate and store the parameters directly with the respective spec keys
        if self.filters:
            for spec, filters in self.filters.items():
                self.weights[spec] = self._calculate_parameters(
                    self.filter_dataframe(train_df, filters),
                    prediction_window,
                    yta_time_interval,
                    prediction_times,
                    num_days,
                )
        else:
            # If there are no filters, store the parameters with a generic key of 'unfiltered'
            self.weights["unfiltered"] = self._calculate_parameters(
                train_df,
                prediction_window,
                yta_time_interval,
                prediction_times,
                num_days,
            )

        if self.verbose:
            self.logger.info(
                f"{self.__class__.__name__} trained for these times: {prediction_times}"
            )
            self.logger.info(
                f"using prediction window of {prediction_window} after the time of prediction"
            )
            self.logger.info(
                f"and time interval of {yta_time_interval} within the prediction window."
            )
            self.logger.info(f"The error value for prediction will be {epsilon}")
            self.logger.info(
                "To see the weights saved by this model, used the get_weights() method"
            )

        # Store metrics about the training data
        self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
        self.metrics["train_set_no"] = len(train_df)
        self.metrics["start_date"] = train_df.index.min().date()
        self.metrics["end_date"] = train_df.index.max().date()
        self.metrics["num_days"] = num_days

        return self

    def get_weights(self):
        """Get the weights computed by the fit method.

        Returns
        -------
        dict
            The weights computed during model fitting.
        """
        return self.weights

    @abstractmethod
    def predict(self, prediction_context: Dict, **kwargs) -> Dict:
        """Predict the number of admissions for the given context.

        This is an abstract method that must be implemented by subclasses.

        Parameters
        ----------
        prediction_context : dict
            A dictionary defining the context for which predictions are to be made.
            It should specify either a general context or one based on the applied filters.
        **kwargs
            Additional keyword arguments specific to the prediction method.

        Returns
        -------
        dict
            A dictionary with predictions for each specified context.

        Raises
        ------
        ValueError
            If filter key is not recognized or prediction_time is not provided.
        KeyError
            If required keys are missing from the prediction context.
        """
        pass

`init(filters=None, verbose=False)`

Initialize the IncomingAdmissionPredictor with optional filters.

Args: filters (dict, optional): A dictionary defining filters for different categories or specialties. If None or empty, no filtering will be applied. verbose (bool, optional): If True, enable info-level logging. Defaults to False.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def __init__(self, filters=None, verbose=False):
    """
    Initialize the IncomingAdmissionPredictor with optional filters.

    Args:
        filters (dict, optional): A dictionary defining filters for different categories or specialties.
                                If None or empty, no filtering will be applied.
        verbose (bool, optional): If True, enable info-level logging. Defaults to False.
    """
    self.filters = filters if filters else {}
    self.verbose = verbose
    self.metrics = {}  # Add metrics dictionary to store metadata

    if verbose:
        # Configure logging for Jupyter notebook compatibility
        import logging
        import sys

        # Create logger
        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

        # Only set up handlers if they don't exist
        if not self.logger.handlers:
            self.logger.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create handler that writes to sys.stdout
            handler = logging.StreamHandler(sys.stdout)
            handler.setLevel(logging.INFO if verbose else logging.WARNING)

            # Create a formatting configuration
            formatter = logging.Formatter("%(message)s")
            handler.setFormatter(formatter)

            # Add the handler to the logger
            self.logger.addHandler(handler)

            # Prevent propagation to root logger
            self.logger.propagate = False

    # Apply filters
    self.filters = filters if filters else {}

`filter_dataframe(df, filters)`

Apply a set of filters to a dataframe.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to filter.	required
`filters`	`dict`	A dictionary where keys are column names and values are the criteria or function to filter by.	required

Returns:

Type	Description
`DataFrame`	A filtered DataFrame.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def filter_dataframe(self, df: pd.DataFrame, filters: Dict) -> pd.DataFrame:
    """Apply a set of filters to a dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame to filter.
    filters : dict
        A dictionary where keys are column names and values are the criteria
        or function to filter by.

    Returns
    -------
    pandas.DataFrame
        A filtered DataFrame.
    """
    filtered_df = df
    for column, criteria in filters.items():
        if callable(criteria):  # If the criteria is a function, apply it directly
            filtered_df = filtered_df[filtered_df[column].apply(criteria)]
        else:  # Otherwise, assume the criteria is a value or list of values for equality check
            filtered_df = filtered_df[filtered_df[column] == criteria]
    return filtered_df

`fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None)`

Fit the model to the training data.

Parameters:

Name	Type	Description	Default
`train_df`	`DataFrame`	The training dataset with historical admission data.	required
`prediction_window`	`timedelta`	The prediction window as a timedelta object.	required
`yta_time_interval`	`timedelta`	The interval for splitting the prediction window as a timedelta object.	required
`prediction_times`	`list`	Times of day at which predictions are made, in hours.	required
`num_days`	`int`	The number of days that the train_df spans.	required
`epsilon`	`float`	A small value representing acceptable error rate to enable calculation of the maximum value of the random variable representing number of beds.	`1e-7`
`y`	`None`	Ignored, present for compatibility with scikit-learn's fit method.	`None`

Returns:

Type	Description
`IncomingAdmissionPredictor`	The instance itself, fitted with the training data.

Raises:

Type	Description
`TypeError`	If prediction_window or yta_time_interval are not timedelta objects.
`ValueError`	If prediction_window/yta_time_interval is not greater than 1.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def fit(
    self,
    train_df: pd.DataFrame,
    prediction_window: timedelta,
    yta_time_interval: timedelta,
    prediction_times: List[float],
    num_days: int,
    epsilon: float = 10**-7,
    y: Optional[None] = None,
) -> "IncomingAdmissionPredictor":
    """Fit the model to the training data.

    Parameters
    ----------
    train_df : pandas.DataFrame
        The training dataset with historical admission data.
    prediction_window : timedelta
        The prediction window as a timedelta object.
    yta_time_interval : timedelta
        The interval for splitting the prediction window as a timedelta object.
    prediction_times : list
        Times of day at which predictions are made, in hours.
    num_days : int
        The number of days that the train_df spans.
    epsilon : float, default=1e-7
        A small value representing acceptable error rate to enable calculation
        of the maximum value of the random variable representing number of beds.
    y : None, optional
        Ignored, present for compatibility with scikit-learn's fit method.

    Returns
    -------
    IncomingAdmissionPredictor
        The instance itself, fitted with the training data.

    Raises
    ------
    TypeError
        If prediction_window or yta_time_interval are not timedelta objects.
    ValueError
        If prediction_window/yta_time_interval is not greater than 1.
    """

    # Validate inputs
    if not isinstance(prediction_window, timedelta):
        raise TypeError("prediction_window must be a timedelta object")
    if not isinstance(yta_time_interval, timedelta):
        raise TypeError("yta_time_interval must be a timedelta object")

    if prediction_window.total_seconds() <= 0:
        raise ValueError("prediction_window must be positive")
    if yta_time_interval.total_seconds() <= 0:
        raise ValueError("yta_time_interval must be positive")
    if yta_time_interval.total_seconds() > 4 * 3600:  # 4 hours in seconds
        warnings.warn("yta_time_interval appears to be longer than 4 hours")

    # Validate the ratio makes sense
    ratio = prediction_window / yta_time_interval
    if int(ratio) == 0:
        raise ValueError(
            "prediction_window must be significantly larger than yta_time_interval"
        )

    # Store original types
    self.prediction_window = prediction_window
    self.yta_time_interval = yta_time_interval
    self.epsilon = epsilon
    self.prediction_times = [
        tuple(x)
        if isinstance(x, (list, np.ndarray))
        else (x, 0)
        if isinstance(x, (int, float))
        else x
        for x in prediction_times
    ]

    # Initialize yet_to_arrive_dict
    self.weights = {}

    # If there are filters specified, calculate and store the parameters directly with the respective spec keys
    if self.filters:
        for spec, filters in self.filters.items():
            self.weights[spec] = self._calculate_parameters(
                self.filter_dataframe(train_df, filters),
                prediction_window,
                yta_time_interval,
                prediction_times,
                num_days,
            )
    else:
        # If there are no filters, store the parameters with a generic key of 'unfiltered'
        self.weights["unfiltered"] = self._calculate_parameters(
            train_df,
            prediction_window,
            yta_time_interval,
            prediction_times,
            num_days,
        )

    if self.verbose:
        self.logger.info(
            f"{self.__class__.__name__} trained for these times: {prediction_times}"
        )
        self.logger.info(
            f"using prediction window of {prediction_window} after the time of prediction"
        )
        self.logger.info(
            f"and time interval of {yta_time_interval} within the prediction window."
        )
        self.logger.info(f"The error value for prediction will be {epsilon}")
        self.logger.info(
            "To see the weights saved by this model, used the get_weights() method"
        )

    # Store metrics about the training data
    self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
    self.metrics["train_set_no"] = len(train_df)
    self.metrics["start_date"] = train_df.index.min().date()
    self.metrics["end_date"] = train_df.index.max().date()
    self.metrics["num_days"] = num_days

    return self

`get_weights()`

Get the weights computed by the fit method.

Returns:

Type	Description
`dict`	The weights computed during model fitting.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def get_weights(self):
    """Get the weights computed by the fit method.

    Returns
    -------
    dict
        The weights computed during model fitting.
    """
    return self.weights

`predict(prediction_context, **kwargs)` `abstractmethod`

Predict the number of admissions for the given context.

This is an abstract method that must be implemented by subclasses.

Parameters:

Name	Type	Description	Default
`prediction_context`	`dict`	A dictionary defining the context for which predictions are to be made. It should specify either a general context or one based on the applied filters.	required
`**kwargs`		Additional keyword arguments specific to the prediction method.	`{}`

Returns:

Type	Description
`dict`	A dictionary with predictions for each specified context.

Raises:

Type	Description
`ValueError`	If filter key is not recognized or prediction_time is not provided.
`KeyError`	If required keys are missing from the prediction context.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

@abstractmethod
def predict(self, prediction_context: Dict, **kwargs) -> Dict:
    """Predict the number of admissions for the given context.

    This is an abstract method that must be implemented by subclasses.

    Parameters
    ----------
    prediction_context : dict
        A dictionary defining the context for which predictions are to be made.
        It should specify either a general context or one based on the applied filters.
    **kwargs
        Additional keyword arguments specific to the prediction method.

    Returns
    -------
    dict
        A dictionary with predictions for each specified context.

    Raises
    ------
    ValueError
        If filter key is not recognized or prediction_time is not provided.
    KeyError
        If required keys are missing from the prediction context.
    """
    pass

`ParametricIncomingAdmissionPredictor`

Bases: IncomingAdmissionPredictor

A predictor for estimating hospital admissions using parametric curves.

This predictor uses a combination of Poisson and binomial distributions to forecast future admissions, excluding patients who have already arrived. The prediction is based on historical data and can be filtered for specific hospital settings.

Parameters:

Name	Type	Description	Default
`filters`	`dict`	Optional filters for data categorization. If None, no filtering is applied.	`None`
`verbose`	`bool`	Whether to enable verbose logging.	`False`

Attributes:

Name	Type	Description
`filters`	`dict`	Filters for data categorization.
`verbose`	`bool`	Verbose logging flag.
`metrics`	`dict`	Stores metadata about the model and training data.
`weights`	`dict`	Model parameters computed during fitting.

Notes

The predictor implements scikit-learn's BaseEstimator and TransformerMixin interfaces for compatibility with scikit-learn pipelines.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

class ParametricIncomingAdmissionPredictor(IncomingAdmissionPredictor):
    """A predictor for estimating hospital admissions using parametric curves.

    This predictor uses a combination of Poisson and binomial distributions to forecast
    future admissions, excluding patients who have already arrived. The prediction is
    based on historical data and can be filtered for specific hospital settings.

    Parameters
    ----------
    filters : dict, optional
        Optional filters for data categorization. If None, no filtering is applied.
    verbose : bool, default=False
        Whether to enable verbose logging.

    Attributes
    ----------
    filters : dict
        Filters for data categorization.
    verbose : bool
        Verbose logging flag.
    metrics : dict
        Stores metadata about the model and training data.
    weights : dict
        Model parameters computed during fitting.

    Notes
    -----
    The predictor implements scikit-learn's BaseEstimator and TransformerMixin
    interfaces for compatibility with scikit-learn pipelines.
    """

    def predict(self, prediction_context: Dict, **kwargs) -> Dict:
        """Predict the number of admissions for the given context using parametric curves.

        Parameters
        ----------
        prediction_context : dict
            A dictionary defining the context for which predictions are to be made.
            It should specify either a general context or one based on the applied filters.
        **kwargs
            Additional keyword arguments for parametric curve configuration:

            x1 : float
                The x-coordinate of the first transition point on the aspirational curve,
                where the growth phase ends and the decay phase begins.
            y1 : float
                The y-coordinate of the first transition point (x1), representing the target
                proportion of patients admitted by time x1.
            x2 : float
                The x-coordinate of the second transition point on the curve, beyond which
                all but a few patients are expected to be admitted.
            y2 : float
                The y-coordinate of the second transition point (x2), representing the target
                proportion of patients admitted by time x2.

        Returns
        -------
        dict
            A dictionary with predictions for each specified context.

        Raises
        ------
        ValueError
            If filter key is not recognized or prediction_time is not provided.
        KeyError
            If required keys are missing from the prediction context.
        """
        # Extract required parameters from kwargs
        x1 = kwargs.get("x1")
        y1 = kwargs.get("y1")
        x2 = kwargs.get("x2")
        y2 = kwargs.get("y2")

        # Validate that required parameters are provided
        if x1 is None or y1 is None or x2 is None or y2 is None:
            raise ValueError(
                "x1, y1, x2, and y2 parameters are required for parametric prediction"
            )

        predictions = {}

        # Calculate Ntimes
        if isinstance(self.prediction_window, timedelta) and isinstance(
            self.yta_time_interval, timedelta
        ):
            NTimes = int(self.prediction_window / self.yta_time_interval)
        elif isinstance(self.prediction_window, timedelta):
            NTimes = int(
                self.prediction_window.total_seconds() / 60 / self.yta_time_interval
            )
        elif isinstance(self.yta_time_interval, timedelta):
            NTimes = int(
                self.prediction_window / (self.yta_time_interval.total_seconds() / 60)
            )
        else:
            NTimes = int(self.prediction_window / self.yta_time_interval)

        # Convert to hours only for numpy operations (which require numeric types)
        prediction_window_hours = (
            self.prediction_window.total_seconds() / 3600
            if isinstance(self.prediction_window, timedelta)
            else self.prediction_window / 60
        )
        yta_time_interval_hours = (
            self.yta_time_interval.total_seconds() / 3600
            if isinstance(self.yta_time_interval, timedelta)
            else self.yta_time_interval / 60
        )

        # Calculate theta, probability of admission in prediction window
        # for each time interval, calculate time remaining before end of window
        time_remaining_before_end_of_window = prediction_window_hours - np.arange(
            0, prediction_window_hours, yta_time_interval_hours
        )

        theta = get_y_from_aspirational_curve(
            time_remaining_before_end_of_window, x1, y1, x2, y2
        )

        for filter_key, filter_values in prediction_context.items():
            try:
                if filter_key not in self.weights:
                    raise ValueError(
                        f"Filter key '{filter_key}' is not recognized in the model weights."
                    )

                prediction_time = filter_values.get("prediction_time")
                if prediction_time is None:
                    raise ValueError(
                        f"No 'prediction_time' provided for filter '{filter_key}'."
                    )

                if prediction_time not in self.prediction_times:
                    prediction_time = find_nearest_previous_prediction_time(
                        prediction_time, self.prediction_times
                    )

                arrival_rates = self.weights[filter_key][prediction_time].get(
                    "arrival_rates"
                )
                if arrival_rates is None:
                    raise ValueError(
                        f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                    )

                predictions[filter_key] = poisson_binom_generating_function(
                    NTimes, arrival_rates, theta, self.epsilon
                )

            except KeyError as e:
                raise KeyError(f"Key error occurred: {e!s}")

        return predictions

`predict(prediction_context, **kwargs)`

Predict the number of admissions for the given context using parametric curves.

Parameters:

Name	Type	Description	Default
`prediction_context`	`dict`	A dictionary defining the context for which predictions are to be made. It should specify either a general context or one based on the applied filters.	required
`**kwargs`		Additional keyword arguments for parametric curve configuration: x1 : float The x-coordinate of the first transition point on the aspirational curve, where the growth phase ends and the decay phase begins. y1 : float The y-coordinate of the first transition point (x1), representing the target proportion of patients admitted by time x1. x2 : float The x-coordinate of the second transition point on the curve, beyond which all but a few patients are expected to be admitted. y2 : float The y-coordinate of the second transition point (x2), representing the target proportion of patients admitted by time x2.	`{}`

Returns:

Type	Description
`dict`	A dictionary with predictions for each specified context.

Raises:

Type	Description
`ValueError`	If filter key is not recognized or prediction_time is not provided.
`KeyError`	If required keys are missing from the prediction context.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def predict(self, prediction_context: Dict, **kwargs) -> Dict:
    """Predict the number of admissions for the given context using parametric curves.

    Parameters
    ----------
    prediction_context : dict
        A dictionary defining the context for which predictions are to be made.
        It should specify either a general context or one based on the applied filters.
    **kwargs
        Additional keyword arguments for parametric curve configuration:

        x1 : float
            The x-coordinate of the first transition point on the aspirational curve,
            where the growth phase ends and the decay phase begins.
        y1 : float
            The y-coordinate of the first transition point (x1), representing the target
            proportion of patients admitted by time x1.
        x2 : float
            The x-coordinate of the second transition point on the curve, beyond which
            all but a few patients are expected to be admitted.
        y2 : float
            The y-coordinate of the second transition point (x2), representing the target
            proportion of patients admitted by time x2.

    Returns
    -------
    dict
        A dictionary with predictions for each specified context.

    Raises
    ------
    ValueError
        If filter key is not recognized or prediction_time is not provided.
    KeyError
        If required keys are missing from the prediction context.
    """
    # Extract required parameters from kwargs
    x1 = kwargs.get("x1")
    y1 = kwargs.get("y1")
    x2 = kwargs.get("x2")
    y2 = kwargs.get("y2")

    # Validate that required parameters are provided
    if x1 is None or y1 is None or x2 is None or y2 is None:
        raise ValueError(
            "x1, y1, x2, and y2 parameters are required for parametric prediction"
        )

    predictions = {}

    # Calculate Ntimes
    if isinstance(self.prediction_window, timedelta) and isinstance(
        self.yta_time_interval, timedelta
    ):
        NTimes = int(self.prediction_window / self.yta_time_interval)
    elif isinstance(self.prediction_window, timedelta):
        NTimes = int(
            self.prediction_window.total_seconds() / 60 / self.yta_time_interval
        )
    elif isinstance(self.yta_time_interval, timedelta):
        NTimes = int(
            self.prediction_window / (self.yta_time_interval.total_seconds() / 60)
        )
    else:
        NTimes = int(self.prediction_window / self.yta_time_interval)

    # Convert to hours only for numpy operations (which require numeric types)
    prediction_window_hours = (
        self.prediction_window.total_seconds() / 3600
        if isinstance(self.prediction_window, timedelta)
        else self.prediction_window / 60
    )
    yta_time_interval_hours = (
        self.yta_time_interval.total_seconds() / 3600
        if isinstance(self.yta_time_interval, timedelta)
        else self.yta_time_interval / 60
    )

    # Calculate theta, probability of admission in prediction window
    # for each time interval, calculate time remaining before end of window
    time_remaining_before_end_of_window = prediction_window_hours - np.arange(
        0, prediction_window_hours, yta_time_interval_hours
    )

    theta = get_y_from_aspirational_curve(
        time_remaining_before_end_of_window, x1, y1, x2, y2
    )

    for filter_key, filter_values in prediction_context.items():
        try:
            if filter_key not in self.weights:
                raise ValueError(
                    f"Filter key '{filter_key}' is not recognized in the model weights."
                )

            prediction_time = filter_values.get("prediction_time")
            if prediction_time is None:
                raise ValueError(
                    f"No 'prediction_time' provided for filter '{filter_key}'."
                )

            if prediction_time not in self.prediction_times:
                prediction_time = find_nearest_previous_prediction_time(
                    prediction_time, self.prediction_times
                )

            arrival_rates = self.weights[filter_key][prediction_time].get(
                "arrival_rates"
            )
            if arrival_rates is None:
                raise ValueError(
                    f"No arrival_rates found for the time of day '{prediction_time}' under filter '{filter_key}'."
                )

            predictions[filter_key] = poisson_binom_generating_function(
                NTimes, arrival_rates, theta, self.epsilon
            )

        except KeyError as e:
            raise KeyError(f"Key error occurred: {e!s}")

    return predictions

`aggregate_probabilities(lam, kmax, theta, time_index)`

Aggregate probabilities for a range of values using the weighted Poisson-Binomial distribution.

Parameters:

Name	Type	Description	Default
`lam`	`ndarray`	An array of lambda values for each time interval.	required
`kmax`	`int`	The maximum number of events to consider.	required
`theta`	`ndarray`	An array of theta values for each time interval.	required
`time_index`	`int`	The current time index for which to calculate probabilities.	required

Returns:

Type	Description
`ndarray`	Aggregated probabilities for the given time index.

Raises:

Type	Description
`ValueError`	If kmax < 0, time_index < 0, or array lengths are invalid.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def aggregate_probabilities(lam, kmax, theta, time_index):
    """Aggregate probabilities for a range of values using the weighted Poisson-Binomial distribution.

    Parameters
    ----------
    lam : numpy.ndarray
        An array of lambda values for each time interval.
    kmax : int
        The maximum number of events to consider.
    theta : numpy.ndarray
        An array of theta values for each time interval.
    time_index : int
        The current time index for which to calculate probabilities.

    Returns
    -------
    numpy.ndarray
        Aggregated probabilities for the given time index.

    Raises
    ------
    ValueError
        If kmax < 0, time_index < 0, or array lengths are invalid.
    """
    if kmax < 0 or time_index < 0 or len(lam) <= time_index or len(theta) <= time_index:
        raise ValueError("Invalid kmax, time_index, or array lengths.")

    probabilities_matrix = np.zeros((kmax + 1, kmax + 1))
    for i in range(kmax + 1):
        probabilities_matrix[: i + 1, i] = weighted_poisson_binomial(
            i, lam[time_index], theta[time_index]
        )
    return probabilities_matrix.sum(axis=1)

`convolute_distributions(dist_a, dist_b)`

Convolutes two probability distributions represented as dataframes.

Parameters:

Name	Type	Description	Default
`dist_a`	`DataFrame`	The first distribution with columns ['sum', 'prob'].	required
`dist_b`	`DataFrame`	The second distribution with columns ['sum', 'prob'].	required

Returns:

Type	Description
`DataFrame`	The convoluted distribution.

Raises:

Type	Description
`ValueError`	If DataFrames do not contain required 'sum' and 'prob' columns.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def convolute_distributions(dist_a, dist_b):
    """Convolutes two probability distributions represented as dataframes.

    Parameters
    ----------
    dist_a : pd.DataFrame
        The first distribution with columns ['sum', 'prob'].
    dist_b : pd.DataFrame
        The second distribution with columns ['sum', 'prob'].

    Returns
    -------
    pd.DataFrame
        The convoluted distribution.

    Raises
    ------
    ValueError
        If DataFrames do not contain required 'sum' and 'prob' columns.
    """
    if not {"sum", "prob"}.issubset(dist_a.columns) or not {
        "sum",
        "prob",
    }.issubset(dist_b.columns):
        raise ValueError("DataFrames must contain 'sum' and 'prob' columns.")

    sums = [x + y for x in dist_a["sum"] for y in dist_b["sum"]]
    probs = [x * y for x in dist_a["prob"] for y in dist_b["prob"]]
    result = pd.DataFrame(zip(sums, probs), columns=["sum", "prob"])
    return result.groupby("sum")["prob"].sum().reset_index()

`find_nearest_previous_prediction_time(requested_time, prediction_times)`

Find the nearest previous time of day in prediction_times relative to requested time.

Parameters:

Name	Type	Description	Default
`requested_time`	`tuple`	The requested time as (hour, minute).	required
`prediction_times`	`list`	List of available prediction times.	required

Returns:

Type	Description
`tuple`	The closest previous time of day from prediction_times.

Notes

If the requested time is earlier than all times in prediction_times, returns the latest time in prediction_times.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def find_nearest_previous_prediction_time(requested_time, prediction_times):
    """Find the nearest previous time of day in prediction_times relative to requested time.

    Parameters
    ----------
    requested_time : tuple
        The requested time as (hour, minute).
    prediction_times : list
        List of available prediction times.

    Returns
    -------
    tuple
        The closest previous time of day from prediction_times.

    Notes
    -----
    If the requested time is earlier than all times in prediction_times,
    returns the latest time in prediction_times.
    """
    if requested_time in prediction_times:
        return requested_time

    original_prediction_time = requested_time
    requested_datetime = datetime.strptime(
        f"{requested_time[0]:02d}:{requested_time[1]:02d}", "%H:%M"
    )
    closest_prediction_time = max(
        prediction_times,
        key=lambda prediction_time_time: datetime.strptime(
            f"{prediction_time_time[0]:02d}:{prediction_time_time[1]:02d}",
            "%H:%M",
        ),
    )
    min_diff = float("inf")

    for prediction_time_time in prediction_times:
        prediction_time_datetime = datetime.strptime(
            f"{prediction_time_time[0]:02d}:{prediction_time_time[1]:02d}",
            "%H:%M",
        )
        diff = (requested_datetime - prediction_time_datetime).total_seconds()

        # If the difference is negative, it means the prediction_time_time is ahead of the requested_time,
        # hence we calculate the difference by considering a day's wrap around.
        if diff < 0:
            diff += 24 * 60 * 60  # Add 24 hours in seconds

        if 0 <= diff < min_diff:
            closest_prediction_time = prediction_time_time
            min_diff = diff

    warnings.warn(
        f"Time of day requested of {original_prediction_time} was not in model training. "
        f"Reverting to predictions for {closest_prediction_time}."
    )

    return closest_prediction_time

`poisson_binom_generating_function(NTimes, arrival_rates, theta, epsilon)`

Generate a distribution based on the aggregate of Poisson and Binomial distributions.

Parameters:

Name	Type	Description	Default
`NTimes`	`int`	The number of time intervals.	required
`arrival_rates`	`ndarray`	An array of lambda values for each time interval.	required
`theta`	`ndarray`	An array of theta values for each time interval.	required
`epsilon`	`float`	The desired error threshold.	required

Returns:

Type	Description
`DataFrame`	The generated distribution.

Raises:

Type	Description
`ValueError`	If NTimes <= 0 or epsilon is not between 0 and 1.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def poisson_binom_generating_function(NTimes, arrival_rates, theta, epsilon):
    """Generate a distribution based on the aggregate of Poisson and Binomial distributions.

    Parameters
    ----------
    NTimes : int
        The number of time intervals.
    arrival_rates : numpy.ndarray
        An array of lambda values for each time interval.
    theta : numpy.ndarray
        An array of theta values for each time interval.
    epsilon : float
        The desired error threshold.

    Returns
    -------
    pd.DataFrame
        The generated distribution.

    Raises
    ------
    ValueError
        If NTimes <= 0 or epsilon is not between 0 and 1.
    """

    if NTimes <= 0 or epsilon <= 0 or epsilon >= 1:
        raise ValueError("Ensure NTimes > 0 and 0 < epsilon < 1.")

    maxlam = max(arrival_rates)
    kmax = int(poisson.ppf(1 - epsilon, maxlam))
    distribution = np.zeros((kmax + 1, NTimes))

    for j in range(NTimes):
        distribution[:, j] = aggregate_probabilities(arrival_rates, kmax, theta, j)

    df_list = [
        pd.DataFrame({"sum": range(kmax + 1), "prob": distribution[:, j]})
        for j in range(NTimes)
    ]
    total_distribution = df_list[0]

    for df in df_list[1:]:
        total_distribution = convolute_distributions(total_distribution, df)

    total_distribution = total_distribution.rename(
        columns={"prob": "agg_proba"}
    ).set_index("sum")

    return total_distribution

`weighted_poisson_binomial(i, lam, theta)`

Calculate weighted probabilities using Poisson and Binomial distributions.

Parameters:

Name	Type	Description	Default
`i`	`int`	The upper bound of the range for the binomial distribution.	required
`lam`	`float`	The lambda parameter for the Poisson distribution.	required
`theta`	`float`	The probability of success for the binomial distribution.	required

Returns:

Type	Description
`ndarray`	An array of weighted probabilities.

Raises:

Type	Description
`ValueError`	If i < 0, lam < 0, or theta is not between 0 and 1.

Source code in src/patientflow/predictors/incoming_admission_predictors.py

def weighted_poisson_binomial(i, lam, theta):
    """Calculate weighted probabilities using Poisson and Binomial distributions.

    Parameters
    ----------
    i : int
        The upper bound of the range for the binomial distribution.
    lam : float
        The lambda parameter for the Poisson distribution.
    theta : float
        The probability of success for the binomial distribution.

    Returns
    -------
    numpy.ndarray
        An array of weighted probabilities.

    Raises
    ------
    ValueError
        If i < 0, lam < 0, or theta is not between 0 and 1.
    """
    if i < 0 or lam < 0 or not 0 <= theta <= 1:
        raise ValueError("Ensure i >= 0, lam >= 0, and 0 <= theta <= 1.")

    arr_seq = np.arange(i + 1)
    probabilities = binom.pmf(arr_seq, i, theta)
    return poisson.pmf(i, lam) * probabilities

`sequence_to_outcome_predictor`

This module implements a SequenceToOutcomePredictor class that models and predicts the probability distribution of sequences in categorical data. The class builds a model based on training data, where input sequences are mapped to specific outcome categories. It provides methods to fit the model, compute sequence-based probabilities, and make predictions on an unseen datatset of input sequences.

Classes:

Name	Description
`SequenceToOutcomePredictor : sklearn.base.BaseEstimator, sklearn.base.TransformerMixin`	A model that predicts the probability of ending in different outcome categories based on input sequences. Note: All sequence inputs are expected to be tuples. Lists will be automatically converted to tuples, and None values will be converted to empty tuples.

`SequenceToOutcomePredictor`

Bases: BaseEstimator, TransformerMixin

A class to model sequence-based predictions for categorical data using input and grouping sequences. This class implements both the fit and predict methods from the parent sklearn classes.

Parameters:

Name	Type	Description	Default
`input_var`	`str`	Name of the column representing the input sequence in the DataFrame.	required
`grouping_var`	`str`	Name of the column representing the grouping sequence in the DataFrame.	required
`outcome_var`	`str`	Name of the column representing the outcome category in the DataFrame.	required
`apply_special_category_filtering`	`bool`	Whether to filter out special categories of patients before fitting the model.	`True`
`admit_col`	`str`	Name of the column indicating whether a patient was admitted.	`'is_admitted'`

Attributes:

Name	Type	Description
`weights`	`dict`	A dictionary storing the probabilities of different input sequences leading to specific outcome categories.
`input_to_grouping_probs`	`DataFrame`	A DataFrame that stores the computed probabilities of input sequences being associated with different grouping sequences.
`special_params`	`(dict, optional)`	The special category parameters used for filtering, only populated if apply_special_category_filtering=True.
`metrics`	`dict`	A dictionary to store metrics related to the training process.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py

class SequenceToOutcomePredictor(BaseEstimator, TransformerMixin):
    """
    A class to model sequence-based predictions for categorical data using input and grouping sequences.
    This class implements both the `fit` and `predict` methods from the parent sklearn classes.

    Parameters
    ----------
    input_var : str
        Name of the column representing the input sequence in the DataFrame.
    grouping_var : str
        Name of the column representing the grouping sequence in the DataFrame.
    outcome_var : str
        Name of the column representing the outcome category in the DataFrame.
    apply_special_category_filtering : bool, default=True
        Whether to filter out special categories of patients before fitting the model.
    admit_col : str, default='is_admitted'
        Name of the column indicating whether a patient was admitted.

    Attributes
    ----------
    weights : dict
        A dictionary storing the probabilities of different input sequences leading to specific outcome categories.
    input_to_grouping_probs : pd.DataFrame
        A DataFrame that stores the computed probabilities of input sequences being associated with different grouping sequences.
    special_params : dict, optional
        The special category parameters used for filtering, only populated if apply_special_category_filtering=True.
    metrics : dict
        A dictionary to store metrics related to the training process.
    """

    def __init__(
        self,
        input_var,
        grouping_var,
        outcome_var,
        apply_special_category_filtering=True,
        admit_col="is_admitted",
    ):
        self.input_var = input_var
        self.grouping_var = grouping_var
        self.outcome_var = outcome_var
        self.apply_special_category_filtering = apply_special_category_filtering
        self.admit_col = admit_col
        self.weights = None
        self.special_params = None
        self.metrics = {}

    def __repr__(self):
        """Return a string representation of the estimator."""
        class_name = self.__class__.__name__
        return (
            f"{class_name}(\n"
            f"    input_var='{self.input_var}',\n"
            f"    grouping_var='{self.grouping_var}',\n"
            f"    outcome_var='{self.outcome_var}',\n"
            f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
            f"    admit_col='{self.admit_col}'\n"
            f")"
        )

    def _ensure_tuple(self, sequence):
        """
        Convert a sequence to tuple if it's not already a tuple.
        Handles string cleaning to avoid double-quoting issues.

        Parameters
        ----------
        sequence : tuple, list, or None
            The sequence to convert

        Returns
        -------
        tuple
            The input sequence as a tuple, or an empty tuple if input was None
        """
        if sequence is None:
            return ()
        if isinstance(sequence, (list, pd.Series)):
            # Clean any quoted strings in the sequence
            cleaned_sequence = [
                ast.literal_eval(item)
                if isinstance(item, str) and item.startswith("'") and item.endswith("'")
                else item
                for item in sequence
            ]
            return tuple(cleaned_sequence) if cleaned_sequence else ()
        if isinstance(sequence, tuple):
            # Clean any quoted strings in the tuple
            return tuple(
                ast.literal_eval(item)
                if isinstance(item, str) and item.startswith("'") and item.endswith("'")
                else item
                for item in sequence
            )
        return sequence

    def _preprocess_data(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocesses the input data before fitting the model.

        Steps include:
        1. Selecting only admitted patients with a non-null specialty
        2. Optionally filtering out special categories
        3. Converting sequence columns to tuple format if they aren't already

        Parameters
        ----------
        X : pd.DataFrame
            DataFrame containing patient data.

        Returns
        -------
        pd.DataFrame
            Preprocessed DataFrame ready for model fitting.
        """
        # Make a copy to avoid modifying the original
        df = X.copy()

        # Step 1: Select only admitted patients with a non-null specialty
        if self.admit_col in df.columns:
            df = df[df[self.admit_col] & ~df[self.outcome_var].isnull()]

        # Step 2: Optionally apply filtering for special categories
        if self.apply_special_category_filtering:
            # Get configuration for categorizing patients based on columns
            self.special_params = create_special_category_objects(df.columns)

            # Extract function that identifies non-special category patients
            opposite_special_category_func = self.special_params["special_func_map"][
                "default"
            ]

            # Determine which category is the special category
            special_category_key = next(
                key
                for key, value in self.special_params["special_category_dict"].items()
                if value == 1.0
            )

            # Filter out special category patients
            df = df[
                df.apply(opposite_special_category_func, axis=1)
                & (df[self.outcome_var] != special_category_key)
            ]

        # Step 3: Convert sequence columns to tuple format
        if self.input_var in df.columns:
            df[self.input_var] = df[self.input_var].apply(self._ensure_tuple)

        if self.grouping_var in df.columns:
            df[self.grouping_var] = df[self.grouping_var].apply(self._ensure_tuple)

        return df

    def fit(self, X: pd.DataFrame) -> "SequenceToOutcomePredictor":
        """
        Fits the predictor based on training data by computing the proportion of each input variable sequence
        ending in specific outcome variable categories.

        Automatically preprocesses the data before fitting.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

        Returns
        -------
        self : SequenceToOutcomePredictor
            The fitted SequenceToOutcomePredictor model with calculated probabilities for each sequence.
        """
        # Store metrics about the training data
        self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
        self.metrics["train_set_no"] = len(X)
        if not X.empty:
            self.metrics["start_date"] = X["snapshot_date"].min()
            self.metrics["end_date"] = X["snapshot_date"].max()

        # Preprocess the data
        X = self._preprocess_data(X)

        # derive the names of the observed outcome variables from the data
        prop_keys = X[self.outcome_var].unique()

        # For each sequence count the number of observed categories
        X_grouped = (
            X.groupby(self.grouping_var)[self.outcome_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # Calculate the total number of times each grouping sequence occurred
        row_totals = X_grouped.sum(axis=1)

        # Calculate for each grouping sequence, the proportion of ending with each observed specialty
        proportions = X_grouped.div(row_totals, axis=0)

        # Calculate the probability of each grouping sequence occurring in the original data
        proportions["probability_of_grouping_sequence"] = row_totals / row_totals.sum()

        # Reweight probabilities of ending with each observed specialty
        # by the likelihood of each grouping sequence occurring
        for col in proportions.columns[
            :-1
        ]:  # Avoid the last column which is the 'probability_of_grouping_sequence'
            proportions[col] *= proportions["probability_of_grouping_sequence"]

        # Convert final sequence to a string in order to conduct string searches on it
        proportions["grouping_sequence_to_string"] = proportions.index.map(
            lambda x: "-".join(map(str, x))
        )

        # Row-wise function to return, for each input sequence,
        # the proportion that end up in each final sequence and thereby
        # the probability of it ending in any observed category
        proportions["prob_input_var_ends_in_observed_specialty"] = proportions[
            "grouping_sequence_to_string"
        ].apply(lambda x: self._string_match_input_var(x, proportions, prop_keys))

        # Convert the prob_input_var_ends_in_observed_specialty column to a dictionary
        result_dict = proportions["prob_input_var_ends_in_observed_specialty"].to_dict()

        # Clean the key to remove excess strint quotes
        def clean_tuple_key(key):
            if isinstance(key, tuple):
                return tuple(
                    ast.literal_eval(item)
                    if item.startswith("'") and item.endswith("'")
                    else item
                    for item in key
                )
            return key

        cleaned_dict = {clean_tuple_key(k): v for k, v in result_dict.items()}

        # save prob_input_var_ends_in_observed_specialty as weights within the model
        self.weights = cleaned_dict

        # save the input to grouping probabilities for use as a reference
        self.input_to_grouping_probs = self._probability_of_input_to_grouping_sequence(
            X
        )

        return self

    def _string_match_input_var(self, input_var_string, proportions, prop_keys):
        """
        Matches a given input sequence string with grouped sequences (expressed as strings) in the dataset and aggregates
        their probabilities for each outcome category. This function filters the data to
        match only those rows where the *beginning* of the grouped sequence string
        matches the given input sequence string, allowing for partial matches.
        For instance, the sequence 'medical' will match 'medical, elderly' and 'medical, surgical'
        as well as 'medical' on its own. It computes the total probabilities of any input sequence ending
        in each outcome category, and normalizes these totals if possible.

        Parameters
        ----------
        input_var_string : str
            The sequence of inputs represented as a string, used to match against sequences in the proportions DataFrame.
        proportions : pd.DataFrame
            DataFrame containing proportions data with an additional column 'grouping_sequence_to_string'
            which includes string representations of sequences.
        prop_keys : np.array
            Array of unique outcomes to consider in calculations.

        Returns
        -------
        dict
            A dictionary where keys are outcome names and values are the aggregated and normalized probabilities
            of an input sequence ending in those outcomes.

        """
        # Filter rows where the grouped sequence string starts with the input sequence string
        props = proportions[
            proportions["grouping_sequence_to_string"].str.match("^" + input_var_string)
        ][prop_keys].sum()

        # Sum of all probabilities to normalize them
        props_total = props.sum()

        # Handle cases where the total probability is zero to avoid division by zero
        if props_total > 0:
            normalized_props = props / props_total
        else:
            normalized_props = (
                props * 0
            )  # Returns zero probabilities if no matches found

        return dict(zip(prop_keys, normalized_props))

    def _probability_of_input_to_grouping_sequence(self, X):
        """
        Computes the probabilities of different input sequences leading to specific grouping sequences.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var` and `grouping_var`.

        Returns
        -------
        pd.DataFrame
            A DataFrame containing the probabilities of input sequences leading to grouping sequences.
        """
        # For each input sequence count the number of grouping sequences
        X_grouped = (
            X.groupby(self.input_var)[self.grouping_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # # Calculate the total number of times each input sequence occurred
        row_totals = X_grouped.sum(axis=1)

        # # Calculate for each grouping sequence, the proportion of ending with each grouping sequence
        proportions = X_grouped.div(row_totals, axis=0)

        # # Calculate the probability of each input sequence occurring in the original data
        proportions["probability_of_grouping_sequence"] = row_totals / row_totals.sum()

        return proportions

    def predict(self, input_sequence: tuple[str, ...]) -> Dict[str, float]:
        """
        Predicts the probabilities of ending in various outcome categories for a given input sequence.

        Parameters
        ----------
        input_sequence : tuple[str, ...]
            A tuple containing the categories that have been observed for an entity in the order they
            have been encountered. An empty tuple represents an entity with no observed categories.

        Returns
        -------
        dict
            A dictionary of categories and the probabilities that the input sequence will end in them.
        """
        input_sequence = self._ensure_tuple(input_sequence)

        if input_sequence is None or pd.isna(input_sequence):
            return self.weights.get(tuple(), {})

        # Return a direct lookup of probabilities if possible.
        if input_sequence in self.weights:
            return self.weights[input_sequence]

        # Otherwise, if the sequence has multiple elements, work back looking for a match
        while len(input_sequence) > 1:
            input_sequence_list = list(input_sequence)
            input_sequence = tuple(input_sequence_list[:-1])  # remove last element

            if input_sequence in self.weights:
                return self.weights[input_sequence]

        # If no relevant data is found:
        return self.weights.get(tuple(), {})

`repr()`

Return a string representation of the estimator.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py

def __repr__(self):
    """Return a string representation of the estimator."""
    class_name = self.__class__.__name__
    return (
        f"{class_name}(\n"
        f"    input_var='{self.input_var}',\n"
        f"    grouping_var='{self.grouping_var}',\n"
        f"    outcome_var='{self.outcome_var}',\n"
        f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
        f"    admit_col='{self.admit_col}'\n"
        f")"
    )

`fit(X)`

Fits the predictor based on training data by computing the proportion of each input variable sequence ending in specific outcome variable categories.

Automatically preprocesses the data before fitting.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.	required

Returns:

Name	Type	Description
`self`	`SequenceToOutcomePredictor`	The fitted SequenceToOutcomePredictor model with calculated probabilities for each sequence.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py

def fit(self, X: pd.DataFrame) -> "SequenceToOutcomePredictor":
    """
    Fits the predictor based on training data by computing the proportion of each input variable sequence
    ending in specific outcome variable categories.

    Automatically preprocesses the data before fitting.

    Parameters
    ----------
    X : pd.DataFrame
        A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

    Returns
    -------
    self : SequenceToOutcomePredictor
        The fitted SequenceToOutcomePredictor model with calculated probabilities for each sequence.
    """
    # Store metrics about the training data
    self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
    self.metrics["train_set_no"] = len(X)
    if not X.empty:
        self.metrics["start_date"] = X["snapshot_date"].min()
        self.metrics["end_date"] = X["snapshot_date"].max()

    # Preprocess the data
    X = self._preprocess_data(X)

    # derive the names of the observed outcome variables from the data
    prop_keys = X[self.outcome_var].unique()

    # For each sequence count the number of observed categories
    X_grouped = (
        X.groupby(self.grouping_var)[self.outcome_var]
        .value_counts()
        .unstack(fill_value=0)
    )

    # Calculate the total number of times each grouping sequence occurred
    row_totals = X_grouped.sum(axis=1)

    # Calculate for each grouping sequence, the proportion of ending with each observed specialty
    proportions = X_grouped.div(row_totals, axis=0)

    # Calculate the probability of each grouping sequence occurring in the original data
    proportions["probability_of_grouping_sequence"] = row_totals / row_totals.sum()

    # Reweight probabilities of ending with each observed specialty
    # by the likelihood of each grouping sequence occurring
    for col in proportions.columns[
        :-1
    ]:  # Avoid the last column which is the 'probability_of_grouping_sequence'
        proportions[col] *= proportions["probability_of_grouping_sequence"]

    # Convert final sequence to a string in order to conduct string searches on it
    proportions["grouping_sequence_to_string"] = proportions.index.map(
        lambda x: "-".join(map(str, x))
    )

    # Row-wise function to return, for each input sequence,
    # the proportion that end up in each final sequence and thereby
    # the probability of it ending in any observed category
    proportions["prob_input_var_ends_in_observed_specialty"] = proportions[
        "grouping_sequence_to_string"
    ].apply(lambda x: self._string_match_input_var(x, proportions, prop_keys))

    # Convert the prob_input_var_ends_in_observed_specialty column to a dictionary
    result_dict = proportions["prob_input_var_ends_in_observed_specialty"].to_dict()

    # Clean the key to remove excess strint quotes
    def clean_tuple_key(key):
        if isinstance(key, tuple):
            return tuple(
                ast.literal_eval(item)
                if item.startswith("'") and item.endswith("'")
                else item
                for item in key
            )
        return key

    cleaned_dict = {clean_tuple_key(k): v for k, v in result_dict.items()}

    # save prob_input_var_ends_in_observed_specialty as weights within the model
    self.weights = cleaned_dict

    # save the input to grouping probabilities for use as a reference
    self.input_to_grouping_probs = self._probability_of_input_to_grouping_sequence(
        X
    )

    return self

`predict(input_sequence)`

Predicts the probabilities of ending in various outcome categories for a given input sequence.

Parameters:

Name	Type	Description	Default
`input_sequence`	`tuple[str, ...]`	A tuple containing the categories that have been observed for an entity in the order they have been encountered. An empty tuple represents an entity with no observed categories.	required

Returns:

Type	Description
`dict`	A dictionary of categories and the probabilities that the input sequence will end in them.

Source code in src/patientflow/predictors/sequence_to_outcome_predictor.py

def predict(self, input_sequence: tuple[str, ...]) -> Dict[str, float]:
    """
    Predicts the probabilities of ending in various outcome categories for a given input sequence.

    Parameters
    ----------
    input_sequence : tuple[str, ...]
        A tuple containing the categories that have been observed for an entity in the order they
        have been encountered. An empty tuple represents an entity with no observed categories.

    Returns
    -------
    dict
        A dictionary of categories and the probabilities that the input sequence will end in them.
    """
    input_sequence = self._ensure_tuple(input_sequence)

    if input_sequence is None or pd.isna(input_sequence):
        return self.weights.get(tuple(), {})

    # Return a direct lookup of probabilities if possible.
    if input_sequence in self.weights:
        return self.weights[input_sequence]

    # Otherwise, if the sequence has multiple elements, work back looking for a match
    while len(input_sequence) > 1:
        input_sequence_list = list(input_sequence)
        input_sequence = tuple(input_sequence_list[:-1])  # remove last element

        if input_sequence in self.weights:
            return self.weights[input_sequence]

    # If no relevant data is found:
    return self.weights.get(tuple(), {})

`value_to_outcome_predictor`

This module implements a ValueToOutcomePredictor class that models and predicts the probability distribution of outcomes based on a single categorical input. The class builds a model based on training data, where input values are mapped to specific outcome categories through an intermediate grouping variable. It provides methods to fit the model, compute probabilities, and make predictions on unseen data.

Classes:

Name	Description
`ValueToOutcomePredictor : sklearn.base.BaseEstimator, sklearn.base.TransformerMixin`	A model that predicts the probability of ending in different outcome categories based on a single input value. Note: All inputs are expected to be strings. None values will be converted to empty strings during preprocessing.

`ValueToOutcomePredictor`

Bases: BaseEstimator, TransformerMixin

A class to model predictions for categorical data using a single input value and grouping variable. This class implements both the fit and predict methods from the parent sklearn classes.

Parameters:

Name	Type	Description	Default
`input_var`	`str`	Name of the column representing the input value in the DataFrame.	required
`grouping_var`	`str`	Name of the column representing the grouping value in the DataFrame.	required
`outcome_var`	`str`	Name of the column representing the outcome category in the DataFrame.	required
`apply_special_category_filtering`	`bool`	Whether to filter out special categories of patients before fitting the model.	`True`
`admit_col`	`str`	Name of the column indicating whether a patient was admitted.	`'is_admitted'`

Attributes:

Name	Type	Description
`weights`	`dict`	A dictionary storing the probabilities of different input values leading to specific outcome categories.
`input_to_grouping_probs`	`DataFrame`	A DataFrame that stores the computed probabilities of input values being associated with different grouping values.
`special_params`	`(dict, optional)`	The special category parameters used for filtering, only populated if apply_special_category_filtering=True.
`metrics`	`dict`	A dictionary to store metrics related to the training process.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py

class ValueToOutcomePredictor(BaseEstimator, TransformerMixin):
    """
    A class to model predictions for categorical data using a single input value and grouping variable.
    This class implements both the `fit` and `predict` methods from the parent sklearn classes.

    Parameters
    ----------
    input_var : str
        Name of the column representing the input value in the DataFrame.
    grouping_var : str
        Name of the column representing the grouping value in the DataFrame.
    outcome_var : str
        Name of the column representing the outcome category in the DataFrame.
    apply_special_category_filtering : bool, default=True
        Whether to filter out special categories of patients before fitting the model.
    admit_col : str, default='is_admitted'
        Name of the column indicating whether a patient was admitted.

    Attributes
    ----------
    weights : dict
        A dictionary storing the probabilities of different input values leading to specific outcome categories.
    input_to_grouping_probs : pd.DataFrame
        A DataFrame that stores the computed probabilities of input values being associated with different grouping values.
    special_params : dict, optional
        The special category parameters used for filtering, only populated if apply_special_category_filtering=True.
    metrics : dict
        A dictionary to store metrics related to the training process.
    """

    def __init__(
        self,
        input_var,
        grouping_var,
        outcome_var,
        apply_special_category_filtering=True,
        admit_col="is_admitted",
    ):
        self.input_var = input_var
        self.grouping_var = grouping_var
        self.outcome_var = outcome_var
        self.apply_special_category_filtering = apply_special_category_filtering
        self.admit_col = admit_col
        self.weights = None
        self.special_params = None
        self.metrics = {}

    def __repr__(self):
        """Return a string representation of the estimator."""
        class_name = self.__class__.__name__
        return (
            f"{class_name}(\n"
            f"    input_var='{self.input_var}',\n"
            f"    grouping_var='{self.grouping_var}',\n"
            f"    outcome_var='{self.outcome_var}',\n"
            f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
            f"    admit_col='{self.admit_col}'\n"
            f")"
        )

    def _preprocess_data(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocesses the input data before fitting the model.

        Steps include:
        1. Selecting only admitted patients with a non-null specialty
        2. Optionally filtering out special categories
        3. Converting input values to strings and handling nulls

        Parameters
        ----------
        X : pd.DataFrame
            DataFrame containing patient data.

        Returns
        -------
        pd.DataFrame
            Preprocessed DataFrame ready for model fitting.
        """
        # Make a copy to avoid modifying the original
        df = X.copy()

        # Step 1: Select only admitted patients with a non-null specialty
        if self.admit_col in df.columns:
            df = df[df[self.admit_col] & ~df[self.outcome_var].isnull()]

        # Step 2: Optionally apply filtering for special categories
        if self.apply_special_category_filtering:
            # Get configuration for categorizing patients based on columns
            self.special_params = create_special_category_objects(df.columns)

            # Extract function that identifies non-special category patients
            opposite_special_category_func = self.special_params["special_func_map"][
                "default"
            ]

            # Determine which category is the special category
            special_category_key = next(
                key
                for key, value in self.special_params["special_category_dict"].items()
                if value == 1.0
            )

            # Filter out special category patients
            df = df[
                df.apply(opposite_special_category_func, axis=1)
                & (df[self.outcome_var] != special_category_key)
            ]

        # Step 3: Convert input values to strings and handle nulls
        if self.input_var in df.columns:
            df[self.input_var] = df[self.input_var].fillna("").astype(str)

        if self.grouping_var in df.columns:
            df[self.grouping_var] = df[self.grouping_var].fillna("").astype(str)

        return df

    def fit(self, X: pd.DataFrame) -> "ValueToOutcomePredictor":
        """
        Fits the predictor based on training data by computing the proportion of each input value
        ending in specific outcome variable categories.

        Automatically preprocesses the data before fitting. During preprocessing, any null values in the
        input and grouping variables are converted to empty strings. These empty strings are then used
        as keys in the model's weights dictionary.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

        Returns
        -------
        self : ValueToOutcomePredictor
            The fitted ValueToOutcomePredictor model with calculated probabilities for each input value.
            The weights dictionary will contain an empty string key ('') for any null values from the input data.
        """

        # Store metrics about the training data
        self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
        self.metrics["train_set_no"] = len(X)
        if not X.empty:
            self.metrics["start_date"] = X["snapshot_date"].min()
            self.metrics["end_date"] = X["snapshot_date"].max()

        # Preprocess the data
        X = self._preprocess_data(X)

        # For each grouping value count the number of observed categories
        X_grouped = (
            X.groupby(self.grouping_var)[self.outcome_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # Calculate the total number of times each grouping value occurred
        row_totals = X_grouped.sum(axis=1)

        # Calculate for each grouping value, the proportion of ending with each observed specialty
        proportions = X_grouped.div(row_totals, axis=0).fillna(0)

        # Calculate probabilities for each input value
        input_probs = {}
        for input_val in X[self.input_var].unique():
            # Get all grouping values associated with this input value
            grouping_vals = X[X[self.input_var] == input_val][
                self.grouping_var
            ].unique()

            # Calculate probability distribution of grouping values for this input value
            input_to_group_probs = X[X[self.input_var] == input_val][
                self.grouping_var
            ].value_counts(normalize=True)

            # Get the probability distribution of outcomes for all relevant grouping values
            # This includes all rows in proportions where the grouping value appears for this input
            group_to_outcome_probs = proportions.loc[grouping_vals]

            # Ensure the rows are aligned by reindexing group_to_outcome_probs
            aligned_group_to_outcome = group_to_outcome_probs.reindex(
                input_to_group_probs.index
            )

            # Create outer product matrix of probabilities:
            # - Rows represent grouping values
            # - Columns represent outcome categories
            # Each cell contains the joint probability of the grouping value and outcome
            input_to_outcome_probs = pd.DataFrame(
                input_to_group_probs.values.reshape(-1, 1)
                * aligned_group_to_outcome.values,
                index=input_to_group_probs.index,
                columns=group_to_outcome_probs.columns,
            )

            # Sum across grouping values to get final probability distribution for this input value
            input_probs[input_val] = input_to_outcome_probs.sum().to_dict()

        # Clean the keys to remove excess string quotes
        def clean_key(key):
            if isinstance(key, str):
                # Remove surrounding quotes if they exist
                if key.startswith("'") and key.endswith("'"):
                    return key[1:-1]
            return key

        # Note: cleaned_dict will contain an empty string key ('') for any null values from the input data
        # This is because null values are converted to empty strings during preprocessing
        cleaned_dict = {clean_key(k): v for k, v in input_probs.items()}

        # save probabilities as weights within the model
        self.weights = cleaned_dict

        # save the input to grouping probabilities for use as a reference
        self.input_to_grouping_probs = self._probability_of_input_to_grouping_value(X)

        return self

    def _probability_of_input_to_grouping_value(self, X):
        """
        Computes the probabilities of different input values leading to specific grouping values.

        Parameters
        ----------
        X : pd.DataFrame
            A pandas DataFrame containing at least the columns specified by `input_var` and `grouping_var`.

        Returns
        -------
        pd.DataFrame
            A DataFrame containing the probabilities of input values leading to grouping values.
        """
        # For each input value count the number of grouping values
        X_grouped = (
            X.groupby(self.input_var)[self.grouping_var]
            .value_counts()
            .unstack(fill_value=0)
        )

        # Calculate the total number of times each input value occurred
        row_totals = X_grouped.sum(axis=1)

        # Calculate for each grouping value, the proportion of ending with each grouping value
        proportions = X_grouped.div(row_totals, axis=0)

        # Calculate the probability of each input value occurring in the original data
        proportions["probability_of_input_value"] = row_totals / row_totals.sum()

        return proportions

    def predict(self, input_value: str) -> Dict[str, float]:
        """
        Predicts the probabilities of ending in various outcome categories for a given input value.

        Parameters
        ----------
        input_value : str
            The input value to predict outcomes for. None values will be handled appropriately.

        Returns
        -------
        dict
            A dictionary of categories and the probabilities that the input value will end in them.
        """
        if input_value is None or pd.isna(input_value):
            return self.weights.get("", {})

        # Convert input to string if it isn't already
        input_value = str(input_value)

        # Return a direct lookup of probabilities if possible
        if input_value in self.weights:
            return self.weights[input_value]

        # If no relevant data is found, return null probabilities
        return self.weights.get(None, {})

`repr()`

Return a string representation of the estimator.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py

def __repr__(self):
    """Return a string representation of the estimator."""
    class_name = self.__class__.__name__
    return (
        f"{class_name}(\n"
        f"    input_var='{self.input_var}',\n"
        f"    grouping_var='{self.grouping_var}',\n"
        f"    outcome_var='{self.outcome_var}',\n"
        f"    apply_special_category_filtering={self.apply_special_category_filtering},\n"
        f"    admit_col='{self.admit_col}'\n"
        f")"
    )

`fit(X)`

Fits the predictor based on training data by computing the proportion of each input value ending in specific outcome variable categories.

Automatically preprocesses the data before fitting. During preprocessing, any null values in the input and grouping variables are converted to empty strings. These empty strings are then used as keys in the model's weights dictionary.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.	required

Returns:

Name	Type	Description
`self`	`ValueToOutcomePredictor`	The fitted ValueToOutcomePredictor model with calculated probabilities for each input value. The weights dictionary will contain an empty string key ('') for any null values from the input data.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py

def fit(self, X: pd.DataFrame) -> "ValueToOutcomePredictor":
    """
    Fits the predictor based on training data by computing the proportion of each input value
    ending in specific outcome variable categories.

    Automatically preprocesses the data before fitting. During preprocessing, any null values in the
    input and grouping variables are converted to empty strings. These empty strings are then used
    as keys in the model's weights dictionary.

    Parameters
    ----------
    X : pd.DataFrame
        A pandas DataFrame containing at least the columns specified by `input_var`, `grouping_var`, and `outcome_var`.

    Returns
    -------
    self : ValueToOutcomePredictor
        The fitted ValueToOutcomePredictor model with calculated probabilities for each input value.
        The weights dictionary will contain an empty string key ('') for any null values from the input data.
    """

    # Store metrics about the training data
    self.metrics["train_dttm"] = datetime.now().strftime("%Y-%m-%d %H:%M")
    self.metrics["train_set_no"] = len(X)
    if not X.empty:
        self.metrics["start_date"] = X["snapshot_date"].min()
        self.metrics["end_date"] = X["snapshot_date"].max()

    # Preprocess the data
    X = self._preprocess_data(X)

    # For each grouping value count the number of observed categories
    X_grouped = (
        X.groupby(self.grouping_var)[self.outcome_var]
        .value_counts()
        .unstack(fill_value=0)
    )

    # Calculate the total number of times each grouping value occurred
    row_totals = X_grouped.sum(axis=1)

    # Calculate for each grouping value, the proportion of ending with each observed specialty
    proportions = X_grouped.div(row_totals, axis=0).fillna(0)

    # Calculate probabilities for each input value
    input_probs = {}
    for input_val in X[self.input_var].unique():
        # Get all grouping values associated with this input value
        grouping_vals = X[X[self.input_var] == input_val][
            self.grouping_var
        ].unique()

        # Calculate probability distribution of grouping values for this input value
        input_to_group_probs = X[X[self.input_var] == input_val][
            self.grouping_var
        ].value_counts(normalize=True)

        # Get the probability distribution of outcomes for all relevant grouping values
        # This includes all rows in proportions where the grouping value appears for this input
        group_to_outcome_probs = proportions.loc[grouping_vals]

        # Ensure the rows are aligned by reindexing group_to_outcome_probs
        aligned_group_to_outcome = group_to_outcome_probs.reindex(
            input_to_group_probs.index
        )

        # Create outer product matrix of probabilities:
        # - Rows represent grouping values
        # - Columns represent outcome categories
        # Each cell contains the joint probability of the grouping value and outcome
        input_to_outcome_probs = pd.DataFrame(
            input_to_group_probs.values.reshape(-1, 1)
            * aligned_group_to_outcome.values,
            index=input_to_group_probs.index,
            columns=group_to_outcome_probs.columns,
        )

        # Sum across grouping values to get final probability distribution for this input value
        input_probs[input_val] = input_to_outcome_probs.sum().to_dict()

    # Clean the keys to remove excess string quotes
    def clean_key(key):
        if isinstance(key, str):
            # Remove surrounding quotes if they exist
            if key.startswith("'") and key.endswith("'"):
                return key[1:-1]
        return key

    # Note: cleaned_dict will contain an empty string key ('') for any null values from the input data
    # This is because null values are converted to empty strings during preprocessing
    cleaned_dict = {clean_key(k): v for k, v in input_probs.items()}

    # save probabilities as weights within the model
    self.weights = cleaned_dict

    # save the input to grouping probabilities for use as a reference
    self.input_to_grouping_probs = self._probability_of_input_to_grouping_value(X)

    return self

`predict(input_value)`

Predicts the probabilities of ending in various outcome categories for a given input value.

Parameters:

Name	Type	Description	Default
`input_value`	`str`	The input value to predict outcomes for. None values will be handled appropriately.	required

Returns:

Type	Description
`dict`	A dictionary of categories and the probabilities that the input value will end in them.

Source code in src/patientflow/predictors/value_to_outcome_predictor.py

def predict(self, input_value: str) -> Dict[str, float]:
    """
    Predicts the probabilities of ending in various outcome categories for a given input value.

    Parameters
    ----------
    input_value : str
        The input value to predict outcomes for. None values will be handled appropriately.

    Returns
    -------
    dict
        A dictionary of categories and the probabilities that the input value will end in them.
    """
    if input_value is None or pd.isna(input_value):
        return self.weights.get("", {})

    # Convert input to string if it isn't already
    input_value = str(input_value)

    # Return a direct lookup of probabilities if possible
    if input_value in self.weights:
        return self.weights[input_value]

    # If no relevant data is found, return null probabilities
    return self.weights.get(None, {})

`prepare`

Module for preparing data, loading models, and organizing snapshots for inference.

This module provides functionality to load a trained model, prepare data for making predictions, calculate arrival rates, and organize snapshot data. It allows for selecting one snapshot per visit, filtering snapshots by prediction time, and mapping snapshot dates to corresponding indices.

Functions:

Name	Description
`git select_one_snapshot_per_visit`	Selects one snapshot per visit based on a random number and returns the filtered DataFrame.
`prepare_patient_snapshots`	Filters the DataFrame by prediction time and optionally selects one snapshot per visit.
`prepare_group_snapshot_dict`	Prepares a dictionary mapping snapshot dates to their corresponding snapshot indices.
`calculate_time_varying_arrival_rates`	Calculates the time-varying arrival rates for a dataset indexed by datetime.

`SpecialCategoryParams`

A picklable implementation of special category parameters for patient classification.

This class identifies pediatric patients based on available age-related columns in the dataset and provides functions to categorise patients accordingly. It's designed to be serializable with pickle by implementing the reduce method.

Parameters:

Name	Type	Description	Default
`columns`	`list or Index`	Column names from the dataset used to determine the appropriate age identification method	required

Attributes:

Name	Type	Description
`columns`	`list`	List of column names from the dataset
`method_type`	`str`	The method used for age detection ('age_on_arrival' or 'age_group')
`special_category_dict`	`dict`	Default category values mapping

Raises:

Type	Description
`ValueError`	If neither 'age_on_arrival' nor 'age_group' columns are found

Source code in src/patientflow/prepare.py

class SpecialCategoryParams:
    """A picklable implementation of special category parameters for patient classification.

    This class identifies pediatric patients based on available age-related columns
    in the dataset and provides functions to categorise patients accordingly.
    It's designed to be serializable with pickle by implementing the __reduce__ method.

    Parameters
    ----------
    columns : list or pandas.Index
        Column names from the dataset used to determine the appropriate age identification method

    Attributes
    ----------
    columns : list
        List of column names from the dataset
    method_type : str
        The method used for age detection ('age_on_arrival' or 'age_group')
    special_category_dict : dict
        Default category values mapping

    Raises
    ------
    ValueError
        If neither 'age_on_arrival' nor 'age_group' columns are found
    """

    def __init__(self, columns):
        """Initialize the SpecialCategoryParams object.

        Parameters
        ----------
        columns : list or pandas.Index
            Column names from the dataset used to determine the appropriate age identification method

        Raises
        ------
        ValueError
            If neither 'age_on_arrival' nor 'age_group' columns are found
        """
        self.columns = columns
        self.special_category_dict = {
            "medical": 0.0,
            "surgical": 0.0,
            "haem/onc": 0.0,
            "paediatric": 1.0,
        }

        if "age_on_arrival" in columns:
            self.method_type = "age_on_arrival"
        elif "age_group" in columns:
            self.method_type = "age_group"
        else:
            raise ValueError("Unknown data format: could not find expected age columns")

    def special_category_func(self, row: Union[dict, pd.Series]) -> bool:
        """Identify if a patient is pediatric based on age data.

        Parameters
        ----------
        row : Union[dict, pd.Series]
            A row of patient data containing either 'age_on_arrival' or 'age_group'

        Returns
        -------
        bool
            True if the patient is pediatric (age < 18 or age_group is '0-17'),
            False otherwise
        """
        if self.method_type == "age_on_arrival":
            return row["age_on_arrival"] < 18
        else:  # age_group
            return row["age_group"] == "0-17"

    def opposite_special_category_func(self, row: Union[dict, pd.Series]) -> bool:
        """Identify if a patient is NOT pediatric.

        Parameters
        ----------
        row : Union[dict, pd.Series]
            A row of patient data

        Returns
        -------
        bool
            True if the patient is NOT pediatric, False if they are pediatric
        """
        return not self.special_category_func(row)

    def get_params_dict(
        self,
    ) -> Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]:
        """Get the special parameter dictionary in the format expected by the SequencePredictor.

        Returns
        -------
        Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]
            A dictionary containing:
            - 'special_category_func': Function to identify pediatric patients
            - 'special_category_dict': Default category values (float)
            - 'special_func_map': Mapping of category names to detection functions
        """
        return {
            "special_category_func": self.special_category_func,
            "special_category_dict": self.special_category_dict,
            "special_func_map": {
                "paediatric": self.special_category_func,
                "default": self.opposite_special_category_func,
            },
        }

    def __reduce__(self) -> Tuple[Type["SpecialCategoryParams"], Tuple[list]]:
        """Support for pickle serialization.

        Returns
        -------
        Tuple[Type['SpecialCategoryParams'], Tuple[list]]
            A tuple containing:
            - The class itself (to be called as a function)
            - A tuple of arguments to pass to the class constructor
        """
        return (self.__class__, (self.columns,))

`init(columns)`

Initialize the SpecialCategoryParams object.

Parameters:

Name	Type	Description	Default
`columns`	`list or Index`	Column names from the dataset used to determine the appropriate age identification method	required

Raises:

Type	Description
`ValueError`	If neither 'age_on_arrival' nor 'age_group' columns are found

Source code in src/patientflow/prepare.py

def __init__(self, columns):
    """Initialize the SpecialCategoryParams object.

    Parameters
    ----------
    columns : list or pandas.Index
        Column names from the dataset used to determine the appropriate age identification method

    Raises
    ------
    ValueError
        If neither 'age_on_arrival' nor 'age_group' columns are found
    """
    self.columns = columns
    self.special_category_dict = {
        "medical": 0.0,
        "surgical": 0.0,
        "haem/onc": 0.0,
        "paediatric": 1.0,
    }

    if "age_on_arrival" in columns:
        self.method_type = "age_on_arrival"
    elif "age_group" in columns:
        self.method_type = "age_group"
    else:
        raise ValueError("Unknown data format: could not find expected age columns")

`reduce()`

Support for pickle serialization.

Returns:

Type	Description
`Tuple[Type[SpecialCategoryParams], Tuple[list]]`	A tuple containing: - The class itself (to be called as a function) - A tuple of arguments to pass to the class constructor

Source code in src/patientflow/prepare.py

def __reduce__(self) -> Tuple[Type["SpecialCategoryParams"], Tuple[list]]:
    """Support for pickle serialization.

    Returns
    -------
    Tuple[Type['SpecialCategoryParams'], Tuple[list]]
        A tuple containing:
        - The class itself (to be called as a function)
        - A tuple of arguments to pass to the class constructor
    """
    return (self.__class__, (self.columns,))

`get_params_dict()`

Get the special parameter dictionary in the format expected by the SequencePredictor.

Returns:

Type	Description
`Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]`	A dictionary containing: - 'special_category_func': Function to identify pediatric patients - 'special_category_dict': Default category values (float) - 'special_func_map': Mapping of category names to detection functions

Source code in src/patientflow/prepare.py

def get_params_dict(
    self,
) -> Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]:
    """Get the special parameter dictionary in the format expected by the SequencePredictor.

    Returns
    -------
    Dict[str, Union[Callable, Dict[str, float], Dict[str, Callable]]]
        A dictionary containing:
        - 'special_category_func': Function to identify pediatric patients
        - 'special_category_dict': Default category values (float)
        - 'special_func_map': Mapping of category names to detection functions
    """
    return {
        "special_category_func": self.special_category_func,
        "special_category_dict": self.special_category_dict,
        "special_func_map": {
            "paediatric": self.special_category_func,
            "default": self.opposite_special_category_func,
        },
    }

`opposite_special_category_func(row)`

Identify if a patient is NOT pediatric.

Parameters:

Name	Type	Description	Default
`row`	`Union[dict, Series]`	A row of patient data	required

Returns:

Type	Description
`bool`	True if the patient is NOT pediatric, False if they are pediatric

Source code in src/patientflow/prepare.py

def opposite_special_category_func(self, row: Union[dict, pd.Series]) -> bool:
    """Identify if a patient is NOT pediatric.

    Parameters
    ----------
    row : Union[dict, pd.Series]
        A row of patient data

    Returns
    -------
    bool
        True if the patient is NOT pediatric, False if they are pediatric
    """
    return not self.special_category_func(row)

`special_category_func(row)`

Identify if a patient is pediatric based on age data.

Parameters:

Name	Type	Description	Default
`row`	`Union[dict, Series]`	A row of patient data containing either 'age_on_arrival' or 'age_group'	required

Returns:

Type	Description
`bool`	True if the patient is pediatric (age < 18 or age_group is '0-17'), False otherwise

Source code in src/patientflow/prepare.py

def special_category_func(self, row: Union[dict, pd.Series]) -> bool:
    """Identify if a patient is pediatric based on age data.

    Parameters
    ----------
    row : Union[dict, pd.Series]
        A row of patient data containing either 'age_on_arrival' or 'age_group'

    Returns
    -------
    bool
        True if the patient is pediatric (age < 18 or age_group is '0-17'),
        False otherwise
    """
    if self.method_type == "age_on_arrival":
        return row["age_on_arrival"] < 18
    else:  # age_group
        return row["age_group"] == "0-17"

`additional_details(column, col_name)`

Generate additional statistical details about a column's contents.

Parameters:

Name	Type	Description	Default
`column`	`Series`	The column to analyze	required
`col_name`	`str`	Name of the column (used for context)	required

Returns:

Type	Description
`str`	A string containing statistical details about the column's contents, including: - For dates: Date range - For categorical data: Frequency of values - For numeric data: Range, mean, standard deviation, and NA count - For datetime: Date range with time

Source code in src/patientflow/prepare.py

def additional_details(column, col_name):
    """Generate additional statistical details about a column's contents.

    Parameters
    ----------
    column : pandas.Series
        The column to analyze
    col_name : str
        Name of the column (used for context)

    Returns
    -------
    str
        A string containing statistical details about the column's contents, including:
        - For dates: Date range
        - For categorical data: Frequency of values
        - For numeric data: Range, mean, standard deviation, and NA count
        - For datetime: Date range with time
    """

    def is_date(string):
        try:
            # Try to parse the string using the strptime method
            datetime.strptime(
                string, "%Y-%m-%d"
            )  # You can adjust the format to match your date format
            return True
        except (ValueError, TypeError):
            return False

    # Convert to datetime if it's an object but formatted as a date
    if column.dtype == "object" and all(
        is_date(str(x)) for x in column.dropna().unique()
    ):
        column = pd.to_datetime(column)
        return f"Date Range: {column.min().strftime('%Y-%m-%d')} - {column.max().strftime('%Y-%m-%d')}"

    if column.dtype in ["object", "category", "bool"]:
        # Categorical data: Frequency of unique values
        # Handle enum instances
        try:
            from enum import Enum

            if any(isinstance(x, Enum) for x in column.dropna().unique()):
                # Convert enum instances to their values for counting
                column = column.apply(lambda x: x.value if isinstance(x, Enum) else x)
        except ImportError:
            pass

        if len(column.value_counts()) <= 12:
            value_counts = column.value_counts(dropna=False).to_dict()
            value_counts = dict(sorted(value_counts.items(), key=lambda x: str(x[0])))
            value_counts_formatted = {k: f"{v:,}" for k, v in value_counts.items()}
            return f"Frequencies: {value_counts_formatted}"
        value_counts = column.value_counts(dropna=False)[0:12].to_dict()
        value_counts = dict(sorted(value_counts.items(), key=lambda x: str(x[0])))
        value_counts_formatted = {k: f"{v:,}" for k, v in value_counts.items()}
        return f"Frequencies (highest 12): {value_counts_formatted}"

    if pd.api.types.is_float_dtype(column):
        # Float data: Range with rounding
        na_count = column.isna().sum()
        column = column.dropna()
        return f"Range: {column.min():.2f} - {column.max():.2f},  Mean: {column.mean():.2f}, Std Dev: {column.std():.2f}, NA: {na_count}"
    if pd.api.types.is_integer_dtype(column):
        # Float data: Range without rounding
        na_count = column.isna().sum()
        column = column.dropna()
        return f"Range: {column.min()} - {column.max()}, Mean: {column.mean():.2f}, Std Dev: {column.std():.2f}, NA: {na_count}"
    if pd.api.types.is_datetime64_any_dtype(column):
        # Datetime data: Minimum and Maximum dates
        return f"Date Range: {column.min().strftime('%Y-%m-%d %H:%M')} - {column.max().strftime('%Y-%m-%d %H:%M')}"
    else:
        return "N/A"

`apply_set(row)`

Randomly assign a set label based on weighted probabilities.

Parameters:

Name	Type	Description	Default
`row`	`Series`	Series containing 'training_set', 'validation_set', and 'test_set' weights	required

Returns:

Type	Description
`str`	One of 'train', 'valid', or 'test' based on weighted random choice

Source code in src/patientflow/prepare.py

def apply_set(row: pd.Series) -> str:
    """Randomly assign a set label based on weighted probabilities.

    Parameters
    ----------
    row : pandas.Series
        Series containing 'training_set', 'validation_set', and 'test_set' weights

    Returns
    -------
    str
        One of 'train', 'valid', or 'test' based on weighted random choice
    """
    return random.choices(
        ["train", "valid", "test"],
        weights=[row.training_set, row.validation_set, row.test_set],
    )[0]

`assign_patient_ids(df, start_training_set, start_validation_set, start_test_set, end_test_set, date_col='arrival_datetime', patient_id='mrn', visit_col='encounter', seed=42)`

Probabilistically assign patient IDs to train/validation/test sets.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with patient_id, encounter, and temporal columns	required
`start_training_set`	`date`	Start date for training period	required
`start_validation_set`	`date`	Start date for validation period	required
`start_test_set`	`date`	Start date for test period	required
`end_test_set`	`date`	End date for test period	required
`date_col`	`str`	Column name for temporal splitting, by default "arrival_datetime"	`'arrival_datetime'`
`patient_id`	`str`	Column name for patient identifier, by default "mrn"	`'mrn'`
`visit_col`	`str`	Column name for visit identifier, by default "encounter"	`'encounter'`
`seed`	`int`	Random seed for reproducible results, by default 42	`42`

Returns:

Type	Description
`DataFrame`	DataFrame with patient ID assignments based on weighted random sampling

Notes

Counts encounters in each time period per patient ID
Randomly assigns each patient ID to one set, weighted by their temporal distribution
Patient with 70% encounters in training, 30% in validation has 70% chance of training assignment

Source code in src/patientflow/prepare.py

def assign_patient_ids(
    df: pd.DataFrame,
    start_training_set: date,
    start_validation_set: date,
    start_test_set: date,
    end_test_set: date,
    date_col: str = "arrival_datetime",
    patient_id: str = "mrn",
    visit_col: str = "encounter",
    seed: int = 42,
) -> pd.DataFrame:
    """Probabilistically assign patient IDs to train/validation/test sets.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with patient_id, encounter, and temporal columns
    start_training_set : datetime.date
        Start date for training period
    start_validation_set : datetime.date
        Start date for validation period
    start_test_set : datetime.date
        Start date for test period
    end_test_set : datetime.date
        End date for test period
    date_col : str, optional
        Column name for temporal splitting, by default "arrival_datetime"
    patient_id : str, optional
        Column name for patient identifier, by default "mrn"
    visit_col : str, optional
        Column name for visit identifier, by default "encounter"
    seed : int, optional
        Random seed for reproducible results, by default 42

    Returns
    -------
    pandas.DataFrame
        DataFrame with patient ID assignments based on weighted random sampling

    Notes
    -----
    - Counts encounters in each time period per patient ID
    - Randomly assigns each patient ID to one set, weighted by their temporal distribution
    - Patient with 70% encounters in training, 30% in validation has 70% chance of training assignment
    """
    # Set random seed for reproducibility
    random.seed(seed)

    patients: pd.DataFrame = (
        df.groupby([patient_id, visit_col])[date_col].max().reset_index()
    )

    # Handle date_col as string, datetime, or date type
    if pd.api.types.is_datetime64_any_dtype(patients[date_col]):
        # Already datetime, extract date if needed
        if hasattr(patients[date_col].iloc[0], "date"):
            date_series = patients[date_col].dt.date
        else:
            # Already date type
            date_series = patients[date_col]
    else:
        # Try to convert string to datetime
        try:
            patients[date_col] = pd.to_datetime(patients[date_col])
            date_series = patients[date_col].dt.date
        except (TypeError, ValueError) as e:
            raise ValueError(
                f"Could not convert column '{date_col}' to datetime format: {str(e)}"
            )

    # Filter out patient IDs outside temporal bounds
    pre_training_patients = patients[date_series < start_training_set]
    post_test_patients = patients[date_series >= end_test_set]

    if len(pre_training_patients) > 0:
        print(
            f"Filtered out {len(pre_training_patients)} patients with only pre-training visits"
        )
    if len(post_test_patients) > 0:
        print(
            f"Filtered out {len(post_test_patients)} patients with only post-test visits"
        )

    valid_patients = patients[
        (date_series >= start_training_set) & (date_series < end_test_set)
    ]
    patients = valid_patients

    # Use the date_series for set assignment
    patients["training_set"] = (date_series >= start_training_set) & (
        date_series < start_validation_set
    )
    patients["validation_set"] = (date_series >= start_validation_set) & (
        date_series < start_test_set
    )
    patients["test_set"] = (date_series >= start_test_set) & (
        date_series < end_test_set
    )

    patients = patients.groupby(patient_id)[
        ["training_set", "validation_set", "test_set"]
    ].sum()
    patients["training_validation_test"] = patients.apply(apply_set, axis=1)

    print(
        f"\nPatient Set Overlaps (before random assignment):"
        f"\nTrain-Valid: {patients[patients.training_set * patients.validation_set != 0].shape[0]} of {patients[patients.training_set + patients.validation_set > 0].shape[0]}"
        f"\nValid-Test: {patients[patients.validation_set * patients.test_set != 0].shape[0]} of {patients[patients.validation_set + patients.test_set > 0].shape[0]}"
        f"\nTrain-Test: {patients[patients.training_set * patients.test_set != 0].shape[0]} of {patients[patients.training_set + patients.test_set > 0].shape[0]}"
        f"\nAll Sets: {patients[patients.training_set * patients.validation_set * patients.test_set != 0].shape[0]} of {patients.shape[0]} total patients"
    )

    return patients

`convert_dict_to_values(df, column, prefix)`

Convert a column containing dictionaries into separate columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the dictionary column	required
`column`	`str`	Name of the column containing dictionaries to convert	required
`prefix`	`str`	Prefix to use for the new column names	required

Returns:

Type	Description
`DataFrame`	DataFrame containing separate columns for each dictionary key, with values extracted from 'value_as_real' or 'value_as_text' if present

Source code in src/patientflow/prepare.py

def convert_dict_to_values(df, column, prefix):
    """Convert a column containing dictionaries into separate columns.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the dictionary column
    column : str
        Name of the column containing dictionaries to convert
    prefix : str
        Prefix to use for the new column names

    Returns
    -------
    pandas.DataFrame
        DataFrame containing separate columns for each dictionary key,
        with values extracted from 'value_as_real' or 'value_as_text' if present
    """

    def extract_relevant_value(d):
        if isinstance(d, dict):
            if "value_as_real" in d or "value_as_text" in d:
                return (
                    d.get("value_as_real")
                    if d.get("value_as_real") is not None
                    else d.get("value_as_text")
                )
            else:
                return d  # Return the dictionary as is if it does not contain 'value_as_real' or 'value_as_text'
        return d  # Return the value as is if it is not a dictionary

    # Apply the extraction function to each entry in the dictionary column
    extracted_values = df[column].apply(
        lambda x: {k: extract_relevant_value(v) for k, v in x.items()}
    )

    # Create a DataFrame from the processed dictionary column
    dict_df = extracted_values.apply(pd.Series)

    # Add a prefix to the column names
    dict_df.columns = [f"{prefix}_{col}" for col in dict_df.columns]

    return dict_df

`convert_set_to_dummies(df, column, prefix)`

Convert a column containing sets into dummy variables.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the set column	required
`column`	`str`	Name of the column containing sets to convert	required
`prefix`	`str`	Prefix to use for the dummy variable column names	required

Returns:

Type	Description
`DataFrame`	DataFrame containing dummy variables for each unique item in the sets

Source code in src/patientflow/prepare.py

def convert_set_to_dummies(df, column, prefix):
    """Convert a column containing sets into dummy variables.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the set column
    column : str
        Name of the column containing sets to convert
    prefix : str
        Prefix to use for the dummy variable column names

    Returns
    -------
    pandas.DataFrame
        DataFrame containing dummy variables for each unique item in the sets
    """
    # Explode the set into rows
    exploded_df = df[column].explode().dropna().to_frame()

    # Create dummy variables for each unique item with a specified prefix
    dummies = pd.get_dummies(exploded_df[column], prefix=prefix)

    # # Sum the dummies back to the original DataFrame's index
    dummies = dummies.groupby(dummies.index).sum()

    # Convert dummy variables to boolean
    dummies = dummies.astype(bool)

    return dummies

`create_special_category_objects(columns)`

Create a configuration for categorising patients with special handling for pediatric cases.

Parameters:

Name	Type	Description	Default
`columns`	`list or Index`	The column names available in the dataset. Used to determine which age format is present.	required

Returns:

Type	Description
`dict`	A dictionary containing special category configuration with: - 'special_category_func': Function to identify pediatric patients - 'special_category_dict': Default category values - 'special_func_map': Mapping of category names to detection functions

Source code in src/patientflow/prepare.py

def create_special_category_objects(columns):
    """Create a configuration for categorising patients with special handling for pediatric cases.

    Parameters
    ----------
    columns : list or pandas.Index
        The column names available in the dataset. Used to determine which age format is present.

    Returns
    -------
    dict
        A dictionary containing special category configuration with:
        - 'special_category_func': Function to identify pediatric patients
        - 'special_category_dict': Default category values
        - 'special_func_map': Mapping of category names to detection functions
    """
    # Create the class instance and return its parameter dictionary
    params_obj = SpecialCategoryParams(columns)
    return params_obj.get_params_dict()

`create_temporal_splits(df, start_train, start_valid, start_test, end_test, col_name='arrival_datetime', patient_id='mrn', visit_col='encounter', seed=42)`

Split dataset into temporal train/validation/test sets.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input dataframe	required
`start_train`	`date`	Training start (inclusive)	required
`start_valid`	`date`	Validation start (inclusive)	required
`start_test`	`date`	Test start (inclusive)	required
`end_test`	`date`	Test end (exclusive)	required
`col_name`	`str`	Primary datetime column for splitting, by default "arrival_datetime"	`'arrival_datetime'`
`patient_id`	`str`	Column name for patient identifier, by default "mrn"	`'mrn'`
`visit_col`	`str`	Column name for visit identifier, by default "encounter"	`'encounter'`
`seed`	`int`	Random seed for reproducible results, by default 42	`42`

Returns:

Type	Description
`Tuple[DataFrame, DataFrame, DataFrame]`	Tuple containing (train_df, valid_df, test_df) split dataframes

Notes

Creates temporal data splits using primary datetime column and optional snapshot dates. Handles patient ID grouping if present to prevent data leakage.

Source code in src/patientflow/prepare.py

def create_temporal_splits(
    df: pd.DataFrame,
    start_train: date,
    start_valid: date,
    start_test: date,
    end_test: date,
    col_name: str = "arrival_datetime",
    patient_id: str = "mrn",
    visit_col: str = "encounter",
    seed: int = 42,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Split dataset into temporal train/validation/test sets.

    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe
    start_train : datetime.date
        Training start (inclusive)
    start_valid : datetime.date
        Validation start (inclusive)
    start_test : datetime.date
        Test start (inclusive)
    end_test : datetime.date
        Test end (exclusive)
    col_name : str, optional
        Primary datetime column for splitting, by default "arrival_datetime"
    patient_id : str, optional
        Column name for patient identifier, by default "mrn"
    visit_col : str, optional
        Column name for visit identifier, by default "encounter"
    seed : int, optional
        Random seed for reproducible results, by default 42

    Returns
    -------
    Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame]
        Tuple containing (train_df, valid_df, test_df) split dataframes

    Notes
    -----
    Creates temporal data splits using primary datetime column and optional snapshot dates.
    Handles patient ID grouping if present to prevent data leakage.
    """

    def get_date_value(series: pd.Series) -> pd.Series:
        """Convert timestamp or date column to date, handling both types.

        Parameters
        ----------
        series : pandas.Series
            Series containing datetime or date values

        Returns
        -------
        pandas.Series
            Series with date values
        """
        try:
            return pd.to_datetime(series).dt.date
        except (AttributeError, TypeError):
            return series

    if patient_id in df.columns:
        set_assignment: pd.DataFrame = assign_patient_ids(
            df,
            start_train,
            start_valid,
            start_test,
            end_test,
            col_name,
            patient_id,
            visit_col,
            seed=seed,
        )
        patient_sets: Dict[str, Set] = {
            k: set(set_assignment[set_assignment.training_validation_test == v].index)
            for k, v in {"train": "train", "valid": "valid", "test": "test"}.items()
        }

    splits: List[pd.DataFrame] = []
    for start, end, set_key in [
        (start_train, start_valid, "train"),
        (start_valid, start_test, "valid"),
        (start_test, end_test, "test"),
    ]:
        mask = (get_date_value(df[col_name]) >= start) & (
            get_date_value(df[col_name]) < end
        )

        if "snapshot_date" in df.columns:
            mask &= (get_date_value(df.snapshot_date) >= start) & (
                get_date_value(df.snapshot_date) < end
            )

        if patient_id in df.columns:
            mask &= df[patient_id].isin(patient_sets[set_key])

        splits.append(df[mask].copy())

    print(f"Split sizes: {[len(split) for split in splits]}")
    return tuple(splits)

`create_yta_filters(df)`

Create specialty filters for categorizing patients by specialty and age group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing patient data with columns that include either 'age_on_arrival' or 'age_group' for pediatric classification	required

Returns:

Type	Description
`dict`	A dictionary mapping specialty names to filter configurations. Each configuration contains: - For pediatric specialty: {"is_child": True} - For other specialties: {"specialty": specialty_name, "is_child": False}

Examples:

>>> df = pd.DataFrame({'patient_id': [1, 2], 'age_on_arrival': [10, 40]})
>>> filters = create_yta_filters(df)
>>> print(filters['paediatric'])
{'is_child': True}
>>> print(filters['medical'])
{'specialty': 'medical', 'is_child': False}

Source code in src/patientflow/prepare.py

def create_yta_filters(df):
    """Create specialty filters for categorizing patients by specialty and age group.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient data with columns that include either
        'age_on_arrival' or 'age_group' for pediatric classification

    Returns
    -------
    dict
        A dictionary mapping specialty names to filter configurations.
        Each configuration contains:
        - For pediatric specialty: {"is_child": True}
        - For other specialties: {"specialty": specialty_name, "is_child": False}

    Examples
    --------
    >>> df = pd.DataFrame({'patient_id': [1, 2], 'age_on_arrival': [10, 40]})
    >>> filters = create_yta_filters(df)
    >>> print(filters['paediatric'])
    {'is_child': True}
    >>> print(filters['medical'])
    {'specialty': 'medical', 'is_child': False}
    """
    # Get the special category parameters using the picklable implementation
    special_params = create_special_category_objects(df.columns)

    # Extract necessary data from the special_params
    special_category_dict = special_params["special_category_dict"]

    # Create the specialty_filters dictionary
    specialty_filters = {}

    for specialty, is_paediatric_flag in special_category_dict.items():
        if is_paediatric_flag == 1.0:
            # For the paediatric specialty, set `is_child` to True
            specialty_filters[specialty] = {"is_child": True}
        else:
            # For other specialties, set `is_child` to False
            specialty_filters[specialty] = {"specialty": specialty, "is_child": False}

    return specialty_filters

`find_group_for_colname(column, dict_col_groups)`

Find the group name that a column belongs to in the column groups dictionary.

Parameters:

Name	Type	Description	Default
`column`	`str`	Name of the column to find the group for	required
`dict_col_groups`	`dict`	Dictionary mapping group names to lists of column names	required

Returns:

Type	Description
`str or None`	The name of the group the column belongs to, or None if not found

Source code in src/patientflow/prepare.py

def find_group_for_colname(column, dict_col_groups):
    """Find the group name that a column belongs to in the column groups dictionary.

    Parameters
    ----------
    column : str
        Name of the column to find the group for
    dict_col_groups : dict
        Dictionary mapping group names to lists of column names

    Returns
    -------
    str or None
        The name of the group the column belongs to, or None if not found
    """
    for key, values_list in dict_col_groups.items():
        if column in values_list:
            return key
    return None

`generate_description(col_name)`

Generate a description for a column based on its name and manual descriptions.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Name of the column to generate a description for	required

Returns:

Type	Description
`str`	A descriptive string explaining the column's purpose and content

Source code in src/patientflow/prepare.py

def generate_description(col_name):
    """Generate a description for a column based on its name and manual descriptions.

    Parameters
    ----------
    col_name : str
        Name of the column to generate a description for

    Returns
    -------
    str
        A descriptive string explaining the column's purpose and content
    """
    manual_descriptions = get_manual_descriptions()

    # Check if manual description is provided
    if col_name in manual_descriptions:
        return manual_descriptions[col_name]

    if (
        col_name.startswith("num")
        and not col_name.startswith("num_obs")
        and not col_name.startswith("num_orders")
    ):
        return "Number of times " + col_name[4:] + " has been recorded"
    if col_name.startswith("num_obs"):
        return "Number of observations of " + col_name[8:]
    if col_name.startswith("latest_obs"):
        return "Latest result for " + col_name[11:]
    if col_name.startswith("latest_lab"):
        return "Latest result for " + col_name[19:]
    if col_name.startswith("lab_orders"):
        return "Request for lab battery " + col_name[11:] + " has been placed"
    if col_name.startswith("visited"):
        return "Patient visited " + col_name[8:] + " previously or is there now"
    else:
        return col_name

`prepare_group_snapshot_dict(df, start_dt=None, end_dt=None)`

Prepare a dictionary mapping snapshot dates to their corresponding snapshot indices.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing at least a 'snapshot_date' column	required
`start_dt`	`date`	Start date for filtering snapshots, by default None	`None`
`end_dt`	`date`	End date for filtering snapshots, by default None	`None`

Returns:

Type	Description
`dict`	A dictionary where: - Keys are dates - Values are arrays of indices corresponding to each date's snapshots - Empty arrays for dates with no snapshots (if start_dt and end_dt are provided)

Raises:

Type	Description
`ValueError`	If 'snapshot_date' column is not present in the DataFrame

Source code in src/patientflow/prepare.py

def prepare_group_snapshot_dict(df, start_dt=None, end_dt=None):
    """Prepare a dictionary mapping snapshot dates to their corresponding snapshot indices.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing at least a 'snapshot_date' column
    start_dt : datetime.date, optional
        Start date for filtering snapshots, by default None
    end_dt : datetime.date, optional
        End date for filtering snapshots, by default None

    Returns
    -------
    dict
        A dictionary where:
        - Keys are dates
        - Values are arrays of indices corresponding to each date's snapshots
        - Empty arrays for dates with no snapshots (if start_dt and end_dt are provided)

    Raises
    ------
    ValueError
        If 'snapshot_date' column is not present in the DataFrame
    """
    # Ensure 'snapshot_date' is in the DataFrame
    if "snapshot_date" not in df.columns:
        raise ValueError("DataFrame must include a 'snapshot_date' column")

    # Filter DataFrame to date range if provided
    filtered_df = df.copy()
    if start_dt and end_dt:
        filtered_df = df[
            (df["snapshot_date"] >= start_dt) & (df["snapshot_date"] < end_dt)
        ]

    # Group the DataFrame by 'snapshot_date' and collect the indices for each group
    snapshots_dict = {
        date: group.index.tolist()
        for date, group in filtered_df.groupby("snapshot_date")
    }

    # If start_dt and end_dt are specified, add any missing keys from prediction_dates
    if start_dt:
        prediction_dates = pd.date_range(
            start=start_dt, end=end_dt, freq="D"
        ).date.tolist()[:-1]
        for dt in prediction_dates:
            if dt not in snapshots_dict:
                snapshots_dict[dt] = []

    return snapshots_dict

`prepare_patient_snapshots(df, prediction_time, exclude_columns=[], single_snapshot_per_visit=True, visit_col=None, label_col='is_admitted')`

Prepare patient snapshots for model training or prediction.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing patient visit data	required
`prediction_time`	`str or datetime`	The specific prediction time to filter for	required
`exclude_columns`	`list`	List of columns to exclude from the final DataFrame, by default []	`[]`
`single_snapshot_per_visit`	`bool`	Whether to select only one snapshot per visit, by default True	`True`
`visit_col`	`str`	Name of the column containing visit identifiers, required if single_snapshot_per_visit is True	`None`
`label_col`	`str`	Name of the column containing the target labels, by default "is_admitted"	`'is_admitted'`

Returns:

Type	Description
`Tuple[DataFrame, Series]`	A tuple containing: - DataFrame: Processed DataFrame with features - Series: Corresponding labels

Raises:

Type	Description
`ValueError`	If single_snapshot_per_visit is True but visit_col is not provided

Source code in src/patientflow/prepare.py

def prepare_patient_snapshots(
    df,
    prediction_time,
    exclude_columns=[],
    single_snapshot_per_visit=True,
    visit_col=None,
    label_col="is_admitted",
) -> Tuple[pd.DataFrame, pd.Series]:
    """Prepare patient snapshots for model training or prediction.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing patient visit data
    prediction_time : str or datetime
        The specific prediction time to filter for
    exclude_columns : list, optional
        List of columns to exclude from the final DataFrame, by default []
    single_snapshot_per_visit : bool, optional
        Whether to select only one snapshot per visit, by default True
    visit_col : str, optional
        Name of the column containing visit identifiers, required if single_snapshot_per_visit is True
    label_col : str, optional
        Name of the column containing the target labels, by default "is_admitted"

    Returns
    -------
    Tuple[pandas.DataFrame, pandas.Series]
        A tuple containing:
        - DataFrame: Processed DataFrame with features
        - Series: Corresponding labels

    Raises
    ------
    ValueError
        If single_snapshot_per_visit is True but visit_col is not provided
    """
    if single_snapshot_per_visit and visit_col is None:
        raise ValueError(
            "visit_col must be provided when single_snapshot_per_visit is True"
        )

    # Filter by the time of day while keeping the original index
    df_tod = df[df["prediction_time"] == prediction_time].copy()

    if single_snapshot_per_visit:
        # Select one row for each visit
        df_single = select_one_snapshot_per_visit(df_tod, visit_col)
        # Create label array with the same index
        y = df_single.pop(label_col).astype(int)
        # Drop specified columns and ensure we do not reset the index
        df_single.drop(columns=exclude_columns, inplace=True)
        return df_single, y
    else:
        # Directly modify df_tod without resetting the index
        df_tod.drop(
            columns=["random_number"] + exclude_columns, inplace=True, errors="ignore"
        )
        y = df_tod.pop(label_col).astype(int)
        return df_tod, y

`select_one_snapshot_per_visit(df, visit_col, seed=42)`

Select one random snapshot per visit from a DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing visit snapshots	required
`visit_col`	`str`	Name of the column containing visit identifiers	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`DataFrame`	DataFrame containing one randomly selected snapshot per visit

Source code in src/patientflow/prepare.py

def select_one_snapshot_per_visit(df, visit_col, seed=42):
    """Select one random snapshot per visit from a DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing visit snapshots
    visit_col : str
        Name of the column containing visit identifiers
    seed : int, optional
        Random seed for reproducibility, by default 42

    Returns
    -------
    pandas.DataFrame
        DataFrame containing one randomly selected snapshot per visit
    """
    # Generate random numbers if not present
    if "random_number" not in df.columns:
        if seed is not None:
            np.random.seed(seed)
        df["random_number"] = np.random.random(size=len(df))

    # Select the row with the maximum random_number for each visit
    max_indices = df.groupby(visit_col)["random_number"].idxmax()
    return df.loc[max_indices].drop(columns=["random_number"])

`validate_special_category_objects(special_params)`

Validate that a special category parameters dictionary contains all required keys.

Parameters:

Name	Type	Description	Default
`special_params`	`Dict[str, Any]`	Dictionary of special category parameters to validate	required

Raises:

Type	Description
`MissingKeysError`	If any required keys are missing from the dictionary

Source code in src/patientflow/prepare.py

def validate_special_category_objects(special_params: Dict[str, Any]) -> None:
    """Validate that a special category parameters dictionary contains all required keys.

    Parameters
    ----------
    special_params : Dict[str, Any]
        Dictionary of special category parameters to validate

    Raises
    ------
    MissingKeysError
        If any required keys are missing from the dictionary
    """
    required_keys = [
        "special_category_func",
        "special_category_dict",
        "special_func_map",
    ]
    missing_keys = [key for key in required_keys if key not in special_params]

    if missing_keys:
        raise MissingKeysError(missing_keys)

`write_data_dict(df, dict_name, dict_path)`

Write a data dictionary for a DataFrame to both Markdown and CSV formats.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to create a data dictionary for	required
`dict_name`	`str`	Base name for the output files (without extension)	required
`dict_path`	`str or Path`	Directory path where the data dictionary files will be written	required

Returns:

Type	Description
`DataFrame`	The created data dictionary as a DataFrame

Notes

Creates two files: - {dict_name}.md: Markdown format data dictionary - {dict_name}.csv: CSV format data dictionary

For visit data, includes separate statistics for admitted and non-admitted patients.

Source code in src/patientflow/prepare.py

def write_data_dict(df, dict_name, dict_path):
    """Write a data dictionary for a DataFrame to both Markdown and CSV formats.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame to create a data dictionary for
    dict_name : str
        Base name for the output files (without extension)
    dict_path : str or pathlib.Path
        Directory path where the data dictionary files will be written

    Returns
    -------
    pandas.DataFrame
        The created data dictionary as a DataFrame

    Notes
    -----
    Creates two files:
    - {dict_name}.md: Markdown format data dictionary
    - {dict_name}.csv: CSV format data dictionary

    For visit data, includes separate statistics for admitted and non-admitted patients.
    """
    cols_to_exclude = ["snapshot_id", "visit_number"]

    df = df.copy(deep=True)

    if "visits" in dict_name:
        df.consultation_sequence = df.consultation_sequence.apply(
            lambda x: str(x)
        ).to_frame()
        df.final_sequence = df.final_sequence.apply(lambda x: str(x)).to_frame()
        df_admitted = df[df.is_admitted]
        df_not_admitted = df[~df.is_admitted]
        dict_col_groups = get_dict_cols(df)

        data_dict = pd.DataFrame(
            {
                "Variable type": [
                    find_group_for_colname(col, dict_col_groups) for col in df.columns
                ],
                "Column Name": df.columns,
                "Data Type": df.dtypes,
                "Description": [generate_description(col) for col in df.columns],
                "Whole dataset": [
                    additional_details(df[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df.columns
                ],
                "Admitted": [
                    additional_details(df_admitted[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df_admitted.columns
                ],
                "Not admitted": [
                    additional_details(df_not_admitted[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df_not_admitted.columns
                ],
            }
        )
        data_dict["Whole dataset"] = data_dict["Whole dataset"].str.replace("'", "")
        data_dict["Admitted"] = data_dict["Admitted"].str.replace("'", "")
        data_dict["Not admitted"] = data_dict["Not admitted"].str.replace("'", "")

    else:
        data_dict = pd.DataFrame(
            {
                "Column Name": df.columns,
                "Data Type": df.dtypes,
                "Description": [generate_description(col) for col in df.columns],
                "Additional Details": [
                    additional_details(df[col], col)
                    if col not in cols_to_exclude
                    else ""
                    for col in df.columns
                ],
            }
        )
        data_dict["Additional Details"] = data_dict["Additional Details"].str.replace(
            "'", ""
        )

    # Export to Markdown and csv for data dictionary
    data_dict.to_markdown(str(dict_path) + "/" + dict_name + ".md", index=False)
    data_dict.to_csv(str(dict_path) + "/" + dict_name + ".csv", index=False)

    return data_dict

`survival_curve`

Core survival curve calculation functions for patient flow analysis.

This module provides the mathematical computation functions for survival analysis without visualization dependencies.

Functions:

Name	Description
`calculate_survival_curve : function`	Calculate survival curve data from patient visit data

`calculate_survival_curve(df, start_time_col, end_time_col)`

Calculate survival curve data from patient visit data.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing patient visit data	required
`start_time_col`	`str`	Name of the column containing the start time (e.g., arrival time)	required
`end_time_col`	`str`	Name of the column containing the end time (e.g., admission time)	required

Returns:

Type	Description
`tuple of (numpy.ndarray, numpy.ndarray, pandas.DataFrame)`	unique_times: Array of time points in hours survival_prob: Array of survival probabilities at each time point df_clean: Cleaned DataFrame with wait_time_hours column added

Source code in src/patientflow/survival_curve.py

def calculate_survival_curve(df, start_time_col, end_time_col):
    """Calculate survival curve data from patient visit data.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient visit data
    start_time_col : str
        Name of the column containing the start time (e.g., arrival time)
    end_time_col : str
        Name of the column containing the end time (e.g., admission time)

    Returns
    -------
    tuple of (numpy.ndarray, numpy.ndarray, pandas.DataFrame)
        - unique_times: Array of time points in hours
        - survival_prob: Array of survival probabilities at each time point
        - df_clean: Cleaned DataFrame with wait_time_hours column added
    """
    # Calculate the wait time in hours
    df = df.copy()
    df["wait_time_hours"] = (
        df[end_time_col] - df[start_time_col]
    ).dt.total_seconds() / 3600

    # Drop any rows with missing wait times
    df_clean = df.dropna(subset=["wait_time_hours"]).copy()

    # Sort the data by wait time
    df_clean = df_clean.sort_values("wait_time_hours")

    # Calculate the number of patients
    n_patients = len(df_clean)

    # Calculate the survival function manually
    # For each time point, calculate proportion of patients who are still waiting
    unique_times = np.sort(df_clean["wait_time_hours"].unique())
    survival_prob = []

    for t in unique_times:
        # Number of patients who experienced the event after this time point
        n_event_after = sum(df_clean["wait_time_hours"] > t)
        # Proportion of patients still waiting
        survival_prob.append(n_event_after / n_patients)

    # Add zero hours wait time (everyone is waiting at time 0)
    unique_times = np.insert(unique_times, 0, 0)
    survival_prob = np.insert(survival_prob, 0, 1.0)

    return unique_times, survival_prob, df_clean

`train`

Training module for patient flow models.

This module provides functionality for training various predictive models used in patient flow analysis, including classifiers and demand forecasting models.

`classifiers`

Machine learning classifiers for patient flow prediction.

This module provides functions for training and evaluating machine learning classifiers for patient admission prediction. It includes utilities for data preparation, model training, hyperparameter tuning, and evaluation using time series cross-validation.

Functions:

Name	Description
`evaluate_predictions`	Calculate multiple metrics (AUC, log loss, AUPRC) for given predictions
`chronological_cross_validation`	Perform time series cross-validation with multiple metrics
`initialise_model`	Initialize a model with given hyperparameters
`create_column_transformer`	Create a column transformer for a dataframe with dynamic column handling
`calculate_class_balance`	Calculate class balance ratios for target labels
`get_feature_metadata`	Extract feature names and importances from pipeline
`get_dataset_metadata`	Get dataset sizes and class balances
`create_balance_info`	Create a dictionary with balance information
`evaluate_model`	Evaluate model on test set
`train_classifier`	Train a single model including data preparation and balancing
`train_multiple_classifiers`	Train admission prediction models for multiple prediction times

`calculate_class_balance(y)`

Calculate class balance ratios for target labels.

Parameters:

Name	Type	Description	Default
`y`	`Series`	Target labels	required

Returns:

Type	Description
`Dict[Any, float]`	Dictionary mapping each class to its proportion

Source code in src/patientflow/train/classifiers.py

def calculate_class_balance(y: Series) -> Dict[Any, float]:
    """Calculate class balance ratios for target labels.

    Parameters
    ----------
    y : Series
        Target labels

    Returns
    -------
    Dict[Any, float]
        Dictionary mapping each class to its proportion
    """
    counter = Counter(y)
    total = len(y)
    return {cls: count / total for cls, count in counter.items()}

`chronological_cross_validation(pipeline, X, y, n_splits=5)`

Perform time series cross-validation with multiple metrics.

Parameters:

Name	Type	Description	Default
`pipeline`	`Pipeline`	Sklearn pipeline to evaluate	required
`X`	`DataFrame`	Feature matrix	required
`y`	`Series`	Target labels	required
`n_splits`	`int`	Number of time series splits, by default 5	`5`

Returns:

Type	Description
`Dict[str, float]`	Dictionary containing training and validation metrics

Source code in src/patientflow/train/classifiers.py

def chronological_cross_validation(
    pipeline: Pipeline, X: DataFrame, y: Series, n_splits: int = 5
) -> Dict[str, float]:
    """Perform time series cross-validation with multiple metrics.

    Parameters
    ----------
    pipeline : Pipeline
        Sklearn pipeline to evaluate
    X : DataFrame
        Feature matrix
    y : Series
        Target labels
    n_splits : int, optional
        Number of time series splits, by default 5

    Returns
    -------
    Dict[str, float]
        Dictionary containing training and validation metrics
    """
    tscv = TimeSeriesSplit(n_splits=n_splits)

    train_metrics: List[FoldResults] = []
    valid_metrics: List[FoldResults] = []

    for train_idx, valid_idx in tscv.split(X):
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

        pipeline.fit(X_train, y_train)
        train_preds = pipeline.predict_proba(X_train)[:, 1]
        valid_preds = pipeline.predict_proba(X_valid)[:, 1]

        train_metrics.append(evaluate_predictions(y_train, train_preds))
        valid_metrics.append(evaluate_predictions(y_valid, valid_preds))

    def aggregate_metrics(metrics_list: List[FoldResults]) -> Dict[str, float]:
        return {
            field: np.mean([getattr(m, field) for m in metrics_list])
            for field in FoldResults.__dataclass_fields__
        }

    train_means = aggregate_metrics(train_metrics)
    valid_means = aggregate_metrics(valid_metrics)

    return {f"train_{metric}": value for metric, value in train_means.items()} | {
        f"valid_{metric}": value for metric, value in valid_means.items()
    }

`create_balance_info(is_balanced, original_size, balanced_size, original_positive_rate, balanced_positive_rate, majority_to_minority_ratio)`

Create a dictionary with balance information.

Parameters:

Name	Type	Description	Default
`is_balanced`	`bool`	Whether the dataset was balanced	required
`original_size`	`int`	Original dataset size	required
`balanced_size`	`int`	Size after balancing	required
`original_positive_rate`	`float`	Positive class rate before balancing	required
`balanced_positive_rate`	`float`	Positive class rate after balancing	required
`majority_to_minority_ratio`	`float`	Ratio of majority to minority class samples	required

Returns:

Type	Description
`Dict[str, Union[bool, int, float]]`	Dictionary containing balance information

Source code in src/patientflow/train/classifiers.py

def create_balance_info(
    is_balanced: bool,
    original_size: int,
    balanced_size: int,
    original_positive_rate: float,
    balanced_positive_rate: float,
    majority_to_minority_ratio: float,
) -> Dict[str, Union[bool, int, float]]:
    """Create a dictionary with balance information.

    Parameters
    ----------
    is_balanced : bool
        Whether the dataset was balanced
    original_size : int
        Original dataset size
    balanced_size : int
        Size after balancing
    original_positive_rate : float
        Positive class rate before balancing
    balanced_positive_rate : float
        Positive class rate after balancing
    majority_to_minority_ratio : float
        Ratio of majority to minority class samples

    Returns
    -------
    Dict[str, Union[bool, int, float]]
        Dictionary containing balance information
    """
    return {
        "is_balanced": is_balanced,
        "original_size": original_size,
        "balanced_size": balanced_size,
        "original_positive_rate": original_positive_rate,
        "balanced_positive_rate": balanced_positive_rate,
        "majority_to_minority_ratio": majority_to_minority_ratio,
    }

`create_column_transformer(df, ordinal_mappings=None)`

Create a column transformer for a dataframe with dynamic column handling.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input dataframe	required
`ordinal_mappings`	`Dict[str, List[Any]]`	Mappings for ordinal categorical features, by default None	`None`

Returns:

Type	Description
`ColumnTransformer`	Configured column transformer

Source code in src/patientflow/train/classifiers.py

def create_column_transformer(
    df: DataFrame, ordinal_mappings: Optional[Dict[str, List[Any]]] = None
) -> ColumnTransformer:
    """Create a column transformer for a dataframe with dynamic column handling.

    Parameters
    ----------
    df : DataFrame
        Input dataframe
    ordinal_mappings : Dict[str, List[Any]], optional
        Mappings for ordinal categorical features, by default None

    Returns
    -------
    ColumnTransformer
        Configured column transformer
    """
    transformers: List[
        Tuple[str, Union[OrdinalEncoder, OneHotEncoder, StandardScaler], List[str]]
    ] = []

    if ordinal_mappings is None:
        ordinal_mappings = {}

    for col in df.columns:
        if col in ordinal_mappings:
            transformers.append(
                (
                    col,
                    OrdinalEncoder(
                        categories=[ordinal_mappings[col]],
                        handle_unknown="use_encoded_value",
                        unknown_value=np.nan,
                    ),
                    [col],
                )
            )
        elif df[col].dtype == "object" or (
            df[col].dtype == "bool" or df[col].nunique() == 2
        ):
            transformers.append((col, OneHotEncoder(handle_unknown="ignore"), [col]))
        else:
            transformers.append((col, StandardScaler(), [col]))

    return ColumnTransformer(transformers)

`evaluate_model(pipeline, X_test, y_test)`

Evaluate model on test set.

Parameters:

Name	Type	Description	Default
`pipeline`	`Pipeline`	Trained sklearn pipeline	required
`X_test`	`DataFrame`	Test features	required
`y_test`	`Series`	Test labels	required

Returns:

Type	Description
`Dict[str, float]`	Dictionary containing test metrics

Source code in src/patientflow/train/classifiers.py

def evaluate_model(
    pipeline: Pipeline, X_test: DataFrame, y_test: Series
) -> Dict[str, float]:
    """Evaluate model on test set.

    Parameters
    ----------
    pipeline : Pipeline
        Trained sklearn pipeline
    X_test : DataFrame
        Test features
    y_test : Series
        Test labels

    Returns
    -------
    Dict[str, float]
        Dictionary containing test metrics
    """
    y_test_pred = pipeline.predict_proba(X_test)[:, 1]
    return {
        "test_auc": float(roc_auc_score(y_test, y_test_pred)),
        "test_logloss": float(log_loss(y_test, y_test_pred)),
        "test_auprc": float(average_precision_score(y_test, y_test_pred)),
    }

`evaluate_predictions(y_true, y_pred)`

Calculate multiple metrics for given predictions.

Parameters:

Name	Type	Description	Default
`y_true`	`NDArray[int_]`	True binary labels	required
`y_pred`	`NDArray[float64]`	Predicted probabilities	required

Returns:

Type	Description
`FoldResults`	Object containing AUC, log loss, and AUPRC metrics

Source code in src/patientflow/train/classifiers.py

def evaluate_predictions(
    y_true: npt.NDArray[np.int_], y_pred: npt.NDArray[np.float64]
) -> FoldResults:
    """Calculate multiple metrics for given predictions.

    Parameters
    ----------
    y_true : npt.NDArray[np.int_]
        True binary labels
    y_pred : npt.NDArray[np.float64]
        Predicted probabilities

    Returns
    -------
    FoldResults
        Object containing AUC, log loss, and AUPRC metrics
    """
    return FoldResults(
        auc=roc_auc_score(y_true, y_pred),
        logloss=log_loss(y_true, y_pred),
        auprc=average_precision_score(y_true, y_pred),
    )

`get_dataset_metadata(X_train, X_valid, y_train, y_valid, X_test=None, y_test=None)`

Get dataset sizes and class balances.

Parameters:

Name	Type	Description	Default
`X_train`	`DataFrame`	Training features	required
`X_valid`	`DataFrame`	Validation features	required
`y_train`	`Series`	Training labels	required
`y_valid`	`Series`	Validation labels	required
`X_test`	`DataFrame`	Test features. If None, test set information will be set to None.	`None`
`y_test`	`Series`	Test labels. If None, test set information will be set to None.	`None`

Returns:

Type	Description
`DatasetMetadata`	Dictionary containing dataset sizes and class balances

Source code in src/patientflow/train/classifiers.py

def get_dataset_metadata(
    X_train: DataFrame,
    X_valid: DataFrame,
    y_train: Series,
    y_valid: Series,
    X_test: Optional[DataFrame] = None,
    y_test: Optional[Series] = None,
) -> DatasetMetadata:
    """Get dataset sizes and class balances.

    Parameters
    ----------
    X_train : DataFrame
        Training features
    X_valid : DataFrame
        Validation features
    y_train : Series
        Training labels
    y_valid : Series
        Validation labels
    X_test : DataFrame, optional
        Test features. If None, test set information will be set to None.
    y_test : Series, optional
        Test labels. If None, test set information will be set to None.

    Returns
    -------
    DatasetMetadata
        Dictionary containing dataset sizes and class balances
    """
    metadata: DatasetMetadata = {
        "train_valid_test_set_no": {
            "train_set_no": len(X_train),
            "valid_set_no": len(X_valid),
            "test_set_no": len(X_test) if X_test is not None else None,
        },
        "train_valid_test_class_balance": {
            "y_train_class_balance": calculate_class_balance(y_train),
            "y_valid_class_balance": calculate_class_balance(y_valid),
            "y_test_class_balance": calculate_class_balance(y_test)
            if y_test is not None
            else None,
        },
    }

    return metadata

`get_feature_metadata(pipeline)`

Extract feature names and importances from pipeline.

Parameters:

Name	Type	Description	Default
`pipeline`	`Pipeline`	Sklearn pipeline containing feature transformer and classifier	required

Returns:

Type	Description
`FeatureMetadata`	Dictionary containing feature names and their importance scores (if available)

Raises:

Type	Description
`AttributeError`	If the classifier doesn't support feature importance

Source code in src/patientflow/train/classifiers.py

def get_feature_metadata(pipeline: Pipeline) -> FeatureMetadata:
    """
    Extract feature names and importances from pipeline.

    Parameters
    ----------
    pipeline : Pipeline
        Sklearn pipeline containing feature transformer and classifier

    Returns
    -------
    FeatureMetadata
        Dictionary containing feature names and their importance scores (if available)

    Raises
    ------
    AttributeError
        If the classifier doesn't support feature importance
    """
    transformed_cols = pipeline.named_steps[
        "feature_transformer"
    ].get_feature_names_out()
    classifier = pipeline.named_steps["classifier"]

    # Try different common feature importance attributes
    if hasattr(classifier, "feature_importances_"):
        importances = classifier.feature_importances_
    elif hasattr(classifier, "coef_"):
        importances = (
            np.abs(classifier.coef_[0])
            if classifier.coef_.ndim > 1
            else np.abs(classifier.coef_)
        )
    else:
        raise AttributeError("Classifier doesn't provide feature importance scores")

    return {
        "feature_names": [col.split("__")[-1] for col in transformed_cols],
        "feature_importances": importances.tolist(),
    }

`initialise_model(model_class, params, xgb_specific_params={'n_jobs': -1, 'eval_metric': 'logloss', 'enable_categorical': True})`

Initialize a model with given hyperparameters.

Parameters:

Name	Type	Description	Default
`model_class`	`Type`	The classifier class to instantiate	required
`params`	`Dict[str, Any]`	Model-specific parameters to set	required
`xgb_specific_params`	`Dict[str, Any]`	XGBoost-specific default parameters	`{'n_jobs': -1, 'eval_metric': 'logloss', 'enable_categorical': True}`

Returns:

Type	Description
`Any`	Initialized model instance

Source code in src/patientflow/train/classifiers.py

def initialise_model(
    model_class: Type,
    params: Dict[str, Any],
    xgb_specific_params: Dict[str, Any] = {
        "n_jobs": -1,
        "eval_metric": "logloss",
        "enable_categorical": True,
    },
) -> Any:
    """
    Initialize a model with given hyperparameters.

    Parameters
    ----------
    model_class : Type
        The classifier class to instantiate
    params : Dict[str, Any]
        Model-specific parameters to set
    xgb_specific_params : Dict[str, Any], optional
        XGBoost-specific default parameters

    Returns
    -------
    Any
        Initialized model instance
    """
    if model_class == XGBClassifier:
        model = model_class(**xgb_specific_params)
        model.set_params(**params)
    else:
        model = model_class(**params)

    return model

`train_classifier(train_visits, valid_visits, prediction_time, exclude_from_training_data, grid, ordinal_mappings, test_visits=None, visit_col=None, model_class=XGBClassifier, use_balanced_training=True, majority_to_minority_ratio=1.0, calibrate_probabilities=True, calibration_method='sigmoid', single_snapshot_per_visit=True, label_col='is_admitted', evaluate_on_test=False)`

Train a single model including data preparation and balancing.

Parameters:

Name	Type	Description	Default
`train_visits`	`DataFrame`	Training visits dataset	required
`valid_visits`	`DataFrame`	Validation visits dataset	required
`prediction_time`	`Tuple[int, int]`	The prediction time point to use	required
`exclude_from_training_data`	`List[str]`	Columns to exclude from training	required
`grid`	`Dict[str, List[Any]]`	Parameter grid for hyperparameter tuning	required
`ordinal_mappings`	`Dict[str, List[Any]]`	Mappings for ordinal categorical features	required
`test_visits`	`DataFrame`	Test visits dataset. Required only when evaluate_on_test=True.	`None`
`visit_col`	`str`	Name of the visit column. Required if single_snapshot_per_visit is True.	`None`
`model_class`	`Type`	The classifier class to use. Must be sklearn-compatible with fit() and predict_proba(). Defaults to XGBClassifier.	`XGBClassifier`
`use_balanced_training`	`bool`	Whether to use balanced training data	`True`
`majority_to_minority_ratio`	`float`	Ratio of majority to minority class samples	`1.0`
`calibrate_probabilities`	`bool`	Whether to apply probability calibration to the best model	`True`
`calibration_method`	`str`	Method for probability calibration ('isotonic' or 'sigmoid')	`'sigmoid'`
`single_snapshot_per_visit`	`bool`	Whether to select only one snapshot per visit. If True, visit_col must be provided.	`True`
`label_col`	`str`	Name of the column containing the target labels	`"is_admitted"`
`evaluate_on_test`	`bool`	Whether to evaluate the final model on the test set. Set to True only when satisfied with validation performance to avoid test set contamination.	`False`

Returns:

Type	Description
`TrainedClassifier`	Trained model, including metrics, and feature information

Source code in src/patientflow/train/classifiers.py

def train_classifier(
    train_visits: DataFrame,
    valid_visits: DataFrame,
    prediction_time: Tuple[int, int],
    exclude_from_training_data: List[str],
    grid: Dict[str, List[Any]],
    ordinal_mappings: Dict[str, List[Any]],
    test_visits: Optional[DataFrame] = None,
    visit_col: Optional[str] = None,
    model_class: Type = XGBClassifier,
    use_balanced_training: bool = True,
    majority_to_minority_ratio: float = 1.0,
    calibrate_probabilities: bool = True,
    calibration_method: str = "sigmoid",
    single_snapshot_per_visit: bool = True,
    label_col: str = "is_admitted",
    evaluate_on_test: bool = False,
) -> TrainedClassifier:
    """
    Train a single model including data preparation and balancing.

    Parameters
    ----------
    train_visits : DataFrame
        Training visits dataset
    valid_visits : DataFrame
        Validation visits dataset
    prediction_time : Tuple[int, int]
        The prediction time point to use
    exclude_from_training_data : List[str]
        Columns to exclude from training
    grid : Dict[str, List[Any]]
        Parameter grid for hyperparameter tuning
    ordinal_mappings : Dict[str, List[Any]]
        Mappings for ordinal categorical features
    test_visits : DataFrame, optional
        Test visits dataset. Required only when evaluate_on_test=True.
    visit_col : str, optional
        Name of the visit column. Required if single_snapshot_per_visit is True.
    model_class : Type, optional
        The classifier class to use. Must be sklearn-compatible with fit() and predict_proba().
        Defaults to XGBClassifier.
    use_balanced_training : bool, default=True
        Whether to use balanced training data
    majority_to_minority_ratio : float, default=1.0
        Ratio of majority to minority class samples
    calibrate_probabilities : bool, default=True
        Whether to apply probability calibration to the best model
    calibration_method : str, default='sigmoid'
        Method for probability calibration ('isotonic' or 'sigmoid')
    single_snapshot_per_visit : bool, default=True
        Whether to select only one snapshot per visit. If True, visit_col must be provided.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels
    evaluate_on_test : bool, default=False
        Whether to evaluate the final model on the test set. Set to True only when
        satisfied with validation performance to avoid test set contamination.

    Returns
    -------
    TrainedClassifier
        Trained model, including metrics, and feature information

    """
    if single_snapshot_per_visit and visit_col is None:
        raise ValueError(
            "visit_col must be provided when single_snapshot_per_visit is True"
        )

    if evaluate_on_test and test_visits is None:
        raise ValueError("test_visits must be provided when evaluate_on_test=True")

    # Get snapshots for each set
    X_train, y_train = prepare_patient_snapshots(
        train_visits,
        prediction_time,
        exclude_from_training_data,
        visit_col=visit_col,
        single_snapshot_per_visit=single_snapshot_per_visit,
        label_col=label_col,
    )
    X_valid, y_valid = prepare_patient_snapshots(
        valid_visits,
        prediction_time,
        exclude_from_training_data,
        visit_col=visit_col,
        single_snapshot_per_visit=single_snapshot_per_visit,
        label_col=label_col,
    )

    # Only prepare test data if evaluation is requested
    if evaluate_on_test:
        X_test, y_test = prepare_patient_snapshots(
            test_visits,
            prediction_time,
            exclude_from_training_data,
            visit_col=visit_col,
            single_snapshot_per_visit=single_snapshot_per_visit,
            label_col=label_col,
        )
    else:
        X_test, y_test = None, None

    # Get dataset metadata before any balancing
    dataset_metadata = get_dataset_metadata(
        X_train, X_valid, y_train, y_valid, X_test, y_test
    )

    # Store original size and positive rate before any balancing
    original_size = len(X_train)
    original_positive_rate = y_train.mean()

    if use_balanced_training:
        pos_indices = y_train[y_train == 1].index
        neg_indices = y_train[y_train == 0].index

        n_pos = len(pos_indices)
        n_neg = int(n_pos * majority_to_minority_ratio)

        neg_indices_sampled = np.random.choice(
            neg_indices, size=min(n_neg, len(neg_indices)), replace=False
        )

        train_balanced_indices = np.concatenate([pos_indices, neg_indices_sampled])
        np.random.shuffle(train_balanced_indices)

        X_train = X_train.loc[train_balanced_indices]
        y_train = y_train.loc[train_balanced_indices]

    # Create balance info after any balancing is done
    balance_info = create_balance_info(
        is_balanced=use_balanced_training,
        original_size=original_size,
        balanced_size=len(X_train),
        original_positive_rate=original_positive_rate,
        balanced_positive_rate=y_train.mean(),
        majority_to_minority_ratio=majority_to_minority_ratio
        if use_balanced_training
        else 1.0,
    )

    # Initialize best training results with default values
    best_training = TrainingResults(
        prediction_time=prediction_time,
        balance_info=balance_info,
        # Other fields will use their default empty dictionaries
    )

    # Initialize best model container
    best_model = TrainedClassifier(
        training_results=best_training,
        pipeline=None,
        calibrated_pipeline=None,
    )

    trials_list: List[HyperParameterTrial] = []
    best_logloss = float("inf")

    for params in ParameterGrid(grid):
        # Initialize model based on provided class
        model = initialise_model(model_class, params)

        column_transformer = create_column_transformer(X_train, ordinal_mappings)
        pipeline = Pipeline(
            [("feature_transformer", column_transformer), ("classifier", model)]
        )

        cv_results = chronological_cross_validation(
            pipeline, X_train, y_train, n_splits=5
        )
        # Store trial results
        trials_list.append(
            HyperParameterTrial(
                parameters=params.copy(),  # Make a copy to ensure immutability
                cv_results=cv_results,
            )
        )

        if cv_results["valid_logloss"] < best_logloss:
            best_logloss = cv_results["valid_logloss"]
            best_model.pipeline = pipeline

            # Get feature metadata if available
            try:
                feature_metadata = get_feature_metadata(pipeline)
                has_feature_importance = True
            except (AttributeError, NotImplementedError):
                feature_metadata = {
                    "feature_names": column_transformer.get_feature_names_out().tolist(),
                    "feature_importances": [],
                }
                has_feature_importance = False

            # Update training results
            best_training.training_info = {
                "cv_trials": trials_list,
                "features": {
                    "names": feature_metadata["feature_names"],
                    "importances": feature_metadata["feature_importances"],
                    "has_importance_values": has_feature_importance,
                },
                "dataset_info": dataset_metadata,
            }

            if calibrate_probabilities:
                best_training.calibration_info = {"method": calibration_method}

    # Apply probability calibration to the best model if requested
    if calibrate_probabilities and best_model.pipeline is not None:
        best_feature_transformer = best_model.pipeline.named_steps[
            "feature_transformer"
        ]
        best_classifier = best_model.pipeline.named_steps["classifier"]

        X_valid_transformed = best_feature_transformer.transform(X_valid)

        if sk_version >= "1.6.0":
            from sklearn.frozen import FrozenEstimator

            calibrated_classifier = CalibratedClassifierCV(
                estimator=FrozenEstimator(best_classifier),
                method=calibration_method,
            )
        else:
            calibrated_classifier = CalibratedClassifierCV(
                estimator=best_classifier, method=calibration_method, cv="prefit"
            )
        calibrated_classifier.fit(X_valid_transformed, y_valid)

        calibrated_pipeline = Pipeline(
            [
                ("feature_transformer", best_feature_transformer),
                ("classifier", calibrated_classifier),
            ]
        )

        best_model.calibrated_pipeline = calibrated_pipeline

        # Only evaluate on test set if requested
        if evaluate_on_test:
            best_training.test_results = evaluate_model(
                calibrated_pipeline, X_test, y_test
            )
        else:
            best_training.test_results = None

    else:
        # Only evaluate on test set if requested
        if evaluate_on_test:
            best_training.test_results = evaluate_model(
                best_model.pipeline, X_test, y_test
            )
        else:
            best_training.test_results = None

    return best_model

`train_multiple_classifiers(train_visits, valid_visits, grid, exclude_from_training_data, ordinal_mappings, prediction_times, test_visits=None, model_name='admissions', visit_col='visit_number', calibrate_probabilities=True, calibration_method='isotonic', use_balanced_training=True, majority_to_minority_ratio=1.0, label_col='is_admitted', evaluate_on_test=False)`

Train admission prediction models for multiple prediction times.

Parameters:

Name	Type	Description	Default
`train_visits`	`DataFrame`	Training visits dataset	required
`valid_visits`	`DataFrame`	Validation visits dataset	required
`grid`	`Dict[str, List[Any]]`	Parameter grid for hyperparameter tuning	required
`exclude_from_training_data`	`List[str]`	Columns to exclude from training	required
`ordinal_mappings`	`Dict[str, List[Any]]`	Mappings for ordinal categorical features	required
`prediction_times`	`List[Tuple[int, int]]`	List of prediction time points	required
`test_visits`	`DataFrame`	Test visits dataset, by default None	`None`
`model_name`	`str`	Name prefix for models, by default "admissions"	`'admissions'`
`visit_col`	`str`	Name of the visit column, by default "visit_number"	`'visit_number'`
`calibrate_probabilities`	`bool`	Whether to calibrate probabilities, by default True	`True`
`calibration_method`	`str`	Calibration method, by default "isotonic"	`'isotonic'`
`use_balanced_training`	`bool`	Whether to use balanced training, by default True	`True`
`majority_to_minority_ratio`	`float`	Ratio for class balancing, by default 1.0	`1.0`
`label_col`	`str`	Name of the label column, by default "is_admitted"	`'is_admitted'`
`evaluate_on_test`	`bool`	Whether to evaluate on test set, by default False	`False`

Returns:

Type	Description
`Dict[str, TrainedClassifier]`	Dictionary mapping model keys to trained classifiers

Source code in src/patientflow/train/classifiers.py

def train_multiple_classifiers(
    train_visits: DataFrame,
    valid_visits: DataFrame,
    grid: Dict[str, List[Any]],
    exclude_from_training_data: List[str],
    ordinal_mappings: Dict[str, List[Any]],
    prediction_times: List[Tuple[int, int]],
    test_visits: Optional[DataFrame] = None,
    model_name: str = "admissions",
    visit_col: str = "visit_number",
    calibrate_probabilities: bool = True,
    calibration_method: str = "isotonic",
    use_balanced_training: bool = True,
    majority_to_minority_ratio: float = 1.0,
    label_col: str = "is_admitted",
    evaluate_on_test: bool = False,
) -> Dict[str, TrainedClassifier]:
    """Train admission prediction models for multiple prediction times.

    Parameters
    ----------
    train_visits : DataFrame
        Training visits dataset
    valid_visits : DataFrame
        Validation visits dataset
    grid : Dict[str, List[Any]]
        Parameter grid for hyperparameter tuning
    exclude_from_training_data : List[str]
        Columns to exclude from training
    ordinal_mappings : Dict[str, List[Any]]
        Mappings for ordinal categorical features
    prediction_times : List[Tuple[int, int]]
        List of prediction time points
    test_visits : DataFrame, optional
        Test visits dataset, by default None
    model_name : str, optional
        Name prefix for models, by default "admissions"
    visit_col : str, optional
        Name of the visit column, by default "visit_number"
    calibrate_probabilities : bool, optional
        Whether to calibrate probabilities, by default True
    calibration_method : str, optional
        Calibration method, by default "isotonic"
    use_balanced_training : bool, optional
        Whether to use balanced training, by default True
    majority_to_minority_ratio : float, optional
        Ratio for class balancing, by default 1.0
    label_col : str, optional
        Name of the label column, by default "is_admitted"
    evaluate_on_test : bool, optional
        Whether to evaluate on test set, by default False

    Returns
    -------
    Dict[str, TrainedClassifier]
        Dictionary mapping model keys to trained classifiers
    """
    if evaluate_on_test and test_visits is None:
        raise ValueError("test_visits must be provided when evaluate_on_test=True")

    trained_models: Dict[str, TrainedClassifier] = {}

    for prediction_time in prediction_times:
        print(f"\nProcessing: {prediction_time}")
        model_key = get_model_key(model_name, prediction_time)

        # Train model with the new simplified interface
        best_model = train_classifier(
            train_visits,
            valid_visits,
            prediction_time,
            exclude_from_training_data,
            grid,
            ordinal_mappings,
            test_visits,
            visit_col,
            use_balanced_training=use_balanced_training,
            majority_to_minority_ratio=majority_to_minority_ratio,
            calibrate_probabilities=calibrate_probabilities,
            calibration_method=calibration_method,
            label_col=label_col,
            evaluate_on_test=evaluate_on_test,
        )

        trained_models[model_key] = best_model

    return trained_models

`emergency_demand`

Emergency demand prediction training module.

This module provides functionality that is specific to the implementation of the patientflow package at University College London Hospital (ULCH). It trains models to predict emergency bed demand.

The module trains three model types: 1. Admission prediction models (multiple classifiers, one for each prediction time) 2. Specialty prediction models (sequence-based) 3. Yet-to-arrive prediction models (aspirational)

Functions:

Name	Description
`test_real_time_predictions : Test real-time prediction functionality`	Selects random test cases and validates that the trained models can generate predictions as if it where making a real-time prediction.
`train_all_models : Complete training pipeline`	Trains all three model types (admissions, specialty, yet-to-arrive) with proper validation and optional model saving.
`main : Entry point for training pipeline`	Loads configuration, data, and runs the complete training process.

`main(data_folder_name=None)`

Main entry point for training patient flow models.

This function orchestrates the complete training pipeline for emergency demand prediction models. It loads configuration, data, and trains all three model types: admission prediction models, specialty prediction models, and yet-to-arrive prediction models.

Parameters:

Name	Type	Description	Default
`data_folder_name`	`str`	Name of the data folder containing the training datasets. If None, will be extracted from command line arguments.	`None`

Returns:

Type	Description
`None`	The function trains and optionally saves models but does not return any values.

Notes

The function performs the following steps: 1. Loads configuration from config.yaml 2. Loads ED visits and inpatient arrivals data 3. Sets up model parameters and hyperparameters 4. Trains admission prediction classifiers 5. Trains specialty prediction sequence model 6. Trains yet-to-arrive prediction model 7. Optionally saves trained models 8. Optionally tests real-time prediction functionality

Source code in src/patientflow/train/emergency_demand.py

def main(data_folder_name=None):
    """
    Main entry point for training patient flow models.

    This function orchestrates the complete training pipeline for emergency demand
    prediction models. It loads configuration, data, and trains all three model
    types: admission prediction models, specialty prediction models, and
    yet-to-arrive prediction models.

    Parameters
    ----------
    data_folder_name : str, optional
        Name of the data folder containing the training datasets.
        If None, will be extracted from command line arguments.

    Returns
    -------
    None
        The function trains and optionally saves models but does not return
        any values.

    Notes
    -----
    The function performs the following steps:
    1. Loads configuration from config.yaml
    2. Loads ED visits and inpatient arrivals data
    3. Sets up model parameters and hyperparameters
    4. Trains admission prediction classifiers
    5. Trains specialty prediction sequence model
    6. Trains yet-to-arrive prediction model
    7. Optionally saves trained models
    8. Optionally tests real-time prediction functionality

    """
    # Parse arguments if not provided
    if data_folder_name is None:
        args = parse_args()
        data_folder_name = (
            data_folder_name if data_folder_name is not None else args.data_folder_name
        )
    print(f"Loading data from folder: {data_folder_name}")

    project_root = set_project_root()

    # Set file locations
    data_file_path, _, model_file_path, config_path = set_file_paths(
        project_root=project_root,
        inference_time=False,
        train_dttm=None,
        data_folder_name=data_folder_name,
        config_file="config.yaml",
    )

    # Load parameters
    config = load_config_file(config_path)

    # Extract parameters
    prediction_times = config["prediction_times"]
    start_training_set = config["start_training_set"]
    start_validation_set = config["start_validation_set"]
    start_test_set = config["start_test_set"]
    end_test_set = config["end_test_set"]
    prediction_window = timedelta(minutes=config["prediction_window"])
    epsilon = float(config["epsilon"])
    yta_time_interval = timedelta(minutes=config["yta_time_interval"])
    x1, y1, x2, y2 = config["x1"], config["y1"], config["x2"], config["y2"]

    # Load data
    ed_visits = load_data(
        data_file_path=data_file_path,
        file_name="ed_visits.csv",
        index_column="snapshot_id",
        sort_columns=["visit_number", "snapshot_date", "prediction_time"],
        eval_columns=["prediction_time", "consultation_sequence", "final_sequence"],
    )
    inpatient_arrivals = load_data(
        data_file_path=data_file_path, file_name="inpatient_arrivals.csv"
    )

    # Create snapshot date
    ed_visits["snapshot_date"] = pd.to_datetime(
        ed_visits["snapshot_date"], dayfirst=True
    ).dt.date

    # Set up model parameters
    grid_params = {"n_estimators": [30], "subsample": [0.7], "colsample_bytree": [0.7]}

    exclude_columns = [
        "visit_number",
        "snapshot_date",
        "prediction_time",
        "specialty",
        "consultation_sequence",
        "final_sequence",
    ]

    ordinal_mappings = {
        "age_group": [
            "0-17",
            "18-24",
            "25-34",
            "35-44",
            "45-54",
            "55-64",
            "65-74",
            "75-115",
        ],
        "latest_acvpu": ["A", "C", "V", "P", "U"],
        "latest_obs_manchester_triage_acuity": [
            "Blue",
            "Green",
            "Yellow",
            "Orange",
            "Red",
        ],
        "latest_obs_objective_pain_score": [
            "Nil",
            "Mild",
            "Moderate",
            "Severe\\E\\Very Severe",
        ],
        "latest_obs_level_of_consciousness": ["A", "C", "V", "P", "U"],
    }

    specialties = ["surgical", "haem/onc", "medical", "paediatric"]
    cdf_cut_points = [0.9, 0.7]
    curve_params = (x1, y1, x2, y2)
    random_seed = 42

    # Call train_all_models with prepared parameters
    train_all_models(
        visits=ed_visits,
        start_training_set=start_training_set,
        start_validation_set=start_validation_set,
        start_test_set=start_test_set,
        end_test_set=end_test_set,
        yta=inpatient_arrivals,
        model_file_path=model_file_path,
        prediction_times=prediction_times,
        prediction_window=prediction_window,
        yta_time_interval=yta_time_interval,
        epsilon=epsilon,
        curve_params=curve_params,
        grid_params=grid_params,
        exclude_columns=exclude_columns,
        ordinal_mappings=ordinal_mappings,
        specialties=specialties,
        cdf_cut_points=cdf_cut_points,
        random_seed=random_seed,
    )

    return

`test_real_time_predictions(visits, models, prediction_window, specialties, cdf_cut_points, curve_params, random_seed)`

Test real-time predictions by selecting a random sample from the visits dataset and generating predictions using the trained models.

Parameters:

Name	Type	Description	Default
`visits`	`DataFrame`	DataFrame containing visit data with columns including 'prediction_time', 'snapshot_date', and other required features for predictions.	required
`models`	`Tuple[Dict[str, TrainedClassifier], SequenceToOutcomePredictor, ParametricIncomingAdmissionPredictor]`	Tuple containing: - trained_classifiers: TrainedClassifier containing admission predictions - spec_model: SequenceToOutcomePredictor for specialty predictions - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions	required
`prediction_window`	`int`	Size of the prediction window in minutes for which to generate forecasts.	required
`specialties`	`list[str]`	List of specialty names to generate predictions for (e.g., ['surgical', 'medical', 'paediatric']).	required
`cdf_cut_points`	`list[float]`	List of probability thresholds for cumulative distribution function cut points (e.g., [0.9, 0.7]).	required
`curve_params`	`tuple[float, float, float, float]`	Parameters (x1, y1, x2, y2) defining the curve used for predictions.	required
`random_seed`	`int`	Random seed for reproducible sampling of test cases.	required

Returns:

Type	Description
`dict`	Dictionary containing: - 'prediction_time': str, The time point for which predictions were made - 'prediction_date': str, The date for which predictions were made - 'realtime_preds': dict, The generated predictions for the sample

Raises:

Type	Description
`Exception`	If real-time inference fails, with detailed error message printed before system exit.

Notes

The function selects a single random row from the visits DataFrame and generates predictions for that specific time point using all provided models. The predictions are made using the create_predictions() function with the specified parameters.

Source code in src/patientflow/train/emergency_demand.py

def test_real_time_predictions(
    visits,
    models: Tuple[
        Dict[str, TrainedClassifier],
        SequenceToOutcomePredictor,
        ParametricIncomingAdmissionPredictor,
    ],
    prediction_window,
    specialties,
    cdf_cut_points,
    curve_params,
    random_seed,
):
    """
    Test real-time predictions by selecting a random sample from the visits dataset
    and generating predictions using the trained models.

    Parameters
    ----------
    visits : pd.DataFrame
        DataFrame containing visit data with columns including 'prediction_time',
        'snapshot_date', and other required features for predictions.
    models : Tuple[Dict[str, TrainedClassifier], SequenceToOutcomePredictor, ParametricIncomingAdmissionPredictor]
        Tuple containing:
        - trained_classifiers: TrainedClassifier containing admission predictions
        - spec_model: SequenceToOutcomePredictor for specialty predictions
        - yet_to_arrive_model: ParametricIncomingAdmissionPredictor for yet-to-arrive predictions
    prediction_window : int
        Size of the prediction window in minutes for which to generate forecasts.
    specialties : list[str]
        List of specialty names to generate predictions for (e.g., ['surgical',
        'medical', 'paediatric']).
    cdf_cut_points : list[float]
        List of probability thresholds for cumulative distribution function
        cut points (e.g., [0.9, 0.7]).
    curve_params : tuple[float, float, float, float]
        Parameters (x1, y1, x2, y2) defining the curve used for predictions.
    random_seed : int
        Random seed for reproducible sampling of test cases.

    Returns
    -------
    dict
        Dictionary containing:
        - 'prediction_time': str, The time point for which predictions were made
        - 'prediction_date': str, The date for which predictions were made
        - 'realtime_preds': dict, The generated predictions for the sample

    Raises
    ------
    Exception
        If real-time inference fails, with detailed error message printed before
        system exit.

    Notes
    -----
    The function selects a single random row from the visits DataFrame and
    generates predictions for that specific time point using all provided models.
    The predictions are made using the create_predictions() function with the
    specified parameters.
    """
    # Select random test set row
    random_row = visits.sample(n=1, random_state=random_seed)
    prediction_time = random_row.prediction_time.values[0]
    prediction_date = random_row.snapshot_date.values[0]

    # Get prediction snapshots
    prediction_snapshots = visits[
        (visits.prediction_time == prediction_time)
        & (visits.snapshot_date == prediction_date)
    ]

    trained_classifiers, spec_model, yet_to_arrive_model = models

    # Find the model matching the required prediction time
    classifier = None
    for model_key, trained_model in trained_classifiers.items():
        if trained_model.training_results.prediction_time == prediction_time:
            classifier = trained_model
            break

    if classifier is None:
        raise ValueError(f"No model found for prediction time {prediction_time}")

    try:
        x1, y1, x2, y2 = curve_params
        _ = create_predictions(
            models=(classifier, spec_model, yet_to_arrive_model),
            prediction_time=prediction_time,
            prediction_snapshots=prediction_snapshots,
            specialties=specialties,
            prediction_window=prediction_window,
            cdf_cut_points=cdf_cut_points,
            x1=x1,
            y1=y1,
            x2=x2,
            y2=y2,
        )
        print("Real-time inference ran correctly")
    except Exception as e:
        print(f"Real-time inference failed due to this error: {str(e)}")
        sys.exit(1)

    return

`train_all_models(visits, start_training_set, start_validation_set, start_test_set, end_test_set, yta, prediction_times, prediction_window, yta_time_interval, epsilon, grid_params, exclude_columns, ordinal_mappings, random_seed, visit_col='visit_number', specialties=None, cdf_cut_points=None, curve_params=None, model_file_path=None, save_models=True, test_realtime=True)`

Train and evaluate patient flow models.

Parameters:

Name	Type	Description	Default
`visits`	`DataFrame`	DataFrame containing visit data.	required
`yta`	`DataFrame`	DataFrame containing yet-to-arrive data.	required
`prediction_times`	`list`	List of times for making predictions.	required
`prediction_window`	`int`	Prediction window size in minutes.	required
`yta_time_interval`	`int`	Interval size for yet-to-arrive predictions in minutes.	required
`epsilon`	`float`	Epsilon parameter for model training.	required
`grid_params`	`dict`	Hyperparameter grid for model training.	required
`exclude_columns`	`list`	Columns to exclude during training.	required
`ordinal_mappings`	`dict`	Ordinal variable mappings for categorical features.	required
`random_seed`	`int`	Random seed for reproducibility.	required
`visit_col`	`str`	Name of column in dataset that is used to identify a hospital visit (eg visit_number, csn).	`'visit_number'`
`specialties`	`list`	List of specialties to consider. Required if test_realtime is True.	`None`
`cdf_cut_points`	`list`	CDF cut points for predictions. Required if test_realtime is True.	`None`
`curve_params`	`tuple`	Curve parameters (x1, y1, x2, y2). Required if test_realtime is True.	`None`
`model_file_path`	`Path`	Path to save trained models. Required if save_models is True.	`None`
`save_models`	`bool`	Whether to save the trained models to disk. Defaults to True.	`True`
`test_realtime`	`bool`	Whether to run real-time prediction tests. Defaults to True.	`True`

Returns:

Type	Description
`None`

Raises:

Type	Description
`ValueError`	If save_models is True but model_file_path is not provided, or if test_realtime is True but any of specialties, cdf_cut_points, or curve_params are not provided.

Notes

The function generates model names internally: - "admissions": "admissions" - "specialty": "ed_specialty" - "yet_to_arrive": f"yet_to_arrive_{int(prediction_window.total_seconds()/3600)}_hours"

Source code in src/patientflow/train/emergency_demand.py

def train_all_models(
    visits,
    start_training_set,
    start_validation_set,
    start_test_set,
    end_test_set,
    yta,
    prediction_times,
    prediction_window: timedelta,
    yta_time_interval: timedelta,
    epsilon,
    grid_params,
    exclude_columns,
    ordinal_mappings,
    random_seed,
    visit_col="visit_number",
    specialties=None,
    cdf_cut_points=None,
    curve_params=None,
    model_file_path=None,
    save_models=True,
    test_realtime=True,
):
    """
    Train and evaluate patient flow models.

    Parameters
    ----------
    visits : pd.DataFrame
        DataFrame containing visit data.
    yta : pd.DataFrame
        DataFrame containing yet-to-arrive data.
    prediction_times : list
        List of times for making predictions.
    prediction_window : int
        Prediction window size in minutes.
    yta_time_interval : int
        Interval size for yet-to-arrive predictions in minutes.
    epsilon : float
        Epsilon parameter for model training.
    grid_params : dict
        Hyperparameter grid for model training.
    exclude_columns : list
        Columns to exclude during training.
    ordinal_mappings : dict
        Ordinal variable mappings for categorical features.
    random_seed : int
        Random seed for reproducibility.
    visit_col : str, optional
        Name of column in dataset that is used to identify a hospital visit (eg visit_number, csn).
    specialties : list, optional
        List of specialties to consider. Required if test_realtime is True.
    cdf_cut_points : list, optional
        CDF cut points for predictions. Required if test_realtime is True.
    curve_params : tuple, optional
        Curve parameters (x1, y1, x2, y2). Required if test_realtime is True.
    model_file_path : Path, optional
        Path to save trained models. Required if save_models is True.
    save_models : bool, optional
        Whether to save the trained models to disk. Defaults to True.
    test_realtime : bool, optional
        Whether to run real-time prediction tests. Defaults to True.

    Returns
    -------
    None

    Raises
    ------
    ValueError
        If save_models is True but model_file_path is not provided,
        or if test_realtime is True but any of specialties, cdf_cut_points, or curve_params are not provided.

    Notes
    -----
    The function generates model names internally:
    - "admissions": "admissions"
    - "specialty": "ed_specialty"
    - "yet_to_arrive": f"yet_to_arrive_{int(prediction_window.total_seconds()/3600)}_hours"
    """
    # Validate parameters
    if save_models and model_file_path is None:
        raise ValueError("model_file_path must be provided when save_models is True")

    if test_realtime:
        if specialties is None:
            raise ValueError("specialties must be provided when test_realtime is True")
        if cdf_cut_points is None:
            raise ValueError(
                "cdf_cut_points must be provided when test_realtime is True"
            )
        if curve_params is None:
            raise ValueError("curve_params must be provided when test_realtime is True")

    # Set random seed
    np.random.seed(random_seed)

    # Define model names internally
    model_names = {
        "admissions": "admissions",
        "specialty": "ed_specialty",
        "yet_to_arrive": f"yet_to_arrive_{int(prediction_window.total_seconds()/3600)}_hours",
    }

    if "arrival_datetime" in visits.columns:
        col_name = "arrival_datetime"
    else:
        col_name = "snapshot_date"

    train_visits, valid_visits, test_visits = create_temporal_splits(
        visits,
        start_training_set,
        start_validation_set,
        start_test_set,
        end_test_set,
        col_name=col_name,
    )

    train_yta, _, _ = create_temporal_splits(
        yta[(~yta.specialty.isnull())],
        start_training_set,
        start_validation_set,
        start_test_set,
        end_test_set,
        col_name="arrival_datetime",
    )

    # Use predicted_times from visits if not explicitly provided
    if prediction_times is None:
        prediction_times = list(visits.prediction_time.unique())

    # Train admission models
    admission_models = train_multiple_classifiers(
        train_visits=train_visits,
        valid_visits=valid_visits,
        test_visits=test_visits,
        grid=grid_params,
        exclude_from_training_data=exclude_columns,
        ordinal_mappings=ordinal_mappings,
        prediction_times=prediction_times,
        model_name=model_names["admissions"],
        visit_col=visit_col,
    )

    # Save admission models if requested

    if save_models:
        save_model(admission_models, model_names["admissions"], model_file_path)

    # Train specialty model
    specialty_model = train_sequence_predictor(
        train_visits=train_visits,
        model_name=model_names["specialty"],
        input_var="consultation_sequence",
        grouping_var="final_sequence",
        outcome_var="specialty",
        visit_col=visit_col,
    )

    # Save specialty model if requested
    if save_models:
        save_model(specialty_model, model_names["specialty"], model_file_path)

    # Train yet-to-arrive model
    yta_model_name = model_names["yet_to_arrive"]

    num_days = (start_validation_set - start_training_set).days

    yta_model = train_parametric_admission_predictor(
        train_visits=train_visits,
        train_yta=train_yta,
        prediction_window=prediction_window,
        yta_time_interval=yta_time_interval,
        prediction_times=prediction_times,
        epsilon=epsilon,
        num_days=num_days,
    )

    # Save yet-to-arrive model if requested
    if save_models:
        save_model(yta_model, yta_model_name, model_file_path)
        print(f"Models have been saved to {model_file_path}")

    # Test real-time predictions if requested
    if test_realtime:
        visits["elapsed_los"] = visits["elapsed_los"].apply(
            lambda x: timedelta(seconds=x)
        )
        test_real_time_predictions(
            visits=visits,
            models=(admission_models, specialty_model, yta_model),
            prediction_window=prediction_window,
            specialties=specialties,
            cdf_cut_points=cdf_cut_points,
            curve_params=curve_params,
            random_seed=random_seed,
        )

    return

`incoming_admission_predictor`

Training utility for parametric admission prediction models.

This module provides functions for training parametric admission prediction models, specifically for predicting yet-to-arrive (YTA) patient volumes using parametric curves. It includes utilities for creating specialty filters and training parametric admission predictors.

The logic in this module is specific to the implementation at UCLH.

`create_yta_filters(df)`

Create specialty filters for categorizing patients by specialty and age group.

This function generates a dictionary of filters based on specialty categories, with special handling for pediatric patients. It uses the SpecialCategoryParams class to determine which specialties correspond to pediatric care.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing patient data with columns that include either 'age_on_arrival' or 'age_group' for pediatric classification.	required

Returns:

Type	Description
`dict`	A dictionary mapping specialty names to filter configurations. Each configuration contains: - For pediatric specialty: {"is_child": True} - For other specialties: {"specialty": specialty_name, "is_child": False}

Source code in src/patientflow/train/incoming_admission_predictor.py

def create_yta_filters(df):
    """
    Create specialty filters for categorizing patients by specialty and age group.

    This function generates a dictionary of filters based on specialty categories,
    with special handling for pediatric patients. It uses the SpecialCategoryParams
    class to determine which specialties correspond to pediatric care.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing patient data with columns that include either
        'age_on_arrival' or 'age_group' for pediatric classification.

    Returns
    -------
    dict
        A dictionary mapping specialty names to filter configurations.
        Each configuration contains:
        - For pediatric specialty: {"is_child": True}
        - For other specialties: {"specialty": specialty_name, "is_child": False}

    """
    # Get the special category parameters using the picklable implementation
    special_params = create_special_category_objects(df.columns)

    # Extract necessary data from the special_params
    special_category_dict = special_params["special_category_dict"]

    # Create the specialty_filters dictionary
    specialty_filters = {}

    for specialty, is_paediatric_flag in special_category_dict.items():
        if is_paediatric_flag == 1.0:
            # For the paediatric specialty, set `is_child` to True
            specialty_filters[specialty] = {"is_child": True}
        else:
            # For other specialties, set `is_child` to False
            specialty_filters[specialty] = {"specialty": specialty, "is_child": False}

    return specialty_filters

`train_parametric_admission_predictor(train_visits, train_yta, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=1e-06)`

Train a parametric yet-to-arrive prediction model.

Parameters:

Name	Type	Description	Default
`train_visits`	`DataFrame`	Visits dataset (used for identifying special categories).	required
`train_yta`	`DataFrame`	Training data for yet-to-arrive predictions.	required
`prediction_window`	`timedelta`	Time window for predictions as a timedelta.	required
`yta_time_interval`	`timedelta`	Time interval for predictions as a timedelta.	required
`prediction_times`	`List[float]`	List of prediction times.	required
`num_days`	`int`	Number of days to consider.	required
`epsilon`	`float`	Epsilon parameter for model, by default 10e-7.	`1e-06`

Returns:

Type	Description
`ParametricIncomingAdmissionPredictor`	Trained ParametricIncomingAdmissionPredictor model.

Raises:

Type	Description
`TypeError`	If prediction_window or yta_time_interval are not timedelta objects.

Source code in src/patientflow/train/incoming_admission_predictor.py

def train_parametric_admission_predictor(
    train_visits: DataFrame,
    train_yta: DataFrame,
    prediction_window: timedelta,
    yta_time_interval: timedelta,
    prediction_times: List[float],
    num_days: int,
    epsilon: float = 10e-7,
) -> ParametricIncomingAdmissionPredictor:
    """
    Train a parametric yet-to-arrive prediction model.

    Parameters
    ----------
    train_visits : DataFrame
        Visits dataset (used for identifying special categories).
    train_yta : DataFrame
        Training data for yet-to-arrive predictions.
    prediction_window : timedelta
        Time window for predictions as a timedelta.
    yta_time_interval : timedelta
        Time interval for predictions as a timedelta.
    prediction_times : List[float]
        List of prediction times.
    num_days : int
        Number of days to consider.
    epsilon : float, optional
        Epsilon parameter for model, by default 10e-7.

    Returns
    -------
    ParametricIncomingAdmissionPredictor
        Trained ParametricIncomingAdmissionPredictor model.

    Raises
    ------
    TypeError
        If prediction_window or yta_time_interval are not timedelta objects.
    """

    if not isinstance(prediction_window, timedelta):
        raise TypeError("prediction_window must be a timedelta object")
    if not isinstance(yta_time_interval, timedelta):
        raise TypeError("yta_time_interval must be a timedelta object")

    if train_yta.index.name is None:
        if "arrival_datetime" in train_yta.columns:
            # Convert to datetime using the actual values, not pandas objects
            train_yta = train_yta.copy()
            train_yta["arrival_datetime"] = pd.to_datetime(
                train_yta["arrival_datetime"].values, utc=True
            )
            train_yta.set_index("arrival_datetime", inplace=True)

    elif train_yta.index.name != "arrival_datetime":
        print("Dataset needs arrival_datetime column")

    specialty_filters = create_yta_filters(train_visits)

    yta_model = ParametricIncomingAdmissionPredictor(filters=specialty_filters)
    yta_model.fit(
        train_df=train_yta,
        prediction_window=prediction_window,
        yta_time_interval=yta_time_interval,
        prediction_times=prediction_times,
        epsilon=epsilon,
        num_days=num_days,
    )

    return yta_model

`sequence_predictor`

Training utility for sequence prediction models.

This module provides functions for training sequence-based prediction models, specifically for predicting patient outcomes based on visit sequences. It includes utilities for filtering patient data and training specialized sequence predictors.

The logic in this module is specific to the implementation at UCLH.

`get_default_visits(admitted)`

Filter a dataframe of patient visits to include only non-pediatric patients.

This function identifies and removes pediatric patients from the dataset based on both age criteria and specialty assignment. It automatically detects the appropriate age column format from the provided dataframe.

Parameters:

Name	Type	Description	Default
`admitted`	`DataFrame`	A pandas DataFrame containing patient visit information. Must include either 'age_on_arrival' or 'age_group' columns, and a 'specialty' column.	required

Returns:

Type	Description
`DataFrame`	A filtered DataFrame containing only non-pediatric patients (adults).

Notes

The function automatically detects which age-related columns are present in the dataframe and configures the appropriate filtering logic. It removes patients who are either: 1. Identified as pediatric based on age criteria, or 2. Assigned to a pediatric specialty

Source code in src/patientflow/train/sequence_predictor.py

def get_default_visits(admitted: DataFrame) -> DataFrame:
    """
    Filter a dataframe of patient visits to include only non-pediatric patients.

    This function identifies and removes pediatric patients from the dataset based on
    both age criteria and specialty assignment. It automatically detects the appropriate
    age column format from the provided dataframe.

    Parameters
    ----------
    admitted : DataFrame
        A pandas DataFrame containing patient visit information. Must include either
        'age_on_arrival' or 'age_group' columns, and a 'specialty' column.

    Returns
    -------
    DataFrame
        A filtered DataFrame containing only non-pediatric patients (adults).

    Notes
    ------
    The function automatically detects which age-related columns are present in the
    dataframe and configures the appropriate filtering logic. It removes patients who
    are either:
    1. Identified as pediatric based on age criteria, or
    2. Assigned to a pediatric specialty

    """
    # Get configuration for categorizing patients based on age columns
    special_params = create_special_category_objects(admitted.columns)

    # Extract function that identifies non-pediatric patients
    opposite_special_category_func = special_params["special_func_map"]["default"]

    # Determine which category is the special category (should be "paediatric")
    special_category_key = next(
        key
        for key, value in special_params["special_category_dict"].items()
        if value == 1.0
    )

    # Filter out pediatric patients based on both age criteria and specialty
    filtered_admitted = admitted[
        admitted.apply(opposite_special_category_func, axis=1)
        & (admitted["specialty"] != special_category_key)
    ]

    return filtered_admitted

`train_sequence_predictor(train_visits, model_name, visit_col, input_var, grouping_var, outcome_var)`

Train a specialty prediction model.

Parameters:

Name	Type	Description	Default
`train_visits`	`DataFrame`	Training data containing visit information.	required
`model_name`	`str`	Name identifier for the model.	required
`visit_col`	`str`	Column name containing visit identifiers.	required
`input_var`	`str`	Column name for input sequence.	required
`grouping_var`	`str`	Column name for grouping sequence.	required
`outcome_var`	`str`	Column name for target variable.	required

Returns:

Type	Description
`SequencePredictor`	Trained SequencePredictor model.

Source code in src/patientflow/train/sequence_predictor.py

def train_sequence_predictor(
    train_visits: DataFrame,
    model_name: str,
    visit_col: str,
    input_var: str,
    grouping_var: str,
    outcome_var: str,
) -> SequenceToOutcomePredictor:
    """
    Train a specialty prediction model.

    Parameters
    ----------
    train_visits : DataFrame
        Training data containing visit information.
    model_name : str
        Name identifier for the model.
    visit_col : str
        Column name containing visit identifiers.
    input_var : str
        Column name for input sequence.
    grouping_var : str
        Column name for grouping sequence.
    outcome_var : str
        Column name for target variable.

    Returns
    -------
    SequencePredictor
        Trained SequencePredictor model.
    """
    visits_single = select_one_snapshot_per_visit(train_visits, visit_col)
    admitted = visits_single[
        (visits_single.is_admitted) & ~(visits_single.specialty.isnull())
    ]
    filtered_admitted = get_default_visits(admitted)

    filtered_admitted.loc[:, input_var] = filtered_admitted[input_var].apply(
        lambda x: tuple(x) if x else ()
    )
    filtered_admitted.loc[:, grouping_var] = filtered_admitted[grouping_var].apply(
        lambda x: tuple(x) if x else ()
    )

    spec_model = SequenceToOutcomePredictor(
        input_var=input_var,
        grouping_var=grouping_var,
        outcome_var=outcome_var,
    )
    spec_model.fit(filtered_admitted)

    return spec_model

`utils`

`save_model(model, model_name, model_file_path)`

Save trained model(s) to disk.

Parameters:

Name	Type	Description	Default
`model`	`object or dict`	A single model instance or a dictionary of models to save.	required
`model_name`	`str`	Base name to use for saving the model(s).	required
`model_file_path`	`Path`	Directory path where the model(s) will be saved.	required

Returns:

Type	Description
`None`

Source code in src/patientflow/train/utils.py

def save_model(model, model_name, model_file_path):
    """
    Save trained model(s) to disk.

    Parameters
    ----------
    model : object or dict
        A single model instance or a dictionary of models to save.
    model_name : str
        Base name to use for saving the model(s).
    model_file_path : Path
        Directory path where the model(s) will be saved.

    Returns
    -------
    None
    """

    if isinstance(model, dict):
        # Handle dictionary of models (e.g., admission models)
        for name, m in model.items():
            full_path = model_file_path / name
            full_path = full_path.with_suffix(".joblib")
            dump(m, full_path)
    else:
        # Handle single model (e.g., specialty or yet-to-arrive model)
        full_path = model_file_path / model_name
        full_path = full_path.with_suffix(".joblib")
        dump(model, full_path)

`viz`

Visualization module for patient flow analysis.

This module provides various plotting and visualization functions for analyzing patient flow data, model results, and evaluation metrics.

`arrival_rates`

Visualization functions for inpatient arrival rates and cumulative statistics.

This module provides functions to visualize time-varying arrival rates and cumulative arrivals, over the course of a day.

Functions:

Name	Description
`annotate_hour_line : function`	Annotate hour lines on a matplotlib plot
`plot_arrival_rates : function`	Plot arrival rates for one or two datasets
`plot_cumulative_arrival_rates : function`	Plot cumulative arrival rates with statistical distributions

`annotate_hour_line(hour_line, y_value, hour_values, start_plot_index, line_styles, x_margin, annotation_prefix, text_y_offset=1, text_x_position=None, slope=None, x1=None, y1=None)`

Annotate hour lines on a matplotlib plot with consistent formatting.

Parameters:

Name	Type	Description	Default
`hour_line`	`int`	The hour to annotate on the plot.	required
`y_value`	`float`	The y-coordinate for annotation positioning.	required
`hour_values`	`list of int`	Hour values corresponding to the x-axis positions.	required
`start_plot_index`	`int`	Starting index for the plot's data.	required
`line_styles`	`dict`	Line styles for annotations keyed by hour.	required
`x_margin`	`float`	Margin added to x-axis for annotation positioning.	required
`annotation_prefix`	`str`	Prefix for the annotation text (e.g., "On average").	required
`text_y_offset`	`float`	Vertical offset for the annotation text from the line, by default 1.	`1`
`text_x_position`	`float`	Horizontal position for annotation text, by default None.	`None`
`slope`	`float`	Slope of a line for extended annotations, by default None.	`None`
`x1`	`float`	Reference x-coordinate for slope-based annotation, by default None.	`None`
`y1`	`float`	Reference y-coordinate for slope-based annotation, by default None.	`None`

Source code in src/patientflow/viz/arrival_rates.py

def annotate_hour_line(
    hour_line,
    y_value,
    hour_values,
    start_plot_index,
    line_styles,
    x_margin,
    annotation_prefix,
    text_y_offset=1,
    text_x_position=None,
    slope=None,
    x1=None,
    y1=None,
):
    """Annotate hour lines on a matplotlib plot with consistent formatting.

    Parameters
    ----------
    hour_line : int
        The hour to annotate on the plot.
    y_value : float
        The y-coordinate for annotation positioning.
    hour_values : list of int
        Hour values corresponding to the x-axis positions.
    start_plot_index : int
        Starting index for the plot's data.
    line_styles : dict
        Line styles for annotations keyed by hour.
    x_margin : float
        Margin added to x-axis for annotation positioning.
    annotation_prefix : str
        Prefix for the annotation text (e.g., "On average").
    text_y_offset : float, optional
        Vertical offset for the annotation text from the line, by default 1.
    text_x_position : float, optional
        Horizontal position for annotation text, by default None.
    slope : float, optional
        Slope of a line for extended annotations, by default None.
    x1 : float, optional
        Reference x-coordinate for slope-based annotation, by default None.
    y1 : float, optional
        Reference y-coordinate for slope-based annotation, by default None.
    """
    a = hour_values[hour_line - start_plot_index]
    if slope is not None and x1 is not None:
        y_a = slope * (a - x1) + y1
        plt.plot([a, a], [0, y_a], color="grey", linestyle=line_styles[hour_line])
        plt.plot(
            [0 - x_margin, a],
            [y_a, y_a],
            color="grey",
            linestyle=line_styles[hour_line],
        )
        annotation_text = (
            f"{annotation_prefix}, {int(y_a)} beds needed by {hour_line}:00"
        )
        y_position = y_a + text_y_offset
    else:
        plt.annotate(
            "",
            xy=(a, y_value),
            xytext=(a, 0),
            arrowprops=dict(
                arrowstyle="-", linestyle=line_styles[hour_line], color="grey"
            ),
        )
        plt.annotate(
            "",
            xy=(a, y_value),
            xytext=(hour_values[0] - x_margin, y_value),
            arrowprops=dict(
                arrowstyle="-", linestyle=line_styles[hour_line], color="grey"
            ),
        )
        annotation_text = (
            f"{annotation_prefix}, {int(y_value)} beds needed by {hour_line}:00"
        ).strip()  # strip() removes leading comma if prefix is empty
        y_position = y_value + text_y_offset

    # Use custom text x position if provided, otherwise use default
    x_position = (
        text_x_position if text_x_position is not None else (hour_values[1] - x_margin)
    )

    plt.annotate(
        annotation_text,
        xy=(a / 2 if slope is not None else a, y_value),
        xytext=(x_position, y_position),
        va="bottom",
        ha="left",
        fontsize=10,
    )

`draw_window_visualization(ax, hour_values, window_params, annotation_prefix, start_window, end_window)`

Draw the window visualization with annotations.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axes to draw on	required
`hour_values`	`array - like`	Hour labels for x-axis	required
`window_params`	`tuple`	(slope, x1, y1, y2) from get_window_parameters	required
`annotation_prefix`	`str`	Prefix for annotations	required
`start_window`	`int`	Start hour for window	required
`end_window`	`int`	End hour for window	required

Source code in src/patientflow/viz/arrival_rates.py

def draw_window_visualization(
    ax, hour_values, window_params, annotation_prefix, start_window, end_window
):
    """Draw the window visualization with annotations.

    Parameters
    ----------
    ax : matplotlib.axes.Axes
        The axes to draw on
    hour_values : array-like
        Hour labels for x-axis
    window_params : tuple
        (slope, x1, y1, y2) from get_window_parameters
    annotation_prefix : str
        Prefix for annotations
    start_window : int
        Start hour for window
    end_window : int
        End hour for window
    """
    slope, x1, y1, x2, y2 = window_params

    # Draw horizontal line
    ax.hlines(y=y2, xmin=x2, xmax=hour_values[-1], color="blue", linestyle="--")

    # Draw diagonal line
    ax.plot([x1, x2], [y1, y2], color="blue", linestyle="--")

    # Add annotation
    ax.annotate(
        f"{annotation_prefix}, {slope:.0f} beds need to be vacated\n"
        f"each hour between {start_window}:00 and {end_window}:00\n"
        f"to create capacity for all overnight arrivals\n"
        f"by {end_window}:00",
        xy=(hour_values[-1], y2 * 0.25),
        xytext=(hour_values[-1], y2 * 0.25),
        va="top",
        ha="right",
    )

`get_window_parameters(data, start_window, end_window, hour_values)`

Calculate window parameters for visualization.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Reindexed cumulative data	required
`start_window`	`int`	Start position in reindexed space	required
`end_window`	`int`	End position in reindexed space	required
`hour_values`	`array - like`	Original hour values for display	required

Returns:

Type	Description
`tuple`	(slope, x1, y1, x2, y2) where: - slope: float, The calculated slope of the line - x1: float, Start hour value - y1: float, Start y-value - x2: float, End hour value - y2: float, End y-value

Source code in src/patientflow/viz/arrival_rates.py

def get_window_parameters(data, start_window, end_window, hour_values):
    """Calculate window parameters for visualization.

    Parameters
    ----------
    data : array-like
        Reindexed cumulative data
    start_window : int
        Start position in reindexed space
    end_window : int
        End position in reindexed space
    hour_values : array-like
        Original hour values for display

    Returns
    -------
    tuple
        (slope, x1, y1, x2, y2) where:
        - slope: float, The calculated slope of the line
        - x1: float, Start hour value
        - y1: float, Start y-value
        - x2: float, End hour value
        - y2: float, End y-value
    """
    y1 = data[start_window]
    y2 = data[-1]
    x1 = hour_values[start_window]  # Get display hour
    x2 = hour_values[end_window]  # Get display hour
    slope = (y2 - y1) / (x2 - x1)

    return slope, x1, y1, x2, y2

`plot_arrival_rates(inpatient_arrivals, title, inpatient_arrivals_2=None, labels=None, lagged_by=None, curve_params=None, time_interval=60, start_plot_index=0, x_margin=0.5, file_prefix='', media_file_path=None, file_name=None, num_days=None, num_days_2=None, return_figure=False)`

Plot arrival rates for one or two datasets with optional lagged and spread rates.

Parameters:

Name	Type	Description	Default
`inpatient_arrivals`	`array - like`	Primary dataset of inpatient arrivals.	required
`title`	`str`	Title of the plot.	required
`inpatient_arrivals_2`	`array - like`	Optional second dataset for comparison, by default None.	`None`
`labels`	`tuple of str`	Labels for the datasets when comparing two datasets, by default None.	`None`
`lagged_by`	`int`	Time lag in hours to apply to the arrival rates, by default None.	`None`
`curve_params`	`tuple of float`	Parameters for spread arrival rates as (x1, y1, x2, y2), by default None.	`None`
`time_interval`	`int`	Time interval in minutes for arrival rate calculations, by default 60.	`60`
`start_plot_index`	`int`	Starting hour index for plotting, by default 0.	`0`
`x_margin`	`float`	Margin on the x-axis, by default 0.5.	`0.5`
`file_prefix`	`str`	Prefix for the saved file name, by default "".	`''`
`media_file_path`	`str or Path`	Directory path to save the plot, by default None.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.	`None`
`num_days`	`int`	Number of days in the first dataset, by default None.	`None`
`num_days_2`	`int`	Number of days in the second dataset, by default None.	`None`
`return_figure`	`bool`	If True, returns the matplotlib figure instead of displaying it, by default False.	`False`

Returns:

Type	Description
`Figure or None`	Returns the figure if return_figure is True, otherwise displays the plot.

Source code in src/patientflow/viz/arrival_rates.py

def plot_arrival_rates(
    inpatient_arrivals,
    title,
    inpatient_arrivals_2=None,
    labels=None,
    lagged_by=None,
    curve_params=None,
    time_interval=60,
    start_plot_index=0,
    x_margin=0.5,
    file_prefix="",
    media_file_path=None,
    file_name=None,
    num_days=None,
    num_days_2=None,
    return_figure=False,
):
    """Plot arrival rates for one or two datasets with optional lagged and spread rates.

    Parameters
    ----------
    inpatient_arrivals : array-like
        Primary dataset of inpatient arrivals.
    title : str
        Title of the plot.
    inpatient_arrivals_2 : array-like, optional
        Optional second dataset for comparison, by default None.
    labels : tuple of str, optional
        Labels for the datasets when comparing two datasets, by default None.
    lagged_by : int, optional
        Time lag in hours to apply to the arrival rates, by default None.
    curve_params : tuple of float, optional
        Parameters for spread arrival rates as (x1, y1, x2, y2), by default None.
    time_interval : int, optional
        Time interval in minutes for arrival rate calculations, by default 60.
    start_plot_index : int, optional
        Starting hour index for plotting, by default 0.
    x_margin : float, optional
        Margin on the x-axis, by default 0.5.
    file_prefix : str, optional
        Prefix for the saved file name, by default "".
    media_file_path : str or Path, optional
        Directory path to save the plot, by default None.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.
    num_days : int, optional
        Number of days in the first dataset, by default None.
    num_days_2 : int, optional
        Number of days in the second dataset, by default None.
    return_figure : bool, optional
        If True, returns the matplotlib figure instead of displaying it, by default False.

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot.
    """
    is_dual_plot = inpatient_arrivals_2 is not None
    if is_dual_plot and labels is None:
        labels = ("Dataset 1", "Dataset 2")

    datasets = [(inpatient_arrivals, "C0", "o", num_days)]
    if is_dual_plot:
        datasets.append((inpatient_arrivals_2, "C1", "s", num_days_2))

    # Calculate and process arrival rates for all datasets
    processed_data = []
    max_y_values = []

    for dataset, color, marker, num_days in datasets:
        # Calculate base arrival rates
        arrival_rates_dict = time_varying_arrival_rates(
            dataset, time_interval, num_days=num_days
        )
        arrival_rates, hour_labels, hour_values = process_arrival_rates(
            arrival_rates_dict
        )
        max_y_values.append(max(arrival_rates))

        # Calculate lagged rates if needed
        arrival_rates_lagged = None
        if lagged_by is not None:
            arrival_rates_lagged_dict = time_varying_arrival_rates_lagged(
                dataset, lagged_by, yta_time_interval=time_interval, num_days=num_days
            )
            arrival_rates_lagged, _, _ = process_arrival_rates(
                arrival_rates_lagged_dict
            )
            max_y_values.append(max(arrival_rates_lagged))

        # Calculate spread rates if needed
        arrival_rates_spread = None
        if curve_params is not None:
            x1, y1, x2, y2 = curve_params
            arrival_rates_spread_dict = unfettered_demand_by_hour(
                dataset, x1, y1, x2, y2, num_days=num_days
            )
            arrival_rates_spread, _, _ = process_arrival_rates(
                arrival_rates_spread_dict
            )
            max_y_values.append(max(arrival_rates_spread))

        processed_data.append(
            {
                "arrival_rates": arrival_rates,
                "arrival_rates_lagged": arrival_rates_lagged,
                "arrival_rates_spread": arrival_rates_spread,
                "color": color,
                "marker": marker,
                "dataset_label": labels[len(processed_data)] if is_dual_plot else None,
            }
        )

    # Helper function to create cyclic data
    def get_cyclic_data(data):
        return data[start_plot_index:] + data[0:start_plot_index]

    # Plot setup
    fig = plt.figure(figsize=(10, 6))
    x_values = get_cyclic_data(hour_labels)

    # Plot data for each dataset
    for data in processed_data:
        dataset_suffix = f" ({data['dataset_label']})" if data["dataset_label"] else ""

        # Base arrival rates
        base_label = f"Arrival rates of admitted patients{dataset_suffix}"
        plt.plot(
            x_values,
            get_cyclic_data(data["arrival_rates"]),
            marker="x",
            color=data["color"],
            markersize=4,
            linestyle=":" if (curve_params or lagged_by) else "-",
            linewidth=1 if (curve_params or lagged_by) else None,
            label=base_label,
        )

        if lagged_by is not None:
            # Lagged arrival rates
            lagged_label = f"Average number of beds needed assuming admission\nexactly {lagged_by} hours after arrival{dataset_suffix}"
            plt.plot(
                x_values,
                get_cyclic_data(data["arrival_rates_lagged"]),
                marker="o",
                markersize=4,
                color=data["color"],
                linestyle="--",
                linewidth=1,
                label=lagged_label,
            )

        if curve_params is not None and data["arrival_rates_spread"] is not None:
            # Spread arrival rates
            spread_label = f"Average number of beds applying ED targets of {int(y1*100)}% in {int(x1)} hours{dataset_suffix}"
            plt.plot(
                x_values,
                get_cyclic_data(data["arrival_rates_spread"]),
                marker=data["marker"],  # Keep original dataset marker
                color=data["color"],  # Keep original dataset color
                label=spread_label,
            )

    # Set plot limits and labels
    plt.ylim(0, max(max_y_values) + 0.25)
    plt.xlim(hour_values[0] - x_margin, hour_values[-1] + x_margin)

    plt.xlabel("Hour of day")
    plt.ylabel("Arrival Rate (patients per hour)")
    plt.title(title)
    plt.grid(True, alpha=0.3)

    # Always show legend if there are multiple datasets or multiple rate types
    if is_dual_plot or lagged_by is not None or curve_params is not None:
        plt.legend()

    plt.tight_layout()

    # Save if path provided
    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = f"{file_prefix}{clean_title_for_filename(title)}"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

plot_cumulative_arrival_rates(inpatient_arrivals, title, curve_params=None, lagged_by=None, time_interval=60, start_plot_index=0, draw_window=None, x_margin=0.5, file_prefix='', set_y_lim=None, hour_lines=[12, 17], line_styles={12: '--', 17: ':', 20: '--'}, annotation_prefix='On average', line_colour='red', media_file_path=None, file_name=None, plot_centiles=False, highlight_centile=0.9, centiles=[0.3, 0.5, 0.7, 0.9, 0.99], markers=['D', 's', '^', 'o', 'v'], line_styles_centiles=['-.', '--', ':', '-', '-'], bed_type_spec='', text_y_offset=1, num_days=None, return_figure=False)

Plot cumulative arrival rates with optional statistical distributions.

Parameters:

Name	Type	Description	Default
`inpatient_arrivals`	`array - like`	Dataset of inpatient arrivals.	required
`title`	`str`	Title of the plot.	required
`curve_params`	`tuple of float`	Parameters for spread rates as (x1, y1, x2, y2), by default None.	`None`
`lagged_by`	`int`	Time lag in hours for cumulative rates, by default None.	`None`
`time_interval`	`int`	Time interval in minutes for rate calculations, by default 60.	`60`
`start_plot_index`	`int`	Starting hour index for plotting, by default 0.	`0`
`draw_window`	`tuple of int`	Time window for detailed annotation, by default None.	`None`
`x_margin`	`float`	Margin on the x-axis, by default 0.5.	`0.5`
`file_prefix`	`str`	Prefix for the saved file name, by default "".	`''`
`set_y_lim`	`float`	Upper limit for the y-axis, by default None.	`None`
`hour_lines`	`list of int`	Specific hours to annotate, by default [12, 17].	`[12, 17]`
`line_styles`	`dict`	Line styles for hour annotations keyed by hour, by default {12: "--", 17: ":", 20: "--"}.	`{12: '--', 17: ':', 20: '--'}`
`annotation_prefix`	`str`	Prefix for annotations, by default "On average".	`'On average'`
`line_colour`	`str`	Color for the main line plot, by default "red".	`'red'`
`media_file_path`	`str or Path`	Directory path to save the plot, by default None.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.	`None`
`plot_centiles`	`bool`	Whether to include percentile visualization, by default False.	`False`
`highlight_centile`	`float`	Percentile to emphasize, by default 0.9. If 1.0 is provided, will use 0.9999 instead.	`0.9`
`centiles`	`list of float`	List of percentiles to calculate, by default [0.3, 0.5, 0.7, 0.9, 0.99].	`[0.3, 0.5, 0.7, 0.9, 0.99]`
`markers`	`list of str`	Marker styles for percentile lines, by default ["D", "s", "^", "o", "v"].	`['D', 's', '^', 'o', 'v']`
`line_styles_centiles`	`list of str`	Line styles for percentile visualization, by default ["-.", "--", ":", "-", "-"].	`['-.', '--', ':', '-', '-']`
`bed_type_spec`	`str`	Specification for bed type in annotations, by default "".	`''`
`text_y_offset`	`float`	Vertical offset for text annotations, by default 1.	`1`
`num_days`	`int`	Number of days in the dataset, by default None.	`None`
`return_figure`	`bool`	If True, returns the matplotlib figure instead of displaying it, by default False.	`False`

Returns:

Type	Description
`Figure or None`	Returns the figure if return_figure is True, otherwise displays the plot.

Source code in src/patientflow/viz/arrival_rates.py

def plot_cumulative_arrival_rates(
    inpatient_arrivals,
    title,
    curve_params=None,
    lagged_by=None,
    time_interval=60,
    start_plot_index=0,
    draw_window=None,
    x_margin=0.5,
    file_prefix="",
    set_y_lim=None,
    hour_lines=[12, 17],
    line_styles={12: "--", 17: ":", 20: "--"},
    annotation_prefix="On average",
    line_colour="red",
    media_file_path=None,
    file_name=None,
    plot_centiles=False,
    highlight_centile=0.9,
    centiles=[0.3, 0.5, 0.7, 0.9, 0.99],
    markers=["D", "s", "^", "o", "v"],
    line_styles_centiles=["-.", "--", ":", "-", "-"],
    bed_type_spec="",
    text_y_offset=1,
    num_days=None,
    return_figure=False,
):
    """Plot cumulative arrival rates with optional statistical distributions.

    Parameters
    ----------
    inpatient_arrivals : array-like
        Dataset of inpatient arrivals.
    title : str
        Title of the plot.
    curve_params : tuple of float, optional
        Parameters for spread rates as (x1, y1, x2, y2), by default None.
    lagged_by : int, optional
        Time lag in hours for cumulative rates, by default None.
    time_interval : int, optional
        Time interval in minutes for rate calculations, by default 60.
    start_plot_index : int, optional
        Starting hour index for plotting, by default 0.
    draw_window : tuple of int, optional
        Time window for detailed annotation, by default None.
    x_margin : float, optional
        Margin on the x-axis, by default 0.5.
    file_prefix : str, optional
        Prefix for the saved file name, by default "".
    set_y_lim : float, optional
        Upper limit for the y-axis, by default None.
    hour_lines : list of int, optional
        Specific hours to annotate, by default [12, 17].
    line_styles : dict, optional
        Line styles for hour annotations keyed by hour, by default {12: "--", 17: ":", 20: "--"}.
    annotation_prefix : str, optional
        Prefix for annotations, by default "On average".
    line_colour : str, optional
        Color for the main line plot, by default "red".
    media_file_path : str or Path, optional
        Directory path to save the plot, by default None.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, uses file_prefix + cleaned title.
    plot_centiles : bool, optional
        Whether to include percentile visualization, by default False.
    highlight_centile : float, optional
        Percentile to emphasize, by default 0.9. If 1.0 is provided, will use 0.9999 instead.
    centiles : list of float, optional
        List of percentiles to calculate, by default [0.3, 0.5, 0.7, 0.9, 0.99].
    markers : list of str, optional
        Marker styles for percentile lines, by default ["D", "s", "^", "o", "v"].
    line_styles_centiles : list of str, optional
        Line styles for percentile visualization, by default ["-.", "--", ":", "-", "-"].
    bed_type_spec : str, optional
        Specification for bed type in annotations, by default "".
    text_y_offset : float, optional
        Vertical offset for text annotations, by default 1.
    num_days : int, optional
        Number of days in the dataset, by default None.
    return_figure : bool, optional
        If True, returns the matplotlib figure instead of displaying it, by default False.

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot.
    """

    # Handle edge case for highlight_centile = 1.0
    original_highlight_centile = highlight_centile
    if highlight_centile >= 1.0:
        highlight_centile = 0.9999  # Use a very high but not exactly 1.0 value

    # Ensure centiles are all valid (no 1.0 values)
    processed_centiles = [min(c, 0.9999) if c >= 1.0 else c for c in centiles]

    # Data processing
    if curve_params is not None:
        x1, y1, x2, y2 = curve_params
        arrival_rates_dict = unfettered_demand_by_hour(
            inpatient_arrivals, x1, y1, x2, y2, num_days=num_days
        )
    elif lagged_by is not None:
        arrival_rates_dict = time_varying_arrival_rates_lagged(
            inpatient_arrivals, lagged_by, time_interval, num_days=num_days
        )
    else:
        arrival_rates_dict = time_varying_arrival_rates(
            inpatient_arrivals, time_interval, num_days=num_days
        )

    # Process arrival rates
    arrival_rates, hour_labels, hour_values = process_arrival_rates(arrival_rates_dict)

    # Reindex based on start_plot_index
    rates_reindexed = (
        list(arrival_rates)[start_plot_index:] + list(arrival_rates)[0:start_plot_index]
    )
    labels_reindexed = (
        list(hour_labels)[start_plot_index:] + list(hour_labels)[0:start_plot_index]
    )

    # Calculate percentiles
    percentiles = [[] for _ in range(len(processed_centiles))]
    cumulative_value_at_centile = np.zeros(len(processed_centiles))

    for hour in range(len(rates_reindexed)):
        for i, centile in enumerate(processed_centiles):
            value_at_centile = stats.poisson.ppf(centile, rates_reindexed[hour])
            cumulative_value_at_centile[i] += value_at_centile
            percentiles[i].append(value_at_centile)

    # Set up plot
    fig = plt.figure(figsize=(10, 6))
    ax = plt.gca()

    # Plot mean line
    label_suffix = f" {bed_type_spec} beds needed" if bed_type_spec else " beds needed"
    cumsum_rates = np.cumsum(rates_reindexed)

    plt.plot(
        labels_reindexed,
        cumsum_rates,
        marker="o",
        markersize=3,
        color=line_colour,
        linewidth=2,
        alpha=0.7,
        label=f"Average number of{label_suffix}",
    )

    # set max y value assuming centiles not plotted
    max_y = cumsum_rates[-1]

    if plot_centiles:
        # Calculate and plot percentiles
        percentiles = [[] for _ in range(len(processed_centiles))]
        cumulative_value_at_centile = np.zeros(len(processed_centiles))
        highlight_percentile_data = None

        # Find the index of highlight_centile in processed_centiles
        highlight_index = -1
        for i, c in enumerate(processed_centiles):
            if (
                abs(c - highlight_centile) < 0.0001
            ):  # Use small epsilon for float comparison
                highlight_index = i
                break

        # If highlight_centile is not in processed_centiles, add it
        if highlight_index == -1:
            processed_centiles.append(highlight_centile)
            percentiles.append([])
            cumulative_value_at_centile = np.append(cumulative_value_at_centile, 0)

        for hour in range(len(rates_reindexed)):
            for i, centile in enumerate(processed_centiles):
                try:
                    # Add error handling for ppf calculation
                    value_at_centile = stats.poisson.ppf(centile, rates_reindexed[hour])

                    # Apply a reasonable upper limit if the value is extremely large
                    if (
                        np.isinf(value_at_centile)
                        or value_at_centile > 1000 * rates_reindexed[hour]
                    ):
                        value_at_centile = 10 * rates_reindexed[hour]

                    cumulative_value_at_centile[i] += value_at_centile
                    percentiles[i].append(value_at_centile)
                except (ValueError, OverflowError, RuntimeError):
                    # Fallback if calculation fails
                    fallback_value = 10 * rates_reindexed[hour]
                    cumulative_value_at_centile[i] += fallback_value
                    percentiles[i].append(fallback_value)

                # Match the highlight centile to the processed value
                if (
                    abs(centile - highlight_centile) < 0.0001
                ):  # Use a small epsilon for floating point comparison
                    highlight_percentile_data = np.cumsum(percentiles[i])

        # Plot percentile lines
        for i, centile in enumerate(processed_centiles):
            marker = markers[i % len(markers)]
            line_style = line_styles_centiles[i % len(line_styles_centiles)]
            linewidth = 2 if centile == highlight_centile else 1
            alpha = 1.0 if centile == highlight_centile else 0.7

            # If the user requested 1.0, display as 99.99% since a Poisson distribution
            # cannot provide exact 100% probability with any finite value
            display_centile = processed_centiles[i]
            if centile == highlight_centile and original_highlight_centile >= 1.0:
                display_centile = (
                    0.9999  # Use 99.99% as the highest displayable probability
                )

            # Format the label text with appropriate precision
            if display_centile >= 0.999:
                # For very high probabilities, show as 99.9% or 99.99% to avoid implying exact 100%
                label_text = f"{display_centile*100:.2f}% probability"
            else:
                label_text = f"{display_centile*100:.0f}% probability"

            cumsum_percentile = np.cumsum(percentiles[i])
            plt.plot(
                labels_reindexed,
                cumsum_percentile,
                marker=marker,
                markersize=3,
                linestyle=line_style,
                color="C0",
                linewidth=linewidth,
                alpha=alpha,
                label=label_text,
            )
        # update max y
        max_y = max(cumulative_value_at_centile)

        # Draw window visualization if requested
        if draw_window:
            start_window, end_window = draw_window
            reindexed_start = (start_window - start_plot_index) % len(
                highlight_percentile_data
            )
            reindexed_end = (end_window - start_plot_index) % len(
                highlight_percentile_data
            )
            window_params = get_window_parameters(
                highlight_percentile_data, reindexed_start, reindexed_end, hour_values
            )
            draw_window_visualization(
                ax,
                hour_values,
                window_params,
                annotation_prefix,
                start_window,
                end_window,
            )
            slope, x1, y1, x2, y2 = window_params
            for hour_line in hour_lines:
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=y1,
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                    slope=slope,
                    x1=x1,
                    y1=y1,
                )

        else:
            # Regular percentile annotations
            for hour_line in hour_lines:
                # Check if highlight_percentile_data is available
                if highlight_percentile_data is None:
                    # Fall back to mean line if no highlight data
                    cumsum_at_hour = cumsum_rates[hour_line - start_plot_index]
                else:
                    cumsum_at_hour = highlight_percentile_data[
                        hour_line - start_plot_index
                    ]
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=cumsum_at_hour,
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                    text_y_offset=text_y_offset,
                )

        # Reverse legend order
        handles, labels = plt.gca().get_legend_handles_labels()
        plt.legend(handles[::-1], labels[::-1], loc="upper left")
    else:
        plt.legend(loc="upper left")

        if draw_window:
            start_window, end_window = draw_window
            reindexed_start = (start_window - start_plot_index) % len(cumsum_rates)
            reindexed_end = (end_window - start_plot_index) % len(cumsum_rates)
            window_params = get_window_parameters(
                cumsum_rates, reindexed_start, reindexed_end, hour_values
            )
            draw_window_visualization(
                ax,
                hour_values,
                window_params,
                annotation_prefix,
                start_window,
                end_window,
            )
            slope, x1, y1, x2, y2 = window_params
            for hour_line in hour_lines:
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=y1,
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                    slope=slope,
                    x1=x1,
                    y1=y1,
                )
        else:
            # Regular mean line annotations
            for hour_line in hour_lines:
                annotate_hour_line(
                    hour_line=hour_line,
                    y_value=cumsum_rates[hour_line - start_plot_index],
                    hour_values=hour_values,
                    start_plot_index=start_plot_index,
                    line_styles=line_styles,
                    x_margin=x_margin,
                    annotation_prefix=annotation_prefix,
                )

    plt.xlabel("Hour of day")
    plt.ylabel("Cumulative number of beds needed")
    plt.xlim(hour_values[0] - x_margin, hour_values[-1] + x_margin)
    plt.ylim(0, set_y_lim if set_y_lim else max(max_y + 2, max_y * 1.2))
    plt.minorticks_on()
    plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(5))

    plt.title(title)
    plt.tight_layout()

    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = f"{file_prefix}{clean_title_for_filename(title)}"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

`aspirational_curve`

Visualization module for plotting aspirational curves in patient flow analysis.

This module provides functionality for creating and customizing plots of aspirational curves, which represent the probability of admission over time. These curves are useful for setting aspirational targets in healthcare settings.

Functions:

Name	Description
`plot_curve : function`	Plot an aspirational curve with specified points and optional annotations

Examples:

>>> plot_curve(
...     title="Admission Probability Curve",
...     x1=4,
...     y1=0.2,
...     x2=24,
...     y2=0.8,
...     include_titles=True
... )

`plot_curve(title, x1, y1, x2, y2, figsize=(10, 5), include_titles=False, text_size=14, media_file_path=None, file_name=None, return_figure=False, annotate_points=False)`

Plot an aspirational curve with specified points and optional annotations.

This function creates a plot of an aspirational curve between two points, with options for customization of the visualization including titles, annotations, and saving to a file.

Parameters:

Name	Type	Description	Default
`title`	`str`	The title of the plot.	required
`x1`	`float`	x-coordinate of the first point.	required
`y1`	`float`	y-coordinate of the first point (probability value).	required
`x2`	`float`	x-coordinate of the second point.	required
`y2`	`float`	y-coordinate of the second point (probability value).	required
`figsize`	`tuple of int`	Figure size in inches (width, height), by default (10, 5).	`(10, 5)`
`include_titles`	`bool`	Whether to include axis labels and title, by default False.	`False`
`text_size`	`int`	Font size for text elements, by default 14.	`14`
`media_file_path`	`str or Path`	Path to save the plot image, by default None.	`None`
`file_name`	`str`	Custom filename for saving the plot. If not provided, uses a cleaned version of the title.	`None`
`return_figure`	`bool`	Whether to return the figure object instead of displaying it, by default False.	`False`
`annotate_points`	`bool`	Whether to add coordinate annotations to the points, by default False.	`False`

Returns:

Type	Description
`Figure or None`	The figure object if return_figure is True, otherwise None.

Notes

The function creates a curve between two points using the create_curve function and adds various visualization elements including grid lines, annotations, and optional titles.

Source code in src/patientflow/viz/aspirational_curve.py

def plot_curve(
    title,
    x1,
    y1,
    x2,
    y2,
    figsize=(10, 5),
    include_titles=False,
    text_size=14,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    annotate_points=False,
):
    """Plot an aspirational curve with specified points and optional annotations.

    This function creates a plot of an aspirational curve between two points,
    with options for customization of the visualization including titles,
    annotations, and saving to a file.

    Parameters
    ----------
    title : str
        The title of the plot.
    x1 : float
        x-coordinate of the first point.
    y1 : float
        y-coordinate of the first point (probability value).
    x2 : float
        x-coordinate of the second point.
    y2 : float
        y-coordinate of the second point (probability value).
    figsize : tuple of int, optional
        Figure size in inches (width, height), by default (10, 5).
    include_titles : bool, optional
        Whether to include axis labels and title, by default False.
    text_size : int, optional
        Font size for text elements, by default 14.
    media_file_path : str or Path, optional
        Path to save the plot image, by default None.
    file_name : str, optional
        Custom filename for saving the plot. If not provided, uses a cleaned version of the title.
    return_figure : bool, optional
        Whether to return the figure object instead of displaying it, by default False.
    annotate_points : bool, optional
        Whether to add coordinate annotations to the points, by default False.

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None.

    Notes
    -----
    The function creates a curve between two points using the create_curve function
    and adds various visualization elements including grid lines, annotations,
    and optional titles.
    """
    gamma, lamda, a, x_values, y_values = create_curve(
        x1, y1, x2, y2, generate_values=True
    )

    # Plot the curve
    fig = plt.figure(figsize=figsize)

    plt.plot(x_values, y_values)
    plt.scatter(x1, y1, color="red")  # Mark the point (x1, y1)
    plt.scatter(x2, y2, color="red")  # Mark the point (x2, y2)

    if annotate_points:
        plt.annotate(
            f"({x1}, {y1:.2f})",
            (x1, y1),
            xytext=(10, -15),
            textcoords="offset points",
            fontsize=text_size,
        )
        plt.annotate(
            f"({x2}, {y2:.2f})",
            (x2, y2),
            xytext=(10, -15),
            textcoords="offset points",
            fontsize=text_size,
        )

    if text_size:
        plt.tick_params(axis="both", which="major", labelsize=text_size)

    x_ticks = np.arange(min(x_values), max(x_values) + 1, 2)
    plt.xticks(x_ticks)

    if include_titles:
        plt.title(title, fontsize=text_size)
        plt.xlabel("Hours since admission", fontsize=text_size)
        plt.ylabel("Probability of admission by this point", fontsize=text_size)

    plt.axhline(y=y1, color="green", linestyle="--", label=f"y ={int(y1*100)}%")
    plt.axvline(x=x1, color="gray", linestyle="--", label="x = 4 hours")
    plt.legend(fontsize=text_size)

    plt.tight_layout()

    if media_file_path:
        os.makedirs(media_file_path, exist_ok=True)
        if file_name:
            filename = file_name
        else:
            filename = clean_title_for_filename(title)
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

`calibration`

Calibration plot visualization module.

This module creates calibration plots for trained models, showing how well the predicted probabilities align with actual outcomes.

Functions:

Name	Description
`plot_calibration : function`	Plot calibration curves for multiple models

`plot_calibration(trained_models, test_visits, exclude_from_training_data, strategy='uniform', media_file_path=None, file_name=None, suptitle=None, return_figure=False, label_col='is_admitted')`

Plot calibration curves for multiple models.

A calibration plot shows how well the predicted probabilities from a model align with the actual outcomes. The plot compares the mean predicted probability with the fraction of positive outcomes for different probability bins.

Parameters:

Name	Type	Description	Default
`trained_models`	`list[TrainedClassifier] or dict[str, TrainedClassifier]`	List of TrainedClassifier objects or dictionary with TrainedClassifier values.	required
`test_visits`	`DataFrame`	DataFrame containing test visit data.	required
`exclude_from_training_data`	`list`	Columns to exclude from the test data.	required
`strategy`	`(uniform, quantile)`	Strategy for calibration curve binning. - 'uniform': Bins are of equal width - 'quantile': Bins have equal number of samples	`'uniform'`
`media_file_path`	`Path`	Path where the plot should be saved.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "calibration_plot.png".	`None`
`suptitle`	`str`	Optional super title for the entire figure.	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it.	`False`
`label_col`	`str`	Name of the column containing the target labels.	`'is_admitted'`

Returns:

Type	Description
`Figure or None`	If return_figure is True, returns the figure object. Otherwise, displays the plot and returns None.

Notes

The function creates a subplot for each trained model, sorted by prediction time. Each subplot shows the calibration curve and a reference line for perfect calibration.

Source code in src/patientflow/viz/calibration.py

def plot_calibration(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits,
    exclude_from_training_data,
    strategy="uniform",
    media_file_path: Optional[Path] = None,
    file_name=None,
    suptitle=None,
    return_figure=False,
    label_col: str = "is_admitted",
):
    """Plot calibration curves for multiple models.

    A calibration plot shows how well the predicted probabilities from a model
    align with the actual outcomes. The plot compares the mean predicted probability
    with the fraction of positive outcomes for different probability bins.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of TrainedClassifier objects or dictionary with TrainedClassifier values.
    test_visits : pandas.DataFrame
        DataFrame containing test visit data.
    exclude_from_training_data : list
        Columns to exclude from the test data.
    strategy : {'uniform', 'quantile'}, default='uniform'
        Strategy for calibration curve binning.
        - 'uniform': Bins are of equal width
        - 'quantile': Bins have equal number of samples
    media_file_path : Path, optional
        Path where the plot should be saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "calibration_plot.png".
    suptitle : str, optional
        Optional super title for the entire figure.
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it.
    label_col : str, default='is_admitted'
        Name of the column containing the target labels.

    Returns
    -------
    matplotlib.figure.Figure or None
        If return_figure is True, returns the figure object. Otherwise, displays
        the plot and returns None.

    Notes
    -----
    The function creates a subplot for each trained model, sorted by prediction time.
    Each subplot shows the calibration curve and a reference line for perfect calibration.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )
    num_plots = len(trained_models_sorted)
    fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 5, 4))

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    for i, trained_model in enumerate(trained_models_sorted):
        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)

        prob_true, prob_pred = calibration_curve(
            y_test, pipeline.predict_proba(X_test)[:, 1], n_bins=10, strategy=strategy
        )

        ax = axs[i]
        hour, minutes = prediction_time

        ax.plot(
            prob_pred,
            prob_true,
            marker="o",
            linewidth=1,
            label="Predictions",
            color=primary_color,
        )
        ax.plot(
            [0, 1],
            [0, 1],
            linestyle="--",
            label="Perfectly calibrated",
            color=secondary_color,
        )
        ax.set_title(f"Calibration Plot for {hour}:{minutes:02}", fontsize=14)
        ax.set_xlabel("Mean Estimated Probability", fontsize=12)
        ax.set_ylabel("Fraction of Positives", fontsize=12)
        ax.legend()

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)

    if media_file_path:
        if file_name:
            calib_plot_path = media_file_path / file_name
        else:
            calib_plot_path = media_file_path / "calibration_plot.png"
        plt.savefig(calib_plot_path)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

`data_distribution`

Visualisation module for plotting data distributions.

This module provides functions for creating distribution plots of data variables grouped by categories.

Functions:

Name	Description
`plot_data_distribution : function`	Plot distributions of data variables grouped by categories

`plot_data_distribution(df, col_name, grouping_var, grouping_var_name, plot_type='both', title=None, rotate_x_labels=False, is_discrete=False, ordinal_order=None, media_file_path=None, file_name=None, return_figure=False, truncate_outliers=True, outlier_method='zscore', outlier_threshold=2.0)`

Plot distributions of data variables grouped by categories.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the data to plot	required
`col_name`	`str`	Name of the column to plot distributions for	required
`grouping_var`	`str`	Name of the column to group the data by	required
`grouping_var_name`	`str`	Display name for the grouping variable	required
`plot_type`	`(both, hist, kde)`	Type of plot to create. 'both' shows histogram with KDE, 'hist' shows only histogram, 'kde' shows only KDE plot	`'both'`
`title`	`str`	Title for the plot	`None`
`rotate_x_labels`	`bool`	Whether to rotate x-axis labels by 90 degrees	`False`
`is_discrete`	`bool`	Whether the data is discrete	`False`
`ordinal_order`	`list`	Order of categories for ordinal data	`None`
`media_file_path`	`Path`	Path where the plot should be saved	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "data_distributions.png".	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it	`False`
`truncate_outliers`	`bool`	Whether to truncate the x-axis to exclude extreme outliers	`True`
`outlier_method`	`(iqr, zscore)`	Method to detect outliers. 'iqr' uses interquartile range, 'zscore' uses z-score	`'iqr'`
`outlier_threshold`	`float`	Threshold for outlier detection. For IQR method, this is the multiplier. For z-score method, this is the number of standard deviations.	`1.5`

Returns:

Type	Description
`FacetGrid or None`	If return_figure is True, returns the FacetGrid object. Otherwise, displays the plot and returns None.

Raises:

Type	Description
`ValueError`	If plot_type is not one of 'both', 'hist', or 'kde' If outlier_method is not one of 'iqr' or 'zscore'

Source code in src/patientflow/viz/data_distribution.py

def plot_data_distribution(
    df,
    col_name,
    grouping_var,
    grouping_var_name,
    plot_type="both",
    title=None,
    rotate_x_labels=False,
    is_discrete=False,
    ordinal_order=None,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    truncate_outliers=True,
    outlier_method="zscore",
    outlier_threshold=2.0,
):
    """Plot distributions of data variables grouped by categories.

    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the data to plot
    col_name : str
        Name of the column to plot distributions for
    grouping_var : str
        Name of the column to group the data by
    grouping_var_name : str
        Display name for the grouping variable
    plot_type : {'both', 'hist', 'kde'}, default='both'
        Type of plot to create. 'both' shows histogram with KDE, 'hist' shows
        only histogram, 'kde' shows only KDE plot
    title : str, optional
        Title for the plot
    rotate_x_labels : bool, default=False
        Whether to rotate x-axis labels by 90 degrees
    is_discrete : bool, default=False
        Whether the data is discrete
    ordinal_order : list, optional
        Order of categories for ordinal data
    media_file_path : Path, optional
        Path where the plot should be saved
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "data_distributions.png".
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    truncate_outliers : bool, default=True
        Whether to truncate the x-axis to exclude extreme outliers
    outlier_method : {'iqr', 'zscore'}, default='zscore'
        Method to detect outliers. 'iqr' uses interquartile range, 'zscore' uses z-score
    outlier_threshold : float, default=1.5
        Threshold for outlier detection. For IQR method, this is the multiplier.
        For z-score method, this is the number of standard deviations.

    Returns
    -------
    seaborn.FacetGrid or None
        If return_figure is True, returns the FacetGrid object. Otherwise,
        displays the plot and returns None.

    Raises
    ------
    ValueError
        If plot_type is not one of 'both', 'hist', or 'kde'
        If outlier_method is not one of 'iqr' or 'zscore'
    """
    sns.set_theme(style="whitegrid")

    if ordinal_order is not None:
        df[col_name] = pd.Categorical(
            df[col_name], categories=ordinal_order, ordered=True
        )

    # Calculate outlier bounds if truncation is requested
    x_limits = None
    if truncate_outliers:
        values = df[col_name].dropna()
        if pd.api.types.is_numeric_dtype(values) and len(values) > 0:
            # Check if data is actually discrete (all values are integers)
            is_actually_discrete = np.allclose(values, values.round())

            # Apply outlier truncation to continuous data OR discrete data with outliers
            # For discrete data, we still want to truncate if there are extreme outliers
            if outlier_method == "iqr":
                Q1 = values.quantile(0.25)
                Q3 = values.quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - outlier_threshold * IQR
                upper_bound = Q3 + outlier_threshold * IQR
            elif outlier_method == "zscore":
                mean_val = values.mean()
                std_val = values.std()
                lower_bound = mean_val - outlier_threshold * std_val
                upper_bound = mean_val + outlier_threshold * std_val
            else:
                raise ValueError(
                    "Invalid outlier_method. Choose from 'iqr' or 'zscore'."
                )

            # Only apply truncation if there are actual outliers
            # For discrete data, ensure lower bound is at least 0
            if values.min() < lower_bound or values.max() > upper_bound:
                if is_actually_discrete:
                    # For discrete data, ensure bounds are reasonable
                    lower_bound = max(0, lower_bound)
                x_limits = (lower_bound, upper_bound)

    g = sns.FacetGrid(df, col=grouping_var, height=3, aspect=1.5)

    if is_discrete:
        valid_values = sorted([x for x in df[col_name].unique() if pd.notna(x)])
        min_val = min(valid_values)
        max_val = max(valid_values)
        bins = np.arange(min_val - 0.5, max_val + 1.5, 1)
    else:
        # Handle numeric data
        values = df[col_name].dropna()
        if pd.api.types.is_numeric_dtype(values):
            if np.allclose(values, values.round()):
                bins = np.arange(values.min() - 0.5, values.max() + 1.5, 1)
            else:
                n_bins = min(100, max(10, int(np.sqrt(len(values)))))
                bins = n_bins
        else:
            bins = "auto"

    if plot_type == "both":
        g.map(sns.histplot, col_name, kde=True, bins=bins)
    elif plot_type == "hist":
        g.map(sns.histplot, col_name, kde=False, bins=bins)
    elif plot_type == "kde":
        g.map(sns.kdeplot, col_name, fill=True)
    else:
        raise ValueError("Invalid plot_type. Choose from 'both', 'hist', or 'kde'.")

    g.set_axis_labels(
        col_name, "Frequency" if plot_type != "kde" else "Density", fontsize=10
    )

    # Set facet titles with smaller font
    g.set_titles(col_template=f"{grouping_var}: {{col_name}}", size=11)

    # Add thousands separators to y-axis
    for ax in g.axes.flat:
        ax.yaxis.set_major_formatter(
            plt.FuncFormatter(lambda x, p: format(int(x), ","))
        )

    if rotate_x_labels:
        for ax in g.axes.flat:
            for label in ax.get_xticklabels():
                label.set_rotation(90)

    if is_discrete:
        for ax in g.axes.flat:
            ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
            # Apply outlier truncation if available, otherwise use default discrete limits
            if x_limits is not None:
                # Ensure discrete limits are reasonable: min ≥ 0, max ≥ 1, and use integers
                lower_limit = max(0, int(x_limits[0]))
                upper_limit = max(
                    1, int(x_limits[1] + 0.5)
                )  # Round up to ensure we include the max value
                ax.set_xlim(lower_limit - 0.5, upper_limit + 0.5)
            else:
                # Ensure default discrete limits are reasonable: min ≥ 0, max ≥ 1
                # Use the actual min/max values to center the bars properly
                lower_limit = max(0, min_val)
                upper_limit = max(1, max_val)
                ax.set_xlim(lower_limit - 0.5, upper_limit + 0.5)
    elif x_limits is not None:
        # Apply outlier truncation to x-axis
        for ax in g.axes.flat:
            ax.set_xlim(x_limits)
            # Ensure integer tick marks for numeric data with outliers
            ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
    else:
        # Let matplotlib auto-scale the x-axis
        pass

    plt.subplots_adjust(top=0.80)
    if title:
        g.figure.suptitle(title, fontsize=14)
    else:
        g.figure.suptitle(
            f"Distribution of {col_name} grouped by {grouping_var_name}", fontsize=14
        )

    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = "data_distributions.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return g
    else:
        plt.show()
        plt.close()

`epudd`

Generate plots comparing observed values with model predictions for discrete distributions.

An Evaluating Predictions for Unique, Discrete, Distributions (EPUDD) plot displays the model's predicted CDF values alongside the actual observed values' positions within their predicted CDF intervals. For discrete distributions, each predicted value has an associated probability, and the CDF is calculated by sorting the values and computing cumulative probabilities.

The plot can show three possible positions for each observation within its predicted interval:

* lower bound of the interval
* midpoint of the interval
* upper bound of the interval

By default, the plot only shows the midpoint of the interval.

For a well-calibrated model, the observed values should fall within their predicted intervals, with the distribution of positions showing appropriate uncertainty.

The visualisation helps assess model calibration by comparing: 1. The predicted cumulative distribution function (CDF) values 2. The actual positions of observations within their predicted intervals 3. The spread and distribution of these positions

Functions:

Name	Description
`plot_epudd : function`	Generates and plots the comparison of model predictions with observed values.

`plot_epudd(prediction_times, prob_dist_dict_all, model_name='admissions', return_figure=False, return_dataframe=False, figsize=None, suptitle=None, media_file_path=None, file_name=None, plot_all_bounds=False)`

Generates plots comparing model predictions with observed values for discrete distributions.

For discrete distributions, each predicted value has an associated probability. The CDF is calculated by sorting the values and computing cumulative probabilities, normalized by the number of time points.

Parameters:

Name	Type	Description	Default
`prediction_times`	`list of tuple`	List of (hour, minute) tuples representing times for which predictions were made.	required
`prob_dist_dict_all`	`dict`	Dictionary of probability distributions keyed by model_key. Each entry contains information about predicted distributions and observed values for different snapshot dates. The predicted distributions should be discrete probability mass functions, with each value having an associated probability.	required
`model_name`	`str`	Base name of the model to construct model keys, by default "admissions".	`'admissions'`
`return_figure`	`bool`	If True, returns the figure object instead of displaying it, by default False.	`False`
`return_dataframe`	`bool`	If True, returns a dictionary of observation dataframes by model_key, by default False. The dataframes contain the merged observation and prediction data for analysis.	`False`
`figsize`	`tuple of (float, float)`	Size of the figure in inches as (width, height). If None, calculated automatically based on number of plots, by default None.	`None`
`suptitle`	`str`	Super title for the entire figure, displayed above all subplots, by default None.	`None`
`media_file_path`	`Path`	Path to save the plot, by default None. If provided, saves the plot as a PNG file.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "plot_epudd.png".	`None`
`plot_all_bounds`	`bool`	If True, plots all bounds (lower, mid, upper). If False, only plots mid bounds. By default False.	`False`

Returns:

Type	Description
`Figure`	The figure object containing the plots, if return_figure is True.
`dict`	Dictionary of observation dataframes by model_key, if return_dataframe is True.
`tuple`	Tuple of (figure, dataframes_dict) if both return_figure and return_dataframe are True.
`None`	If neither return_figure nor return_dataframe is True, displays the plots and returns None.

Notes

For discrete distributions, the CDF is calculated by:

1. Sorting the predicted values
2. Computing cumulative probabilities for each value
3. Normalizing by the number of time points

The plot shows three possible positions for each observation:

* lower_cdf (pink): Uses the lower bound of the CDF interval
* mid_cdf (green): Uses the midpoint of the CDF interval
* upper_cdf (light blue): Uses the upper bound of the CDF interval

The black points represent the model's predicted CDF values, calculated from the sorted values and their associated probabilities, while the colored points show where the actual observations fall within their predicted intervals. For a well-calibrated model, the observed values should fall within their predicted intervals, with the distribution of positions showing appropriate uncertainty.

Source code in src/patientflow/viz/epudd.py

def plot_epudd(
    prediction_times: List[Tuple[int, int]],
    prob_dist_dict_all: Dict[str, Dict],
    model_name: str = "admissions",
    return_figure: bool = False,
    return_dataframe: bool = False,
    figsize: Optional[Tuple[float, float]] = None,
    suptitle: Optional[str] = None,
    media_file_path: Optional[Path] = None,
    file_name=None,
    plot_all_bounds: bool = False,
) -> Union[
    Figure, Dict[str, pd.DataFrame], Tuple[Figure, Dict[str, pd.DataFrame]], None
]:
    """
    Generates plots comparing model predictions with observed values for discrete distributions.

    For discrete distributions, each predicted value has an associated probability. The CDF
    is calculated by sorting the values and computing cumulative probabilities, normalized
    by the number of time points.

    Parameters
    ----------
    prediction_times : list of tuple
        List of (hour, minute) tuples representing times for which predictions were made.
    prob_dist_dict_all : dict
        Dictionary of probability distributions keyed by model_key. Each entry contains
        information about predicted distributions and observed values for different
        snapshot dates. The predicted distributions should be discrete probability mass
        functions, with each value having an associated probability.
    model_name : str, optional
        Base name of the model to construct model keys, by default "admissions".
    return_figure : bool, optional
        If True, returns the figure object instead of displaying it, by default False.
    return_dataframe : bool, optional
        If True, returns a dictionary of observation dataframes by model_key, by default False.
        The dataframes contain the merged observation and prediction data for analysis.
    figsize : tuple of (float, float), optional
        Size of the figure in inches as (width, height). If None, calculated automatically
        based on number of plots, by default None.
    suptitle : str, optional
        Super title for the entire figure, displayed above all subplots, by default None.
    media_file_path : Path, optional
        Path to save the plot, by default None. If provided, saves the plot as a PNG file.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "plot_epudd.png".
    plot_all_bounds : bool, optional
        If True, plots all bounds (lower, mid, upper). If False, only plots mid bounds.
        By default False.

    Returns
    -------
    matplotlib.figure.Figure
        The figure object containing the plots, if return_figure is True.
    dict
        Dictionary of observation dataframes by model_key, if return_dataframe is True.
    tuple
        Tuple of (figure, dataframes_dict) if both return_figure and return_dataframe are True.
    None
        If neither return_figure nor return_dataframe is True, displays the plots and returns None.

    Notes
    -----
    For discrete distributions, the CDF is calculated by:

        1. Sorting the predicted values
        2. Computing cumulative probabilities for each value
        3. Normalizing by the number of time points

    The plot shows three possible positions for each observation:

        * lower_cdf (pink): Uses the lower bound of the CDF interval
        * mid_cdf (green): Uses the midpoint of the CDF interval
        * upper_cdf (light blue): Uses the upper bound of the CDF interval

    The black points represent the model's predicted CDF values, calculated from the sorted
    values and their associated probabilities, while the colored points show where the actual
    observations fall within their predicted intervals. For a well-calibrated model, the
    observed values should fall within their predicted intervals, with the distribution of
    positions showing appropriate uncertainty.

    """
    # Sort prediction times by converting to minutes since midnight
    prediction_times_sorted: List[Tuple[int, int]] = sorted(
        prediction_times,
        key=lambda x: x[0] * 60 + x[1],
    )

    # Calculate figure parameters
    num_plots: int = len(prediction_times_sorted)
    figsize = figsize or (num_plots * 5, 4)

    # Create subplot layout
    fig: Figure
    axs: np.ndarray
    fig, axs = plt.subplots(1, num_plots, figsize=figsize)
    axs = [axs] if num_plots == 1 else axs

    # Define plotting types and colors
    all_types = ["lower", "mid", "upper"]
    plot_types = all_types if plot_all_bounds else ["mid"]
    colors: Dict[str, str] = {
        "lower": "#FF1493",  # deeppink
        "mid": "#228B22",  # chartreuse4/forest green
        "upper": "#ADD8E6",  # lightblue
    }

    all_obs_dfs: Dict[str, pd.DataFrame] = {}

    # Process each subplot
    for i, prediction_time in enumerate(prediction_times_sorted):
        model_key: str = get_model_key(model_name, prediction_time)
        prob_dist_dict: Dict = prob_dist_dict_all[model_key]

        if not prob_dist_dict:
            continue

        # Create distribution and observation dataframes
        all_distributions = _create_distribution_records(prob_dist_dict, all_types)
        distr_coll: pd.DataFrame = pd.DataFrame(all_distributions)

        all_observations = _create_observation_records(prob_dist_dict)
        adm_coll: pd.DataFrame = pd.DataFrame(all_observations)

        # For each actual observation, find its position in the predicted CDF
        # by matching datetime and admission count to get lower/mid/upper bounds
        merged_df: pd.DataFrame = pd.merge(
            adm_coll,
            distr_coll.rename(
                columns={
                    "num_adm_pred": "num_adm",
                    **{f"{t}_predicted_cdf": f"{t}_observed_cdf" for t in all_types},
                }
            ),
            on=["dt", "num_adm"],
            how="inner",
        )

        if merged_df.empty:
            continue

        all_obs_dfs[model_key] = merged_df
        ax = axs[i]
        num_time_points: int = len(prob_dist_dict)

        # Plot predictions and observations
        _plot_predictions(ax, distr_coll, num_time_points, plot_types)
        _plot_observations(ax, merged_df, plot_types, colors, i == 0)
        _setup_subplot(ax, prediction_time, i == 0)

    # Final plot configuration
    plt.tight_layout()
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)
    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = "plot_epudd.png"
        plt.savefig(media_file_path / filename, dpi=300)

    # Return based on flags
    if return_figure and return_dataframe:
        return fig, all_obs_dfs
    elif return_figure:
        return fig
    elif return_dataframe:
        plt.show()
        plt.close()
        return all_obs_dfs
    else:
        plt.show()
        plt.close()
        return None

`estimated_probabilities`

Visualization module for plotting estimated probabilities from trained models.

This module provides functions for creating distribution plots of estimated probabilities from trained classification models.

Functions:

Name	Description
`plot_estimated_probabilities : function`	Plot estimated probability distributions for multiple models

`plot_estimated_probabilities(trained_models, test_visits, exclude_from_training_data, bins=30, media_file_path=None, file_name=None, suptitle=None, return_figure=False, label_col='is_admitted')`

Plot estimated probability distributions for multiple models.

Parameters:

Name	Type	Description	Default
`trained_models`	`list[TrainedClassifier] or dict[str, TrainedClassifier]`	List of TrainedClassifier objects or dict with TrainedClassifier values	required
`test_visits`	`DataFrame`	DataFrame containing test visit data	required
`exclude_from_training_data`	`list`	Columns to exclude from the test data	required
`bins`	`int`	Number of bins for the histograms	`30`
`media_file_path`	`Path`	Path where the plot should be saved	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "estimated_probabilities.png".	`None`
`suptitle`	`str`	Optional super title for the entire figure	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it	`False`
`label_col`	`str`	Name of the column containing the target labels	`"is_admitted"`

Returns:

Type	Description
`Figure or None`	If return_figure is True, returns the figure object. Otherwise, displays the plot and returns None.

Source code in src/patientflow/viz/estimated_probabilities.py

def plot_estimated_probabilities(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits,
    exclude_from_training_data,
    bins=30,
    media_file_path: Optional[Path] = None,
    file_name=None,
    suptitle: Optional[str] = None,
    return_figure=False,
    label_col: str = "is_admitted",
):
    """Plot estimated probability distributions for multiple models.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of TrainedClassifier objects or dict with TrainedClassifier values
    test_visits : pandas.DataFrame
        DataFrame containing test visit data
    exclude_from_training_data : list
        Columns to exclude from the test data
    bins : int, default=30
        Number of bins for the histograms
    media_file_path : Path, optional
        Path where the plot should be saved
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "estimated_probabilities.png".
    suptitle : str, optional
        Optional super title for the entire figure
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    label_col : str, default="is_admitted"
        Name of the column containing the target labels

    Returns
    -------
    matplotlib.figure.Figure or None
        If return_figure is True, returns the figure object. Otherwise, displays
        the plot and returns None.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )
    num_plots = len(trained_models_sorted)
    fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 5, 4))

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    for i, trained_model in enumerate(trained_models_sorted):
        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)

        # Get predictions
        y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

        # Separate predictions for positive and negative cases
        pos_preds = y_pred_proba[y_test == 1]
        neg_preds = y_pred_proba[y_test == 0]

        ax = axs[i]
        hour, minutes = prediction_time

        # Plot distributions
        ax.hist(
            neg_preds,
            bins=bins,
            alpha=0.5,
            color=primary_color,
            density=True,
            label="Negative Cases",
            histtype="step",
            linewidth=2,
        )
        ax.hist(
            pos_preds,
            bins=bins,
            alpha=0.5,
            color=secondary_color,
            density=True,
            label="Positive Cases",
            histtype="step",
            linewidth=2,
        )

        # Optional: Fill with lower opacity
        ax.hist(neg_preds, bins=bins, alpha=0.2, color=primary_color, density=True)
        ax.hist(pos_preds, bins=bins, alpha=0.2, color=secondary_color, density=True)

        ax.set_title(
            f"Distribution of Estimated Probabilities at {hour}:{minutes:02}",
            fontsize=14,
        )
        ax.set_xlabel("Estimated Probability", fontsize=12)
        ax.set_ylabel("Density", fontsize=12)
        ax.set_xlim(0, 1)
        ax.legend()

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle is not None:
        plt.suptitle(suptitle, y=1.05, fontsize=16)

    if media_file_path:
        if file_name:
            filename = file_name
        else:
            filename = "estimated_probabilities.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

`features`

Visualisation module for plotting feature importances from trained models.

This module provides functionality to visualize feature importances from trained classifiers, allowing for comparison across different prediction time points.

Functions:

Name	Description
`plot_features : function`	Plot feature importance for multiple models

`plot_features(trained_models, media_file_path=None, file_name=None, top_n=20, suptitle=None, return_figure=False)`

Plot feature importance for multiple models.

Parameters:

Name	Type	Description	Default
`trained_models`	`list[TrainedClassifier] or dict[str, TrainedClassifier]`	List of TrainedClassifier objects or dictionary with TrainedClassifier values.	required
`media_file_path`	`Path`	Path where the plot should be saved. If None, the plot is only displayed.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "feature_importance_plots.png".	`None`
`top_n`	`int`	Number of top features to display.	`20`
`suptitle`	`str`	Super title for the entire figure.	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it.	`False`

Returns:

Type	Description
`Figure or None`	The matplotlib figure if return_figure is True, otherwise None.

Notes

The function sorts models by prediction time and creates a horizontal bar plot for each model showing the top N most important features. Feature names are truncated to 25 characters for better display.

Source code in src/patientflow/viz/features.py

def plot_features(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    media_file_path: Optional[Path] = None,
    file_name=None,
    top_n: int = 20,
    suptitle: Optional[str] = None,
    return_figure: bool = False,
) -> Optional[plt.Figure]:
    """Plot feature importance for multiple models.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of TrainedClassifier objects or dictionary with TrainedClassifier values.
    media_file_path : Path, optional
        Path where the plot should be saved. If None, the plot is only displayed.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "feature_importance_plots.png".
    top_n : int, default=20
        Number of top features to display.
    suptitle : str, optional
        Super title for the entire figure.
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it.

    Returns
    -------
    plt.Figure or None
        The matplotlib figure if return_figure is True, otherwise None.

    Notes
    -----
    The function sorts models by prediction time and creates a horizontal bar plot
    for each model showing the top N most important features. Feature names are
    truncated to 25 characters for better display.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )

    num_plots = len(trained_models_sorted)
    fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 6, 12))

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    for i, trained_model in enumerate(trained_models_sorted):
        # Always use regular pipeline
        pipeline: Pipeline = trained_model.pipeline
        prediction_time = trained_model.training_results.prediction_time

        # Get feature names from the pipeline
        transformed_cols = pipeline.named_steps[
            "feature_transformer"
        ].get_feature_names_out()
        transformed_cols = [col.split("__")[-1] for col in transformed_cols]
        truncated_cols = [col[:25] for col in transformed_cols]

        # Get feature importances
        feature_importances = pipeline.named_steps["classifier"].feature_importances_
        indices = np.argsort(feature_importances)[
            -top_n:
        ]  # Get indices of the top N features

        # Plot for this prediction time
        ax = axs[i]
        hour, minutes = prediction_time
        ax.barh(range(len(indices)), feature_importances[indices], align="center")
        ax.set_yticks(range(len(indices)))
        ax.set_yticklabels(np.array(truncated_cols)[indices])
        ax.set_xlabel("Importance")
        ax.set_ylabel("Features")
        ax.set_title(f"Feature Importances for {hour}:{minutes:02}")

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle is not None:
        plt.suptitle(suptitle, y=1.05, fontsize=16)

    if media_file_path:
        # Save and display plot
        if file_name:
            feature_plot_path = media_file_path / file_name
        else:
            feature_plot_path = media_file_path / "feature_importance_plots.png"
        plt.savefig(feature_plot_path, bbox_inches="tight")

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()
        return None

`madcap`

Module for generating MADCAP (Model Accuracy and Discriminative Calibration Plots) visualizations.

MADCAP plots compare model-predicted probabilities to observed outcomes, helping to assess model calibration and discrimination. The plots can be generated for individual prediction times or for specific groups (e.g., age groups).

Functions:

Name	Description
`classify_age : function`	Classifies age into categories based on numeric values or age group strings.
`plot_madcap : function`	Generates MADCAP plots for a list of trained models, comparing estimated probabilities to observed values.
`_plot_madcap_subplot : function`	Plots a single MADCAP subplot showing cumulative predicted and observed values.
`_plot_madcap_by_group_single : function`	Generates MADCAP plots for specific groups at a given prediction time.
`plot_madcap_by_group : function`	Generates MADCAP plots for different groups across multiple prediction times.
`plot_madcap_by_group`	Generates MADCAP plots for groups (e.g., age groups) across a series of prediction times.

`classify_age(age, age_categories=None)`

Classify age into categories based on numeric values or age group strings.

Parameters:

Name	Type	Description	Default
`age`	`int, float, or str`	Age value (e.g., 30) or age group string (e.g., '18-24').	required
`age_categories`	`dict`	Dictionary defining age categories and their ranges. If not provided, uses DEFAULT_AGE_CATEGORIES. Expected format: { "category_name": { "numeric": {"min": min_age, "max": max_age}, "groups": ["age_group1", "age_group2", ...] } }	`None`

Returns:

Type	Description
`str`	Category name based on the age or age group, or 'unknown' for unexpected or invalid values.

Examples:

>>> classify_age(25)
'adults'
>>> classify_age('65-74')
'65 or over'

Source code in src/patientflow/viz/madcap.py

def classify_age(age, age_categories=None):
    """Classify age into categories based on numeric values or age group strings.

    Parameters
    ----------
    age : int, float, or str
        Age value (e.g., 30) or age group string (e.g., '18-24').
    age_categories : dict, optional
        Dictionary defining age categories and their ranges. If not provided, uses DEFAULT_AGE_CATEGORIES.
        Expected format:
        {
            "category_name": {
                "numeric": {"min": min_age, "max": max_age},
                "groups": ["age_group1", "age_group2", ...]
            }
        }

    Returns
    -------
    str
        Category name based on the age or age group, or 'unknown' for unexpected or invalid values.

    Examples
    --------
    >>> classify_age(25)
    'adults'
    >>> classify_age('65-74')
    '65 or over'
    """
    if age_categories is None:
        age_categories = DEFAULT_AGE_CATEGORIES

    if isinstance(age, (int, float)):
        for category, rules in age_categories.items():
            numeric_rules = rules.get("numeric", {})
            min_age = numeric_rules.get("min", float("-inf"))
            max_age = numeric_rules.get("max", float("inf"))

            if min_age <= age <= max_age:
                return category
        return "unknown"
    elif isinstance(age, str):
        for category, rules in age_categories.items():
            if age in rules.get("groups", []):
                return category
        return "unknown"
    else:
        return "unknown"

`plot_madcap(trained_models, test_visits, exclude_from_training_data, media_file_path=None, file_name=None, suptitle=None, return_figure=False, label_col='is_admitted')`

Generate MADCAP plots for a list of trained models.

Parameters:

Name	Type	Description	Default
`trained_models`	`list[TrainedClassifier] or dict[str, TrainedClassifier]`	List of trained classifier objects or dictionary with TrainedClassifier values.	required
`test_visits`	`DataFrame`	DataFrame containing test visit data.	required
`exclude_from_training_data`	`List[str]`	List of columns to exclude from training data.	required
`media_file_path`	`Path`	Directory path where the generated plots will be saved.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "madcap_plot.png".	`None`
`suptitle`	`str`	Suptitle for the plot.	`None`
`return_figure`	`bool`	If True, returns the figure object instead of displaying it.	`False`
`label_col`	`str`	Name of the column containing the target labels.	`"is_admitted"`

Returns:

Type	Description
`Optional[Figure]`	The figure if return_figure is True, None otherwise.

Source code in src/patientflow/viz/madcap.py

def plot_madcap(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits: pd.DataFrame,
    exclude_from_training_data: List[str],
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    suptitle: Optional[str] = None,
    return_figure: bool = False,
    label_col: str = "is_admitted",
) -> Optional[plt.Figure]:
    """Generate MADCAP plots for a list of trained models.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of trained classifier objects or dictionary with TrainedClassifier values.
    test_visits : pd.DataFrame
        DataFrame containing test visit data.
    exclude_from_training_data : List[str]
        List of columns to exclude from training data.
    media_file_path : Path, optional
        Directory path where the generated plots will be saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "madcap_plot.png".
    suptitle : str, optional
        Suptitle for the plot.
    return_figure : bool, default=False
        If True, returns the figure object instead of displaying it.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels.

    Returns
    -------
    Optional[plt.Figure]
        The figure if return_figure is True, None otherwise.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )
    num_plots = len(trained_models_sorted)

    # Calculate the number of rows and columns for the subplots
    num_cols = min(num_plots, 5)  # Maximum 5 columns
    num_rows = math.ceil(num_plots / num_cols)

    fig, axes = plt.subplots(num_rows, num_cols, figsize=(num_plots * 5, 4))

    # Handle the case of a single plot differently
    if num_plots == 1:
        # When there's only one plot, axes is a single Axes object, not an array
        trained_model = trained_models_sorted[0]

        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)
        predict_proba = pipeline.predict_proba(X_test)[:, 1]

        # Plot directly on the single axes
        _plot_madcap_subplot(predict_proba, y_test, prediction_time, axes)
    else:
        # For multiple plots, ensure axes is always a 2D array
        if num_rows == 1:
            axes = axes.reshape(1, -1)

        for i, trained_model in enumerate(trained_models_sorted):
            # Use calibrated pipeline if available, otherwise use regular pipeline
            if (
                hasattr(trained_model, "calibrated_pipeline")
                and trained_model.calibrated_pipeline is not None
            ):
                pipeline = trained_model.calibrated_pipeline
            else:
                pipeline = trained_model.pipeline

            prediction_time = trained_model.training_results.prediction_time

            # Get test data for this prediction time
            X_test, y_test = prepare_patient_snapshots(
                df=test_visits,
                prediction_time=prediction_time,
                exclude_columns=exclude_from_training_data,
                single_snapshot_per_visit=False,
                label_col=label_col,
            )

            X_test = add_missing_columns(pipeline, X_test)
            predict_proba = pipeline.predict_proba(X_test)[:, 1]

            row = i // num_cols
            col = i % num_cols
            _plot_madcap_subplot(predict_proba, y_test, prediction_time, axes[row, col])

        # Hide any unused subplots
        for j in range(i + 1, num_rows * num_cols):
            row = j // num_cols
            col = j % num_cols
            axes[row, col].axis("off")

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle:
        fig.suptitle(suptitle, fontsize=16, y=1.05)
        # Adjust layout to accommodate suptitle
        plt.subplots_adjust(top=0.85)

    if media_file_path:
        plot_name = file_name if file_name else "madcap_plot.png"
        madcap_plot_path = Path(media_file_path) / plot_name
        plt.savefig(madcap_plot_path, bbox_inches="tight")

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close(fig)
        return None

`plot_madcap_by_group(trained_models, test_visits, exclude_from_training_data, grouping_var, grouping_var_name, media_file_path=None, file_name=None, plot_difference=False, return_figure=False, label_col='is_admitted')`

Generate MADCAP plots for different groups across multiple prediction times.

Parameters:

Name	Type	Description	Default
`trained_models`	`list[TrainedClassifier] or dict[str, TrainedClassifier]`	List of trained classifier objects or dictionary with TrainedClassifier values.	required
`test_visits`	`DataFrame`	DataFrame containing the test visit data.	required
`exclude_from_training_data`	`List[str]`	List of columns to exclude from training data.	required
`grouping_var`	`str`	The column name in the dataset that defines the grouping variable.	required
`grouping_var_name`	`str`	A descriptive name for the grouping variable, used in plot titles.	required
`media_file_path`	`Path`	Directory path where the generated plots will be saved.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to a generated name based on group and time.	`None`
`plot_difference`	`bool`	If True, includes difference plot between predicted and observed outcomes.	`False`
`return_figure`	`bool`	If True, returns a list of figure objects instead of displaying them.	`False`
`label_col`	`str`	Name of the column containing the target labels.	`"is_admitted"`

Returns:

Type	Description
`Optional[List[Figure]]`	List of figures if return_figure is True, None otherwise.

Source code in src/patientflow/viz/madcap.py

def plot_madcap_by_group(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits: pd.DataFrame,
    exclude_from_training_data: List[str],
    grouping_var: str,
    grouping_var_name: str,
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    plot_difference: bool = False,
    return_figure: bool = False,
    label_col: str = "is_admitted",
) -> Optional[List[plt.Figure]]:
    """Generate MADCAP plots for different groups across multiple prediction times.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of trained classifier objects or dictionary with TrainedClassifier values.
    test_visits : pd.DataFrame
        DataFrame containing the test visit data.
    exclude_from_training_data : List[str]
        List of columns to exclude from training data.
    grouping_var : str
        The column name in the dataset that defines the grouping variable.
    grouping_var_name : str
        A descriptive name for the grouping variable, used in plot titles.
    media_file_path : Path, optional
        Directory path where the generated plots will be saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to a generated name based on group and time.
    plot_difference : bool, default=False
        If True, includes difference plot between predicted and observed outcomes.
    return_figure : bool, default=False
        If True, returns a list of figure objects instead of displaying them.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels.

    Returns
    -------
    Optional[List[plt.Figure]]
        List of figures if return_figure is True, None otherwise.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )

    figures = []
    for trained_model in trained_models_sorted:
        # Use calibrated pipeline if available, otherwise use regular pipeline
        if (
            hasattr(trained_model, "calibrated_pipeline")
            and trained_model.calibrated_pipeline is not None
        ):
            pipeline = trained_model.calibrated_pipeline
        else:
            pipeline = trained_model.pipeline

        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, y_test = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        # Check if the grouping variable exists in X_test columns
        if grouping_var not in X_test.columns:
            raise ValueError(f"'{grouping_var}' not found in the dataset columns.")

        X_test = add_missing_columns(pipeline, X_test)
        predict_proba = pipeline.predict_proba(X_test)[:, 1]

        # Apply classification based on the grouping variable
        if grouping_var == "age_group":
            group = X_test["age_group"].apply(classify_age)
        elif grouping_var == "age_on_arrival":
            group = X_test["age_on_arrival"].apply(classify_age)
        else:
            group = X_test[grouping_var]

        fig = _plot_madcap_by_group_single(
            predict_proba,
            y_test,
            group,
            prediction_time,
            grouping_var_name,
            media_file_path,
            file_name=file_name,
            plot_difference=plot_difference,
            return_figure=True,
        )
        if return_figure:
            figures.append(fig)

    if return_figure:
        return figures
    else:
        return None

`observed_against_expected`

Visualisation utilities for evaluating patient flow predictions.

This module provides functions for creating visualizations to evaluate the accuracy and performance of patient flow predictions, particularly focusing on comparing observed versus expected values.

Functions:

Name	Description
`plot_deltas : function`	Plot histograms of observed minus expected values
`plot_arrival_delta_single_instance : function`	Plot comparison between observed arrivals and expected arrival rates
`plot_arrival_deltas : function`	Plot delta charts for multiple snapshot dates on the same figure

`plot_arrival_delta_single_instance(df, prediction_time, snapshot_date, prediction_window, yta_time_interval=timedelta(minutes=15), show_delta=True, show_only_delta=False, media_file_path=None, file_name=None, return_figure=False, fig_size=(10, 4))`

Plot comparison between observed arrivals and expected arrival rates.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing arrival data	required
`prediction_time`	`tuple`	(hour, minute) of prediction time	required
`snapshot_date`	`date`	Date to analyze	required
`prediction_window`	`int`	Prediction window in minutes	required
`show_delta`	`bool`	If True, plot the difference between actual and expected arrivals	`True`
`show_only_delta`	`bool`	If True, only plot the delta between actual and expected arrivals	`False`
`yta_time_interval`	`int`	Time interval in minutes for calculating arrival rates	`15`
`media_file_path`	`Path`	Path to save the plot	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "arrival_comparison.png"	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it	`False`
`fig_size`	`tuple`	Figure size as (width, height) in inches	`(10, 4)`

Returns:

Type	Description
`Figure or None`	The figure object if return_figure is True, otherwise None

Source code in src/patientflow/viz/observed_against_expected.py

def plot_arrival_delta_single_instance(
    df,
    prediction_time,
    snapshot_date,
    prediction_window: timedelta,
    yta_time_interval: timedelta = timedelta(minutes=15),
    show_delta=True,
    show_only_delta=False,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    fig_size=(10, 4),
):
    """Plot comparison between observed arrivals and expected arrival rates.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing arrival data
    prediction_time : tuple
        (hour, minute) of prediction time
    snapshot_date : datetime.date
        Date to analyze
    prediction_window : int
        Prediction window in minutes
    show_delta : bool, default=True
        If True, plot the difference between actual and expected arrivals
    show_only_delta : bool, default=False
        If True, only plot the delta between actual and expected arrivals
    yta_time_interval : int, default=15
        Time interval in minutes for calculating arrival rates
    media_file_path : Path, optional
        Path to save the plot
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "arrival_comparison.png"
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    fig_size : tuple, default=(10, 4)
        Figure size as (width, height) in inches

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None
    """
    # Prepare data
    df_copy, snapshot_datetime, default_datetime, prediction_time_obj = (
        _prepare_arrival_data(
            df, prediction_time, snapshot_date, prediction_window, yta_time_interval
        )
    )

    # Get arrivals within the prediction window
    arrivals = df_copy[
        (df_copy.index > snapshot_datetime)
        & (df_copy.index <= snapshot_datetime + prediction_window)
    ]

    # Sort arrivals by time and create cumulative count
    arrivals = arrivals.sort_values("arrival_datetime")
    arrivals["cumulative_count"] = range(1, len(arrivals) + 1)

    # Calculate arrival rates and prepare time points
    mean_arrival_rates = _calculate_arrival_rates(
        df_copy, prediction_time_obj, prediction_window, yta_time_interval
    )

    # Prepare arrival times
    arrival_times_piecewise = _prepare_arrival_times(
        mean_arrival_rates, prediction_time_obj, default_date=datetime(2024, 1, 1)
    )

    # Calculate cumulative rates
    cumulative_rates = _calculate_cumulative_rates(
        arrival_times_piecewise, mean_arrival_rates
    )

    # Create figure with subplots if showing delta
    if show_delta and not show_only_delta:
        fig, (ax1, ax2) = plt.subplots(
            2, 1, figsize=(fig_size[0], fig_size[1] * 2), sharex=True
        )
        ax = ax1
    else:
        plt.figure(figsize=fig_size)
        ax = plt.gca()

    # Ensure arrivals index is timezone-aware
    if arrivals.index.tz is None:
        arrivals.index = arrivals.index.tz_localize("UTC")

    # Convert arrival times to use default date for plotting
    arrival_times_plot = [
        default_datetime + (t - snapshot_datetime) for t in arrivals.index
    ]

    # Create combined timeline
    all_times = _create_combined_timeline(
        default_datetime, arrival_times_plot, prediction_window, arrival_times_piecewise
    )

    # Interpolate both actual and expected to the combined timeline
    actual_counts = np.interp(
        [t.timestamp() for t in all_times],
        [
            t.timestamp()
            for t in [default_datetime]
            + arrival_times_plot
            + [default_datetime + prediction_window]
        ],
        [0]
        + list(arrivals["cumulative_count"])
        + [arrivals["cumulative_count"].iloc[-1] if len(arrivals) > 0 else 0],
    )

    expected_counts = np.interp(
        [t.timestamp() for t in all_times],
        [t.timestamp() for t in arrival_times_piecewise],
        cumulative_rates,
    )

    # Calculate delta
    delta = actual_counts - expected_counts
    delta[0] = 0  # Ensure delta starts at 0

    if not show_only_delta:
        # Plot actual and expected arrivals
        ax.step(
            [default_datetime]
            + arrival_times_plot
            + [default_datetime + prediction_window],
            [0]
            + list(arrivals["cumulative_count"])
            + [arrivals["cumulative_count"].iloc[-1] if len(arrivals) > 0 else 0],
            where="post",
            label="Actual Arrivals",
        )
        ax.scatter(
            arrival_times_piecewise,
            cumulative_rates,
            label="Expected Arrivals",
            color="orange",
        )

        ax.set_xlabel("Time")
        ax.set_title(
            f"Cumulative Arrivals in the {int(prediction_window.total_seconds()/3600)} hours after {format_prediction_time(prediction_time)} on {snapshot_date}"
        )
        ax.legend()

    if show_delta or show_only_delta:
        if show_only_delta:
            _plot_arrival_delta_chart(
                ax, all_times, delta, prediction_time, prediction_window, snapshot_date
            )
        else:
            _plot_arrival_delta_chart(
                ax2, all_times, delta, prediction_time, prediction_window, snapshot_date
            )
        plt.tight_layout()

    # Format time axis for all subplots
    for ax in plt.gcf().get_axes():
        _format_time_axis(ax, all_times)

    if media_file_path:
        filename = file_name if file_name else "arrival_comparison.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

`plot_arrival_deltas(df, prediction_time, snapshot_dates, prediction_window, yta_time_interval=timedelta(minutes=15), media_file_path=None, file_name=None, return_figure=False, fig_size=(15, 6))`

Plot delta charts for multiple snapshot dates on the same figure.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing arrival data	required
`prediction_time`	`tuple`	(hour, minute) of prediction time	required
`snapshot_dates`	`list`	List of datetime.date objects to analyze	required
`prediction_window`	`timedelta`	Prediction window in minutes	required
`yta_time_interval`	`int`	Time interval in minutes for calculating arrival rates	`15`
`media_file_path`	`Path`	Path to save the plot	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "multiple_deltas.png"	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it	`False`
`fig_size`	`tuple`	Figure size as (width, height) in inches	`(15, 6)`

Returns:

Type	Description
`Figure or None`	The figure object if return_figure is True, otherwise None

Source code in src/patientflow/viz/observed_against_expected.py

def plot_arrival_deltas(
    df,
    prediction_time,
    snapshot_dates,
    prediction_window: timedelta,
    yta_time_interval: timedelta = timedelta(minutes=15),
    media_file_path=None,
    file_name=None,
    return_figure=False,
    fig_size=(15, 6),
):
    """Plot delta charts for multiple snapshot dates on the same figure.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing arrival data
    prediction_time : tuple
        (hour, minute) of prediction time
    snapshot_dates : list
        List of datetime.date objects to analyze
    prediction_window : timedelta
        Prediction window in minutes
    yta_time_interval : int, default=15
        Time interval in minutes for calculating arrival rates
    media_file_path : Path, optional
        Path to save the plot
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "multiple_deltas.png"
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    fig_size : tuple, default=(15, 6)
        Figure size as (width, height) in inches

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None
    """
    # Create figure with subplots
    fig = plt.figure(figsize=fig_size)
    gs = plt.GridSpec(1, 2, width_ratios=[2, 1])
    ax1 = plt.subplot(gs[0])
    ax2 = plt.subplot(gs[1])

    # Store all deltas for averaging
    all_deltas = []
    all_times_list = []
    final_deltas = []  # Store final delta values for histogram

    # Calculate common values once
    prediction_time_obj, default_datetime = _prepare_common_values(prediction_time)

    for snapshot_date in snapshot_dates:
        # Prepare data for this date
        df_copy, snapshot_datetime, _, _ = _prepare_arrival_data(
            df, prediction_time, snapshot_date, prediction_window, yta_time_interval
        )

        # Get arrivals within the prediction window
        arrivals = df_copy[
            (df_copy.index > snapshot_datetime)
            & (df_copy.index <= snapshot_datetime + pd.Timedelta(prediction_window))
        ]

        if len(arrivals) == 0:
            continue

        # Sort arrivals by time and create cumulative count
        arrivals = arrivals.sort_values("arrival_datetime")
        arrivals["cumulative_count"] = range(1, len(arrivals) + 1)

        # Calculate arrival rates and prepare time points
        mean_arrival_rates = _calculate_arrival_rates(
            df_copy, prediction_time_obj, prediction_window, yta_time_interval
        )

        # Prepare arrival times
        arrival_times_piecewise = _prepare_arrival_times(
            mean_arrival_rates, prediction_time_obj, default_date=datetime(2024, 1, 1)
        )

        # Calculate cumulative rates
        cumulative_rates = _calculate_cumulative_rates(
            arrival_times_piecewise, mean_arrival_rates
        )

        # Convert arrival times to use default date for plotting
        arrival_times_plot = [
            default_datetime + (t - snapshot_datetime) for t in arrivals.index
        ]

        # Create combined timeline
        all_times = _create_combined_timeline(
            default_datetime,
            arrival_times_plot,
            prediction_window,
            arrival_times_piecewise,
        )

        # Interpolate both actual and expected to the combined timeline
        actual_counts = np.interp(
            [t.timestamp() for t in all_times],
            [
                t.timestamp()
                for t in [default_datetime]
                + arrival_times_plot
                + [default_datetime + pd.Timedelta(prediction_window)]
            ],
            [0]
            + list(arrivals["cumulative_count"])
            + [arrivals["cumulative_count"].iloc[-1]],
        )

        expected_counts = np.interp(
            [t.timestamp() for t in all_times],
            [t.timestamp() for t in arrival_times_piecewise],
            cumulative_rates,
        )

        # Calculate delta
        delta = actual_counts - expected_counts
        delta[0] = 0  # Ensure delta starts at 0

        # Store for averaging
        all_deltas.append(delta)
        all_times_list.append(all_times)

        # Store final delta value for histogram
        final_deltas.append(delta[-1])

        # Plot delta for this snapshot date
        ax1.step(all_times, delta, where="post", color="grey", alpha=0.5)

    # Calculate and plot average delta
    if all_deltas:
        # Find the common time points across all dates
        common_times = sorted(set().union(*[set(times) for times in all_times_list]))

        # Interpolate all deltas to common time points
        interpolated_deltas = []
        for times, delta in zip(all_times_list, all_deltas):
            # Only interpolate within the actual time range for each date
            min_time = min(times)
            max_time = max(times)
            valid_times = [t for t in common_times if min_time <= t <= max_time]

            if valid_times:
                interpolated = np.interp(
                    [t.timestamp() for t in valid_times],
                    [t.timestamp() for t in times],
                    delta,
                )
                # Pad with NaN for times outside the valid range
                padded = np.full(len(common_times), np.nan)
                valid_indices = [
                    i for i, t in enumerate(common_times) if t in valid_times
                ]
                padded[valid_indices] = interpolated
                interpolated_deltas.append(padded)

        # Calculate average delta, ignoring NaN values
        avg_delta = np.nanmean(interpolated_deltas, axis=0)

        # Plot average delta as a solid line
        # Only plot where we have valid data (not NaN)
        valid_mask = ~np.isnan(avg_delta)
        if np.any(valid_mask):
            ax1.step(
                [t for t, m in zip(common_times, valid_mask) if m],
                avg_delta[valid_mask],
                where="post",
                color="red",
                linewidth=2,
            )

    # Add horizontal line at y=0
    ax1.axhline(y=0, color="gray", linestyle="--", alpha=0.5)

    # Format the main plot
    ax1.set_xlabel("Time")
    ax1.set_ylabel("Difference (Actual - Expected)")
    ax1.set_title(
        f"Difference Between Actual and Expected Arrivals in the {(int(prediction_window.total_seconds()/3600))} hours after {format_prediction_time(prediction_time)} on all dates"
    )

    # Format time axis
    _format_time_axis(ax1, common_times)

    # Create histogram of final delta values
    if final_deltas:
        # Round values to nearest integer for binning
        rounded_deltas = np.round(final_deltas)
        unique_values = np.unique(rounded_deltas)

        # Create bins centered on integer values
        bin_edges = np.arange(unique_values.min() - 0.5, unique_values.max() + 1.5, 1)

        ax2.hist(final_deltas, bins=bin_edges, color="grey", alpha=0.7)
        ax2.axvline(x=0, color="gray", linestyle="--", alpha=0.5)
        ax2.set_xlabel("Final Difference (Actual - Expected)")
        ax2.set_ylabel("Count")
        ax2.set_title("Distribution of Final Differences")

        # Set x-axis ticks to integer values with appropriate spacing
        value_range = unique_values.max() - unique_values.min()
        step_size = max(1, int(value_range / 10))  # Aim for about 10 ticks
        ax2.set_xticks(
            np.arange(unique_values.min(), unique_values.max() + 1, step_size)
        )

    plt.tight_layout()

    if media_file_path:
        filename = file_name if file_name else "multiple_deltas.png"
        plt.savefig(media_file_path / filename, dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

`plot_deltas(results1, results2=None, title1=None, title2=None, main_title='Histograms of Observed - Expected Values', xlabel='Observed minus expected', media_file_path=None, file_name=None, return_figure=False)`

Plot histograms of observed minus expected values.

Creates a grid of histograms showing the distribution of differences between observed and expected values for different prediction times. Optionally compares two sets of results side by side.

Parameters:

Name	Type	Description	Default
`results1`	`dict`	First set of results containing observed and expected values for different prediction times. Keys are prediction times, values are dicts with 'observed' and 'expected' arrays.	required
`results2`	`dict`	Second set of results for comparison, following the same format as results1.	`None`
`title1`	`str`	Title for the first set of results.	`None`
`title2`	`str`	Title for the second set of results.	`None`
`main_title`	`str`	Main title for the entire plot.	`"Histograms of Observed - Expected Values"`
`xlabel`	`str`	Label for the x-axis of each histogram.	`"Observed minus expected"`
`media_file_path`	`Path`	Path where the plot should be saved. If provided, saves the plot as a PNG file.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "observed_vs_expected.png".	`None`
`return_figure`	`bool`	If True, returns the matplotlib figure object instead of displaying it.	`False`

Returns:

Type	Description
`Figure or None`	The figure object if return_figure is True, otherwise None.

Notes

The function creates a grid of histograms with a maximum of 5 columns. Each histogram shows the distribution of differences between observed and expected values for a specific prediction time. A red dashed line at x=0 indicates where observed equals expected.

Source code in src/patientflow/viz/observed_against_expected.py

def plot_deltas(
    results1,
    results2=None,
    title1=None,
    title2=None,
    main_title="Histograms of Observed - Expected Values",
    xlabel="Observed minus expected",
    media_file_path=None,
    file_name=None,
    return_figure=False,
):
    """Plot histograms of observed minus expected values.

    Creates a grid of histograms showing the distribution of differences between
    observed and expected values for different prediction times. Optionally compares
    two sets of results side by side.

    Parameters
    ----------
    results1 : dict
        First set of results containing observed and expected values for different
        prediction times. Keys are prediction times, values are dicts with 'observed'
        and 'expected' arrays.
    results2 : dict, optional
        Second set of results for comparison, following the same format as results1.
    title1 : str, optional
        Title for the first set of results.
    title2 : str, optional
        Title for the second set of results.
    main_title : str, default="Histograms of Observed - Expected Values"
        Main title for the entire plot.
    xlabel : str, default="Observed minus expected"
        Label for the x-axis of each histogram.
    media_file_path : Path, optional
        Path where the plot should be saved. If provided, saves the plot as a PNG file.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "observed_vs_expected.png".
    return_figure : bool, default=False
        If True, returns the matplotlib figure object instead of displaying it.

    Returns
    -------
    matplotlib.figure.Figure or None
        The figure object if return_figure is True, otherwise None.

    Notes
    -----
    The function creates a grid of histograms with a maximum of 5 columns.
    Each histogram shows the distribution of differences between observed and
    expected values for a specific prediction time. A red dashed line at x=0
    indicates where observed equals expected.
    """
    # Calculate the number of subplots needed
    num_plots = len(results1)

    # Calculate the number of rows and columns for the subplots
    num_cols = min(5, num_plots)  # Maximum of 5 columns
    num_rows = math.ceil(num_plots / num_cols)

    if results2:
        num_rows *= 2  # Double the number of rows if we have two result sets

    # Set a minimum width for the figure
    min_width = 8  # minimum width in inches
    width = max(min_width, 4 * num_cols)
    height = 4 * num_rows

    # Create the plot
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(width, height), squeeze=False)
    fig.suptitle(main_title, fontsize=14)

    # Flatten the axes array
    axes = axes.flatten()

    def plot_results(
        results, start_index, result_title, global_min, global_max, max_freq
    ):
        # Convert prediction times to minutes for sorting
        prediction_times_sorted = sorted(
            results.items(),
            key=lambda x: int(x[0].split("_")[-1][:2]) * 60
            + int(x[0].split("_")[-1][2:]),
        )

        # Create symmetric bins around zero
        bins = np.arange(global_min, global_max + 2) - 0.5

        for i, (_prediction_time, values) in enumerate(prediction_times_sorted):
            observed = np.array(values["observed"])
            expected = np.array(values["expected"])
            difference = observed - expected

            ax = axes[start_index + i]

            ax.hist(difference, bins=bins, edgecolor="black", alpha=0.7)
            ax.axvline(x=0, color="r", linestyle="--", linewidth=1)

            # Format the prediction time
            formatted_time = format_prediction_time(_prediction_time)

            # Combine the result_title and formatted_time
            if result_title:
                ax.set_title(f"{result_title} {formatted_time}")
            else:
                ax.set_title(formatted_time)

            ax.set_xlabel(xlabel)
            ax.set_ylabel("Frequency")
            ax.set_xlim(global_min - 0.5, global_max + 0.5)
            ax.set_ylim(0, max_freq)

    # Calculate global min and max differences for consistent x-axis across both result sets
    all_differences = []
    max_counts = []

    # Gather all differences and compute histogram data for both result sets
    for results in [results1] + ([results2] if results2 else []):
        for _, values in results.items():
            observed = np.array(values["observed"])
            expected = np.array(values["expected"])
            differences = observed - expected
            all_differences.extend(differences)
            # Compute histogram data to find maximum frequency
            counts, _ = np.histogram(differences)
            max_counts.append(max(counts))

    # Find the symmetric range around zero
    abs_max = max(abs(min(all_differences)), abs(max(all_differences)))
    global_min = -math.ceil(abs_max)
    global_max = math.ceil(abs_max)

    # Find the maximum frequency across all histograms
    max_freq = math.ceil(max(max_counts) * 1.1)  # Add 10% padding

    # Plot the first results set
    plot_results(results1, 0, title1, global_min, global_max, max_freq)

    # Plot the second results set if provided
    if results2:
        plot_results(results2, num_plots, title2, global_min, global_max, max_freq)

    # Hide any unused subplots
    for j in range(num_plots * (2 if results2 else 1), len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()

    if media_file_path:
        if file_name:
            plt.savefig(media_file_path / file_name, dpi=300)
        else:
            plt.savefig(media_file_path / "observed_vs_expected.png", dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

`probability_distribution`

Module for generating probability distribution visualizations.

Functions:

Name	Description
`plot_prob_dist : Plot a probability distribution as a bar chart with enhanced plotting options.`

`plot_prob_dist(prob_dist_data, title, media_file_path=None, figsize=(6, 3), include_titles=False, truncate_at_beds=None, text_size=None, bar_colour='#5B9BD5', file_name=None, probability_thresholds=None, show_probability_thresholds=True, probability_levels=None, plot_bed_base=None, xlabel='Number of beds', return_figure=False)`

Plot a probability distribution as a bar chart with enhanced plotting options.

This function generates a bar plot for a given probability distribution, either as a pandas DataFrame, a scipy.stats distribution object (e.g., Poisson), or a dictionary. The plot can be customized with titles, axis labels, markers, and additional visual properties.

Parameters:

Name	Type	Description	Default
`prob_dist_data`	`pandas.DataFrame, dict, scipy.stats distribution, or array-like`	The probability distribution data to be plotted. Can be: - pandas DataFrame - dictionary (keys are indices, values are probabilities) - scipy.stats distribution (e.g., Poisson). If a `scipy.stats` distribution is provided, the function computes probabilities for integer values within the specified range. - array-like of probabilities (indices will be 0 to len(array)-1)	required
`title`	`str`	The title of the plot, used for display and optionally as the file name.	required
`media_file_path`	`str or Path`	Directory where the plot image will be saved. If not provided, the plot is displayed without saving.	`None`
`figsize`	`tuple of float`	The size of the figure, specified as (width, height). Default is (6, 3)	`(6, 3)`
`include_titles`	`bool`	Whether to include titles and axis labels in the plot. Default is False	`False`
`truncate_at_beds`	`int or tuple of (int, int)`	Either a single number specifying the upper bound, or a tuple of (lower_bound, upper_bound) for the x-axis range. If None, the full range of the data will be displayed.	`None`
`text_size`	`int`	Font size for plot text, including titles and tick labels.	`None`
`bar_colour`	`str`	The color of the bars in the plot. Default is "#5B9BD5"	`'#5B9BD5'`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to a generated name based on the title.	`None`
`probability_thresholds`	`dict`	A dictionary where keys are points on the cumulative distribution function (as decimals, e.g., 0.9 for 90%) and values are the corresponding resource thresholds (bed counts). For example, {0.9: 15} indicates there is a 90% probability that at least 15 beds will be needed (represents the lower tail of the distribution).	`None`
`show_probability_thresholds`	`bool`	Whether to show vertical lines indicating the resource requirements at different points on the cumulative distribution function. Default is True	`True`
`probability_levels`	`list of float`	List of probability levels for automatic threshold calculation.	`None`
`plot_bed_base`	`dict`	Dictionary of bed balance lines to plot in red. Keys are labels and values are x-axis positions.	`None`
`xlabel`	`str`	A label for the x axis. Default is "Number of beds"	`'Number of beds'`
`return_figure`	`bool`	If True, returns the matplotlib figure instead of displaying it. Default is False	`False`

Returns:

Type	Description
`Figure or None`	Returns the figure if return_figure is True, otherwise displays the plot

Examples:

Basic usage with an array of probabilities:

>>> probabilities = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
>>> plot_prob_dist(probabilities, "Bed Demand Distribution")

With thresholds:

>>> thresholds = _calculate_probability_thresholds(probabilities, [0.8, 0.95])
>>> plot_prob_dist(probabilities, "Bed Demand with Confidence Levels",
...                probability_thresholds=thresholds)

Using with a scipy stats distribution:

>>> from scipy import stats
>>> poisson_dist = stats.poisson(mu=5)  # Poisson with mean of 5
>>> plot_prob_dist(poisson_dist, "Poisson Distribution (μ=5)",
...                truncate_at_beds=(0, 15))

Source code in src/patientflow/viz/probability_distribution.py

def plot_prob_dist(
    prob_dist_data,
    title,
    media_file_path=None,
    figsize=(6, 3),
    include_titles=False,
    truncate_at_beds=None,
    text_size=None,
    bar_colour="#5B9BD5",
    file_name=None,
    probability_thresholds=None,
    show_probability_thresholds=True,
    probability_levels=None,
    plot_bed_base=None,
    xlabel="Number of beds",
    return_figure=False,
):
    """Plot a probability distribution as a bar chart with enhanced plotting options.

    This function generates a bar plot for a given probability distribution, either
    as a pandas DataFrame, a scipy.stats distribution object (e.g., Poisson), or a
    dictionary. The plot can be customized with titles, axis labels, markers, and
    additional visual properties.

    Parameters
    ----------
    prob_dist_data : pandas.DataFrame, dict, scipy.stats distribution, or array-like
        The probability distribution data to be plotted. Can be:
        - pandas DataFrame
        - dictionary (keys are indices, values are probabilities)
        - scipy.stats distribution (e.g., Poisson). If a `scipy.stats` distribution is provided,
        the function computes probabilities for integer values within the specified range.
        - array-like of probabilities (indices will be 0 to len(array)-1)
    title : str
        The title of the plot, used for display and optionally as the file name.
    media_file_path : str or pathlib.Path, optional
        Directory where the plot image will be saved. If not provided, the plot is
        displayed without saving.
    figsize : tuple of float, optional
        The size of the figure, specified as (width, height).
        Default is (6, 3)
    include_titles : bool, optional
        Whether to include titles and axis labels in the plot.
        Default is False
    truncate_at_beds : int or tuple of (int, int), optional
        Either a single number specifying the upper bound, or a tuple of
        (lower_bound, upper_bound) for the x-axis range. If None, the full
        range of the data will be displayed.
    text_size : int, optional
        Font size for plot text, including titles and tick labels.
    bar_colour : str, optional
        The color of the bars in the plot.
        Default is "#5B9BD5"
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to a generated name based on the title.
    probability_thresholds : dict, optional
        A dictionary where keys are points on the cumulative distribution function (as decimals, e.g., 0.9 for 90%)
        and values are the corresponding resource thresholds (bed counts).
        For example, {0.9: 15} indicates there is a 90% probability that
        at least 15 beds will be needed (represents the lower tail of the distribution).
    show_probability_thresholds : bool, optional
        Whether to show vertical lines indicating the resource requirements
        at different points on the cumulative distribution function.
        Default is True
    probability_levels : list of float, optional
        List of probability levels for automatic threshold calculation.
    plot_bed_base : dict, optional
        Dictionary of bed balance lines to plot in red.
        Keys are labels and values are x-axis positions.
    xlabel : str, optional
        A label for the x axis.
        Default is "Number of beds"
    return_figure : bool, optional
        If True, returns the matplotlib figure instead of displaying it.
        Default is False

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot

    Examples
    --------
    Basic usage with an array of probabilities:

    >>> probabilities = [0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]
    >>> plot_prob_dist(probabilities, "Bed Demand Distribution")

    With thresholds:

    >>> thresholds = _calculate_probability_thresholds(probabilities, [0.8, 0.95])
    >>> plot_prob_dist(probabilities, "Bed Demand with Confidence Levels",
    ...                probability_thresholds=thresholds)

    Using with a scipy stats distribution:

    >>> from scipy import stats
    >>> poisson_dist = stats.poisson(mu=5)  # Poisson with mean of 5
    >>> plot_prob_dist(poisson_dist, "Poisson Distribution (μ=5)",
    ...                truncate_at_beds=(0, 15))
    """

    # Handle array-like input
    if isinstance(prob_dist_data, (np.ndarray, list)):
        array_length = len(prob_dist_data)
        prob_dist_data = pd.DataFrame(
            {"agg_proba": prob_dist_data}, index=range(array_length)
        )

    # Handle scipy.stats distribution input
    elif hasattr(prob_dist_data, "pmf") and callable(prob_dist_data.pmf):
        # Determine range for the distribution
        if truncate_at_beds is None:
            # Default range for distributions if not specified
            lower_bound = 0
            upper_bound = 20  # Reasonable default for most discrete distributions
        elif isinstance(truncate_at_beds, (int, float)):
            lower_bound = 0
            upper_bound = truncate_at_beds
        else:
            lower_bound, upper_bound = truncate_at_beds

        # Generate x values and probabilities
        x = np.arange(lower_bound, upper_bound + 1)
        probs = prob_dist_data.pmf(x)
        prob_dist_data = pd.DataFrame({"agg_proba": probs}, index=x)

        # No need to filter later
        truncate_at_beds = None

    # Handle dictionary input
    elif isinstance(prob_dist_data, dict):
        prob_dist_data = pd.DataFrame(
            {"agg_proba": list(prob_dist_data.values())},
            index=list(prob_dist_data.keys()),
        )

    # Apply truncation if specified
    if truncate_at_beds is not None:
        # Determine bounds
        if isinstance(truncate_at_beds, (int, float)):
            lower_bound = 0
            upper_bound = truncate_at_beds
        else:
            lower_bound, upper_bound = truncate_at_beds

        # Apply filtering
        mask = (prob_dist_data.index >= lower_bound) & (
            prob_dist_data.index <= upper_bound
        )
        filtered_data = prob_dist_data[mask]
    else:
        # Use all available data
        filtered_data = prob_dist_data

    # Calculate probability thresholds if probability_levels is provided
    if probability_thresholds is None and probability_levels is not None:
        probability_thresholds = _calculate_probability_thresholds(
            filtered_data["agg_proba"].values, probability_levels
        )

    # Create the plot
    fig = plt.figure(figsize=figsize)

    if not file_name:
        file_name = (
            title.replace(" ", "_").replace("/n", "_").replace("%", "percent") + ".png"
        )

    # Plot bars
    plt.bar(
        filtered_data.index,
        filtered_data["agg_proba"].values,
        color=bar_colour,
    )

    # Generate appropriate ticks based on data range
    if len(filtered_data) > 0:
        data_min = min(filtered_data.index)
        data_max = max(filtered_data.index)
        data_range = data_max - data_min

        if data_range <= 10:
            tick_step = 1
        elif data_range <= 50:
            tick_step = 5
        else:
            tick_step = 10

        tick_start = (data_min // tick_step) * tick_step
        tick_end = data_max + 1
        plt.xticks(np.arange(tick_start, tick_end, tick_step))

    # Plot probability threshold lines
    if show_probability_thresholds and probability_thresholds:
        colors = itertools.cycle(
            plt.cm.gray(np.linspace(0.3, 0.7, len(probability_thresholds)))
        )
        for probability, bed_count in probability_thresholds.items():
            plt.axvline(
                x=bed_count,
                linestyle="--",
                linewidth=2,
                color=next(colors),
                label=f"{probability*100:.0f}% probability of needing ≥ {bed_count} beds",
            )
        plt.legend(loc="upper right")

    # Add bed balance lines
    if plot_bed_base:
        for point in plot_bed_base:
            plt.axvline(
                x=plot_bed_base[point],
                linewidth=2,
                color="red",
                label=f"bed balance: {point}",
            )
        plt.legend(loc="upper right")

    # Add text and labels
    if text_size:
        plt.tick_params(axis="both", which="major", labelsize=text_size)
        plt.xlabel(xlabel, fontsize=text_size)
        if include_titles:
            plt.title(title, fontsize=text_size)
            plt.ylabel("Probability", fontsize=text_size)
    else:
        plt.xlabel(xlabel)
        if include_titles:
            plt.title(title)
            plt.ylabel("Probability")

    plt.tight_layout()

    # Save or display the figure
    if media_file_path:
        plt.savefig(media_file_path / file_name.replace(" ", "_"), dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()

`quantile_quantile`

Generate Quantile-Quantile (QQ) plots to compare observed values with model predictions.

This module creates QQ plots for healthcare bed demand predictions, comparing observed values with model predictions. A QQ plot is a graphical technique for determining if two data sets come from populations with a common distribution. If the points form a line approximately along the reference line y=x, this suggests the distributions are similar.

Functions:

Name	Description
`qq_plot : function`	Generate multiple QQ plots comparing observed values with model predictions

Notes

To prepare the predicted distribution: * Treat the predicted distributions (saved as cdfs) for all time points of interest as if they were one distribution * Within this predicted distribution, because each probability is over a discrete rather than continuous number of input values, the upper and lower of values of the probability range are saved at each value * The mid point between upper and lower is calculated and saved * The distribution of cdf mid points (one for each horizon date) is sorted by value of the mid point and a cdf of this is calculated (this is a cdf of cdfs, in effect) * These are weighted by the probability of each value occurring

To prepare the observed distribution: * Take observed number each horizon date and save the cdf of that value from its predicted distribution * The distribution of cdf values (one per horizon date) is sorted * These are weighted by the probability of each value occurring, which is a uniform probability (1 / over the number of horizon dates)

`qq_plot(prediction_times, prob_dist_dict_all, model_name='admissions', return_figure=False, figsize=None, suptitle=None, media_file_path=None, file_name=None)`

Generate multiple QQ plots comparing observed values with model predictions.

Parameters:

Name	Type	Description	Default
`prediction_times`	`list of tuple`	List of (hour, minute) tuples for prediction times.	required
`prob_dist_dict_all`	`dict`	Dictionary of probability distributions keyed by model_key.	required
`model_name`	`str`	Base name of the model to construct model keys.	`"admissions"`
`return_figure`	`bool`	If True, returns the figure object instead of displaying it.	`False`
`figsize`	`tuple of float`	Size of the figure in inches as (width, height). If None, calculated automatically based on number of plots.	`None`
`suptitle`	`str`	Super title for the entire figure, displayed above all subplots.	`None`
`media_file_path`	`Path`	Path to save the plot.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "qq_plot.png".	`None`

Returns:

Type	Description
`Figure or None`	Returns the figure if return_figure is True, otherwise displays the plot and returns None.

Notes

The function creates a QQ plot for each prediction time, comparing the observed distribution with the predicted distribution. Each subplot shows how well the model's predictions match the actual observations.

Source code in src/patientflow/viz/quantile_quantile.py

def qq_plot(
    prediction_times,
    prob_dist_dict_all,
    model_name="admissions",
    return_figure=False,
    figsize=None,
    suptitle=None,
    media_file_path=None,
    file_name=None,
):
    """Generate multiple QQ plots comparing observed values with model predictions.

    Parameters
    ----------
    prediction_times : list of tuple
        List of (hour, minute) tuples for prediction times.
    prob_dist_dict_all : dict
        Dictionary of probability distributions keyed by model_key.
    model_name : str, default="admissions"
        Base name of the model to construct model keys.
    return_figure : bool, default=False
        If True, returns the figure object instead of displaying it.
    figsize : tuple of float, optional
        Size of the figure in inches as (width, height). If None, calculated automatically
        based on number of plots.
    suptitle : str, optional
        Super title for the entire figure, displayed above all subplots.
    media_file_path : Path, optional
        Path to save the plot.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "qq_plot.png".

    Returns
    -------
    matplotlib.figure.Figure or None
        Returns the figure if return_figure is True, otherwise displays the plot and returns None.

    Notes
    -----
    The function creates a QQ plot for each prediction time, comparing the observed
    distribution with the predicted distribution. Each subplot shows how well the
    model's predictions match the actual observations.
    """
    # Sort prediction times by converting to minutes since midnight
    prediction_times_sorted = sorted(
        prediction_times,
        key=lambda x: x[0] * 60
        + x[1],  # Convert (hour, minute) to minutes since midnight
    )

    num_plots = len(prediction_times_sorted)
    if figsize is None:
        figsize = (num_plots * 5, 4)

    # Create subplot layout
    fig, axs = plt.subplots(1, num_plots, figsize=figsize)

    # Handle case of single prediction time
    if num_plots == 1:
        axs = [axs]

    # Loop through each subplot
    for i, prediction_time in enumerate(prediction_times_sorted):
        # Initialize lists to store CDF and observed data
        cdf_data = []
        observed_data = []

        # Get model key and corresponding prob_dist_dict
        model_key = get_model_key(model_name, prediction_time)
        prob_dist_dict = prob_dist_dict_all[model_key]

        # Process data for current subplot
        for dt in prob_dist_dict:
            agg_predicted = np.array(prob_dist_dict[dt]["agg_predicted"])
            agg_observed = prob_dist_dict[dt]["agg_observed"]

            upper = agg_predicted.cumsum()
            lower = np.hstack((0, upper[:-1]))
            mid = (upper + lower) / 2

            cdf_data.append(np.column_stack((upper, lower, mid, agg_predicted)))
            # Round the observed data to nearest integer before using as index
            agg_observed_int = int(round(agg_observed))
            observed_data.append(mid[agg_observed_int])

        if not cdf_data:
            continue

        # Prepare data for plotting
        cdf_data = np.vstack(cdf_data)
        qq_model = pd.DataFrame(
            cdf_data, columns=["cdf_upper", "cdf_mid", "cdf_lower", "weights"]
        )
        qq_model = qq_model.sort_values("cdf_mid")
        qq_model["cum_weight"] = qq_model["weights"].cumsum()
        qq_model["cum_weight_normed"] = (
            qq_model["cum_weight"] / qq_model["weights"].sum()
        )

        qq_observed = pd.DataFrame(observed_data, columns=["cdf_observed"])
        qq_observed = qq_observed.sort_values("cdf_observed")
        qq_observed["weights"] = 1 / len(observed_data)
        qq_observed["cum_weight_normed"] = qq_observed["weights"].cumsum()

        qq_observed["max_model_cdf_at_this_value"] = qq_observed["cdf_observed"].apply(
            lambda x: qq_model[qq_model["cdf_mid"] <= x]["cum_weight_normed"].max()
        )

        # Plot on current subplot
        ax = axs[i]
        ax.set_aspect("equal")
        ax.set_xlim([0, 1])
        ax.set_ylim([0, 1])

        # Reference line y=x
        ax.plot([0, 1], [0, 1], linestyle="--")

        # Plot QQ data points
        ax.plot(
            qq_observed["max_model_cdf_at_this_value"],
            qq_observed["cum_weight_normed"],
            marker=".",
            linewidth=0,
        )

        # Set labels and title for subplot with hour:minute format
        hour, minutes = prediction_time
        ax.set_xlabel("Cdf of model distribution")
        ax.set_ylabel("Cdf of observed distribution")
        ax.set_title(f"QQ Plot for {hour}:{minutes:02}")

    plt.tight_layout()

    # Add suptitle if provided
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)

    if media_file_path:
        plt.savefig(media_file_path / (file_name or "qq_plot.png"), dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close(fig)

`randomised_pit`

`plot_randomised_pit(prediction_times, prob_dist_dict_all, model_name='admissions', return_figure=False, return_dataframe=False, figsize=None, suptitle=None, media_file_path=None, file_name=None, n_bins=10, seed=42)`

Generate randomised PIT histograms for multiple prediction times side by side.

Parameters:

Name	Type	Description	Default
`prediction_times`	`list of tuple`	List of (hour, minute) tuples representing times for which predictions were made.	required
`prob_dist_dict_all`	`dict`	Dictionary of probability distributions keyed by model_key. Each entry contains information about predicted distributions and observed values for different snapshot dates.	required
`model_name`	`str`	Base name of the model to construct model keys, by default "admissions".	`'admissions'`
`return_figure`	`bool`	If True, returns the figure object instead of displaying it, by default False.	`False`
`return_dataframe`	`bool`	If True, returns a dictionary of PIT values by model_key, by default False.	`False`
`figsize`	`tuple of (float, float)`	Size of the figure in inches as (width, height). If None, calculated automatically based on number of plots, by default None.	`None`
`suptitle`	`str`	Super title for the entire figure, displayed above all subplots, by default None.	`None`
`media_file_path`	`Path`	Path to save the plot, by default None. If provided, saves the plot as a PNG file.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "plot_randomised_pit.png".	`None`
`n_bins`	`int`	Number of histogram bins, by default 10.	`10`
`seed`	`int`	Random seed for reproducibility, by default 42.	`42`

Returns:

Type	Description
`Figure`	The figure object containing the plots, if return_figure is True.
`dict`	Dictionary of PIT values by model_key, if return_dataframe is True.
`tuple`	Tuple of (figure, pit_values_dict) if both return_figure and return_dataframe are True.
`None`	If neither return_figure nor return_dataframe is True, displays the plots and returns None.

Source code in src/patientflow/viz/randomised_pit.py

def plot_randomised_pit(
    prediction_times: List[Tuple[int, int]],
    prob_dist_dict_all: Dict[str, Dict],
    model_name: str = "admissions",
    return_figure: bool = False,
    return_dataframe: bool = False,
    figsize: Optional[Tuple[float, float]] = None,
    suptitle: Optional[str] = None,
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    n_bins: int = 10,
    seed: Optional[int] = 42,
) -> Union[
    plt.Figure, Dict[str, List[float]], Tuple[plt.Figure, Dict[str, List[float]]], None
]:
    """
    Generate randomised PIT histograms for multiple prediction times side by side.

    Parameters
    ----------
    prediction_times : list of tuple
        List of (hour, minute) tuples representing times for which predictions were made.
    prob_dist_dict_all : dict
        Dictionary of probability distributions keyed by model_key. Each entry contains
        information about predicted distributions and observed values for different
        snapshot dates.
    model_name : str, optional
        Base name of the model to construct model keys, by default "admissions".
    return_figure : bool, optional
        If True, returns the figure object instead of displaying it, by default False.
    return_dataframe : bool, optional
        If True, returns a dictionary of PIT values by model_key, by default False.
    figsize : tuple of (float, float), optional
        Size of the figure in inches as (width, height). If None, calculated automatically
        based on number of plots, by default None.
    suptitle : str, optional
        Super title for the entire figure, displayed above all subplots, by default None.
    media_file_path : Path, optional
        Path to save the plot, by default None. If provided, saves the plot as a PNG file.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "plot_randomised_pit.png".
    n_bins : int, optional
        Number of histogram bins, by default 10.
    seed : int, optional
        Random seed for reproducibility, by default 42.

    Returns
    -------
    matplotlib.figure.Figure
        The figure object containing the plots, if return_figure is True.
    dict
        Dictionary of PIT values by model_key, if return_dataframe is True.
    tuple
        Tuple of (figure, pit_values_dict) if both return_figure and return_dataframe are True.
    None
        If neither return_figure nor return_dataframe is True, displays the plots and returns None.
    """
    if seed is not None:
        np.random.seed(seed)

    # Sort prediction times by converting to minutes since midnight
    prediction_times_sorted = sorted(
        prediction_times,
        key=lambda x: x[0] * 60 + x[1],
    )

    # Calculate figure parameters
    num_plots = len(prediction_times_sorted)
    figsize = figsize or (num_plots * 5, 4)

    # Create subplot layout
    fig, axs = plt.subplots(1, num_plots, figsize=figsize)
    axs = [axs] if num_plots == 1 else axs

    all_pit_values: Dict[str, List[float]] = {}
    max_density = 0.0  # Track maximum density across all histograms

    # Process each subplot
    for i, prediction_time in enumerate(prediction_times_sorted):
        model_key = get_model_key(model_name, prediction_time)
        prob_dist_dict = prob_dist_dict_all[model_key]

        if not prob_dist_dict:
            continue

        observations = []
        cdf_functions = []

        # Extract data for each date
        for dt in prob_dist_dict:
            try:
                observation = prob_dist_dict[dt]["agg_observed"]
                predicted_dist = prob_dist_dict[dt]["agg_predicted"]["agg_proba"]

                # Convert probability distribution to CDF function
                cdf_func = _prob_to_cdf(predicted_dist)

                observations.append(observation)
                cdf_functions.append(cdf_func)

            except Exception as e:
                print(f"Skipping date {dt} due to error: {e}")
                continue

        if len(observations) == 0:
            continue

        # Generate PIT values
        pit_values = []

        for obs, cdf_func in zip(observations, cdf_functions):
            try:
                # Calculate PIT range bounds
                lower = cdf_func(obs - 1) if obs > 0 else 0.0
                upper = cdf_func(obs)

                # Sample randomly within the range
                pit_value = np.random.uniform(lower, upper)
                pit_values.append(pit_value)

            except Exception as e:
                print(f"Error processing observation {obs}: {e}")
                continue

        all_pit_values[model_key] = pit_values

        # Calculate histogram to get density
        hist, _ = np.histogram(pit_values, bins=n_bins, density=True)
        max_density = max(max_density, np.max(hist))

    # Now plot with consistent y-axis scale
    for i, prediction_time in enumerate(prediction_times_sorted):
        model_key = get_model_key(model_name, prediction_time)
        pit_values = all_pit_values.get(model_key, [])

        if not pit_values:
            continue

        # Plot histogram
        ax = axs[i]
        ax.hist(
            pit_values,
            bins=n_bins,
            density=True,
            alpha=0.7,
            edgecolor="black",
            label="Randomised PIT",
        )

        # Add uniform reference line
        ax.axhline(
            y=1.0, color="red", linestyle="--", linewidth=2, label="Perfect Uniform"
        )

        # Set labels and title
        hour, minutes = prediction_time
        ax.set_xlabel("PIT Value")
        ax.set_ylabel("Density")
        ax.set_title(f"PIT Histogram for {hour}:{minutes:02}")
        ax.set_xlim(0, 1)
        ax.set_ylim(0, max_density * 1.1)  # Add 10% padding
        ax.grid(True, alpha=0.3)

        if i == 0:  # Only show legend on first subplot
            ax.legend()

    # Final plot configuration
    plt.tight_layout()
    if suptitle:
        plt.suptitle(suptitle, fontsize=16, y=1.05)
    if media_file_path:
        plt.savefig(media_file_path / (file_name or "plot_randomised_pit.png"), dpi=300)

    # Return based on flags
    if return_figure and return_dataframe:
        return fig, all_pit_values
    elif return_figure:
        return fig
    elif return_dataframe:
        plt.show()
        plt.close()
        return all_pit_values
    else:
        plt.show()
        plt.close()
        return None

`shap`

SHAP (SHapley Additive exPlanations) visualization module.

This module provides functionality for generating SHAP plots. These are useful for visualizing feature importance and their impact on model decisions.

Functions:

Name	Description
`plot_shap : function`	Generate SHAP plots for multiple trained models.

`plot_shap(trained_models, test_visits, exclude_from_training_data, media_file_path=None, file_name=None, return_figure=False, label_col='is_admitted')`

Generate SHAP plots for multiple trained models.

This function creates SHAP (SHapley Additive exPlanations) summary plots for each trained model, showing the impact of features on model predictions. The plots can be saved to a specified media file path or displayed directly.

Parameters:

Name	Type	Description	Default
`trained_models`	`list[TrainedClassifier] or dict[str, TrainedClassifier]`	List of trained classifier objects or dictionary with TrainedClassifier values.	required
`test_visits`	`DataFrame`	DataFrame containing the test visit data.	required
`exclude_from_training_data`	`list[str]`	List of columns to exclude from training data.	required
`media_file_path`	`Path`	Directory path where the generated plots will be saved. If None, plots are only displayed.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "shap_plot.png".	`None`
`return_figure`	`bool`	If True, returns the figure instead of displaying it.	`False`
`label_col`	`str`	Name of the column containing the target labels.	`"is_admitted"`

Returns:

Type	Description
`Figure or None`	If return_figure is True, returns the generated figure. Otherwise, returns None.

Source code in src/patientflow/viz/shap.py

def plot_shap(
    trained_models: list[TrainedClassifier] | dict[str, TrainedClassifier],
    test_visits,
    exclude_from_training_data,
    media_file_path: Optional[Path] = None,
    file_name: Optional[str] = None,
    return_figure=False,
    label_col: str = "is_admitted",
):
    """Generate SHAP plots for multiple trained models.

    This function creates SHAP (SHapley Additive exPlanations) summary plots for each
    trained model, showing the impact of features on model predictions. The plots can
    be saved to a specified media file path or displayed directly.

    Parameters
    ----------
    trained_models : list[TrainedClassifier] or dict[str, TrainedClassifier]
        List of trained classifier objects or dictionary with TrainedClassifier values.
    test_visits : pandas.DataFrame
        DataFrame containing the test visit data.
    exclude_from_training_data : list[str]
        List of columns to exclude from training data.
    media_file_path : Path, optional
        Directory path where the generated plots will be saved. If None, plots are
        only displayed.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "shap_plot.png".
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it.
    label_col : str, default="is_admitted"
        Name of the column containing the target labels.

    Returns
    -------
    matplotlib.figure.Figure or None
        If return_figure is True, returns the generated figure. Otherwise, returns None.
    """
    # Convert dict to list if needed
    if isinstance(trained_models, dict):
        trained_models = list(trained_models.values())

    # Sort trained_models by prediction time
    trained_models_sorted = sorted(
        trained_models,
        key=lambda x: x.training_results.prediction_time[0] * 60
        + x.training_results.prediction_time[1],
    )

    for trained_model in trained_models_sorted:
        fig, ax = plt.subplots(figsize=(8, 12))

        # use non-calibrated pipeline
        pipeline: Pipeline = trained_model.pipeline
        prediction_time = trained_model.training_results.prediction_time

        # Get test data for this prediction time
        X_test, _ = prepare_patient_snapshots(
            df=test_visits,
            prediction_time=prediction_time,
            exclude_columns=exclude_from_training_data,
            single_snapshot_per_visit=False,
            label_col=label_col,
        )

        X_test = add_missing_columns(pipeline, X_test)
        transformed_cols = pipeline.named_steps[
            "feature_transformer"
        ].get_feature_names_out()
        transformed_cols = [col.split("__")[-1] for col in transformed_cols]
        truncated_cols = [col[:45] for col in transformed_cols]

        # Transform features
        X_test = pipeline.named_steps["feature_transformer"].transform(X_test)

        # Create SHAP explainer
        explainer = shap.TreeExplainer(pipeline.named_steps["classifier"])

        # Convert sparse matrix to dense if necessary
        if scipy.sparse.issparse(X_test):
            X_test = X_test.toarray()

        shap_values = explainer.shap_values(X_test)

        # Print prediction distribution
        predictions = pipeline.named_steps["classifier"].predict(X_test)
        print(
            "Predicted classification (not admitted, admitted): ",
            np.bincount(predictions),
        )

        # Print mean SHAP values for each class
        if isinstance(shap_values, list):
            print("SHAP values shape:", [arr.shape for arr in shap_values])
            print("Mean SHAP values (class 0):", np.abs(shap_values[0]).mean(0))
            print("Mean SHAP values (class 1):", np.abs(shap_values[1]).mean(0))

        # Create SHAP summary plot
        rng = np.random.default_rng()
        shap.summary_plot(
            shap_values,
            X_test,
            feature_names=truncated_cols,
            show=False,
            rng=rng,
        )

        hour, minutes = prediction_time
        ax.set_title(f"SHAP Values for Time of Day: {hour}:{minutes:02}")
        ax.set_xlabel("SHAP Value")
        plt.tight_layout()

        if media_file_path:
            # Save plot
            if file_name:
                shap_plot_path = str(media_file_path / file_name)
            else:
                shap_plot_path = str(
                    media_file_path / f"shap_plot_{hour:02}{minutes:02}.png"
                )
            plt.savefig(shap_plot_path)

        if return_figure:
            return fig
        else:
            plt.show()
            plt.close(fig)

`survival_curve`

Visualization tools for patient flow analysis using survival curves.

This module provides functions to create and analyze survival curves for time-to-event analysis.

Functions:

Name	Description
`plot_admission_time_survival_curve : function`	Create single or multiple survival curves for ward admission times

Notes

The survival curves show the proportion of patients who have not yet experienced an event (e.g., admission to ward) over time
Time is measured in hours from the initial event (e.g., arrival)
A 4-hour target line is included by default to show performance against common healthcare targets
The curves are created without external survival analysis packages for simplicity and transparency
Multiple curves can be plotted on the same figure for comparison

`plot_admission_time_survival_curve(df, start_time_col='arrival_datetime', end_time_col='departure_datetime', title='Time to Event Survival Curve', target_hours=[4], xlabel='Elapsed time from start', ylabel='Proportion not yet experienced event', annotation_string='{:.1%} experienced event\nwithin {:.0f} hours', labels=None, media_file_path=None, file_name=None, return_figure=False, return_df=False)`

Create a survival curve for time-to-event analysis.

This function creates a survival curve showing the proportion of patients
who have not yet experienced an event over time. Can plot single or multiple
survival curves on the same plot.

Parameters

df : pandas.DataFrame or list of pandas.DataFrame
    DataFrame(s) containing patient visit data. If a list is provided,
    multiple survival curves will be plotted on the same figure.
start_time_col : str, default="arrival_datetime"
    Name of the column containing the start time (e.g., arrival time)
end_time_col : str, default="admitted_to_ward_datetime"
    Name of the column containing the end time (e.g., admission time)
title : str, default="Time to Event Survival Curve"
    Title for the plot
target_hours : list of float, default=[4]
    List of target times in hours to show on the plot
xlabel : str, default="Elapsed time from start"
    Label for the x-axis
ylabel : str, default="Proportion not yet experienced event"
    Label for the y-axis
annotation_string : str, default="{:.1%} experienced event

within {:.0f} hours" String template for the text annotation. Use {:.1%} for the proportion and {:.0f} for the hours. Annotations are only shown for the first curve when plotting multiple curves. labels : list of str, optional Labels for each survival curve when plotting multiple curves. If None and multiple dataframes are provided, default labels will be used. Ignored when plotting a single curve. media_file_path : pathlib.Path, optional Path to save the plot. If None, the plot is not saved. file_name : str, optional Custom filename to use when saving the plot. If not provided, defaults to "survival_curve.png". return_figure : bool, default=False If True, returns the figure instead of displaying it return_df : bool, default=False If True, returns a DataFrame containing the survival curve data. For multiple curves, returns a list of DataFrames.

Returns

matplotlib.figure.Figure or pandas.DataFrame or list or tuple or None
    - If return_figure is True and return_df is False: returns the figure object
    - If return_figure is False and return_df is True: returns the DataFrame(s) with survival curve data
    - If both return_figure and return_df are True: returns a tuple of (figure, DataFrame(s))
    - If both are False: returns None

Notes

The survival curve shows the proportion of patients who have not yet experienced
the event at each time point. Vertical lines are drawn at each target hour
to indicate the target times, with the corresponding proportion of patients
who experienced the event within these timeframes.

When plotting multiple curves, different colors are automatically assigned
and a legend is displayed. Target line annotations are only shown for the
first curve to avoid visual clutter.

Source code in src/patientflow/viz/survival_curve.py

def plot_admission_time_survival_curve(
    df,
    start_time_col="arrival_datetime",
    end_time_col="departure_datetime",
    title="Time to Event Survival Curve",
    target_hours=[4],
    xlabel="Elapsed time from start",
    ylabel="Proportion not yet experienced event",
    annotation_string="{:.1%} experienced event\nwithin {:.0f} hours",
    labels=None,
    media_file_path=None,
    file_name=None,
    return_figure=False,
    return_df=False,
):
    """Create a survival curve for time-to-event analysis.

    This function creates a survival curve showing the proportion of patients
    who have not yet experienced an event over time. Can plot single or multiple
    survival curves on the same plot.

    Parameters
    ----------
    df : pandas.DataFrame or list of pandas.DataFrame
        DataFrame(s) containing patient visit data. If a list is provided,
        multiple survival curves will be plotted on the same figure.
    start_time_col : str, default="arrival_datetime"
        Name of the column containing the start time (e.g., arrival time)
    end_time_col : str, default="admitted_to_ward_datetime"
        Name of the column containing the end time (e.g., admission time)
    title : str, default="Time to Event Survival Curve"
        Title for the plot
    target_hours : list of float, default=[4]
        List of target times in hours to show on the plot
    xlabel : str, default="Elapsed time from start"
        Label for the x-axis
    ylabel : str, default="Proportion not yet experienced event"
        Label for the y-axis
    annotation_string : str, default="{:.1%} experienced event\nwithin {:.0f} hours"
        String template for the text annotation. Use {:.1%} for the proportion and {:.0f} for the hours.
        Annotations are only shown for the first curve when plotting multiple curves.
    labels : list of str, optional
        Labels for each survival curve when plotting multiple curves.
        If None and multiple dataframes are provided, default labels will be used.
        Ignored when plotting a single curve.
    media_file_path : pathlib.Path, optional
        Path to save the plot. If None, the plot is not saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "survival_curve.png".
    return_figure : bool, default=False
        If True, returns the figure instead of displaying it
    return_df : bool, default=False
        If True, returns a DataFrame containing the survival curve data.
        For multiple curves, returns a list of DataFrames.

    Returns
    -------
    matplotlib.figure.Figure or pandas.DataFrame or list or tuple or None
        - If return_figure is True and return_df is False: returns the figure object
        - If return_figure is False and return_df is True: returns the DataFrame(s) with survival curve data
        - If both return_figure and return_df are True: returns a tuple of (figure, DataFrame(s))
        - If both are False: returns None

    Notes
    -----
    The survival curve shows the proportion of patients who have not yet experienced
    the event at each time point. Vertical lines are drawn at each target hour
    to indicate the target times, with the corresponding proportion of patients
    who experienced the event within these timeframes.

    When plotting multiple curves, different colors are automatically assigned
    and a legend is displayed. Target line annotations are only shown for the
    first curve to avoid visual clutter.
    """
    # Handle single dataframe vs list of dataframes
    if isinstance(df, pd.DataFrame):
        dataframes = [df]
        is_single_curve = True
    else:
        dataframes = df
        is_single_curve = False

    # Handle labels
    if labels is None:
        if is_single_curve:
            curve_labels = [None]
        else:
            curve_labels = [f"Curve {i+1}" for i in range(len(dataframes))]
    else:
        curve_labels = labels

    # Validate inputs
    if len(dataframes) != len(curve_labels):
        raise ValueError("Number of dataframes must match number of labels")

    # Create the plot
    fig = plt.figure(figsize=(10, 6))

    # Define colors for multiple curves
    colors = plt.cm.Set1(np.linspace(0, 1, len(dataframes)))

    survival_dfs = []

    # Process each dataframe
    for idx, (current_df, label) in enumerate(zip(dataframes, curve_labels)):
        # Calculate survival curve using the extracted function
        survival_df = calculate_survival_curve(current_df, start_time_col, end_time_col)

        # Extract arrays for plotting
        unique_times = survival_df["time_hours"].values
        survival_prob = survival_df["survival_probability"].values

        # Store DataFrame if requested
        if return_df:
            survival_dfs.append(survival_df)

        # Plot the survival curve
        color = colors[idx] if not is_single_curve else None
        plt.step(
            unique_times,
            survival_prob,
            where="post",
            color=color,
            label=label if not is_single_curve else None,
        )

        # Plot target lines and annotations only for the first curve (or single curve)
        if idx == 0:
            # Plot target lines for each target hour
            for target_hour in target_hours:
                # Find the survival probability at target hours
                closest_time_idx = np.abs(unique_times - target_hour).argmin()
                if closest_time_idx < len(survival_prob):
                    survival_at_target = survival_prob[closest_time_idx]
                    event_at_target = 1 - survival_at_target

                    # Add text annotation to the plot (only for single curve or first curve)
                    if is_single_curve or len(dataframes) == 1:
                        plt.text(
                            target_hour + 0.5,
                            survival_at_target,
                            annotation_string.format(event_at_target, target_hour),
                            bbox=dict(facecolor="white", alpha=0.8),
                        )

                        # Draw a vertical line from x-axis to the curve at target hours
                        plt.plot(
                            [target_hour, target_hour],
                            [0, survival_at_target],
                            color="grey",
                            linestyle="--",
                            linewidth=2,
                        )

                        # Draw a horizontal line from the curve to the y-axis at the survival probability level
                        plt.plot(
                            [0, target_hour],
                            [survival_at_target, survival_at_target],
                            color="grey",
                            linestyle="--",
                            linewidth=2,
                        )

    # Configure the plot
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.grid(True, alpha=0.3)

    # Make axes meet at the origin
    plt.xlim(left=0)
    plt.ylim(bottom=0)

    # Move spines to the origin
    ax = plt.gca()
    ax.spines["left"].set_position(("data", 0))
    ax.spines["bottom"].set_position(("data", 0))

    # Hide the top and right spines
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)

    # Add legend for multiple curves
    if not is_single_curve:
        plt.legend()

    plt.tight_layout()

    if media_file_path:
        if file_name:
            plt.savefig(media_file_path / file_name, dpi=300)
        else:
            plt.savefig(media_file_path / "survival_curve.png", dpi=300)

    # Handle return values
    return_data = (
        survival_dfs[0]
        if (return_df and is_single_curve)
        else survival_dfs
        if return_df
        else None
    )

    if return_figure and return_df:
        return fig, return_data
    elif return_figure:
        return fig
    elif return_df:
        return return_data
    else:
        plt.show()
        plt.close()

`trial_results`

Charts for hyperparameter optimisation trials.

This module provides tools to visualise the performance metrics of multiple hyperparameter tuning trials, highlighting the best trials for each metric.

Functions:

Name	Description
`plot_trial_results : function`	Plot selected performance metrics for a list of hyperparameter trials.

`plot_trial_results(trials_list, metrics=None, media_file_path=None, file_name=None, return_figure=False)`

Plot selected performance metrics from hyperparameter trials as scatter plots.

This function visualizes the performance metrics of a series of hyperparameter trials. It creates scatter plots for each selected metric, with the best-performing trial highlighted and annotated with its hyperparameters.

Optionally, the plot can be saved to disk or returned as a figure object.

Parameters:

Name	Type	Description	Default
`trials_list`	`List[HyperParameterTrial]`	A list of `HyperParameterTrial` instances containing validation set results (not cross-validation fold results) and hyperparameter settings. Each trial's `cv_results` dictionary contains metrics such as 'valid_auc' and 'valid_logloss', which are computed on a held-out validation set for each hyperparameter configuration.	required
`metrics`	`List[str]`	List of metric names to plot. If None, defaults to ["valid_auc", "valid_logloss"]. Each metric should be a key in the trial's cv_results dictionary.	`None`
`media_file_path`	`Path or None`	Directory path where the generated plot image will be saved as "trial_results.png". If None, the plot is not saved.	`None`
`file_name`	`str`	Custom filename to use when saving the plot. If not provided, defaults to "trial_results.png".	`None`
`return_figure`	`bool`	If True, the matplotlib figure is returned instead of being displayed directly. Default is False.	`False`

Returns:

Type	Description
`Figure or None`	The matplotlib figure object if `return_figure` is True; otherwise, None.

Notes

Assumes that each HyperParameterTrial in trials_list has a cv_results dictionary containing the requested metrics, which are computed on the validation set.
Parameters from the best-performing trials are shown in the plots.

Source code in src/patientflow/viz/trial_results.py

def plot_trial_results(
    trials_list: List[HyperParameterTrial],
    metrics: Optional[List[str]] = None,
    media_file_path=None,
    file_name=None,
    return_figure=False,
):
    """
    Plot selected performance metrics from hyperparameter trials as scatter plots.

    This function visualizes the performance metrics of a series of hyperparameter trials.
    It creates scatter plots for each selected metric, with the best-performing trial
    highlighted and annotated with its hyperparameters.

    Optionally, the plot can be saved to disk or returned as a figure object.

    Parameters
    ----------
    trials_list : List[HyperParameterTrial]
        A list of `HyperParameterTrial` instances containing validation set results
        (not cross-validation fold results) and hyperparameter settings. Each trial's
        `cv_results` dictionary contains metrics such as 'valid_auc' and 'valid_logloss',
        which are computed on a held-out validation set for each hyperparameter configuration.
    metrics : List[str], optional
        List of metric names to plot. If None, defaults to ["valid_auc", "valid_logloss"].
        Each metric should be a key in the trial's cv_results dictionary.
    media_file_path : pathlib.Path or None, optional
        Directory path where the generated plot image will be saved as "trial_results.png".
        If None, the plot is not saved.
    file_name : str, optional
        Custom filename to use when saving the plot. If not provided, defaults to "trial_results.png".
    return_figure : bool, optional
        If True, the matplotlib figure is returned instead of being displayed directly.
        Default is False.

    Returns
    -------
    matplotlib.figure.Figure or None
        The matplotlib figure object if `return_figure` is True; otherwise, None.

    Notes
    -----
    - Assumes that each `HyperParameterTrial` in `trials_list` has a `cv_results` dictionary
      containing the requested metrics, which are computed on the validation set.
    - Parameters from the best-performing trials are shown in the plots.
    """
    # Set default metrics if none provided
    if metrics is None:
        metrics = ["valid_auc", "valid_logloss"]

    # Extract metrics from trials
    metric_values = {
        metric: [trial.cv_results.get(metric, 0) for trial in trials_list]
        for metric in metrics
    }

    # Create trial indices
    trial_indices = list(range(len(trials_list)))

    # Create figure with subplots
    n_metrics = len(metrics)
    fig, axes = plt.subplots(1, n_metrics, figsize=(7 * n_metrics, 6))
    if n_metrics == 1:
        axes = [axes]

    # Plot each metric
    for idx, (metric, values) in enumerate(metric_values.items()):
        ax = axes[idx]

        # Plot metric as dots
        ax.scatter(trial_indices, values, s=50, alpha=0.7)
        ax.set_xlabel("Trial Number")
        ax.set_ylabel(metric.replace("valid_", "").upper())
        ax.set_title(metric.replace("valid_", "").replace("_", " ").title())
        ax.grid(True, linestyle="--", alpha=0.7)

        # Set x-axis to display integers
        ax.set_xticks(trial_indices)
        ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: str(int(x))))

        # Set y-axis limits
        if "loss" in metric.lower():
            best_idx = values.index(min(values))
            ax.set_ylim(bottom=0, top=max(values) * 1.1)
        else:
            best_idx = values.index(max(values))
            ax.set_ylim(bottom=0, top=max(values) * 1.1)

        # Highlight best value
        highlight_color = "green" if "loss" not in metric.lower() else "darkred"
        ax.scatter(
            [best_idx],
            [values[best_idx]],
            color=highlight_color,
            s=150,
            edgecolor="black",
            zorder=5,
        )

        # Add annotation with best parameters
        best_trial = trials_list[best_idx]
        param_text = "\n".join([f"{k}: {v}" for k, v in best_trial.parameters.items()])
        best_value = values[best_idx]
        ax.text(
            0.05,
            0.05,
            f"Best {metric.replace('valid_', '').upper()}: {best_value:.4f}\n\nParameters:\n{param_text}",
            transform=ax.transAxes,
            bbox=dict(facecolor="white", alpha=0.7),
            fontsize=9,
        )

    # Add overall title
    fig.suptitle("Hyperparameter Trial Results", fontsize=14)

    # Adjust layout
    plt.tight_layout()

    if media_file_path:
        if file_name:
            plt.savefig(media_file_path / file_name, dpi=300)
        else:
            plt.savefig(media_file_path / "trial_results.png", dpi=300)

    if return_figure:
        return fig
    else:
        plt.show()
        plt.close()

`utils`

Utility functions for visualization and data formatting.

This module provides helper functions for cleaning and formatting data for visualization purposes, including filename cleaning and prediction time formatting.

Functions:

Name	Description
`clean_title_for_filename : function`	Clean a title string to make it suitable for use in filenames
`format_prediction_time : function`	Format prediction time to 'HH:MM' format

`clean_title_for_filename(title)`

Clean a title string to make it suitable for use in filenames.

Parameters:

Name	Type	Description	Default
`title`	`str`	The title to clean.	required

Returns:

Type	Description
`str`	The cleaned title, safe for use in filenames.

Source code in src/patientflow/viz/utils.py

def clean_title_for_filename(title):
    """Clean a title string to make it suitable for use in filenames.

    Parameters
    ----------
    title : str
        The title to clean.

    Returns
    -------
    str
        The cleaned title, safe for use in filenames.
    """
    replacements = {" ": "_", "%": "", "\n": "", ",": "", ".": ""}

    clean_title = title
    for old, new in replacements.items():
        clean_title = clean_title.replace(old, new)
    return clean_title

`format_prediction_time(prediction_time)`

Format prediction time to 'HH:MM' format.

Parameters:

Name	Type	Description	Default
`prediction_time`	`str or tuple`	Either: - A string in 'HHMM' format, possibly containing underscores - A tuple of (hour, minute)	required

Returns:

Type	Description
`str`	Formatted time string in 'HH:MM' format.

Source code in src/patientflow/viz/utils.py

def format_prediction_time(prediction_time):
    """Format prediction time to 'HH:MM' format.

    Parameters
    ----------
    prediction_time : str or tuple
        Either:
            - A string in 'HHMM' format, possibly containing underscores
            - A tuple of (hour, minute)

    Returns
    -------
    str
        Formatted time string in 'HH:MM' format.
    """
    if isinstance(prediction_time, tuple):
        hour, minute = prediction_time
        return f"{hour:02d}:{minute:02d}"
    else:
        # Split the string by underscores and take the last element
        last_part = prediction_time.split("_")[-1]
        # Add a colon in the middle
        return f"{last_part[:2]}:{last_part[2:]}"

API reference

aggregate

build_expression(syms, n)

compute_core_expression(ri, s)

create_symbols(n)

expression_subs(expression, n, predictions)

get_prob_dist(snapshots_dict, X_test, y_test, model, weights=None, verbose=False, category_filter=None, normal_approx_threshold=30)

get_prob_dist_for_prediction_moment(X_test, model, weights=None, inference_time=False, y_test=None, category_filter=None, normal_approx_threshold=30)

get_prob_dist_using_survival_curve(snapshot_dates, test_visits, category, prediction_time, prediction_window, start_time_col, end_time_col, model, verbose=False)

model_input_to_pred_proba(model_input, model)

pred_proba_to_agg_predicted(predictions_proba, weights=None, normal_approx_threshold=30)

return_coeff(expression, i)

calculate

admission_in_prediction_window

calculate_probability(elapsed_los, prediction_window, x1, y1, x2, y2)

create_curve(x1, y1, x2, y2, a=0.01, generate_values=False)

decay_curve(x, x1, y1, lamda)

get_survival_probability(survival_df, time_hours)

get_y_from_aspirational_curve(x, x1, y1, x2, y2)

growth_curve(x, a, gamma)

arrival_rates

admission_probabilities(hours_since_arrival, x1, y1, x2, y2)

count_yet_to_arrive(df, snapshot_dates, prediction_times, prediction_window_hours)

process_arrival_rates(arrival_rates_dict)

time_varying_arrival_rates(df, yta_time_interval, num_days=None, verbose=False)

time_varying_arrival_rates_lagged(df, lagged_by, num_days=None, yta_time_interval=60)

unfettered_demand_by_hour(df, x1, y1, x2, y2, yta_time_interval=60, max_hours_since_arrival=10, num_days=None)

weighted_arrival_rates(weighted_rates, elapsed_hours, hour_idx, num_intervals)

survival_curve

calculate_survival_curve(df, start_time_col, end_time_col)

errors

MissingKeysError

ModelLoadError

evaluate

calc_mae_mpe(prob_dist_dict_all, use_most_probable=False)

calculate_admission_probs_relative_to_prediction(df, prediction_datetime, prediction_window, x1, y1, x2, y2, is_before=True)

calculate_results(expected_values, observed_values)

calculate_weighted_observed(df, dt, prediction_window, x1, y1, x2, y2, prediction_time)

combine_distributions(dist1, dist2)

create_time_mask(df, hour, minute)

evaluate_combined_model(prob_dist_dict_all, df, yta_preds, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name, use_most_probable=True)

evaluate_six_week_average(prob_dist_dict_all, df, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name)

get_arrivals_with_admission_probs(df, prediction_datetime, prediction_window, prediction_time, x1, y1, x2, y2, date_range=None, target_date=None, target_weekday=None)

predict_using_previous_weeks(df, dt, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, weighted=True)

generate

create_fake_finished_visits(start_date, end_date, mean_patients_per_day, admitted_only=False)

create_fake_snapshots(prediction_times, start_date, end_date, df=None, observations_df=None, lab_orders_df=None, mean_patients_per_day=50)

load

data_from_csv(csv_path, index_column=None, sort_columns=None, eval_columns=None)

get_dict_cols(df)

get_model_key(model_name, prediction_time)

load_config_file(config_file_path, return_start_end_dates=False)

load_data(data_file_path, file_name, index_column=None, sort_columns=None, eval_columns=None, home_path=None, encoding=None)

parse_args()

safe_literal_eval(s)

set_data_file_names(uclh, data_file_path, config_file_path=None)

set_file_paths(project_root, data_folder_name, train_dttm=None, inference_time=False, config_file='config.yaml', prefix=None, verbose=True)

set_project_root(env_var=None)

model_artifacts

FoldResults dataclass

HyperParameterTrial dataclass

TrainedClassifier dataclass

TrainingResults dataclass

predict

emergency_demand

add_missing_columns(pipeline, df)

create_predictions(models, prediction_time, prediction_snapshots, specialties, prediction_window, x1, y1, x2, y2, cdf_cut_points, use_admission_in_window_prob=True)

find_probability_threshold_index(sequence, threshold)

get_specialty_probs(specialties, specialty_model, snapshots_df, special_category_func=None, special_category_dict=None)

predictors

incoming_admission_predictors

EmpiricalIncomingAdmissionPredictor

__init__(filters=None, verbose=False)

fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None, start_time_col='arrival_datetime', end_time_col='departure_datetime')

get_survival_curve()

predict(prediction_context, **kwargs)

IncomingAdmissionPredictor

__init__(filters=None, verbose=False)

filter_dataframe(df, filters)

fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None)

`aggregate`

`build_expression(syms, n)`

`compute_core_expression(ri, s)`

`create_symbols(n)`

`expression_subs(expression, n, predictions)`

`get_prob_dist(snapshots_dict, X_test, y_test, model, weights=None, verbose=False, category_filter=None, normal_approx_threshold=30)`

`get_prob_dist_for_prediction_moment(X_test, model, weights=None, inference_time=False, y_test=None, category_filter=None, normal_approx_threshold=30)`

`get_prob_dist_using_survival_curve(snapshot_dates, test_visits, category, prediction_time, prediction_window, start_time_col, end_time_col, model, verbose=False)`

`model_input_to_pred_proba(model_input, model)`

`pred_proba_to_agg_predicted(predictions_proba, weights=None, normal_approx_threshold=30)`

`return_coeff(expression, i)`

`calculate`

`admission_in_prediction_window`

`calculate_probability(elapsed_los, prediction_window, x1, y1, x2, y2)`

`create_curve(x1, y1, x2, y2, a=0.01, generate_values=False)`

`decay_curve(x, x1, y1, lamda)`

`get_survival_probability(survival_df, time_hours)`

`get_y_from_aspirational_curve(x, x1, y1, x2, y2)`

`growth_curve(x, a, gamma)`

`arrival_rates`

`admission_probabilities(hours_since_arrival, x1, y1, x2, y2)`

`count_yet_to_arrive(df, snapshot_dates, prediction_times, prediction_window_hours)`

`process_arrival_rates(arrival_rates_dict)`

`time_varying_arrival_rates(df, yta_time_interval, num_days=None, verbose=False)`

`time_varying_arrival_rates_lagged(df, lagged_by, num_days=None, yta_time_interval=60)`

`unfettered_demand_by_hour(df, x1, y1, x2, y2, yta_time_interval=60, max_hours_since_arrival=10, num_days=None)`

`weighted_arrival_rates(weighted_rates, elapsed_hours, hour_idx, num_intervals)`

`survival_curve`

`calculate_survival_curve(df, start_time_col, end_time_col)`

`errors`

`MissingKeysError`

`ModelLoadError`

`evaluate`

`calc_mae_mpe(prob_dist_dict_all, use_most_probable=False)`

`calculate_admission_probs_relative_to_prediction(df, prediction_datetime, prediction_window, x1, y1, x2, y2, is_before=True)`

`calculate_results(expected_values, observed_values)`

`calculate_weighted_observed(df, dt, prediction_window, x1, y1, x2, y2, prediction_time)`

`combine_distributions(dist1, dist2)`

`create_time_mask(df, hour, minute)`

`evaluate_combined_model(prob_dist_dict_all, df, yta_preds, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name, use_most_probable=True)`

`evaluate_six_week_average(prob_dist_dict_all, df, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, model_name)`

`get_arrivals_with_admission_probs(df, prediction_datetime, prediction_window, prediction_time, x1, y1, x2, y2, date_range=None, target_date=None, target_weekday=None)`

`predict_using_previous_weeks(df, dt, prediction_window, x1, y1, x2, y2, prediction_time, num_weeks, weighted=True)`

`generate`

`create_fake_finished_visits(start_date, end_date, mean_patients_per_day, admitted_only=False)`

`create_fake_snapshots(prediction_times, start_date, end_date, df=None, observations_df=None, lab_orders_df=None, mean_patients_per_day=50)`

`load`

`data_from_csv(csv_path, index_column=None, sort_columns=None, eval_columns=None)`

`get_dict_cols(df)`

`get_model_key(model_name, prediction_time)`

`load_config_file(config_file_path, return_start_end_dates=False)`

`load_data(data_file_path, file_name, index_column=None, sort_columns=None, eval_columns=None, home_path=None, encoding=None)`

`parse_args()`

`safe_literal_eval(s)`

`set_data_file_names(uclh, data_file_path, config_file_path=None)`

`set_file_paths(project_root, data_folder_name, train_dttm=None, inference_time=False, config_file='config.yaml', prefix=None, verbose=True)`

`set_project_root(env_var=None)`

`model_artifacts`

`FoldResults` `dataclass`

`HyperParameterTrial` `dataclass`

`TrainedClassifier` `dataclass`

`TrainingResults` `dataclass`

`predict`

`emergency_demand`

`add_missing_columns(pipeline, df)`

`create_predictions(models, prediction_time, prediction_snapshots, specialties, prediction_window, x1, y1, x2, y2, cdf_cut_points, use_admission_in_window_prob=True)`

`find_probability_threshold_index(sequence, threshold)`

`get_specialty_probs(specialties, specialty_model, snapshots_df, special_category_func=None, special_category_dict=None)`

`predictors`

`incoming_admission_predictors`

`EmpiricalIncomingAdmissionPredictor`

`init(filters=None, verbose=False)`

`fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None, start_time_col='arrival_datetime', end_time_col='departure_datetime')`

`get_survival_curve()`

`predict(prediction_context, **kwargs)`

`IncomingAdmissionPredictor`

`init(filters=None, verbose=False)`

`filter_dataframe(df, filters)`

`fit(train_df, prediction_window, yta_time_interval, prediction_times, num_days, epsilon=10 ** -7, y=None)`

`get_weights()`