API Reference


All of the caproj modules outlined below were developed for use in an interactive enviroment, such as that of a Jupyter notebook.

Please note that no testing is in place for any of the modules outlined here. Therefore, if you wish to use functions or classes contained in any of these modules, expect your mileage may vary.


This module contains functions for generating the interval metrics used in modeling for each unique capital project

Module variables:


List of column names containing info for each project’s end-state


Dictionary for mapping members of endstate_columns to new column names


List of column names containing descriptive info for each project


Dictionary for mapping members of info_columns to new column names

Module functions:

print_record_project_count(dataframe[, dataset])

Prints summary of records and unique projects in dataframe

generate_interval_data(data[, …])

Generates a project analysis dataset for the specified interval

print_interval_dict([datadict_dir, …])

Prints summary of data dictionary for the generate_interval_data output


Calculates interval change metrics for each PID and appends the dataset


df – pd.DataFrame containing joined project interval data output from the join_data_endstate() function


Copy of input pd.DataFrame with the new metrics appended as additional columns

caproj.datagen.endstate_column_rename_dict = {'Budget_Forecast': 'Budget_End', 'Change_Years': 'Final_Change_Years', 'Current_Phase': 'Phase_End', 'Date_Reported_As_Of': 'Final_Change_Date', 'Forecast_Completion': 'Schedule_End', 'PID_Index': 'Number_Changes'}

Dictionary for mapping members of endstate_columns to new column names

caproj.datagen.endstate_columns = ['Date_Reported_As_Of', 'Change_Years', 'PID', 'Current_Phase', 'Budget_Forecast', 'Forecast_Completion', 'PID_Index']

List of column names containing info for each project’s end-state


Ensures datetime columns are formatted correctly and changes are sorted


df – pd.DataFrame of the cleaned capital projects change records data


Original pd.DataFrame with datetime columns formatted and records sorted

caproj.datagen.extract_project_details(df, copy_columns=['PID', 'Project_Name', 'Description', 'Category', 'Borough', 'Managing_Agency', 'Client_Agency', 'Current_Phase', 'Current_Project_Years', 'Current_Project_Year', 'Design_Start', 'Original_Budget', 'Original_Schedule'], column_rename_dict={'Current_Phase': 'Phase_Start', 'Original_Budget': 'Budget_Start', 'Original_Schedule': 'Schedule_Start'}, use_record=0, record_index='PID_Index')

Generates a dataframe with project details for each unique PID

  • df (pd.DataFrame) – The cleaned capital projects change records data

  • copy_columns (list, optional) – list of the names of columns that should be copied containing primary information about each project, defaults to info_columns

  • column_rename_dict (dict, optional) – dict of column name mappings to rename copied columns, defaults to info_column_rename_dict

  • use_record (int, optional) – integer record_index value to use as the basis the resulting project info, defaults to 0 (indicating that the first chronological record for each project will be used)

  • record_index (str, optional) – indicates the column name to use for the record_index referenced use_record, defaults to “PID_Index”


dataframe containing the primary project details for each unique PID, and the PID is set as the index

Return type


caproj.datagen.find_max_record_indices(df, record_index='PID_Index')

Creates a list of Record_ID values of the max record ID for each PID

  • df – pd.DataFrame containing the cleaned capital project change records

  • record_index – string name of column containing PID ordinal indices (defaul record_index=’PID_Index’)


list of max Record_ID values for each PID

caproj.datagen.generate_interval_data(data, change_year_interval=None, inclusive_stop=True, to_csv=False, save_dir='../data/interim/', custom_filename=None, verbose=1, return_df=True)

Generates a project analysis dataset for the specified interval


If you specify to_csv=True, the default bahavior will be to save the resulting dataframe as:


or if change_year_interval=None:


The save_dir and custom_filename arguments allow you to change this to_csv behavior, however using them is not recommended for the sake of file naming consistency in this project.

  • data – pd.DataFrame of the cleaned capital projects change records data

  • change_year_interval – integer or None representing the maximum year from which to include changes for each project, if None, then all years’ worth of changes included (default change_year_interval=None)

  • inclusive_stop – boolean, indicating whether projects to be included in the subset dataframe need to be older than the change_year_interval year or can be equal-to-or-older-than the change_year_interval year If True, >= is used for subsetting, if False > is used (default inclusive_stop=True)

  • to_csv – boolean, indicating whether or not the resulting dataframe should be saved to disk (default to_csv=False)

  • save_path – string or None, indicating the path to which the resulting dataframe should be saved to .csv, if None the dataframe is not saved, just returned (default save_path=None)

  • custom_filename – string or None, indicating whether to name the resulting .csv file something other than the name ‘NYC_capital_projects_{interval}yr.csv’ (default custom_filename=None)

  • verbose – integer, default verbose=1 prints the number of project remaining in the resulting dataframe, otherwise that information is not printed

  • return_df – boolean, determines whether the resulting pd.DataFrame object is returned (default return_df=True)


pd.DataFrame containing the summary change data for each unique project matching the specified change_year_interval

caproj.datagen.info_column_rename_dict = {'Current_Phase': 'Phase_Start', 'Original_Budget': 'Budget_Start', 'Original_Schedule': 'Schedule_Start'}

Dictionary for mapping members of info_columns to new column names

caproj.datagen.info_columns = ['PID', 'Project_Name', 'Description', 'Category', 'Borough', 'Managing_Agency', 'Client_Agency', 'Current_Phase', 'Current_Project_Years', 'Current_Project_Year', 'Design_Start', 'Original_Budget', 'Original_Schedule']

List of column names containing descriptive info for each project

caproj.datagen.join_data_endstate(df_details, df_endstate, how='inner')

Creates dataframe joining the df_details and df_endstate dataframes by PID

  • df_details – pd.DataFrame output from the extract_project_details() function

  • df_endstate – pd.DataFrame output from the project_interval_endstate() function

  • how – string passed to the pd.merge method indicating the type of join to perform (default how=’inner’)


pd.DataFrame containing the join results, the index is reset

caproj.datagen.print_interval_dict(datadict_dir='../references/data_dicts/', datadict_filename='data_dict_interval.csv')

Prints summary of data dictionary for the generate_interval_data output

  • datadict_dir – optional string indicating directory location of target data dictionary (default ../references/data_dicts/)

  • datadict_filename – optional string indicating filename of target data dict (default data_dict_interval.csv)


No objects are returned, printed output only

caproj.datagen.print_record_project_count(dataframe, dataset='full')

Prints summary of records and unique projects in dataframe

  • dataframe – pd.DataFrame object for the version of the NYC capital projects data you wish to summarize

  • dataset – string, accepts ‘full’, ‘all’, ‘training’, or ‘test’ (default ‘full’)


prints to standard output, no objects returned

caproj.datagen.project_interval_endstate(df, keep_columns=['Date_Reported_As_Of', 'Change_Years', 'PID', 'Current_Phase', 'Budget_Forecast', 'Forecast_Completion', 'PID_Index'], column_rename_dict={'Budget_Forecast': 'Budget_End', 'Change_Years': 'Final_Change_Years', 'Current_Phase': 'Phase_End', 'Date_Reported_As_Of': 'Final_Change_Date', 'Forecast_Completion': 'Schedule_End', 'PID_Index': 'Number_Changes'}, change_year_interval=None, record_index='PID_Index', change_col='Change_Year', project_age_col='Current_Project_Year', inclusive_stop=True)

Generates a dataframe of endstate data for each unique PID given the specified analysis interval

  • df – pd.DataFrame of the cleaned capital projects change records data

  • keep_columns – list of column names for columns that should be kept as part of the resulting dataframe (default keep_columns=endstate_columns module variable)

  • column_rename_dict – dict mapping existing column names to the new names to which they should be named (default column_rename_dict=endstate_column_rename_dict module variable)

  • change_year_interval – integer or None representing the maximum year from which to include changes for each project, if None, then all years’ worth of changes included (default change_year_interval=None)

  • record_index – string name of column containing PID ordinal indices (defaul record_index=’PID_Index’)

  • change_col – string, name of column containing change year indicators (default change_col=’Change_Year’)

  • project_age_col – string, name of column containing current age of each project at the time the dataset was compiled (default project_age_col=’Current_Project_Year’)

  • inclusive_stop – boolean, indicating whether projects to be included in the subset dataframe need to be older than the change_year_interval year or can be equal-to-or-older-than the change_year_interval year If True, >= is used for subsetting, if False > is used (default inclusive_stop=True)


pd.DataFrame containing endstate data for each unique project, the index is set to the PID

caproj.datagen.subset_project_changes(df, change_year_interval=3, change_col='Change_Year', project_age_col='Current_Project_Year', inclusive_stop=True)

Generates a subsetted dataframe with only the change records that occur in or before the specified max interval year

  • df – pd.DataFrame of the cleaned capital projects change records data

  • change_year_interval – integer representing the maximum year from which to include changes for each project (default change_year_interval=3)

  • change_col – string, name of column containing change year indicators (default change_col=’Change_Year’)

  • project_age_col – string, name of column containing current age of each project at the time the dataset was compiled (default project_age_col=’Current_Project_Year’)

  • inclusive_stop – boolean, indicating whether projects to be included in the subset dataframe need to be older than the change_year_interval year or can be equal-to-or-older-than the change_year_interval year If True, >= is used for subsetting, if False > is used (default inclusive_stop=True)


pd.DataFrame of the subsetted data, the index is set to each record’s ‘Record_ID’ value


This module contains functions for scaling features of an X features design matrix and for encoding categorical variables

Module functions:

encode_categories(data, colname[, one_hot, …])

Encodes categorical variable column and appends values to dataframe

scale_features(train_df, val_df[, …])

Scales val_df features based on train_df and returns scaled dataframe


Efficient numpy sigmoid transformation of dataframe, array, or matrix


Adds 1 to input data and then applies Log transformation to those values

caproj.scale.encode_categories(data, colname, one_hot=True, drop_cat=None, cat_list=None, drop_original_col=False, append_colname=None)

Encodes categorical variable column and appends values to dataframe

This function offers the option to either one-hot-encode (0,1) or LabelEncode (as consecutive integers (0, n)) categorical values by setting one_hot to either True or False.

  • data – The pd.dataframe object containing the column you wish to encode

  • colname – string indicating name of column you wish to encode

  • one_hot – boolean indicating whether you with to one-hot-encode the categories. If False, the values are simply encoded to a set of consecutive integers. (default)

  • drop_cat – None or category value you wish to drop from your one-hot-encoded variable columns. If None and one_hot=True, no variable columns are dropped. If one_hot=False, any category value passed drop_cat will ensure that value is sorted to the last place position in the resulting encoded integer values (default drop_cat=None)

  • cat_list – None or list specifying the full set of category values contained in your target column. The benefit of providing your own list is that it allows you to provide a custom ordering of categories to the encoder. If None, the categories will default to alphabetical order. (default cat_list=None)

  • drop_original_col – Boolean indicating whether the original category column specified by colname will be dropped from the resulting dataframe

  • append_colname – None or string, indicating what should be appended to one hot encoded column names. This is useful in instances where multiple columns have identical category names within them or, a category name matches an existing column. None will result in no string being added. (default append_colname=None)


pd.DataFrame of the original input dataframe with the additional encoded category column(s) appended to it.


Adds 1 to input data and then applies Log transformation to those values


x – data to undergo transformation (datatypes accepted include, pandas DataFrames and Series, numpy matrices and arrays, or single int or float values x)


The transformed dataframe, series, array, or value depending on the type of original input x object

caproj.scale.scale_features(train_df, val_df, exclude_scale_cols=[], scaler=<class 'sklearn.preprocessing._data.RobustScaler'>, scale_before_func=None, scale_after_func=None, reapply_scaler=False, **kwargs)

Scales val_df features based on train_df and returns scaled dataframe

Accepts various sklearn scalers and allows you to specify features you do not want affected by scaling by using the exclude_scale_cols parameter.


Be certain to reset the index of your accompanying y_train and y_test dataframes, or you will risk running into potential indexing errors while working with your scaled X dataframes

  • train_df – The training data

  • val_df – Your test/validation data

  • exclude_scale_cols – Optional list containing names of columns we do not wish to scale, default=[]

  • scaler – The sklearn scaler method used to fit the data (i.e. StandardScaler, MinMaxScaler, RobustScaler, etc.), default=RobustScaler

  • scale_before_func – Optional function (i.e. np.log, np.sigmoid, or custom function) to be applied to train and val dfs prior to the scaler fitting and scaling val_df, default=None

  • scale_after_func – Optional function (i.e. np.log, np.sigmoid, or custom function) to be applied to val_df after the scaler has scaled the datafrme

  • reapply_scaler – Boolean, if set to True, the scaler is fitted a second time after the scale_after_func is applied (useful if using MinMaxScaler and you wish to maintain a 0 to 1 scale after applying a secondary transformation to the data), default is reapply_scaler=False

  • kwargs – Any additional arguments are passed as parameters to the selected scaler (for instance feature_range=(-1,1) would be an appropriate argument if scaler is set to MinMaxScaler)


a feature-scaled version of the val_df dataframe, and a list of fitted sklearn scaler objects that were used to scale values (for later use in case original values need to be restored), list will either be of length 1 or 2 depending on whether reapply_scaler was set to True


Efficient numpy sigmoid transformation of dataframe, array, or matrix


x – data to undergo transformation (datatypes accepted include, pandas DataFrames and Series, numpy matrices and arrays, or single int or float values x)


The transformed dataframe, series, array, or value depending on the type of original input x object


This module contains functions for generating fitted models and summarizing the results

Module functions:

generate_model_dict(model, model_descr, …)

Fits the specified model type and generates a dictionary of results

print_model_results(model_dict[, score])

Prints a model results summary from the model dictionary generated using the generate_model_dict() function

caproj.model.generate_model_dict(model, model_descr, X_train, X_test, y_train, y_test, multioutput=True, verbose=False, predictions=True, scores=True, model_api='sklearn', sm_formulas=None, y_stored=True, **kwargs)

Fits the specified model type and generates a dictionary of results

This function works for fitting and generating predictions for sklearn, keras, and statsmodels models. PyGam models typically also work by specifying the ‘sklearn’ model_api. For statsmodels models, only those that depend on the statsmodels.formula.api work.

The returned output dictionary follows this structure:

    'description': model_descr_string
    'model': fitted model object
    'y_variables': [y1_varname_string, y2_varname_string]
    'formulas': [y1_formula_string, y2_formula_string]
                empty list if statsmodel api is not used
    'y_values': {
        'train': y_train array,
        'test': y_test array,
    'predictions': {
        'train': train_predictions array,
        'test': test_predictions array,
    'score': {
        'train': training r2_score array,
        'test': test r2_score array

  • model – the uninitialized sklearn, pygam, or statsmodels regression model object, or a previously compiled keras model

  • model_descr – a brief string describing the model (cannot exceed 80 characters)

  • X_test, y_train, y_test (X_train,) – the datasets on which to fit and evaluate the model

  • multioutput – Boolean, if True and sklearn model_api, will attempt fitting a single multioutput model, if False or ‘statsmodel’ model_api fits separate models for each output

  • verbose – if True, prints resulting fitted model object (default=False)

  • predictions – if True the dict stores model.predict() predictions for both the X_train and X_test input dataframes

  • scores – if True, metrics scores are calculated and stored in the resulting dict for both the train and test predictions

  • model_api – specifies the api-type required for the input model, options include ‘sklearn’, ‘keras’, or ‘statsmodels’ (default=’sklearn’)

  • sm_formulas – list of statsmodels formulas defining model for each output y (include only endogenous variables, such as x1 + x2 + x3 instead of y ~ x1 + x2 + x3), default is None

  • y_stored – boolean, determines whether the true y values are stored in the resulting dictionary. It is convenient to keep these stored alongside the predictions for easier evaluation later (default is y_stored=True)

  • kwargs – are optional arguments that pass directly to the model object at time of initialization, or in the case of the ‘keras’ model api, they pass to the keras.mdoel.fit() method


returns a dictionary object containing the resulting fitted model object, resulting predictions, and train and test scores (if specified as True)

caproj.model.print_model_results(model_dict, score='both')

Prints a model results summary from the model dictionary generated using the generate_model_dict() function

  • model_dict – dict, output dictionary from the generate_model_dict() function

  • accuracy – None, ‘both’, ‘test’, or ‘train’ parameters accepted, identifies which results to print for this particular metric


nothing is returned, this function just prints summary output


This module contains functions for visualizing data and model results

Module functions:

plot_value_counts(value_counts[, figsize, color])

Generates barplot from pandas value_counts series

plot_barplot(value_counts, title[, height, …])

Generates a horizontal barplot from a pandas value_counts series

plot_hist_comps(df, metric_1, metric_2[, …])

Plots side-by-side histograms for comparison with log yscale option

plot_line(x_vals, y_vals, title, x_label, …)

Generates line plot given input x, y values

plot_2d_embed_scatter(data1, data2, title, …)

Plots 2D scatterplot of dimension-reduced embeddings for train and test

plot_true_pred([model_dict, dataset, …])

Plots model prediction results directly from model_dict or input arrays

plot_bdgt_sched_scaled(X, X_scaled, scale_descr)

Plots original vs scaled versions of budget and schedule input data

plot_change_trend(trend_data, pid_data, pid)

Plots 4 subplots showing project budget and duration forecast change trend

plot_gam_by_predictor(model_dict, …[, …])

Calculates and plots the partial dependence and 95% CIs for a GAM model

plot_coefficients(model_dict[, subplots, …])

Plots coefficients from statsmodels linear regression model


Loads an image from file, converts it to np.array and returns the array

plot_jpg(filepath, title[, figsize])

Plots a jpeg image from file


Loads an image from file, converts it to np.array and returns the array


filepath (str) – path to image file


numpy representation of image

Return type


caproj.visualize.plot_2d_embed_scatter(data1, data2, title, xlabel, ylabel, data1_name='training obs', data2_name='TEST obs', height=5, point_size=None)

Plots 2D scatterplot of dimension-reduced embeddings for train and test

2D matplotlib scatterplot, no objects are returned.

NOTE: This function assumes the data inputs are 2D np.array objects of

share (n, 2), and that two separate sets of encoded embeddings are going to be plotted together (i.e. the train and the test observations). 2D pd.DataFrame objects can be passed, and are converted to np.array within the plotting function.

  • data1 – np.array 2D containing 2 encoded dimensions

  • data2 – a second np.array 2D containing 2 encoded dimensions

  • title – str, text used for plot title

  • xlabel – string representing the label for the x axis

  • ylabel – string representing the label for the y axis

  • data1_name – string representing the name of the first dataset, this will be the label given to those points in the plot’s legend (default ‘training obs’)

  • data2_name – string representing the name of the first dataset, this will be the label given to those points in the plot’s legend (default ‘TEST obs’)

  • height – integer that determines the hieght of the plot (default is 5)

  • point_size – integer or None, default of None will revert to matplotlib scatter default, integer entered will override the default marker size

caproj.visualize.plot_barplot(value_counts, title, height=6, varname=None, color='k', label_space=0.01)

Generates a horizontal barplot from a pandas value_counts series

  • value_counts – pd.Series object generated by pandas value_counts() method

  • title – string, the printed title of the plot

  • height – integer, the desired height of the plot (default is 6)

  • varname – string or None, text to print for plot’s y-axis title

  • color – string, the matplotlib color name for the color you would like for the plotted bars (default is ‘k’ or black)

  • label_space – float, a coefficient used to space the count label an appropriate distance from the plotted bar (default is 0.01)


a matplotlib plot. No objects are returned

caproj.visualize.plot_bdgt_sched_scaled(X, X_scaled, scale_descr, X_test=None, X_test_scaled=None, bdgt_col='Budget_Start', sched_col='Duration_Start')

Plots original vs scaled versions of budget and schedule input data

Generates 1x2 subplotted scatterplots, no objects returned

  • X – Dataframe or 2D array with original budget and schedule train data

  • X_scaled – Dataframe or 2D array with scaled budget and schedule train data

  • scale_descr – Short string description of scaling transformation used to title scaled data plot (e.g. ‘Sigmoid Standardized’)

  • X_test – Optional, Dataframe or 2D array with original test data, which will plot test data as overlay with training data (default is X_test=None, which does not plot any overlay)

  • X_test_scaled – Optional, Dataframe or 2D array with original test data, which plots overlay similar to X_test (default is X_test_scaled=None)

  • bdgt_col – string name of budget values column for input dataframes (default bdgt_col=’Budget_Start’)

  • sched_col – string name of budget values column for input dataframes (default bdgt_col=’Duration_Start’)

caproj.visualize.plot_change_trend(trend_data, pid_data, pid, interval=None)

Plots 4 subplots showing project budget and duration forecast change trend

Generates image of 4 subplots, no objects are returned.

  • trend_data – pd.DataFrame, the cleaned dataset of all project change records (i.e. ‘Capital_Projects_clean.csv’ dataframe)

  • pid_data – pd.DataFrame, the prediction_interval dataframe produced using this project’s data generator function (i.e. ‘NYC_Capital_Projects_3yr.csv’ dataframe)

  • pid – integer, the PID for the project you wish to plot

  • interval – integer or None, indicating the max Change_Year you wish to plot, if None all change records are plotted for the specified pid (default, interval=None)

caproj.visualize.plot_coefficients(model_dict, subplots=1, 2, fig_height=8, suptitle_spacing=1)

Plots coefficients from statsmodels linear regression model

Generates a plotted series of subplots illustrating estimated coefficients and 95% CIs. No objects are returned

  • model_dict (dict) – model dictionary object from generate model dict function, containing fitted Statsmodels linear regression model objects (NOTE: this function is compatible with statsmodels models only)

  • subplots (tuple) – to plot each of the 2 predicted y variables, provides the dimension of subplots for the figure (NOTE: currently this function is only configured to plot 2 columns of subplots, therefore no other value other than two is accepted for the subplots width dimension), defaults to (1, 2)

  • fig_height (int or float) – this value is passed directly to the figsize parameter of plt.subplots() and determines the overall height of your plot, defaults to 8

  • suptitle_spacing (float) – this value is passed to the ‘y’ parameter for plt.suptitle(), defaults to 1.10

caproj.visualize.plot_gam_by_predictor(model_dict, model_index, X_data, y_data, dataset='train', suptitle_y=1)

Calculates and plots the partial dependence and 95% CIs for a GAM model

Plots a set of subplots for each predictor contained in your X data. No objects are returned.

  • model_dict – model dictionary containing the fitted PyGAM models you wish to plot

  • model_index – integer indicating the index of the model stored in yur model_dict that you wish to plot

  • X_data – pd.DataFrame containing the matching predictor set you wish to plot beneath your predictor contribution lines

  • y_data – pd.DataFrame containing the matching outcome set you wish to plot beneath your predictor contribution lines

  • dataset – string, ‘train’ or ‘test’ indicating the type of X and y data you have entered for the X_data and y_data arguments (default=’train)

  • suptitle – float > 1.00 indicating the spacing required to prevent your plot from overlapping your title text (default=1.04)

caproj.visualize.plot_hist_comps(df, metric_1, metric_2, y_log=False, bins=20)

Plots side-by-side histograms for comparison with log yscale option

Plots 2 subplots, no objects are returned

  • df – pd.DataFrame object containing the data you wish to plot

  • metric_1 – string, name of column containing data for the first plot

  • metric_2 – string, name of column containing data for second plot

  • y_log – boolean, indicating whether the y-axis should be plotted with a log scale (default False)

  • bins – integer, the number of bins to use for the histogram (default 20)

caproj.visualize.plot_jpg(filepath, title, figsize=16, 12)

Plots a jpeg image from file

  • filepath (str) – path to file for plotting

  • title (str) – plot title text

  • figsize (tuple) – dimensions of resulting plot, defaults to (16, 12)

caproj.visualize.plot_line(x_vals, y_vals, title, x_label, y_label, height=3.5)

Generates line plot given input x, y values

caproj.visualize.plot_true_pred(model_dict=None, dataset='train', y_true=None, y_pred=None, model_descr=None, y1_label=None, y2_label=None)

Plots model prediction results directly from model_dict or input arrays

Generates 5 subplots, (1) true values with predicted values overlay, each y variable on its own axis, (2) output variable 1 true vs. predicted on each axis,(3) output variable 2 true vs. predicted on each axis (4) output variable 1 true vs. residuals, (5) output variable 2 true vs. residuals (no objects are returned)

This plotting function only really requires that a model_dict from the generate_model_dict() function be used as input. However, through use of the y_true, y_pred, model_descr, and y1 and y2 label parameters, predictions stored in a shape (n,2) array can be plotted directly wihtout the use of a model_dict

NOTE: This plotting function requires y to consist of 2 output variables.

Therefore, it will not work with y data not of shape=(n, 2).

  • model_dict – dictionary or None, if model results from the generate_model_dict func is used, function defaults to data from that dict for plot, if None plot expects y_true, y_pred, model_descr, and y1/y2 label inputs for plotting

  • dataset – string, ‘train’ or ‘test’, indicates whether to plot training or test results if using model_dict as data source, and labels plots accordingly if y_pred and y_true inputs are used (default is ‘train’)

  • y_pred (y_true,) – None or pd.DataFrame and np.array shape=(n,2) data sources accepted and used for plotting if model_dict=None (default for both is None)

  • model_descr – None or string of max length 80 used to describe model in title. If None, model_descr defaults to description in model_dict, if string is entered, that string overrides the description in model_dict, if using y_true/y_test as data source model_descr must be specified as a string (default is None)

  • y2_label (y1_label,) – None or string of max length 40 used to describe the 2 output y variables being plotted. These values appear along the plot axes and in the titles of subplots. If None, the y_variables names from the model_dict are used. If strings are entered, those strings are used to override the model_dict values. If using y_true/y_test as data source, these values must be specified (default is None for both label)

caproj.visualize.plot_value_counts(value_counts, figsize=9, 3, color='tab:blue')

Generates barplot from pandas value_counts series

  • value_counts (DataFrame) – pandas DataFrame generated using the pandas value_counts method

  • figsize (tuple, optional) – dimensions of resulting plot, defaults to (9, 3)

  • color (str, optional) – color of resulting plotted bars, defaults to “tab:blue”


This module contains functions for visualizing data and model results

Module classes:

UMAP_embedder(scaler, final_cols, …)

Class methods for generating UMAP embedding and HDBSCAN clusters

Module functions:

silplot(X, cluster_labels, clusterer[, …])

Generates silhouette subplot of kmeans clusters alongside PCA n=2

display_gapstat_with_errbars(gap_df[, height])

Generates plots of gap stats with error bars for each number of clusters

fit_neighbors(data, min_samples)

Fits n nearest neighbors based on min samples and returns distances

plot_epsilon(distances, min_samples[, height])

Plot epsilon by index sorted by increasing distance

silscore_dbscan(data, labels, clustered_bool)

Generates sil score ommitting observations not assigned to any cluster by dbscan

fit_dbscan(data, min_samples, eps)

Fits dbscan and returns dictionary of results including model, labels, indices


Prints summary results of fitted DBSCAN results dictionary

plot_dendrogram(linkage_data, method_name[, …])

Plots a dendrogram given a set of input hierarchy linkage data

plot_cluster_hist(data, title, metric[, …])

Requires melted dataframe as input and plots histograms by cluster

plot_umap_scatter(x, y, color, title, scale_var)

plots scatterplot with color scale

plot_category_scatter(data, x_col, y_col, …)

plots scatterplot with categories colors

class caproj.cluster.UMAP_embedder(scaler, final_cols, mapper_dict, clusterer, bert_embedding)

Class methods for generating UMAP embedding and HDBSCAN clusters


Returns HDBSCAN cluster labels

get_full_df(df, dimensions='all')

Returns UMAP full dataframe

get_mapping_attributes(df, return_extra=False, dimensions='all')
if return extra = True, returns 3 objects:
  1. mapping

  2. columns needed to be added to harmonize with entire data

  3. dummified df before adding columns of [1]

get_mapping_description(df, dimensions='all')

Returns UMAP final dataframe

caproj.cluster.display_gapstat_with_errbars(gap_df, height=4)

Generates plots of gap stats with error bars for each number of clusters

  • gap_df (DataFrame) – dataframe attribute of a fitted gap_statistic.OptimalK object for plotting (i.e. OptimalK.gap_df)

  • height (int, optional) – hieght of the resulting plot, defaults to 4

caproj.cluster.fit_dbscan(data, min_samples, eps)

Fits dbscan and returns dictionary of results including model, labels, indices

  • data (array-like) – original data to be fitted using sklearn.cluster.DBSCAN

  • min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

  • eps (int or float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.


dictionary of results and important characteristics of the fitted DBSCAN algorithm (see NOTE below)

Return type


NOTE: Dictionary returned includes the following items:

    "model": DBSCAN(eps=eps, min_samples=min_samples).fit(data),
    "n_clusters": sum([i != -1 for i in set(model.labels_)]),
    "labels": model.labels_,
    "core_sample_indices": model.core_sample_indices_,
    "clustered_bool": [i != -1 for i in labels],
    "cluster_counts": pd.Series(labels).value_counts(),
    "sil_score": silscore_dbscan(data, labels, clustered_bool),
caproj.cluster.fit_neighbors(data, min_samples)

Fits n nearest neighbors based on min samples and returns distances

This is a simple implementation of the sklearn.neighbors.NearestNeighbors and returns the distance results from that object’s fitted_neighbors method

  • data (dataframe or array) – data on which to perform nearest neighbors algorithm

  • min_samples (int) – number of neighbors to use by default for kneighbors queries


array representing the lengths to points

Return type


caproj.cluster.make_spider(mean_peaks_per_cluster, row, name, color)

Generate spider plot showing attributes of a single cluster

caproj.cluster.plot_category_scatter(data, x_col, y_col, cat_col, title, colormap='Paired', xlabel='1st dimension', ylabel='2nd dimension')

plots scatterplot with categories colors

caproj.cluster.plot_cluster_hist(data, title, metric, cluster_col='cluster', val_col='Standardized Metric Value', metric_col='Metric', cmap='Paired', bins=6)

Requires melted dataframe as input and plots histograms by cluster

caproj.cluster.plot_dendrogram(linkage_data, method_name, yticks=16, ytick_interval=1, height=4.5)

Plots a dendrogram given a set of input hierarchy linkage data

  • linkage_data – np.array output from scipy.cluster.hierarchy, which should have been applied to a distance matrix to convert it to linkage data

  • method_name – string describing the linkage method used, should be fewer than 30 characters

  • yticks – integer, the number of desired y tick lavels for the resulting plot

  • ytick_interval – integer, the desired interval for the resulting y ticks

  • height – float, the desired height of the resulting plot

return: plots dendrogram, no objects are returned

caproj.cluster.plot_epsilon(distances, min_samples, height=5)

Plot epsilon by index sorted by increasing distance

Generates a line plot of epsilon with observations sorted by increasing distances

  • distances (array) – distances generated by fit_neighbors()

  • min_samples (int) – number of neighbors used to generate distances

  • height (int, optional) – height of plot, defaults to 5

caproj.cluster.plot_spider_clusters(title, mean_peaks_per_cluster)

Generate spider plot subplots for all input clusters

caproj.cluster.plot_umap_scatter(x, y, color, title, scale_var, colormap='Reds', xlabel='1st dimension', ylabel='2nd dimension')

plots scatterplot with color scale


Prints summary results of fitted DBSCAN results dictionary

Provides printed summary and plotted value counts by cluster


dbscan_dict (dict) – returned output dictionary from fit_dbscan`() function

caproj.cluster.silplot(X, cluster_labels, clusterer, pointlabels=None, height=6)

Generates silhouette subplot of kmeans clusters alongside PCA n=2

Two side-by-side subplots are generated showing (1) the silhouette plot of the clusterer’s results and (2) the PCA 2-dimensional reduction of the input data, color-coded by cluster.

Source: The majority of the code from this function was provided as a

helper function from the CS109b staff in HW2

The original code authored by the cs109b teaching staff is modified from: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

  • X (pandas.DataFrame) – original mutli-dimensional data values against which you are plotting

  • cluster_labels (list or array) – list of labels for each observation in X

  • clusterer (sklearn.cluster.KMeans object) – fitted sklearn kmeans clustering object

  • pointlabels (list or None, optional) – list of labels for each point, defaults to None

  • height (int, optional) – height of resulting subplots, defaults to 6

caproj.cluster.silscore_dbscan(data, labels, clustered_bool)

Generates sil score ommitting observations not assigned to any cluster by dbscan

  • data (array or dataframe) – original data used for dbscan clustering

  • labels (array) – cluster label for each observation in data

  • clustered_bool (list or 1-d array) – boolean value for each observation indicated whether it had been clustered by dbscan


silhouette score

Return type



This module contains functions for visualizing data and model results

Module variables:


Import and set seed for reproducible results

Module functions:

build_dense_ae_architecture(input_dim, …)

Builds and compiles a tensorflow.keras dense autoencoder network

plot_history(history, title[, val_name, …])

Plot training and validation loss using keras history object

caproj.autoencoder.build_dense_ae_architecture(input_dim, encoding_dim, droprate, learning_rate, name)

Builds and compiles a tensorflow.keras dense autoencoder network


This network architecture was designed for the specific purpose of encoding a dataset of 1D embeddings. Therefore, the input dimension must be 1D with a length that equals the number of values in any single observation’s embedding

  • input_dim – integer, the length of each embedding (must all be of the same length)

  • encoding_dim – integer, the desired bottleneck dimension for the encoder network

  • droprate – float >0 <1, this is passed to the rate argument for the dropout layers between each dense layer

  • learning_rate – float, the desired learning rate for the Adam optimizer used while compiling the model

  • name – string, the desired name of the resulting network


tuple of 3 tf.keras model object, [0] full autoencoder model, [1] encoder model, [2] decoder model

caproj.autoencoder.plot_history(history, title, val_name='validation', loss_type='MSE')

Plot training and validation loss using keras history object

  • history – keras training history object or dict. If a dict is used, it must have two keys named ‘loss’ and ‘val_loss’ for which the corresponding values must be lists or arrays with float values

  • title – string, the title of the resulting plot

  • val_name – string, the name for the val_loss line in the plot legend (default ‘validation’)

  • loss_type – string, the loss type name to be printed as the y axis label (default ‘MSE’)


a line plot illustrating model training history, no objects are returned

caproj.autoencoder.random_seed = 109

Import and set seed for reproducible results

caproj.autoencoder.seed(self, seed=None)

Reseed a legacy MT19937 BitGenerator


This is a convenience, legacy function.

The best practice is to not reseed a BitGenerator, rather to recreate a new one. This method is here for legacy reasons. This example demonstrates best practice.

>>> from numpy.random import MT19937
>>> from numpy.random import RandomState, SeedSequence
>>> rs = RandomState(MT19937(SeedSequence(123456789)))
# Later, you want to restart the stream
>>> rs = RandomState(MT19937(SeedSequence(987654321)))


This module contains utlility functions for performaing HDBSCAN and UMAP analyses


Documentation is currently incomplete for each function in this module.

Module functions:

predict_ensemble(ensemble, X)

Run data through each classifier in ensemble list to get predicted probabilities

adjusted_classes(y_scores, t)

Adjust class predictions based on the prediction threshold (t)

print_report(m, X_valid, y_valid[, t, …])

Print a comprehensive classification report on both validation and training set

draw_umap(data[, n_neighbors, min_dist, c, …])

Generate plot of UMAP algorithm results based on specified arguments

cluster_hdbscan(clusterable_embedding, …)

Generate plot of HDBSCAN algorithm results based on specified arguments

caproj.utils.adjusted_classes(y_scores, t)

Adjust class predictions based on the prediction threshold (t)

  • Will only work for binary classification problems.

caproj.utils.cluster_hdbscan(clusterable_embedding, min_cluster_size, viz_embedding_list)

Generate plot of HDBSCAN algorithm results based on specified arguments

caproj.utils.draw_umap(data, n_neighbors=15, min_dist=0.1, c=None, n_components=2, metric='euclidean', title='', plot=True, cmap=None, use_plotly=False, **kwargs)

Generate plot of UMAP algorithm results based on specified arguments

caproj.utils.predict_ensemble(ensemble, X)

Run data through each classifier in ensemble list to get predicted probabilities

  • Those are then averaged out across all classifiers.

caproj.utils.print_report(m, X_valid, y_valid, t=0.5, X_train=None, y_train=None, show_output=True)

Print a comprehensive classification report on both validation and training set

  • The metrics returned are AUC, F1, Precision, Recall and Confusion Matrix.

  • It accepts both single classifiers and ensembles.

  • Results are dependent on probability threshold applied to individual predictions.


This module contains functions for generating and analyzing trees and tree ensemble models and visualizing the model results

Module variables:


sets default depths for comparison in cross validation


sets cross-validation kfold parameter

Module functions:

generate_adaboost_staged_scores(model_dict, …)

Generates adaboost staged scores in order to find ideal number of iterations

plot_adaboost_staged_scores(model_dict, …)

Plots the adaboost staged scores for each y variable’s predictions and iteration

calc_meanstd_logistic(X_tr, y_tr, X_te, y_te)

Fits and generates tree classifier results, iterated for each input depth

calc_meanstd_regression(X_tr, y_tr, X_te, y_te)

Fits and generates tree regressor results, iterated for each input depth


plot the best depth finder for decision tree model

calculate(data_train, data_test, categories, …)

Calculate decision tree results using a particular set of X features

calc_models(data_train, data_test, …[, …])

Iterate over all combinations of attributes to return lists of resulting models

caproj.trees.calc_meanstd_logistic(X_tr, y_tr, X_te, y_te, depths: list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], cv: int = 5)

Fits and generates tree classifier results, iterated for each input depth

  • X_tr (array-like) – Training data X values

  • y_tr (array-like) – Training data y values

  • X_te (array-like) – Test data X values

  • y_te (array-like) – Test data y values

  • depths (list, optional) – List of depths for each iterated decision tree classifier, defaults to depths

  • cv (int, optional) – Number of k-folds used for cross-validation, defaults to cv


Five arrays are returned (1) mean cross-validation scores for each iteration, (2) standard deviation of each cross-validation score, (3) each training observation’s ROC AUC score, (4) each test observation’s ROC AUC score, (5) each fitted classifier’s model object

Return type


caproj.trees.calc_meanstd_regression(X_tr, y_tr, X_te, y_te, depths: list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], cv: int = 5)

Fits and generates tree regressor results, iterated for each input depth

  • X_tr (array-like) – Training data X values

  • y_tr (array-like) – Training data y values

  • X_te (array-like) – Test data X values

  • y_te (array-like) – Test data y values

  • depths (list, optional) – List of depths for each iterated decision tree regressor, defaults to depths

  • cv (int, optional) – Number of k-folds used for cross-validation, defaults to cv


Five arrays are returned (1) mean cross-validation scores for each iteration, (2) standard deviation of each cross-validation score, (3) each training observation’s \(R^2\) score, (4) each test observation’s \(R^2\) score, (5) each fitted regressors’s model object

Return type


caproj.trees.calc_models(data_train, data_test, categories, nondescr_attrbutes, descr_attributes, responses_list, logistic=True)

Iterate over all combinations of attributes to return lists of resulting models

  • data_train (array-like) – Training dataset

  • data_test (array-like) – Test dataset

  • categories (list) – List of project categories as they appear in the data

  • nondescr_attrbutes (list) – Column names of all features not consisting of those engineered from project descriptions

  • descr_attributes (list) – Column names of features engineered using project descriptions

  • responses_list (list) – Column names of model responses (i.e. each different y variable)

  • logistic (bool, optional) – Indicates whether to use decision tree classifier (i.e. logistic=True) or regressor (i.e. logistic=False), defaults to True


Two list objects containing (1) lists of dictionaries of model results and (2) lists of fitted model dictionaries for each iterated model

Return type


caproj.trees.calculate(data_train, data_test, categories, attributes: list, responses_list: list, logistic=True)

Calculate decision tree results using a particular set of X features

  • data_train (array-like) – Training dataset

  • data_test (array-like) – Test dataset

  • categories (list) – List of project categories as they appear in the data

  • attributes (list) – Column names of feature columns (i.e. each different X variable under consideration)

  • responses_list (list) – Column names of model responses (i.e. each different y variable)

  • logistic (bool, optional) – Indicates whether to use decision tree classifier (i.e. logistic=True) or regressor (i.e. logistic=False), defaults to True


Two lists containing (1) dictionaries of model results and (2) fitted model dictionaries, one dictionary for each response variable

Return type


caproj.trees.cv = 5

sets cross-validation kfold parameter

caproj.trees.depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

sets default depths for comparison in cross validation

caproj.trees.generate_adaboost_staged_scores(model_dict, X_train, X_test, y_train, y_test)

Generates adaboost staged scores in order to find ideal number of iterations

  • model_dict (dict) – Output fitted model dictionary generated using caproj.model.generate_model_dict()

  • X_train (array-like) – Training data X values

  • X_test (array-like) – Test data X values

  • y_train (array-like) – Training data y values

  • y_test (array-like) – Test data y values


tuple of 2D numpy arrays for adaboost staged scores at each iteration and each response variable, one array for training scores and one for test

Return type


caproj.trees.plot_adaboost_staged_scores(model_dict, X_train, X_test, y_train, y_test, height=4)

Plots the adaboost staged scores for each y variable’s predictions and iteration

  • model_dict (dict) – Output fitted model dictionary generated using caproj.model.generate_model_dict()

  • X_train (array-like) – Training data X values

  • X_test (array-like) – Test data X values

  • y_train (array-like) – Training data y values

  • y_test (array-like) – Test data y values

  • height (int, optional) – Height dimension of resulting plot, defaults to 4


plot the best depth finder for decision tree model


result (dict) – Dictionary returned from the calculate() function