API Reference

caproj.data

This submodule contains the BaseData class, which aggregates caproj.data mixin classes and BaseDataOps class instantiation functionality.

class caproj.data.BaseData(df_input, copy_input)

Inherit core class functionality from the BaseDataOps parent class and act as the core data operations class in which methods of specialized mixin classes are combined.

caproj.data.Mixins = [<class 'caproj.data.clean.CleanMixin'>]

List of mixin classes inherited by the BaseData class

caproj.data.base

This module contains the core caproj.data read and write functionality

Module classes:

BaseDataOps(df_input, copy_input)

Manage base read/write operations for caproj.data module classes

Module variables:

log

logging.getLogger instance for module


class caproj.data.base.BaseDataOps(df_input, copy_input)

Manage base read/write operations for caproj.data module classes

Variables

Class methods:

BaseDataOps.from_file(filename[, copy_input])

Invoke BaseData class and read csv into pandas.DataFrame

BaseDataOps.from_object(input_object[, …])

Invoke BaseData and read dataframe from in-memory object

BaseDataOps.to_file(target_filename, …)

Save current version of BaseDataOps.df to file in .csv format

BaseDataOps.log_record_count([id_col])

Log number of records and unique projects in BaseDataOps.df

BaseDataOps.lint_colnames()

Normalize column name format using underscore (‘_’) as a separator

BaseDataOps.rename_columns([map_dict, json_path])

Map existing column names to new names based on input dictionary

BaseDataOps.set_dtypes([map_dict, …])

Map and convert columns to specified data types

classmethod from_file(filename, copy_input=False, **read_kwargs)

Invoke BaseData class and read csv into pandas.DataFrame

Parameters
  • filename – str filename of .csv file to be read

  • copy_input – bool to specify whether self.df_input persists

  • read_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()

Returns

pandas.DataFrame and copy_input bool as class attributes

Raises

TypeError – if the filename is not a .csv filetype

classmethod from_object(input_object, copy_input=False)

Invoke BaseData and read dataframe from in-memory object

Input objects can be either (a) an existing BaseData object, in which case the pandas.DataFrame stored within that object will be read, or (b) a simple pandas.DataFrame object.

Parameters
  • input_object – object to be read into BaseData

  • copy_input – bool to specify whether self.df_input persists

Returns

pandas.DataFrame and copy_input bool as class variables

Raises

Exceptions – if the input_object is neither a pandas.Dataframe nor a BaseData object with an existing BaseDataOps.df attribute

lint_colnames()

Normalize column name format using underscore (‘_’) as a separator

log_record_count(id_col='PID')

Log number of records and unique projects in BaseDataOps.df

rename_columns(map_dict=None, json_path=None)

Map existing column names to new names based on input dictionary

A simple wrapper for the pandas DataFrame.rename method

Parameters
  • map_dict (dict, optional) – column name mapping {current_value: new_value}, defaults to None

  • json_path (str, optional) – file path to json file storing the desired map_dict, defaults to None

set_dtypes(map_dict=None, json_path=None, coerce=False)

Map and convert columns to specified data types

Internally, this function uses the pandas .to_* data type conversion methods.

Parameters
  • map_dict ([type], optional) – [description], defaults to None

  • json_path ([type], optional) – [description], defaults to None

  • coerce (bool, optional) – [description], defaults to False

sort_values(by, ascending=True, na_position='last', ignore_index=False)

Sort dataframe records by specified columns

A simple implementation of the Pandas sort_values method. This operation is performed in place on the BaseDataOps.df stored attribute. Therefore, no objects are returned.

Parameters
  • by (str or list of str) – Name or list of names to sort by.

  • ascending (bool or list of bool, optional) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by, defaults to True

  • na_position ({‘first’, ‘last’}, optional) – Puts NaNs at the beginning if first; last puts NaNs at the end, defaults to ‘last’

  • ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1; defaults to False

to_file(target_filename, **to_csv_kwargs)

Save current version of BaseDataOps.df to file in .csv format

Parameters
  • target_filename – str filename to which csv should be written

  • to_csv_kwargs – optional args to pandas.DataFrame.to_csv()

caproj.data.base.log = <Logger caproj.data.base (WARNING)>

logging.getLogger instance for module

caproj.data.clean

This module contains BaseData mixin class to clean the NYC Capital Projects dataset

Module classes:

CleanMixin()

BaseData mixin class methods for cleansing the NYC capital projects dataset

Module variables:

log

logging.getLogger instance for module


class caproj.data.clean.CleanMixin

BaseData mixin class methods for cleansing the NYC capital projects dataset

check_duplicates(columns)

Check column or combination of columns for duplicate records

Todo

  • use pandas duplicated method with keep=False to identify dupe rows

  • log # of duplicated rows and indices of duplicated values

  • use value_counts to identify how many dupes per value > 1 occurance

concat_values(columns, to_colname)

Add new column of values generated by concatenating other column’s values

Todo

  • Add option to set desired length for each column’s values

  • Add option to add seperator between values

Parameters
  • columns (str or list of str) – name(s) of column(s) for which values should be concatenated

  • to_colname (str) – name of resulting column containing the concatenated values

remove_missing_records(columns)

Delete records with missing values in specified columns

Parameters

columns (str or list of str) – column name(s) for columns to check for blank values, any record with a blank value in the columns will be dropped from dataframe

caproj.data.clean.log = <Logger caproj.data.clean (WARNING)>

logging.getLogger instance for module

caproj.features

This module contains functionality for generating engineered features to be used in modeling for this project.

Module functions:

placeholder()

Placeholder function to illustrate testing and docs generation


caproj.features.placeholder()

Placeholder function to illustrate testing and docs generation

caproj.models

This module contains functionality for generating engineered features to be used in modeling for this project.

Module functions:

placeholder()

Placeholder function to illustrate testing and docs generation


caproj.models.placeholder()

Placeholder function to illustrate testing and docs generation

caproj.visualizations

This module contains functions for visualizing data and model results

FUNCTIONS

save_plot()

Save a matplotlib plot to file

plot_barplot()

Generate a horizontal barplot from a pandas value_counts series

caproj.visualizations.plot_barplot(value_counts, title, height=6, varname=None, color='k', label_space=0.01, savepath=None)

Generate a horizontal barplot from a pandas value_counts series

Parameters
  • value_counts – pd.Series object generated by pandas value_counts() method

  • title – string, the printed title of the plot

  • height – integer, the desired height of the plot (default is 6)

  • color – string, the matplotlib color name for the color you would like for the plotted bars (default is ‘k’ or black)

  • label_space – float, a coefficient used to space the count label an appropriate distance from the plotted bar (default is 0.01)

  • savepath – string or none, specifies filepath at which to save the resulting barplot. If None, nothing is saved. (Default is None)

Returns

a matplotlib plot is generated; no objects are returned

caproj.visualizations.save_plot(plt_object, savepath=None)

Save a matplotlib plot to file

Parameters
  • plt_object – matplotlib.pyplot plot object

  • savepath – string or None, specifies filepath at which to save the matplotlib plot. If None, nothing is saved. (Default is None)

Returns

No objects are returned

caproj.logger

This module contains logging-related features for the caproj package

Module functions:

logfunc([orig_func, log, funcname, argvals, …])

Wrap function call to provide log information when function is called

start_logging([default_path, default_level, …])

Set up logging configuration for caproj package


caproj.logger.logfunc(orig_func=None, log=None, funcname=False, argvals=False, docdescr=False, runtime=False)

Wrap function call to provide log information when function is called

This function acts as a functools.wraps decorator for decorating functions or methods to provide logging functionality to log details of the decorated function

Parameters
  • orig_func – NoneType placeholder parameter

  • log – logging.getLogger object for logging, default is None

  • funcname – boolean indicating whether to log name of function, default is False

  • argvals – boolean indicating whether to log function arguments, default is False

  • docdescr – boolean indicating whether to log function docstring short description, default is False

  • runtime – boolean indicating whether to log function execution runtime in seconds, default is False

Returns

functools.wraps wrapper function

Example

log = logging.getLogger(__name__)

@logfunc(log=log, funcname=True, runtime=True)
def some_function(arg1, **kwargs):
    pass

Note

All logfunc logs are generate at the ‘INFO’ logging level

caproj.logger.start_logging(default_path='logging.json', default_level='INFO', env_key='LOG_CFG')

Set up logging configuration for caproj package

Parameters
  • default_path – string file path for json formatted logging configuration file (default is ‘logging.json’)

  • default_level – string indicating the default level for logging, accepts the following values: ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’ (default is ‘INFO’)

  • env_key – string indicating environment key if one exists (default is ‘LOG_CFG’)

caproj.cli

This module contains the command line app for caproj.

Why does this file exist, and why not put this in __main__?

You might be tempted to import things from __main__ later, but that will cause problems: the code will get executed twice:

  • When you run python -m caproj, python will execute __main__.py as a script. That means there won’t be any caproj.__main__ in sys.modules.

  • When you import __main__ it will get executed again (as a module) because there’s no caproj.__main__ in sys.modules.

  • Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration

Module functions:

main([args])

Placeholder function for testing and to illustrate caproj cli


caproj.cli.main(args=None)

Placeholder function for testing and to illustrate caproj cli

Prints list of input arguments

Parameters

argsstr or NoneType, default is None

Example:

>> python -m caproj foo bar
['foo', 'bar']