API Reference¶

Contents

caproj.data
caproj.data.base
caproj.data.clean
caproj.features
caproj.models
caproj.visualizations
caproj.logger
caproj.cli

caproj.data ¶

This submodule contains the BaseData class, which aggregates caproj.data mixin classes and BaseDataOps class instantiation functionality.

class caproj.data.BaseData(df_input, copy_input)¶: Inherit core class functionality from the BaseDataOps parent class and act as the core data operations class in which methods of specialized mixin classes are combined.

caproj.data.Mixins = [<class 'caproj.data.clean.CleanMixin'>]¶: List of mixin classes inherited by the BaseData class

caproj.data.base ¶

This module contains the core caproj.data read and write functionality

Module classes:

BaseDataOps(df_input, copy_input)

Manage base read/write operations for caproj.data module classes

Module variables:

log

logging.getLogger instance for module

class caproj.data.base.BaseDataOps(df_input, copy_input)¶

Manage base read/write operations for caproj.data module classes

Variables

df – pandas.DataFrame working copy either read in from from file or from an existing object by using BaseData initializing class methods BaseDataOps.from_file() or BaseDataOps.from_object()
df_input – pandas.DataFrame original input copy, not operated upon by any class methods, and only created if copy_input parameter set to True during BaseDataOps.from_file() or from_object() class creation

Class methods:

`BaseDataOps.from_file`(filename[, copy_input])	Invoke BaseData class and read csv into pandas.DataFrame
`BaseDataOps.from_object`(input_object[, …])	Invoke BaseData and read dataframe from in-memory object
`BaseDataOps.to_file`(target_filename, …)	Save current version of BaseDataOps.df to file in .csv format
`BaseDataOps.log_record_count`([id_col])	Log number of records and unique projects in BaseDataOps.df
`BaseDataOps.lint_colnames`()	Normalize column name format using underscore (‘_’) as a separator
`BaseDataOps.rename_columns`([map_dict, json_path])	Map existing column names to new names based on input dictionary
`BaseDataOps.set_dtypes`([map_dict, …])	Map and convert columns to specified data types

classmethod from_file(filename, copy_input=False, **read_kwargs)¶

Invoke BaseData class and read csv into pandas.DataFrame

Parameters

filename – str filename of .csv file to be read
copy_input – bool to specify whether self.df_input persists
read_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()

Returns

pandas.DataFrame and copy_input bool as class attributes

Raises

TypeError – if the filename is not a .csv filetype

classmethod from_object(input_object, copy_input=False)¶

Invoke BaseData and read dataframe from in-memory object

Input objects can be either (a) an existing BaseData object, in which case the pandas.DataFrame stored within that object will be read, or (b) a simple pandas.DataFrame object.

Parameters

input_object – object to be read into BaseData
copy_input – bool to specify whether self.df_input persists

Returns

pandas.DataFrame and copy_input bool as class variables

Raises

Exceptions – if the input_object is neither a pandas.Dataframe nor a BaseData object with an existing BaseDataOps.df attribute

lint_colnames()¶: Normalize column name format using underscore (‘_’) as a separator

log_record_count(id_col='PID')¶: Log number of records and unique projects in BaseDataOps.df

rename_columns(map_dict=None, json_path=None)¶

Map existing column names to new names based on input dictionary

A simple wrapper for the pandas DataFrame.rename method

Parameters

map_dict (dict, optional) – column name mapping {current_value: new_value}, defaults to None
json_path (str, optional) – file path to json file storing the desired map_dict, defaults to None

set_dtypes(map_dict=None, json_path=None, coerce=False)¶

Map and convert columns to specified data types

Internally, this function uses the pandas .to_* data type conversion methods.

Parameters

map_dict ([type], optional) – [description], defaults to None
json_path ([type], optional) – [description], defaults to None
coerce (bool, optional) – [description], defaults to False

sort_values(by, ascending=True, na_position='last', ignore_index=False)¶

Sort dataframe records by specified columns

A simple implementation of the Pandas sort_values method. This operation is performed in place on the BaseDataOps.df stored attribute. Therefore, no objects are returned.

Parameters

by (str or list of str) – Name or list of names to sort by.
ascending (bool or list of bool, optional) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by, defaults to True
na_position ({‘first’, ‘last’}, optional) – Puts NaNs at the beginning if first; last puts NaNs at the end, defaults to ‘last’
ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1; defaults to False

to_file(target_filename, **to_csv_kwargs)¶

Save current version of BaseDataOps.df to file in .csv format

Parameters

target_filename – str filename to which csv should be written
to_csv_kwargs – optional args to pandas.DataFrame.to_csv()

caproj.data.base.log = <Logger caproj.data.base (WARNING)>¶: logging.getLogger instance for module

caproj.data.clean ¶

This module contains BaseData mixin class to clean the NYC Capital Projects dataset

Module classes:

CleanMixin()

BaseData mixin class methods for cleansing the NYC capital projects dataset

Module variables:

log

logging.getLogger instance for module

class caproj.data.clean.CleanMixin¶

BaseData mixin class methods for cleansing the NYC capital projects dataset

check_duplicates(columns)¶

Check column or combination of columns for duplicate records

Todo

use pandas duplicated method with keep=False to identify dupe rows
log # of duplicated rows and indices of duplicated values
use value_counts to identify how many dupes per value > 1 occurance

concat_values(columns, to_colname)¶

Add new column of values generated by concatenating other column’s values

Todo

Add option to set desired length for each column’s values
Add option to add seperator between values

Parameters

columns (str or list of str) – name(s) of column(s) for which values should be concatenated
to_colname (str) – name of resulting column containing the concatenated values

remove_missing_records(columns)¶

Delete records with missing values in specified columns

Parameters: columns (str or list of str) – column name(s) for columns to check for blank values, any record with a blank value in the columns will be dropped from dataframe

caproj.data.clean.log = <Logger caproj.data.clean (WARNING)>¶: logging.getLogger instance for module

caproj.features ¶

This module contains functionality for generating engineered features to be used in modeling for this project.

Module functions:

placeholder()

Placeholder function to illustrate testing and docs generation

caproj.features.placeholder()¶: Placeholder function to illustrate testing and docs generation

caproj.models ¶

This module contains functionality for generating engineered features to be used in modeling for this project.

Module functions:

placeholder()

Placeholder function to illustrate testing and docs generation

caproj.models.placeholder()¶: Placeholder function to illustrate testing and docs generation

caproj.visualizations ¶

This module contains functions for visualizing data and model results

FUNCTIONS

save_plot()
Save a matplotlib plot to file

plot_barplot()
Generate a horizontal barplot from a pandas value_counts series

caproj.visualizations.plot_barplot(value_counts, title, height=6, varname=None, color='k', label_space=0.01, savepath=None)¶

Generate a horizontal barplot from a pandas value_counts series

Parameters

value_counts – pd.Series object generated by pandas value_counts() method
title – string, the printed title of the plot
height – integer, the desired height of the plot (default is 6)
color – string, the matplotlib color name for the color you would like for the plotted bars (default is ‘k’ or black)
label_space – float, a coefficient used to space the count label an appropriate distance from the plotted bar (default is 0.01)
savepath – string or none, specifies filepath at which to save the resulting barplot. If None, nothing is saved. (Default is None)

Returns

a matplotlib plot is generated; no objects are returned

caproj.visualizations.save_plot(plt_object, savepath=None)¶

Save a matplotlib plot to file

Parameters

plt_object – matplotlib.pyplot plot object
savepath – string or None, specifies filepath at which to save the matplotlib plot. If None, nothing is saved. (Default is None)

Returns

No objects are returned

caproj.logger ¶

This module contains logging-related features for the caproj package

Module functions:

`logfunc`([orig_func, log, funcname, argvals, …])	Wrap function call to provide log information when function is called
`start_logging`([default_path, default_level, …])	Set up logging configuration for `caproj` package

caproj.logger.logfunc(orig_func=None, log=None, funcname=False, argvals=False, docdescr=False, runtime=False)¶

Wrap function call to provide log information when function is called

This function acts as a functools.wraps decorator for decorating functions or methods to provide logging functionality to log details of the decorated function

Parameters

orig_func – NoneType placeholder parameter
log – logging.getLogger object for logging, default is None
funcname – boolean indicating whether to log name of function, default is False
argvals – boolean indicating whether to log function arguments, default is False
docdescr – boolean indicating whether to log function docstring short description, default is False
runtime – boolean indicating whether to log function execution runtime in seconds, default is False

Returns

functools.wraps wrapper function

Example

log = logging.getLogger(__name__)

@logfunc(log=log, funcname=True, runtime=True)
def some_function(arg1, **kwargs):
    pass

Note

All logfunc logs are generate at the ‘INFO’ logging level

caproj.logger.start_logging(default_path='logging.json', default_level='INFO', env_key='LOG_CFG')¶

Set up logging configuration for caproj package

Parameters

default_path – string file path for json formatted logging configuration file (default is ‘logging.json’)
default_level – string indicating the default level for logging, accepts the following values: ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’ (default is ‘INFO’)
env_key – string indicating environment key if one exists (default is ‘LOG_CFG’)

caproj.cli ¶

This module contains the command line app for caproj.

Why does this file exist, and why not put this in __main__?

You might be tempted to import things from __main__ later, but that will cause problems: the code will get executed twice:

When you run python -m caproj, python will execute __main__.py as a script. That means there won’t be any caproj.__main__ in sys.modules.
When you import __main__ it will get executed again (as a module) because there’s no caproj.__main__ in sys.modules.
Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration

Module functions:

main([args])

Placeholder function for testing and to illustrate caproj cli

caproj.cli.main(args=None)¶

Placeholder function for testing and to illustrate caproj cli

Prints list of input arguments

Parameters: args – str or NoneType, default is None

Example:

>> python -m caproj foo bar
['foo', 'bar']

API Reference¶

caproj.data ¶

caproj.data.base ¶

caproj.data.clean ¶

caproj.features ¶

caproj.models ¶

caproj.visualizations ¶

caproj.logger ¶

caproj.cli ¶

caproj-sandbox

Navigation

Related Topics