API Reference¶
Contents
caproj.data¶
This submodule contains the BaseData class, which aggregates caproj.data
mixin classes and BaseDataOps class instantiation functionality.
-
class
caproj.data.BaseData(df_input, copy_input)¶ Inherit core class functionality from the
BaseDataOpsparent class and act as the core data operations class in which methods of specialized mixin classes are combined.
caproj.data.base¶
This module contains the core caproj.data read and write functionality
Module classes:
|
Manage base read/write operations for |
Module variables:
|
-
class
caproj.data.base.BaseDataOps(df_input, copy_input)¶ Manage base read/write operations for
caproj.datamodule classes- Variables
df – pandas.DataFrame working copy either read in from from file or from an existing object by using
BaseDatainitializing class methodsBaseDataOps.from_file()orBaseDataOps.from_object()df_input – pandas.DataFrame original input copy, not operated upon by any class methods, and only created if
copy_inputparameter set toTrueduringBaseDataOps.from_file()orfrom_object()class creation
Class methods:
BaseDataOps.from_file(filename[, copy_input])Invoke BaseData class and read csv into pandas.DataFrame
BaseDataOps.from_object(input_object[, …])Invoke BaseData and read dataframe from in-memory object
BaseDataOps.to_file(target_filename, …)Save current version of BaseDataOps.df to file in .csv format
BaseDataOps.log_record_count([id_col])Log number of records and unique projects in BaseDataOps.df
Normalize column name format using underscore (‘_’) as a separator
BaseDataOps.rename_columns([map_dict, json_path])Map existing column names to new names based on input dictionary
BaseDataOps.set_dtypes([map_dict, …])Map and convert columns to specified data types
-
classmethod
from_file(filename, copy_input=False, **read_kwargs)¶ Invoke BaseData class and read csv into pandas.DataFrame
- Parameters
filename – str filename of .csv file to be read
copy_input – bool to specify whether
self.df_inputpersistsread_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()
- Returns
pandas.DataFrame and copy_input bool as class attributes
- Raises
TypeError – if the
filenameis not a .csv filetype
-
classmethod
from_object(input_object, copy_input=False)¶ Invoke BaseData and read dataframe from in-memory object
Input objects can be either (a) an existing
BaseDataobject, in which case thepandas.DataFramestored within that object will be read, or (b) a simplepandas.DataFrameobject.- Parameters
input_object – object to be read into
BaseDatacopy_input – bool to specify whether self.df_input persists
- Returns
pandas.DataFrame and copy_input bool as class variables
- Raises
Exceptions – if the
input_objectis neither a pandas.Dataframe nor aBaseDataobject with an existingBaseDataOps.dfattribute
-
lint_colnames()¶ Normalize column name format using underscore (‘_’) as a separator
-
log_record_count(id_col='PID')¶ Log number of records and unique projects in BaseDataOps.df
-
rename_columns(map_dict=None, json_path=None)¶ Map existing column names to new names based on input dictionary
A simple wrapper for the pandas
DataFrame.renamemethod- Parameters
map_dict (dict, optional) – column name mapping {current_value: new_value}, defaults to None
json_path (str, optional) – file path to json file storing the desired map_dict, defaults to None
-
set_dtypes(map_dict=None, json_path=None, coerce=False)¶ Map and convert columns to specified data types
Internally, this function uses the
pandas.to_*data type conversion methods.- Parameters
map_dict ([type], optional) – [description], defaults to None
json_path ([type], optional) – [description], defaults to None
coerce (bool, optional) – [description], defaults to False
-
sort_values(by, ascending=True, na_position='last', ignore_index=False)¶ Sort dataframe records by specified columns
A simple implementation of the Pandas
sort_valuesmethod. This operation is performed in place on theBaseDataOps.dfstored attribute. Therefore, no objects are returned.- Parameters
by (str or list of str) – Name or list of names to sort by.
ascending (bool or list of bool, optional) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by, defaults to True
na_position ({‘first’, ‘last’}, optional) – Puts NaNs at the beginning if first; last puts NaNs at the end, defaults to ‘last’
ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1; defaults to False
-
to_file(target_filename, **to_csv_kwargs)¶ Save current version of BaseDataOps.df to file in .csv format
- Parameters
target_filename – str filename to which csv should be written
to_csv_kwargs – optional args to pandas.DataFrame.to_csv()
-
caproj.data.base.log= <Logger caproj.data.base (WARNING)>¶ logging.getLoggerinstance for module
caproj.data.clean¶
This module contains BaseData mixin class to clean the NYC Capital Projects dataset
Module classes:
|
Module variables:
|
-
class
caproj.data.clean.CleanMixin¶ BaseDatamixin class methods for cleansing the NYC capital projects dataset-
check_duplicates(columns)¶ Check column or combination of columns for duplicate records
Todo
use pandas duplicated method with keep=False to identify dupe rows
log # of duplicated rows and indices of duplicated values
use value_counts to identify how many dupes per value > 1 occurance
-
concat_values(columns, to_colname)¶ Add new column of values generated by concatenating other column’s values
Todo
Add option to set desired length for each column’s values
Add option to add seperator between values
- Parameters
columns (str or list of str) – name(s) of column(s) for which values should be concatenated
to_colname (str) – name of resulting column containing the concatenated values
-
remove_missing_records(columns)¶ Delete records with missing values in specified columns
- Parameters
columns (str or list of str) – column name(s) for columns to check for blank values, any record with a blank value in the columns will be dropped from dataframe
-
-
caproj.data.clean.log= <Logger caproj.data.clean (WARNING)>¶ logging.getLoggerinstance for module
caproj.features¶
This module contains functionality for generating engineered features to be used in modeling for this project.
Module functions:
Placeholder function to illustrate testing and docs generation |
-
caproj.features.placeholder()¶ Placeholder function to illustrate testing and docs generation
caproj.models¶
This module contains functionality for generating engineered features to be used in modeling for this project.
Module functions:
Placeholder function to illustrate testing and docs generation |
-
caproj.models.placeholder()¶ Placeholder function to illustrate testing and docs generation
caproj.visualizations¶
This module contains functions for visualizing data and model results
FUNCTIONS
- save_plot()
Save a matplotlib plot to file
- plot_barplot()
Generate a horizontal barplot from a pandas value_counts series
-
caproj.visualizations.plot_barplot(value_counts, title, height=6, varname=None, color='k', label_space=0.01, savepath=None)¶ Generate a horizontal barplot from a pandas value_counts series
- Parameters
value_counts – pd.Series object generated by pandas value_counts() method
title – string, the printed title of the plot
height – integer, the desired height of the plot (default is 6)
color – string, the matplotlib color name for the color you would like for the plotted bars (default is ‘k’ or black)
label_space – float, a coefficient used to space the count label an appropriate distance from the plotted bar (default is 0.01)
savepath – string or none, specifies filepath at which to save the resulting barplot. If None, nothing is saved. (Default is None)
- Returns
a matplotlib plot is generated; no objects are returned
-
caproj.visualizations.save_plot(plt_object, savepath=None)¶ Save a matplotlib plot to file
- Parameters
plt_object – matplotlib.pyplot plot object
savepath – string or None, specifies filepath at which to save the matplotlib plot. If None, nothing is saved. (Default is None)
- Returns
No objects are returned
caproj.logger¶
This module contains logging-related features for the caproj package
Module functions:
|
Wrap function call to provide log information when function is called |
|
Set up logging configuration for |
-
caproj.logger.logfunc(orig_func=None, log=None, funcname=False, argvals=False, docdescr=False, runtime=False)¶ Wrap function call to provide log information when function is called
This function acts as a
functools.wrapsdecorator for decorating functions or methods to provide logging functionality to log details of the decorated function- Parameters
orig_func – NoneType placeholder parameter
log – logging.getLogger object for logging, default is None
funcname – boolean indicating whether to log name of function, default is False
argvals – boolean indicating whether to log function arguments, default is False
docdescr – boolean indicating whether to log function docstring short description, default is False
runtime – boolean indicating whether to log function execution runtime in seconds, default is False
- Returns
functools.wrapswrapper function- Example
log = logging.getLogger(__name__) @logfunc(log=log, funcname=True, runtime=True) def some_function(arg1, **kwargs): pass
Note
All
logfunclogs are generate at the ‘INFO’ logging level
-
caproj.logger.start_logging(default_path='logging.json', default_level='INFO', env_key='LOG_CFG')¶ Set up logging configuration for
caprojpackage- Parameters
default_path – string file path for json formatted logging configuration file (default is ‘logging.json’)
default_level – string indicating the default level for logging, accepts the following values: ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’ (default is ‘INFO’)
env_key – string indicating environment key if one exists (default is ‘LOG_CFG’)
caproj.cli¶
This module contains the command line app for caproj.
Why does this file exist, and why not put this in __main__?
You might be tempted to import things from __main__ later, but that will
cause problems: the code will get executed twice:
When you run
python -m caproj, python will execute__main__.pyas a script. That means there won’t be anycaproj.__main__insys.modules.When you
import __main__it will get executed again (as a module) because there’s nocaproj.__main__insys.modules.Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration
Module functions:
|
Placeholder function for testing and to illustrate |
-
caproj.cli.main(args=None)¶ Placeholder function for testing and to illustrate
caprojcliPrints list of input arguments
- Parameters
args –
strorNoneType, default isNone
Example:
>> python -m caproj foo bar ['foo', 'bar']