API Reference¶
Contents
caproj.data¶
This submodule contains the BaseData
class, which aggregates caproj.data
mixin classes and BaseDataOps
class instantiation functionality.
-
class
caproj.data.
BaseData
(df_input, copy_input)¶ Inherit core class functionality from the
BaseDataOps
parent class and act as the core data operations class in which methods of specialized mixin classes are combined.
caproj.data.base¶
This module contains the core caproj.data
read and write functionality
Module classes:
|
Manage base read/write operations for |
Module variables:
|
-
class
caproj.data.base.
BaseDataOps
(df_input, copy_input)¶ Manage base read/write operations for
caproj.data
module classes- Variables
df – pandas.DataFrame working copy either read in from from file or from an existing object by using
BaseData
initializing class methodsBaseDataOps.from_file()
orBaseDataOps.from_object()
df_input – pandas.DataFrame original input copy, not operated upon by any class methods, and only created if
copy_input
parameter set toTrue
duringBaseDataOps.from_file()
orfrom_object()
class creation
Class methods:
BaseDataOps.from_file
(filename[, copy_input])Invoke BaseData class and read csv into pandas.DataFrame
BaseDataOps.from_object
(input_object[, …])Invoke BaseData and read dataframe from in-memory object
BaseDataOps.to_file
(target_filename, …)Save current version of BaseDataOps.df to file in .csv format
BaseDataOps.log_record_count
([id_col])Log number of records and unique projects in BaseDataOps.df
Normalize column name format using underscore (‘_’) as a separator
BaseDataOps.rename_columns
([map_dict, json_path])Map existing column names to new names based on input dictionary
BaseDataOps.set_dtypes
([map_dict, …])Map and convert columns to specified data types
-
classmethod
from_file
(filename, copy_input=False, **read_kwargs)¶ Invoke BaseData class and read csv into pandas.DataFrame
- Parameters
filename – str filename of .csv file to be read
copy_input – bool to specify whether
self.df_input
persistsread_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()
- Returns
pandas.DataFrame and copy_input bool as class attributes
- Raises
TypeError – if the
filename
is not a .csv filetype
-
classmethod
from_object
(input_object, copy_input=False)¶ Invoke BaseData and read dataframe from in-memory object
Input objects can be either (a) an existing
BaseData
object, in which case thepandas.DataFrame
stored within that object will be read, or (b) a simplepandas.DataFrame
object.- Parameters
input_object – object to be read into
BaseData
copy_input – bool to specify whether self.df_input persists
- Returns
pandas.DataFrame and copy_input bool as class variables
- Raises
Exceptions – if the
input_object
is neither a pandas.Dataframe nor aBaseData
object with an existingBaseDataOps.df
attribute
-
lint_colnames
()¶ Normalize column name format using underscore (‘_’) as a separator
-
log_record_count
(id_col='PID')¶ Log number of records and unique projects in BaseDataOps.df
-
rename_columns
(map_dict=None, json_path=None)¶ Map existing column names to new names based on input dictionary
A simple wrapper for the pandas
DataFrame.rename
method- Parameters
map_dict (dict, optional) – column name mapping {current_value: new_value}, defaults to None
json_path (str, optional) – file path to json file storing the desired map_dict, defaults to None
-
set_dtypes
(map_dict=None, json_path=None, coerce=False)¶ Map and convert columns to specified data types
Internally, this function uses the
pandas
.to_*
data type conversion methods.- Parameters
map_dict ([type], optional) – [description], defaults to None
json_path ([type], optional) – [description], defaults to None
coerce (bool, optional) – [description], defaults to False
-
sort_values
(by, ascending=True, na_position='last', ignore_index=False)¶ Sort dataframe records by specified columns
A simple implementation of the Pandas
sort_values
method. This operation is performed in place on theBaseDataOps.df
stored attribute. Therefore, no objects are returned.- Parameters
by (str or list of str) – Name or list of names to sort by.
ascending (bool or list of bool, optional) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by, defaults to True
na_position ({‘first’, ‘last’}, optional) – Puts NaNs at the beginning if first; last puts NaNs at the end, defaults to ‘last’
ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1; defaults to False
-
to_file
(target_filename, **to_csv_kwargs)¶ Save current version of BaseDataOps.df to file in .csv format
- Parameters
target_filename – str filename to which csv should be written
to_csv_kwargs – optional args to pandas.DataFrame.to_csv()
-
caproj.data.base.
log
= <Logger caproj.data.base (WARNING)>¶ logging.getLogger
instance for module
caproj.data.clean¶
This module contains BaseData mixin class to clean the NYC Capital Projects dataset
Module classes:
|
Module variables:
|
-
class
caproj.data.clean.
CleanMixin
¶ BaseData
mixin class methods for cleansing the NYC capital projects dataset-
check_duplicates
(columns)¶ Check column or combination of columns for duplicate records
Todo
use pandas duplicated method with keep=False to identify dupe rows
log # of duplicated rows and indices of duplicated values
use value_counts to identify how many dupes per value > 1 occurance
-
concat_values
(columns, to_colname)¶ Add new column of values generated by concatenating other column’s values
Todo
Add option to set desired length for each column’s values
Add option to add seperator between values
- Parameters
columns (str or list of str) – name(s) of column(s) for which values should be concatenated
to_colname (str) – name of resulting column containing the concatenated values
-
remove_missing_records
(columns)¶ Delete records with missing values in specified columns
- Parameters
columns (str or list of str) – column name(s) for columns to check for blank values, any record with a blank value in the columns will be dropped from dataframe
-
-
caproj.data.clean.
log
= <Logger caproj.data.clean (WARNING)>¶ logging.getLogger
instance for module
caproj.features¶
This module contains functionality for generating engineered features to be used in modeling for this project.
Module functions:
Placeholder function to illustrate testing and docs generation |
-
caproj.features.
placeholder
()¶ Placeholder function to illustrate testing and docs generation
caproj.models¶
This module contains functionality for generating engineered features to be used in modeling for this project.
Module functions:
Placeholder function to illustrate testing and docs generation |
-
caproj.models.
placeholder
()¶ Placeholder function to illustrate testing and docs generation
caproj.visualizations¶
This module contains functions for visualizing data and model results
FUNCTIONS
- save_plot()
Save a matplotlib plot to file
- plot_barplot()
Generate a horizontal barplot from a pandas value_counts series
-
caproj.visualizations.
plot_barplot
(value_counts, title, height=6, varname=None, color='k', label_space=0.01, savepath=None)¶ Generate a horizontal barplot from a pandas value_counts series
- Parameters
value_counts – pd.Series object generated by pandas value_counts() method
title – string, the printed title of the plot
height – integer, the desired height of the plot (default is 6)
color – string, the matplotlib color name for the color you would like for the plotted bars (default is ‘k’ or black)
label_space – float, a coefficient used to space the count label an appropriate distance from the plotted bar (default is 0.01)
savepath – string or none, specifies filepath at which to save the resulting barplot. If None, nothing is saved. (Default is None)
- Returns
a matplotlib plot is generated; no objects are returned
-
caproj.visualizations.
save_plot
(plt_object, savepath=None)¶ Save a matplotlib plot to file
- Parameters
plt_object – matplotlib.pyplot plot object
savepath – string or None, specifies filepath at which to save the matplotlib plot. If None, nothing is saved. (Default is None)
- Returns
No objects are returned
caproj.logger¶
This module contains logging-related features for the caproj
package
Module functions:
|
Wrap function call to provide log information when function is called |
|
Set up logging configuration for |
-
caproj.logger.
logfunc
(orig_func=None, log=None, funcname=False, argvals=False, docdescr=False, runtime=False)¶ Wrap function call to provide log information when function is called
This function acts as a
functools.wraps
decorator for decorating functions or methods to provide logging functionality to log details of the decorated function- Parameters
orig_func – NoneType placeholder parameter
log – logging.getLogger object for logging, default is None
funcname – boolean indicating whether to log name of function, default is False
argvals – boolean indicating whether to log function arguments, default is False
docdescr – boolean indicating whether to log function docstring short description, default is False
runtime – boolean indicating whether to log function execution runtime in seconds, default is False
- Returns
functools.wraps
wrapper function- Example
log = logging.getLogger(__name__) @logfunc(log=log, funcname=True, runtime=True) def some_function(arg1, **kwargs): pass
Note
All
logfunc
logs are generate at the ‘INFO’ logging level
-
caproj.logger.
start_logging
(default_path='logging.json', default_level='INFO', env_key='LOG_CFG')¶ Set up logging configuration for
caproj
package- Parameters
default_path – string file path for json formatted logging configuration file (default is ‘logging.json’)
default_level – string indicating the default level for logging, accepts the following values: ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’ (default is ‘INFO’)
env_key – string indicating environment key if one exists (default is ‘LOG_CFG’)
caproj.cli¶
This module contains the command line app for caproj
.
Why does this file exist, and why not put this in __main__
?
You might be tempted to import things from __main__
later, but that will
cause problems: the code will get executed twice:
When you run
python -m caproj
, python will execute__main__.py
as a script. That means there won’t be anycaproj.__main__
insys.modules
.When you
import __main__
it will get executed again (as a module) because there’s nocaproj.__main__
insys.modules
.Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration
Module functions:
|
Placeholder function for testing and to illustrate |
-
caproj.cli.
main
(args=None)¶ Placeholder function for testing and to illustrate
caproj
cliPrints list of input arguments
- Parameters
args –
str
orNoneType
, default isNone
Example:
>> python -m caproj foo bar ['foo', 'bar']