basedata.ops package¶

Submodules¶

basedata.ops.base module¶

This module contains the BaseDataClass parent class and common functions that are that are reused across basedata.ops submodules.

class basedata.ops.base.BaseDataClass(input_df, copy_input)¶

Bases: object

BaseDataClass manages base read/write operations and instantiates self.df for child classes across basedata.ops submodule classes

classmethod from_file(filename, copy_input=False, **read_kwargs)¶

Invokes BaseDataClass and reads input csv or excel from disk into a pandas.DataFrame object.

Parameters

filename – str filename of .csv, .xls, or .xlsx file to be read
copy_input – bool to specify whether self.input_df persists
read_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()

Returns

pandas.DataFrame and copy_input bool as class variables

classmethod from_object(input_object, copy_input=False)¶

Invokes BaseDataClass and reads input df from similar BaseData class instance or pandas.DataFrame object.

Parameters

input_object – object to be read into BaseDataClass
copy_input – bool to specify whether self.input_df persists

Returns

pandas.DataFrame and copy_input bool as class variables

to_file(target_filename, **to_csv_kwargs)¶

Saves current version of self.df to file in csv format

Parameters

target_filename – str filename to which csv should be written
to_csv_kwargs – optional args to pandas.DataFrame.to_csv()

basedata.ops.base.inplace_return_series(dataframe, column, series, inplace, return_series, target_column=None)¶

helper function to reuse throughout library. It applies logic for performing inplace series transformations and returning copies of modified series

Parameters

dataframe – pandas.DataFrame for which we are modifying series
column – str name of target column for our series
series – pandas.Series
inplace – bool whether we wish to overwrite existing column with series
return_series – bool whether we wish to return a copy of the pandas.Series object

Returns

pandas.Series

basedata.ops.base.regex_replace_value(val, val_new, pattern, val_exception=nan)¶

Replaces string value if Regex pattern is not satisfied by the input value

Parameters

val – str input value to be evaluated by Regex re.match function
pattern – str regex pattern used to identify values to replace
val_new – str replacement value if input pattern is not satisfied
val_exception – str or np.nan value to return in input value raises an exception (default=np.nan)
val_none – str or np.nan value to return if input value is none or np.nan (default=np.nan)

Returns

str output value based on input parameters

basedata.ops.base.regex_sub_value(val, pattern, val_sub='', val_exception=nan, val_none=nan)¶

Replaces characters in a string value based on specified Regex pattern

Parameters

val – str input value to be evaluated by Regex re.sub function
pattern – str regex pattern used to identify characters to subsitute
val_sub – str value to substitute for specified input characters
val_exception – str or np.nan value to return in input value raises an exception (default=np.nan)
val_none – str or np.nan value to return if input value is none or np.nan (default=np.nan)

Returns

str value based on input parameters

basedata.ops.cols module¶

This submodule basedata.ops mixin classes for manipulating column values and column names.

The functionality of these mixin classes is aggregated in the basedata.ops BaseDataOps class.

class basedata.ops.cols.ColumnConversionsMixin¶

Bases: object

Mixin class methods and associated tools for converting column values to specific types

add_column(column, value)¶

Adds column to self.df and populates that column with a static value

Parameters

column – str name of new column
value – object to populate each row of the new column

Returns

None, self.df is updated inplace

apply_function(column_list, function, target_column, inplace=True, return_series=False, **kwargs)¶

Applies function to dataframe object, using pandas.DataFrame.apply() method.

Parameters

column_list – list column name(s) against which to apply function
function – function to apply to dataframe object
target_column – None or string name of new column created, if inplace=True target_column name must be specified
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
kwargs – optional keyword args to for pandas apply method. Axis=1 is required whenever the function is applied to multiple input columns

Returns

pandas.Series if return_series is specified as True

check_datetime(column, dropna=False, **kwargs)¶

returns a value_counts series reporting all column values that cannot be directly converted to a datetime data type

Parameters

column – str name of column to check for datetime conversion
dropna – bool optional, whether to drop na values from resulting value_count series, default=False
kwargs – additional arguments for pandas value_counts method

Returns

pandas.Series object

check_nonnumeric(column, dropna=False, **kwargs)¶

returns a value_counts series reporting all column values that cannot be directly converted to numeric data types int or float

Parameters

column – str name of column to check for nonnumeric
dropna – bool optional, whether to drop na values from resulting value_count series, default=False
kwargs – additional arguments for pandas value_counts method

Returns

pandas.Series object

map_column_names(map_dict, inplace=True)¶

maps existing column names to new names based on input dictionary

acts as a simple wrapper for the pandas DataFrame.rename method, when columns are specified in that functions parameters

Parameters

map_dict – dict mapping {current_value: new_value}
inplace – bool whether to make changes to self.df in place, default=True

Returns

pandas.DataFrame if inplace is specified as False

map_values(column, map_dict, na_action=None, exhaustive=False, inplace=True, return_series=False, target_column=None)¶

maps existing column values to new value based on input dictionary

acts as a simple wrapper for the pandas Series.map method

Parameters

column – str name of column in which value will be mapped
map_dict – dict mapping {current_value: new_value}
na_action – None or ‘ignore’, if ‘ignore’ propogate NaN values without passing them to the mapping correspondence, defaul=None
exhaustive – bool whether or not value map is expected to affect all values in the series, if True, any values not matching map_dict will be converted to np.nan, if False, they retain their original values. default=False
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

report_values(column, dropna=False, **kwargs)¶

returns a value_counts series reporting all unique column values

Parameters

column – str name of column to check for unique values
dropna – bool optional, whether to drop na values from resulting value_count series, default=False
kwargs – additional arguments for pandas value_counts method

Returns

pandas.Series object

substitute_chars(column, pattern, val_sub, val_exception=nan, val_none=nan, inplace=True, return_series=False, target_column=None)¶

Strips or replaces characters from column values.

When val_sub is ‘’, this method strips the characters specified by the regex pattern. If the objective is to replace the specified characters, specify the desired replacement characters using val_sub (i.e. val_sub=’substring’)

Parameters

column – str name of column on which to apply this operation
pattern – str Regex pattern specifying which types of characters to substiture with val_sub str
val_sub – str character(s) with which to replace pattern values
val_exception – str or numpy.nan value to return for instances where an exception is raised, default=numpy.nan
val_none – str or numpy.nan value to return for instances of either None or np.nan, default=numpy.nan
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

to_datetime(column, coerce=True, inplace=True, return_series=False, target_column=None)¶

wrapper for pandas to_numeric method, which converts column values to a datetime data type

Parameters

column – str name of column to convert to datetime
coerce – bool optional, specifies whether to ‘coerce’ non-convertable values to numpy.nan if True or to leave those values as is if False, default=True
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

to_numeric(column, coerce=True, inplace=True, return_series=False, target_column=None)¶

wrapper for pandas to_numeric method, which converts column values to a numeric data type (int or float)

Parameters

column – str name of column to convert to numeric
coerce – bool optional, specifies whether to ‘coerce’ non-convertable values to numpy.nan if True or to leave those values as is if False, default=True
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

basedata.ops.ids module¶

This submodule basedata.ops mixin classes for cleaning unique ID values.

The functionality of these mixin classes is aggregated in the basedata.ops BaseDataOps class.

class basedata.ops.ids.DedupeMixin¶

Bases: object

Mixin class methods used to inspect dataframe objects for duplicate key values, analyze duplicate key records, and to remove duplicate key records once identified.

drop_dupes(column, index_list, validate=True)¶

Drops rows in self.df based on input index_list values, will return print message if any duplicate vlaues remain in the specified column.

Parameters

index_list – list indices to be dropped
validate – bool raises exception if duplicates still remain

flush_duperecords()¶: Deletes self.duperecords dictionary from class __dict__ to free memory

report_dupes(column, to_file=None, return_df=True)¶

Invokes a dataframe consisting of records associated with duplicate values in the specified column and saves csv of dataframe to file if specified.

In saved csv, duplicate records indices are save in column ‘index_id’

Parameters

column – str name of column to check for duplicate values
to_file – str optional filename if a csv of the duplicates dataframe should be saved. Default is None.
return_df – bool indicates whether or not to return dataframe

Returns

pandas.Dataframe of all records associated with column dupes

class basedata.ops.ids.ValidIDsMixin¶

Bases: object

functions for validating and modifying ID values

drop_blankID_rows(column)¶

drops all rows from self.df where the choosen column value is a nan value

changes to self.df are made inplace and the df index is reset to contiguous values 0-n.

remove_offlenIDs(column, target_len=8, pattern='[0-9]', val_new=nan, val_exception=nan, inplace=True, return_series=False, target_column=None)¶

removes all IDs not matching the desired character length and replaces them with a chosen replacement value

Parameters

target_len – int specifying length of a valid id, default=8
pattern – str Regex pattern specifying which types of characters to substiture with val_sub str, default=’[0-9]’
val_new – str or numpy.nan value with which to replace values of incorrect length, default=numpy.nan
val_exception – str or numpy.nan value to return for instances where an exception is raised, default=numpy.nan
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False

Returns

pandas.Series of column values after replacing offlenIDs

replace_blankIDs(column, replace_col, inplace=True, return_series=False, target_column=None)¶

replaces all blank ID values with the corresponding values from a different value in the same dataframe

Parameters

replace_col – str name of column with replacement values
inplace – bool make changes to self.df inplace, default=True
return_series – bool whether to return modified pandas.Series object, default=False

report_offlenIDs(column, target_len=8, dropna=False)¶

generate a value_counts report with all of the column values with number of characters not matching the specified target length

Parameters

target_len – int specifying length of a valid id, default=8
dropna – bool indicating whether to include np.nan values in the output value_counts series, default=False

Returns

pandas.Series of the IDs not matching the target_len

strip_nonnumeric(column, pattern='[^0-9]', val_sub='', val_exception=nan, val_none=nan, inplace=True, return_series=False, target_column=None)¶

strips nonnumeric characters from column values

Parameters

pattern – str Regex pattern specifying which types of charaters to substiture with val_sub str, optional, default=’[^0-9]’
val_sub – str character(s) with which to replace pattern values, default = ‘’
val_exception – str or numpy.nan value to return for instances where an exception is raised
val_none – str or numpy.nan value to return for instances of either None or np.nan
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False

Returns

pandas.Series if return_series is specified as True

Module contents¶

This submodule contains the BaseDataOps class, which aggregates basedata.ops mixin classes and BaseDataClass functionality.

class basedata.ops.BaseDataOps(input_df, copy_input)¶

Bases: basedata.ops.cols.ColumnConversionsMixin, basedata.ops.ids.DedupeMixin, basedata.ops.ids.ValidIDsMixin, basedata.ops.base.BaseDataClass

The BaseDataOps class inherits core class functionality from the BaseDataClass parent class and acts as the core data operations class in which the methods of the many specialized mixin classes is combined.

basedata.ops package¶

Submodules¶

basedata.ops.base module¶

basedata.ops.cols module¶

basedata.ops.ids module¶

Module contents¶

BaseData

Navigation

Related Topics