basedata.ops package

Submodules

basedata.ops.base module

This module contains the BaseDataClass parent class and common functions that are that are reused across basedata.ops submodules.

class basedata.ops.base.BaseDataClass(input_df, copy_input)

Bases: object

BaseDataClass manages base read/write operations and instantiates self.df for child classes across basedata.ops submodule classes

classmethod from_file(filename, copy_input=False, **read_kwargs)

Invokes BaseDataClass and reads input csv or excel from disk into a pandas.DataFrame object.

Parameters
  • filename – str filename of .csv, .xls, or .xlsx file to be read

  • copy_input – bool to specify whether self.input_df persists

  • read_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()

Returns

pandas.DataFrame and copy_input bool as class variables

classmethod from_object(input_object, copy_input=False)

Invokes BaseDataClass and reads input df from similar BaseData class instance or pandas.DataFrame object.

Parameters
  • input_object – object to be read into BaseDataClass

  • copy_input – bool to specify whether self.input_df persists

Returns

pandas.DataFrame and copy_input bool as class variables

to_file(target_filename, **to_csv_kwargs)

Saves current version of self.df to file in csv format

Parameters
  • target_filename – str filename to which csv should be written

  • to_csv_kwargs – optional args to pandas.DataFrame.to_csv()

basedata.ops.base.inplace_return_series(dataframe, column, series, inplace, return_series, target_column=None)

helper function to reuse throughout library. It applies logic for performing inplace series transformations and returning copies of modified series

Parameters
  • dataframe – pandas.DataFrame for which we are modifying series

  • column – str name of target column for our series

  • series – pandas.Series

  • inplace – bool whether we wish to overwrite existing column with series

  • return_series – bool whether we wish to return a copy of the pandas.Series object

Returns

pandas.Series

basedata.ops.base.regex_replace_value(val, val_new, pattern, val_exception=nan)

Replaces string value if Regex pattern is not satisfied by the input value

Parameters
  • val – str input value to be evaluated by Regex re.match function

  • pattern – str regex pattern used to identify values to replace

  • val_new – str replacement value if input pattern is not satisfied

  • val_exception – str or np.nan value to return in input value raises an exception (default=np.nan)

  • val_none – str or np.nan value to return if input value is none or np.nan (default=np.nan)

Returns

str output value based on input parameters

basedata.ops.base.regex_sub_value(val, pattern, val_sub='', val_exception=nan, val_none=nan)

Replaces characters in a string value based on specified Regex pattern

Parameters
  • val – str input value to be evaluated by Regex re.sub function

  • pattern – str regex pattern used to identify characters to subsitute

  • val_sub – str value to substitute for specified input characters

  • val_exception – str or np.nan value to return in input value raises an exception (default=np.nan)

  • val_none – str or np.nan value to return if input value is none or np.nan (default=np.nan)

Returns

str value based on input parameters

basedata.ops.cols module

This submodule basedata.ops mixin classes for manipulating column values and column names.

The functionality of these mixin classes is aggregated in the basedata.ops BaseDataOps class.

class basedata.ops.cols.ColumnConversionsMixin

Bases: object

Mixin class methods and associated tools for converting column values to specific types

add_column(column, value)

Adds column to self.df and populates that column with a static value

Parameters
  • column – str name of new column

  • value – object to populate each row of the new column

Returns

None, self.df is updated inplace

apply_function(column_list, function, target_column, inplace=True, return_series=False, **kwargs)

Applies function to dataframe object, using pandas.DataFrame.apply() method.

Parameters
  • column_list – list column name(s) against which to apply function

  • function – function to apply to dataframe object

  • target_column – None or string name of new column created, if inplace=True target_column name must be specified

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

  • kwargs – optional keyword args to for pandas apply method. Axis=1 is required whenever the function is applied to multiple input columns

Returns

pandas.Series if return_series is specified as True

check_datetime(column, dropna=False, **kwargs)

returns a value_counts series reporting all column values that cannot be directly converted to a datetime data type

Parameters
  • column – str name of column to check for datetime conversion

  • dropna – bool optional, whether to drop na values from resulting value_count series, default=False

  • kwargs – additional arguments for pandas value_counts method

Returns

pandas.Series object

check_nonnumeric(column, dropna=False, **kwargs)

returns a value_counts series reporting all column values that cannot be directly converted to numeric data types int or float

Parameters
  • column – str name of column to check for nonnumeric

  • dropna – bool optional, whether to drop na values from resulting value_count series, default=False

  • kwargs – additional arguments for pandas value_counts method

Returns

pandas.Series object

map_column_names(map_dict, inplace=True)

maps existing column names to new names based on input dictionary

acts as a simple wrapper for the pandas DataFrame.rename method, when columns are specified in that functions parameters

Parameters
  • map_dict – dict mapping {current_value: new_value}

  • inplace – bool whether to make changes to self.df in place, default=True

Returns

pandas.DataFrame if inplace is specified as False

map_values(column, map_dict, na_action=None, exhaustive=False, inplace=True, return_series=False, target_column=None)

maps existing column values to new value based on input dictionary

acts as a simple wrapper for the pandas Series.map method

Parameters
  • column – str name of column in which value will be mapped

  • map_dict – dict mapping {current_value: new_value}

  • na_action – None or ‘ignore’, if ‘ignore’ propogate NaN values without passing them to the mapping correspondence, defaul=None

  • exhaustive – bool whether or not value map is expected to affect all values in the series, if True, any values not matching map_dict will be converted to np.nan, if False, they retain their original values. default=False

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

  • target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

report_values(column, dropna=False, **kwargs)

returns a value_counts series reporting all unique column values

Parameters
  • column – str name of column to check for unique values

  • dropna – bool optional, whether to drop na values from resulting value_count series, default=False

  • kwargs – additional arguments for pandas value_counts method

Returns

pandas.Series object

substitute_chars(column, pattern, val_sub, val_exception=nan, val_none=nan, inplace=True, return_series=False, target_column=None)

Strips or replaces characters from column values.

When val_sub is ‘’, this method strips the characters specified by the regex pattern. If the objective is to replace the specified characters, specify the desired replacement characters using val_sub (i.e. val_sub=’substring’)

Parameters
  • column – str name of column on which to apply this operation

  • pattern – str Regex pattern specifying which types of characters to substiture with val_sub str

  • val_sub – str character(s) with which to replace pattern values

  • val_exception – str or numpy.nan value to return for instances where an exception is raised, default=numpy.nan

  • val_none – str or numpy.nan value to return for instances of either None or np.nan, default=numpy.nan

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

  • target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

to_datetime(column, coerce=True, inplace=True, return_series=False, target_column=None)

wrapper for pandas to_numeric method, which converts column values to a datetime data type

Parameters
  • column – str name of column to convert to datetime

  • coerce – bool optional, specifies whether to ‘coerce’ non-convertable values to numpy.nan if True or to leave those values as is if False, default=True

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

  • target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

to_numeric(column, coerce=True, inplace=True, return_series=False, target_column=None)

wrapper for pandas to_numeric method, which converts column values to a numeric data type (int or float)

Parameters
  • column – str name of column to convert to numeric

  • coerce – bool optional, specifies whether to ‘coerce’ non-convertable values to numpy.nan if True or to leave those values as is if False, default=True

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

  • target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None

Returns

pandas.Series if return_series is specified as True

basedata.ops.ids module

This submodule basedata.ops mixin classes for cleaning unique ID values.

The functionality of these mixin classes is aggregated in the basedata.ops BaseDataOps class.

class basedata.ops.ids.DedupeMixin

Bases: object

Mixin class methods used to inspect dataframe objects for duplicate key values, analyze duplicate key records, and to remove duplicate key records once identified.

drop_dupes(column, index_list, validate=True)

Drops rows in self.df based on input index_list values, will return print message if any duplicate vlaues remain in the specified column.

Parameters
  • index_list – list indices to be dropped

  • validate – bool raises exception if duplicates still remain

flush_duperecords()

Deletes self.duperecords dictionary from class __dict__ to free memory

report_dupes(column, to_file=None, return_df=True)

Invokes a dataframe consisting of records associated with duplicate values in the specified column and saves csv of dataframe to file if specified.

In saved csv, duplicate records indices are save in column ‘index_id’

Parameters
  • column – str name of column to check for duplicate values

  • to_file – str optional filename if a csv of the duplicates dataframe should be saved. Default is None.

  • return_df – bool indicates whether or not to return dataframe

Returns

pandas.Dataframe of all records associated with column dupes

class basedata.ops.ids.ValidIDsMixin

Bases: object

functions for validating and modifying ID values

drop_blankID_rows(column)

drops all rows from self.df where the choosen column value is a nan value

changes to self.df are made inplace and the df index is reset to contiguous values 0-n.

remove_offlenIDs(column, target_len=8, pattern='[0-9]', val_new=nan, val_exception=nan, inplace=True, return_series=False, target_column=None)

removes all IDs not matching the desired character length and replaces them with a chosen replacement value

Parameters
  • target_len – int specifying length of a valid id, default=8

  • pattern – str Regex pattern specifying which types of characters to substiture with val_sub str, default=’[0-9]’

  • val_new – str or numpy.nan value with which to replace values of incorrect length, default=numpy.nan

  • val_exception – str or numpy.nan value to return for instances where an exception is raised, default=numpy.nan

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

Returns

pandas.Series of column values after replacing offlenIDs

replace_blankIDs(column, replace_col, inplace=True, return_series=False, target_column=None)

replaces all blank ID values with the corresponding values from a different value in the same dataframe

Parameters
  • replace_col – str name of column with replacement values

  • inplace – bool make changes to self.df inplace, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

report_offlenIDs(column, target_len=8, dropna=False)

generate a value_counts report with all of the column values with number of characters not matching the specified target length

Parameters
  • target_len – int specifying length of a valid id, default=8

  • dropna – bool indicating whether to include np.nan values in the output value_counts series, default=False

Returns

pandas.Series of the IDs not matching the target_len

strip_nonnumeric(column, pattern='[^0-9]', val_sub='', val_exception=nan, val_none=nan, inplace=True, return_series=False, target_column=None)

strips nonnumeric characters from column values

Parameters
  • pattern – str Regex pattern specifying which types of charaters to substiture with val_sub str, optional, default=’[^0-9]’

  • val_sub – str character(s) with which to replace pattern values, default = ‘’

  • val_exception – str or numpy.nan value to return for instances where an exception is raised

  • val_none – str or numpy.nan value to return for instances of either None or np.nan

  • inplace – bool whether to make changes to self.df in place, default=True

  • return_series – bool whether to return modified pandas.Series object, default=False

Returns

pandas.Series if return_series is specified as True

Module contents

This submodule contains the BaseDataOps class, which aggregates basedata.ops mixin classes and BaseDataClass functionality.

class basedata.ops.BaseDataOps(input_df, copy_input)

Bases: basedata.ops.cols.ColumnConversionsMixin, basedata.ops.ids.DedupeMixin, basedata.ops.ids.ValidIDsMixin, basedata.ops.base.BaseDataClass

The BaseDataOps class inherits core class functionality from the BaseDataClass parent class and acts as the core data operations class in which the methods of the many specialized mixin classes is combined.