basedata.ops package¶
Submodules¶
basedata.ops.base module¶
This module contains the BaseDataClass parent class and common functions that are that are reused across basedata.ops submodules.
-
class
basedata.ops.base.
BaseDataClass
(input_df, copy_input)¶ Bases:
object
BaseDataClass manages base read/write operations and instantiates self.df for child classes across basedata.ops submodule classes
-
classmethod
from_file
(filename, copy_input=False, **read_kwargs)¶ Invokes BaseDataClass and reads input csv or excel from disk into a pandas.DataFrame object.
- Parameters
filename – str filename of .csv, .xls, or .xlsx file to be read
copy_input – bool to specify whether self.input_df persists
read_kwargs – optional args to pandas.DataFrame.read_csv() or pandas.DataFrame.read_excel()
- Returns
pandas.DataFrame and copy_input bool as class variables
-
classmethod
from_object
(input_object, copy_input=False)¶ Invokes BaseDataClass and reads input df from similar BaseData class instance or pandas.DataFrame object.
- Parameters
input_object – object to be read into BaseDataClass
copy_input – bool to specify whether self.input_df persists
- Returns
pandas.DataFrame and copy_input bool as class variables
-
to_file
(target_filename, **to_csv_kwargs)¶ Saves current version of self.df to file in csv format
- Parameters
target_filename – str filename to which csv should be written
to_csv_kwargs – optional args to pandas.DataFrame.to_csv()
-
classmethod
-
basedata.ops.base.
inplace_return_series
(dataframe, column, series, inplace, return_series, target_column=None)¶ helper function to reuse throughout library. It applies logic for performing inplace series transformations and returning copies of modified series
- Parameters
dataframe – pandas.DataFrame for which we are modifying series
column – str name of target column for our series
series – pandas.Series
inplace – bool whether we wish to overwrite existing column with series
return_series – bool whether we wish to return a copy of the pandas.Series object
- Returns
pandas.Series
-
basedata.ops.base.
regex_replace_value
(val, val_new, pattern, val_exception=nan)¶ Replaces string value if Regex pattern is not satisfied by the input value
- Parameters
val – str input value to be evaluated by Regex re.match function
pattern – str regex pattern used to identify values to replace
val_new – str replacement value if input pattern is not satisfied
val_exception – str or np.nan value to return in input value raises an exception (default=np.nan)
val_none – str or np.nan value to return if input value is none or np.nan (default=np.nan)
- Returns
str output value based on input parameters
-
basedata.ops.base.
regex_sub_value
(val, pattern, val_sub='', val_exception=nan, val_none=nan)¶ Replaces characters in a string value based on specified Regex pattern
- Parameters
val – str input value to be evaluated by Regex re.sub function
pattern – str regex pattern used to identify characters to subsitute
val_sub – str value to substitute for specified input characters
val_exception – str or np.nan value to return in input value raises an exception (default=np.nan)
val_none – str or np.nan value to return if input value is none or np.nan (default=np.nan)
- Returns
str value based on input parameters
basedata.ops.cols module¶
This submodule basedata.ops mixin classes for manipulating column values and column names.
The functionality of these mixin classes is aggregated in the basedata.ops BaseDataOps class.
-
class
basedata.ops.cols.
ColumnConversionsMixin
¶ Bases:
object
Mixin class methods and associated tools for converting column values to specific types
-
add_column
(column, value)¶ Adds column to self.df and populates that column with a static value
- Parameters
column – str name of new column
value – object to populate each row of the new column
- Returns
None, self.df is updated inplace
-
apply_function
(column_list, function, target_column, inplace=True, return_series=False, **kwargs)¶ Applies function to dataframe object, using pandas.DataFrame.apply() method.
- Parameters
column_list – list column name(s) against which to apply function
function – function to apply to dataframe object
target_column – None or string name of new column created, if inplace=True target_column name must be specified
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
kwargs – optional keyword args to for pandas apply method. Axis=1 is required whenever the function is applied to multiple input columns
- Returns
pandas.Series if return_series is specified as True
-
check_datetime
(column, dropna=False, **kwargs)¶ returns a value_counts series reporting all column values that cannot be directly converted to a datetime data type
- Parameters
column – str name of column to check for datetime conversion
dropna – bool optional, whether to drop na values from resulting value_count series, default=False
kwargs – additional arguments for pandas value_counts method
- Returns
pandas.Series object
-
check_nonnumeric
(column, dropna=False, **kwargs)¶ returns a value_counts series reporting all column values that cannot be directly converted to numeric data types int or float
- Parameters
column – str name of column to check for nonnumeric
dropna – bool optional, whether to drop na values from resulting value_count series, default=False
kwargs – additional arguments for pandas value_counts method
- Returns
pandas.Series object
-
map_column_names
(map_dict, inplace=True)¶ maps existing column names to new names based on input dictionary
acts as a simple wrapper for the pandas DataFrame.rename method, when columns are specified in that functions parameters
- Parameters
map_dict – dict mapping {current_value: new_value}
inplace – bool whether to make changes to self.df in place, default=True
- Returns
pandas.DataFrame if inplace is specified as False
-
map_values
(column, map_dict, na_action=None, exhaustive=False, inplace=True, return_series=False, target_column=None)¶ maps existing column values to new value based on input dictionary
acts as a simple wrapper for the pandas Series.map method
- Parameters
column – str name of column in which value will be mapped
map_dict – dict mapping {current_value: new_value}
na_action – None or ‘ignore’, if ‘ignore’ propogate NaN values without passing them to the mapping correspondence, defaul=None
exhaustive – bool whether or not value map is expected to affect all values in the series, if True, any values not matching map_dict will be converted to np.nan, if False, they retain their original values. default=False
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None
- Returns
pandas.Series if return_series is specified as True
-
report_values
(column, dropna=False, **kwargs)¶ returns a value_counts series reporting all unique column values
- Parameters
column – str name of column to check for unique values
dropna – bool optional, whether to drop na values from resulting value_count series, default=False
kwargs – additional arguments for pandas value_counts method
- Returns
pandas.Series object
-
substitute_chars
(column, pattern, val_sub, val_exception=nan, val_none=nan, inplace=True, return_series=False, target_column=None)¶ Strips or replaces characters from column values.
When val_sub is ‘’, this method strips the characters specified by the regex pattern. If the objective is to replace the specified characters, specify the desired replacement characters using val_sub (i.e. val_sub=’substring’)
- Parameters
column – str name of column on which to apply this operation
pattern – str Regex pattern specifying which types of characters to substiture with val_sub str
val_sub – str character(s) with which to replace pattern values
val_exception – str or numpy.nan value to return for instances where an exception is raised, default=numpy.nan
val_none – str or numpy.nan value to return for instances of either None or np.nan, default=numpy.nan
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None
- Returns
pandas.Series if return_series is specified as True
-
to_datetime
(column, coerce=True, inplace=True, return_series=False, target_column=None)¶ wrapper for pandas to_numeric method, which converts column values to a datetime data type
- Parameters
column – str name of column to convert to datetime
coerce – bool optional, specifies whether to ‘coerce’ non-convertable values to numpy.nan if True or to leave those values as is if False, default=True
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None
- Returns
pandas.Series if return_series is specified as True
-
to_numeric
(column, coerce=True, inplace=True, return_series=False, target_column=None)¶ wrapper for pandas to_numeric method, which converts column values to a numeric data type (int or float)
- Parameters
column – str name of column to convert to numeric
coerce – bool optional, specifies whether to ‘coerce’ non-convertable values to numpy.nan if True or to leave those values as is if False, default=True
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
target_column – None or string name of new column created, if None and inplace=True, modified series replaces original column, default=None
- Returns
pandas.Series if return_series is specified as True
-
basedata.ops.ids module¶
This submodule basedata.ops mixin classes for cleaning unique ID values.
The functionality of these mixin classes is aggregated in the basedata.ops BaseDataOps class.
-
class
basedata.ops.ids.
DedupeMixin
¶ Bases:
object
Mixin class methods used to inspect dataframe objects for duplicate key values, analyze duplicate key records, and to remove duplicate key records once identified.
-
drop_dupes
(column, index_list, validate=True)¶ Drops rows in self.df based on input index_list values, will return print message if any duplicate vlaues remain in the specified column.
- Parameters
index_list – list indices to be dropped
validate – bool raises exception if duplicates still remain
-
flush_duperecords
()¶ Deletes self.duperecords dictionary from class __dict__ to free memory
-
report_dupes
(column, to_file=None, return_df=True)¶ Invokes a dataframe consisting of records associated with duplicate values in the specified column and saves csv of dataframe to file if specified.
In saved csv, duplicate records indices are save in column ‘index_id’
- Parameters
column – str name of column to check for duplicate values
to_file – str optional filename if a csv of the duplicates dataframe should be saved. Default is None.
return_df – bool indicates whether or not to return dataframe
- Returns
pandas.Dataframe of all records associated with column dupes
-
-
class
basedata.ops.ids.
ValidIDsMixin
¶ Bases:
object
functions for validating and modifying ID values
-
drop_blankID_rows
(column)¶ drops all rows from self.df where the choosen column value is a nan value
changes to self.df are made inplace and the df index is reset to contiguous values 0-n.
-
remove_offlenIDs
(column, target_len=8, pattern='[0-9]', val_new=nan, val_exception=nan, inplace=True, return_series=False, target_column=None)¶ removes all IDs not matching the desired character length and replaces them with a chosen replacement value
- Parameters
target_len – int specifying length of a valid id, default=8
pattern – str Regex pattern specifying which types of characters to substiture with val_sub str, default=’[0-9]’
val_new – str or numpy.nan value with which to replace values of incorrect length, default=numpy.nan
val_exception – str or numpy.nan value to return for instances where an exception is raised, default=numpy.nan
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
- Returns
pandas.Series of column values after replacing offlenIDs
-
replace_blankIDs
(column, replace_col, inplace=True, return_series=False, target_column=None)¶ replaces all blank ID values with the corresponding values from a different value in the same dataframe
- Parameters
replace_col – str name of column with replacement values
inplace – bool make changes to self.df inplace, default=True
return_series – bool whether to return modified pandas.Series object, default=False
-
report_offlenIDs
(column, target_len=8, dropna=False)¶ generate a value_counts report with all of the column values with number of characters not matching the specified target length
- Parameters
target_len – int specifying length of a valid id, default=8
dropna – bool indicating whether to include np.nan values in the output value_counts series, default=False
- Returns
pandas.Series of the IDs not matching the target_len
-
strip_nonnumeric
(column, pattern='[^0-9]', val_sub='', val_exception=nan, val_none=nan, inplace=True, return_series=False, target_column=None)¶ strips nonnumeric characters from column values
- Parameters
pattern – str Regex pattern specifying which types of charaters to substiture with val_sub str, optional, default=’[^0-9]’
val_sub – str character(s) with which to replace pattern values, default = ‘’
val_exception – str or numpy.nan value to return for instances where an exception is raised
val_none – str or numpy.nan value to return for instances of either None or np.nan
inplace – bool whether to make changes to self.df in place, default=True
return_series – bool whether to return modified pandas.Series object, default=False
- Returns
pandas.Series if return_series is specified as True
-
Module contents¶
This submodule contains the BaseDataOps class, which aggregates basedata.ops mixin classes and BaseDataClass functionality.
-
class
basedata.ops.
BaseDataOps
(input_df, copy_input)¶ Bases:
basedata.ops.cols.ColumnConversionsMixin
,basedata.ops.ids.DedupeMixin
,basedata.ops.ids.ValidIDsMixin
,basedata.ops.base.BaseDataClass
The BaseDataOps class inherits core class functionality from the BaseDataClass parent class and acts as the core data operations class in which the methods of the many specialized mixin classes is combined.