nucml.model package¶
Submodules¶
nucml.model.building_utils module¶
-
nucml.model.building_utils.
compile_and_fit
(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0, logs_dir_name='logs', append_wandb=False, verbose=0, comet=False, comet_exp=None)[source]¶ Compiles and fits a TensorFlow model.
- Parameters
model (object) – TensorFlow model object.
name (str) – Name by which the TensorFlow model will be saved.
x_train (np.array) – The X-train numpy array set.
y_train (np.array) – The y-train numpy array.
x_test (np.array) – The X-test numpy array set.
y_test (np.array) – The y-test numpy array.
BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.
max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.
DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.
lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.
initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.
logs_dir_name (str, optional) – Name of the directory where the logs will be stored. Defaults to “logs”.
append_wandb (bool, optional) – If True, the WANDB callback will be appended. Defaults to False.
verbose (int, optional) – See the TensorFlow verbosity for more information. Defaults to 0.
comet (bool, optional) – If True, the training will be under a Comet experiment. Defaults to False.
comet_exp (object, optional) – The Comet experiment by which the training will happen. Defaults to None.
- Returns
Training history object.
- Return type
object
-
nucml.model.building_utils.
compile_and_fit_lw
(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0)[source]¶ Compile and fits a TensorFlow model.
- Parameters
model (object) – TensorFlow model to fit.
name (str) – Name by which the model will be saved.
x_train (np.array) – The X-train numpy array set.
y_train (np.array) – The y-train numpy array.
x_test (np.array) – The X-test numpy array set.
y_test (np.array) – The y-test numpy array.
BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.
max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.
DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.
lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.
initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.
- Returns
Training history object.
- Return type
object
-
nucml.model.building_utils.
get_callbacks
(name, logs_dir_name='logs', lr_method='plateau', patience_epochs=10, save_freq=37655, append_wandb=False)[source]¶ Gets a callback list for TensorFlow training.
- Parameters
name (str) – Name of the model.
logs_dir_name (str, optional) – Name of the directory where the callback logs will be stored. Defaults to “logs”.
lr_method (str, optional) – Learning rate method to implement. Defaults to “plateau”.
patience_epochs (int, optional) – Number of epochs to wait before stopping training due to lack of validation progress. Defaults to 10.
save_freq (int, optional) – Determines how many steps are needed to create a checkpoint. Defaults to 7531*5.
append_wandb (bool, optional) – If True, the WANDB callback will be added. Defaults to False.
- Returns
List containing all TensorFlow callbacks.
- Return type
list
nucml.model.model_building module¶
-
nucml.model.model_building.
compile_and_fit
(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0, logs_dir_name='logs', append_wandb=False, verbose=0, comet=False, comet_exp=None)[source]¶ Compiles and fits a TensorFlow model.
- Parameters
model (object) – TensorFlow model object.
name (str) – Name by which the TensorFlow model will be saved.
x_train (np.array) – The X-train numpy array set.
y_train (np.array) – The y-train numpy array.
x_test (np.array) – The X-test numpy array set.
y_test (np.array) – The y-test numpy array.
BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.
max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.
DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.
lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.
initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.
logs_dir_name (str, optional) – Name of the directory where the logs will be stored. Defaults to “logs”.
append_wandb (bool, optional) – If True, the WANDB callback will be appended. Defaults to False.
verbose (int, optional) – See the TensorFlow verbosity for more information. Defaults to 0.
comet (bool, optional) – If True, the training will be under a Comet experiment. Defaults to False.
comet_exp (object, optional) – The Comet experiment by which the training will happen. Defaults to None.
- Returns
Training history object.
- Return type
object
-
nucml.model.model_building.
compile_and_fit_lw
(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0)[source]¶ Compile and fits a TensorFlow model.
- Parameters
model (object) – TensorFlow model to fit.
name (str) – Name by which the model will be saved.
x_train (np.array) – The X-train numpy array set.
y_train (np.array) – The y-train numpy array.
x_test (np.array) – The X-test numpy array set.
y_test (np.array) – The y-test numpy array.
BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.
max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.
DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.
lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.
initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.
- Returns
Training history object.
- Return type
object
-
nucml.model.model_building.
get_callbacks
(name, logs_dir_name='logs', lr_method='plateau', patience_epochs=10, save_freq=37655, append_wandb=False)[source]¶ Gets a callback list for TensorFlow training.
- Parameters
name (str) – Name of the model.
logs_dir_name (str, optional) – Name of the directory where the callback logs will be stored. Defaults to “logs”.
lr_method (str, optional) – Learning rate method to implement. Defaults to “plateau”.
patience_epochs (int, optional) – Number of epochs to wait before stopping training due to lack of validation progress. Defaults to 10.
save_freq (int, optional) – Determines how many steps are needed to create a checkpoint. Defaults to 7531*5.
append_wandb (bool, optional) – If True, the WANDB callback will be added. Defaults to False.
- Returns
List containing all TensorFlow callbacks.
- Return type
list
nucml.model.plot module¶
-
nucml.model.plot.
dt_training
(results_df, param_1='max_depth', param_2='msl', train_metric='train_mae', test_metric='test_mae', save=False, save_dir='', show=True)[source]¶ Plots both the train and test loss as a function of a second feature (i.e. training steps).
- Parameters
results_df (pd.DataFrame) – Pandas DataFrame containing the train and test metric information.
param_1 (str) – Feature containing the information for a given parameter to plot.
param_2 (str) – Feature containing the information for a second parameter to plot.
train_metric (str) – Name of the feature containing the train performance metric.
test_metric (str) – Name of the feature containing the test performance metric.
save (bool, optional) – If True, the figure will be saved. Defaults to False.
save_dir (str, optional) – Path-like string where the resulting figure will be saved. Defaults to ‘’.
show (bool, optional) – If True, the image is rendered. Defaults to True.
- Returns
Plotly figure object.
- Return type
object
-
nucml.model.plot.
knn_training
(results_df, x_feature='id', train_metric='train_mae', val_metric='val_mae', test_metric='test_mae', save=False, save_path='', show=True)[source]¶ Plots both the train, val, and test loss as a function of a given parameter (i.e. number of neighbors).
- Parameters
results_df (pd.DataFrame) – Pandas DataFrame containing the train, val, and test metric information.
x_feature (str) – Feature containing the x-axis information. Can contain information such as the training steps or parameters such as k-number.
train_metric (str) – Name of the feature containing the train performance metric.
val_metric (str) – Name of the feature containing the validation performance metric.
test_metric (str) – Name of the feature containing the test performance metric.
save (bool, optional) – If True, the figure will be saved. Defaults to False.
save_dir (str, optional) – Path-like string where the resulting figure will be saved. Defaults to ‘’.
show (bool, optional) – If True, the image is rendered. Defaults to True.
- Returns
Plotly figure object.
- Return type
object
-
nucml.model.plot.
train_test
(df, x_feature, train_metric, test_metric, save=False, save_dir='', render_browser=False, paper=False)[source]¶ Plots both the train and test loss as a function of a second feature (i.e. training steps).
- Parameters
df (pd.DataFrame) – Pandas DataFrame containing the train and test metric information.
x_feature (str) – Feature containing the x-axis information. Can contain information such as the training steps or parameters such as k-number, number of estimators, etc.
train_metric (str) – Name of the feature containing the train performance metric.
test_metric (str) – Name of the feature containing the test performance metric.
save (bool, optional) – If True, the figure will be saved. Defaults to False.
save_dir (str, optional) – Path-like string where the resulting figure will be saved. Defaults to ‘’.
render_browser (bool, optional) – If True, the figure will be rendered in a new browser tab. Defaults to False.
paper (bool, optional) – If True, the figure will be resized to fit into two-column documents. Defaults to False.
- Returns
Plotly figure object.
- Return type
object
-
nucml.model.plot.
xgb_training
(dictionary, save=False, show=True, title='', save_dir='')[source]¶ Plots the Loss vs Number of Estimators resulting from an XGBoost training process.
- Parameters
dictionary (dict) – dictionary generated from the XGBoost training process.
save (bool, optional) – If True, the image is saved. Defaults to False.
show (bool, optional) – If True, the image is rendered. Defaults to True.
title (str, optional) – Title to render above the plot. Defaults to “”.
path (str, optional) – Path-like string where the figure will be saved. Defaults to “”.
- Returns
None
nucml.model.training module¶
-
nucml.model.training.
train_dt
(x_train, y_train, x_test, y_test, parameters_dict, save_models=False, save_dir='.')[source]¶ Trains multiple DT models according to the parameters given in the parameters_dict argument. Useful for quick experimentation. For a more efficient and advanced DT training method, see the dt.py script.
- Parameters
x_train (DataFrame or np.array) – Training data.
y_train (DataFrame or np.array) – Training labels.
x_test (DataFrame or np.array) – Testing data.
y_test (DataFrame or np.array) – Testing labels.
parameters_dict (dict) – Dictionary object. Keys must be “max_depth_list”, “min_split_list”, and “min_leaf_split”. Values must be lists of parameteres to test.
save_models (bool, optional) – If True, the trained models will be saved. Defaults to False.
save_dir (str, optional) – Path-like string where the trained models will be saved. Defaults to “.”.
- Returns
contains the performance metrics for all trained models.
- Return type
DataFrame
-
nucml.model.training.
train_knn
(x_train, y_train, x_test, y_test, k_list, save_models=False, save_dir='.')[source]¶ Trains multiple KNN models given a list of k-values. Useful for quick experimentation. For a more efficient and advanced KNN training method, see the knn.py script.
- Parameters
x_train (DataFrame or np.array) – Training data.
y_train (DataFrame or np.array) – Training labels.
x_test (DataFrame or np.array) – Testing data.
y_test (DataFrame or np.array) – Testing labels.
k_list (list) – List of k-values to iterate through.
save_models (bool, optional) – If True, the trained models will be saved. Defaults to False.
save_dir (str, optional) – Path-like string where the trained models will be saved. Defaults to “.”.
- Returns
Contains performance metrics for all trained models.
- Return type
DataFrame
-
nucml.model.training.
train_xgb
(x_train, y_train, x_test, y_test, parameters_dict, save_models=False, save_dir='./')[source]¶ Trains multiple DT models according to the parameters given in the parameters_dict argument. Useful for quick experimentation. For a more efficient and advanced DT training method, see the dt.py script.
- Parameters
x_train (DataFrame or np.array) – Training data.
y_train (DataFrame or np.array) – Training labels.
x_test (DataFrame or np.array) – Testing data.
y_test (DataFrame or np.array) – Testing labels.
parameters_dict (dict) – Dictionary object. Keys must be “max_depth_list”, “num_estimator_list”, and “learning_rate_list”. Values must be lists of parameteres to test.
save_models (bool, optional) – If True, the trained models will be saved. Defaults to False.
save_dir (str, optional) – Path-like string where the trained models will be saved. Defaults to “.”.
- Returns
contains the performance metrics for all trained models.
- Return type
DataFrame
nucml.model.utilities module¶
-
nucml.model.utilities.
cleanup_model_dir
(results_df, model_dir, keep_best=True, keep_first=False)[source]¶ Deletes unwanted models and scalers. Keeps best models based on training, validation, and testing perfromance if wanted.
- Parameters
results_df (DataFrame) – The loaded results data file created by the various training scripts.
model_dir (str) – Path-like string where all model directories are stored.
keep_best (bool, optional) – If True, it will keep three or more models based on performance. Defaults to True.
keep_first (bool, optional) – If True, it will keep the first appearance in case of a duplicate rows. Defaults to False.
- Returns
None
-
nucml.model.utilities.
create_error_df
(identifier, error_metrics_dict)[source]¶ Creates a simple dataframe from the performance metrics dictionary yielded by the regression_error_metrics() function.
- Parameters
identifier (str or int or float) – String or number used for identifying the created dataframe row.
error_metrics_dict (dict) – Dictionary containing the performance metrics.
- Returns
DataFrame
-
nucml.model.utilities.
create_train_test_error_df
(identifier, train_error_metrics, test_error_metrics, val_error_metrics=None)[source]¶ - Creates a pandas DataFrame containing the error metrics provided by both the train and test dictionaries generated
by the regression_error_metrics() function. A validation error metrics dictionary can also be provided.
- Parameters
identifier (str, int) – Label use for identification of the row.
train_error_metrics (dict) – Dictionary containing the error metrics for the train set.
test_error_metrics (dict) – Dictionary containing the error metrics for the test set.
val_error_metrics (dict, optional) – Dictionary containing the error metrics for the val set. Defaults to None.
- Returns
DataFrame
-
nucml.model.utilities.
get_best_models_df
(results_df, keep_first=False)[source]¶ Returns a three row minimum dataframe with the best models based on training, validation, and testing performance metrics. The results_df argument is based on the file generated by the python training scripts including knn.py, dt.py, and xgb.py which includes results for all training iterations along with stored model and scaler paths.
- Parameters
results_df (DataFrame) – Results dataframe created by the model training scripts.
keep_first (bool, optional) – In some cases there might be duplicates. If True, this will keep the
first instance of a duplicate value. Defaults to False. (the) –
- Returns
DataFrame
-
nucml.model.utilities.
load_model_and_scaler
(model_scaler_info, df=True, model_only=False)[source]¶ Loads both the model and scaler given a dataframe with path’s specified.
- Parameters
model_scaler_info (DataFrame) – Must contain a “model_path” and a “scaler_path” feature if a DataFrame is passed. Else, it must contain the “model_path” and “scaler_path” as keys in a dictionary.
df (bool, optional) – If True, the model_scaler_info variable must be a DataFrame. If False, it must be a python dictionary.
model_only (bool, optional) – If True, the scaler will not be loaded. Only the model will be loaded.
- Returns
returns the loaded model and scaler.
- Return type
object, object
-
nucml.model.utilities.
make_predictions
(data, model, model_type)[source]¶ Makes prediction using a trained model. Currently handles tensorflow, xgboost, and scikit-learn models
- Parameters
data (np.array) – Numpy matrix needed for model predictions. The data will be prepared using tf.data.Dataset for TensorFlow, xgb.DMatrix for xgboost, and passed as is for sklearn models.
model (object) – Trained machine learning model.
model_type (str) – Type of model being provided. Options include “tf” for TensorFlow, “xgb” for XGBoost, and “sk” for sklearn models.
- Returns
object containing the model predictions. Type will be dependent on model type.
- Return type
object
-
nucml.model.utilities.
regression_error_metrics
(v1, v2)[source]¶ Calculates the MAE, MSE, EVS, MAEM, and R2 between two vectors.
- Parameters
v1 (np.array) – First array.
v2 (np.array) – Second array.
- Returns
Dictionary containing all 5 error metrics in key:value pairs.
- Return type
dict
-
nucml.model.utilities.
remove_unused_models
(model_results_path, acedate_directory)[source]¶ Finds best models in terms of train, validation and testing sets and deletes all others. It also keeps the best models in terms of multiplication factor.
WARNING: Once deleted, other models will not be accessible.
- Parameters
model_results_path (str) – Filepath to model training results CSV file generated using the model training scripts.
acedate_directory (str) – Path to the relevant directory were all models for a given algorithm are stored.
- Returns
None