nucml.model package

Submodules

nucml.model.building_utils module

nucml.model.building_utils.compile_and_fit(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0, logs_dir_name='logs', append_wandb=False, verbose=0, comet=False, comet_exp=None)[source]

Compiles and fits a TensorFlow model.

Parameters
  • model (object) – TensorFlow model object.

  • name (str) – Name by which the TensorFlow model will be saved.

  • x_train (np.array) – The X-train numpy array set.

  • y_train (np.array) – The y-train numpy array.

  • x_test (np.array) – The X-test numpy array set.

  • y_test (np.array) – The y-test numpy array.

  • BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.

  • max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.

  • DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.

  • lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.

  • initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.

  • logs_dir_name (str, optional) – Name of the directory where the logs will be stored. Defaults to “logs”.

  • append_wandb (bool, optional) – If True, the WANDB callback will be appended. Defaults to False.

  • verbose (int, optional) – See the TensorFlow verbosity for more information. Defaults to 0.

  • comet (bool, optional) – If True, the training will be under a Comet experiment. Defaults to False.

  • comet_exp (object, optional) – The Comet experiment by which the training will happen. Defaults to None.

Returns

Training history object.

Return type

object

nucml.model.building_utils.compile_and_fit_lw(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0)[source]

Compile and fits a TensorFlow model.

Parameters
  • model (object) – TensorFlow model to fit.

  • name (str) – Name by which the model will be saved.

  • x_train (np.array) – The X-train numpy array set.

  • y_train (np.array) – The y-train numpy array.

  • x_test (np.array) – The X-test numpy array set.

  • y_test (np.array) – The y-test numpy array.

  • BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.

  • max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.

  • DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.

  • lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.

  • initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.

Returns

Training history object.

Return type

object

nucml.model.building_utils.get_callbacks(name, logs_dir_name='logs', lr_method='plateau', patience_epochs=10, save_freq=37655, append_wandb=False)[source]

Gets a callback list for TensorFlow training.

Parameters
  • name (str) – Name of the model.

  • logs_dir_name (str, optional) – Name of the directory where the callback logs will be stored. Defaults to “logs”.

  • lr_method (str, optional) – Learning rate method to implement. Defaults to “plateau”.

  • patience_epochs (int, optional) – Number of epochs to wait before stopping training due to lack of validation progress. Defaults to 10.

  • save_freq (int, optional) – Determines how many steps are needed to create a checkpoint. Defaults to 7531*5.

  • append_wandb (bool, optional) – If True, the WANDB callback will be added. Defaults to False.

Returns

List containing all TensorFlow callbacks.

Return type

list

nucml.model.building_utils.get_optimizer(lr_schedule)[source]
nucml.model.building_utils.get_xgboost_params(eta=0.5, gamma=0, l2=0, max_depth=30, grow_policy='depthwise', max_bin=256, determ_hist='true', objective='rmse', resume=False, gpu_id=0)[source]
nucml.model.building_utils.tf_dataset_gen(x, y, xt, yt, BUFFER_SIZE, BATCH_SIZE, N_TRAIN, gpu=False, multiplier=2, cache=False)[source]

nucml.model.model_building module

nucml.model.model_building.compile_and_fit(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0, logs_dir_name='logs', append_wandb=False, verbose=0, comet=False, comet_exp=None)[source]

Compiles and fits a TensorFlow model.

Parameters
  • model (object) – TensorFlow model object.

  • name (str) – Name by which the TensorFlow model will be saved.

  • x_train (np.array) – The X-train numpy array set.

  • y_train (np.array) – The y-train numpy array.

  • x_test (np.array) – The X-test numpy array set.

  • y_test (np.array) – The y-test numpy array.

  • BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.

  • max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.

  • DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.

  • lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.

  • initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.

  • logs_dir_name (str, optional) – Name of the directory where the logs will be stored. Defaults to “logs”.

  • append_wandb (bool, optional) – If True, the WANDB callback will be appended. Defaults to False.

  • verbose (int, optional) – See the TensorFlow verbosity for more information. Defaults to 0.

  • comet (bool, optional) – If True, the training will be under a Comet experiment. Defaults to False.

  • comet_exp (object, optional) – The Comet experiment by which the training will happen. Defaults to None.

Returns

Training history object.

Return type

object

nucml.model.model_building.compile_and_fit_lw(model, name, x_train, y_train, x_test, y_test, BATCH_SIZE=120, max_epochs=5, DECAY_EPOCHS=10, lr_method='plateau', initial_epoch=0)[source]

Compile and fits a TensorFlow model.

Parameters
  • model (object) – TensorFlow model to fit.

  • name (str) – Name by which the model will be saved.

  • x_train (np.array) – The X-train numpy array set.

  • y_train (np.array) – The y-train numpy array.

  • x_test (np.array) – The X-test numpy array set.

  • y_test (np.array) – The y-test numpy array.

  • BATCH_SIZE (int, optional) – Batch size for the tensorflow dataset. Defaults to 120.

  • max_epochs (int, optional) – Max number of epochs to train for. Defaults to 5.

  • DECAY_EPOCHS (int, optional) – Number of epochs before slightly decreasing the learning rate. Defaults to 10.

  • lr_method (str, optional) – Type of learning rate adjustment method. Defaults to “plateau”.

  • initial_epoch (int, optional) – Initial epoch of the provided model. Defaults to 0.

Returns

Training history object.

Return type

object

nucml.model.model_building.get_callbacks(name, logs_dir_name='logs', lr_method='plateau', patience_epochs=10, save_freq=37655, append_wandb=False)[source]

Gets a callback list for TensorFlow training.

Parameters
  • name (str) – Name of the model.

  • logs_dir_name (str, optional) – Name of the directory where the callback logs will be stored. Defaults to “logs”.

  • lr_method (str, optional) – Learning rate method to implement. Defaults to “plateau”.

  • patience_epochs (int, optional) – Number of epochs to wait before stopping training due to lack of validation progress. Defaults to 10.

  • save_freq (int, optional) – Determines how many steps are needed to create a checkpoint. Defaults to 7531*5.

  • append_wandb (bool, optional) – If True, the WANDB callback will be added. Defaults to False.

Returns

List containing all TensorFlow callbacks.

Return type

list

nucml.model.plot module

nucml.model.plot.dt_training(results_df, param_1='max_depth', param_2='msl', train_metric='train_mae', test_metric='test_mae', save=False, save_dir='', show=True)[source]

Plots both the train and test loss as a function of a second feature (i.e. training steps).

Parameters
  • results_df (pd.DataFrame) – Pandas DataFrame containing the train and test metric information.

  • param_1 (str) – Feature containing the information for a given parameter to plot.

  • param_2 (str) – Feature containing the information for a second parameter to plot.

  • train_metric (str) – Name of the feature containing the train performance metric.

  • test_metric (str) – Name of the feature containing the test performance metric.

  • save (bool, optional) – If True, the figure will be saved. Defaults to False.

  • save_dir (str, optional) – Path-like string where the resulting figure will be saved. Defaults to ‘’.

  • show (bool, optional) – If True, the image is rendered. Defaults to True.

Returns

Plotly figure object.

Return type

object

nucml.model.plot.knn_training(results_df, x_feature='id', train_metric='train_mae', val_metric='val_mae', test_metric='test_mae', save=False, save_path='', show=True)[source]

Plots both the train, val, and test loss as a function of a given parameter (i.e. number of neighbors).

Parameters
  • results_df (pd.DataFrame) – Pandas DataFrame containing the train, val, and test metric information.

  • x_feature (str) – Feature containing the x-axis information. Can contain information such as the training steps or parameters such as k-number.

  • train_metric (str) – Name of the feature containing the train performance metric.

  • val_metric (str) – Name of the feature containing the validation performance metric.

  • test_metric (str) – Name of the feature containing the test performance metric.

  • save (bool, optional) – If True, the figure will be saved. Defaults to False.

  • save_dir (str, optional) – Path-like string where the resulting figure will be saved. Defaults to ‘’.

  • show (bool, optional) – If True, the image is rendered. Defaults to True.

Returns

Plotly figure object.

Return type

object

nucml.model.plot.train_test(df, x_feature, train_metric, test_metric, save=False, save_dir='', render_browser=False, paper=False)[source]

Plots both the train and test loss as a function of a second feature (i.e. training steps).

Parameters
  • df (pd.DataFrame) – Pandas DataFrame containing the train and test metric information.

  • x_feature (str) – Feature containing the x-axis information. Can contain information such as the training steps or parameters such as k-number, number of estimators, etc.

  • train_metric (str) – Name of the feature containing the train performance metric.

  • test_metric (str) – Name of the feature containing the test performance metric.

  • save (bool, optional) – If True, the figure will be saved. Defaults to False.

  • save_dir (str, optional) – Path-like string where the resulting figure will be saved. Defaults to ‘’.

  • render_browser (bool, optional) – If True, the figure will be rendered in a new browser tab. Defaults to False.

  • paper (bool, optional) – If True, the figure will be resized to fit into two-column documents. Defaults to False.

Returns

Plotly figure object.

Return type

object

nucml.model.plot.xgb_training(dictionary, save=False, show=True, title='', save_dir='')[source]

Plots the Loss vs Number of Estimators resulting from an XGBoost training process.

Parameters
  • dictionary (dict) – dictionary generated from the XGBoost training process.

  • save (bool, optional) – If True, the image is saved. Defaults to False.

  • show (bool, optional) – If True, the image is rendered. Defaults to True.

  • title (str, optional) – Title to render above the plot. Defaults to “”.

  • path (str, optional) – Path-like string where the figure will be saved. Defaults to “”.

Returns

None

nucml.model.plot.xgb_training_w_path(path_to_csv, save=False, saving_path='xgb_training.png')[source]

nucml.model.training module

nucml.model.training.train_dt(x_train, y_train, x_test, y_test, parameters_dict, save_models=False, save_dir='.')[source]

Trains multiple DT models according to the parameters given in the parameters_dict argument. Useful for quick experimentation. For a more efficient and advanced DT training method, see the dt.py script.

Parameters
  • x_train (DataFrame or np.array) – Training data.

  • y_train (DataFrame or np.array) – Training labels.

  • x_test (DataFrame or np.array) – Testing data.

  • y_test (DataFrame or np.array) – Testing labels.

  • parameters_dict (dict) – Dictionary object. Keys must be “max_depth_list”, “min_split_list”, and “min_leaf_split”. Values must be lists of parameteres to test.

  • save_models (bool, optional) – If True, the trained models will be saved. Defaults to False.

  • save_dir (str, optional) – Path-like string where the trained models will be saved. Defaults to “.”.

Returns

contains the performance metrics for all trained models.

Return type

DataFrame

nucml.model.training.train_knn(x_train, y_train, x_test, y_test, k_list, save_models=False, save_dir='.')[source]

Trains multiple KNN models given a list of k-values. Useful for quick experimentation. For a more efficient and advanced KNN training method, see the knn.py script.

Parameters
  • x_train (DataFrame or np.array) – Training data.

  • y_train (DataFrame or np.array) – Training labels.

  • x_test (DataFrame or np.array) – Testing data.

  • y_test (DataFrame or np.array) – Testing labels.

  • k_list (list) – List of k-values to iterate through.

  • save_models (bool, optional) – If True, the trained models will be saved. Defaults to False.

  • save_dir (str, optional) – Path-like string where the trained models will be saved. Defaults to “.”.

Returns

Contains performance metrics for all trained models.

Return type

DataFrame

nucml.model.training.train_xgb(x_train, y_train, x_test, y_test, parameters_dict, save_models=False, save_dir='./')[source]

Trains multiple DT models according to the parameters given in the parameters_dict argument. Useful for quick experimentation. For a more efficient and advanced DT training method, see the dt.py script.

Parameters
  • x_train (DataFrame or np.array) – Training data.

  • y_train (DataFrame or np.array) – Training labels.

  • x_test (DataFrame or np.array) – Testing data.

  • y_test (DataFrame or np.array) – Testing labels.

  • parameters_dict (dict) – Dictionary object. Keys must be “max_depth_list”, “num_estimator_list”, and “learning_rate_list”. Values must be lists of parameteres to test.

  • save_models (bool, optional) – If True, the trained models will be saved. Defaults to False.

  • save_dir (str, optional) – Path-like string where the trained models will be saved. Defaults to “.”.

Returns

contains the performance metrics for all trained models.

Return type

DataFrame

nucml.model.utilities module

nucml.model.utilities.cleanup_model_dir(results_df, model_dir, keep_best=True, keep_first=False)[source]

Deletes unwanted models and scalers. Keeps best models based on training, validation, and testing perfromance if wanted.

Parameters
  • results_df (DataFrame) – The loaded results data file created by the various training scripts.

  • model_dir (str) – Path-like string where all model directories are stored.

  • keep_best (bool, optional) – If True, it will keep three or more models based on performance. Defaults to True.

  • keep_first (bool, optional) – If True, it will keep the first appearance in case of a duplicate rows. Defaults to False.

Returns

None

nucml.model.utilities.create_error_df(identifier, error_metrics_dict)[source]

Creates a simple dataframe from the performance metrics dictionary yielded by the regression_error_metrics() function.

Parameters
  • identifier (str or int or float) – String or number used for identifying the created dataframe row.

  • error_metrics_dict (dict) – Dictionary containing the performance metrics.

Returns

DataFrame

nucml.model.utilities.create_train_test_error_df(identifier, train_error_metrics, test_error_metrics, val_error_metrics=None)[source]
Creates a pandas DataFrame containing the error metrics provided by both the train and test dictionaries generated

by the regression_error_metrics() function. A validation error metrics dictionary can also be provided.

Parameters
  • identifier (str, int) – Label use for identification of the row.

  • train_error_metrics (dict) – Dictionary containing the error metrics for the train set.

  • test_error_metrics (dict) – Dictionary containing the error metrics for the test set.

  • val_error_metrics (dict, optional) – Dictionary containing the error metrics for the val set. Defaults to None.

Returns

DataFrame

nucml.model.utilities.filter_by_parameters(results_df, param_dict)[source]
nucml.model.utilities.get_best_models_df(results_df, keep_first=False)[source]

Returns a three row minimum dataframe with the best models based on training, validation, and testing performance metrics. The results_df argument is based on the file generated by the python training scripts including knn.py, dt.py, and xgb.py which includes results for all training iterations along with stored model and scaler paths.

Parameters
  • results_df (DataFrame) – Results dataframe created by the model training scripts.

  • keep_first (bool, optional) – In some cases there might be duplicates. If True, this will keep the

  • first instance of a duplicate value. Defaults to False. (the) –

Returns

DataFrame

nucml.model.utilities.get_parameters_from_line(results_df, model='knn')[source]
nucml.model.utilities.load_model_and_scaler(model_scaler_info, df=True, model_only=False)[source]

Loads both the model and scaler given a dataframe with path’s specified.

Parameters
  • model_scaler_info (DataFrame) – Must contain a “model_path” and a “scaler_path” feature if a DataFrame is passed. Else, it must contain the “model_path” and “scaler_path” as keys in a dictionary.

  • df (bool, optional) – If True, the model_scaler_info variable must be a DataFrame. If False, it must be a python dictionary.

  • model_only (bool, optional) – If True, the scaler will not be loaded. Only the model will be loaded.

Returns

returns the loaded model and scaler.

Return type

object, object

nucml.model.utilities.make_predictions(data, model, model_type)[source]

Makes prediction using a trained model. Currently handles tensorflow, xgboost, and scikit-learn models

Parameters
  • data (np.array) – Numpy matrix needed for model predictions. The data will be prepared using tf.data.Dataset for TensorFlow, xgb.DMatrix for xgboost, and passed as is for sklearn models.

  • model (object) – Trained machine learning model.

  • model_type (str) – Type of model being provided. Options include “tf” for TensorFlow, “xgb” for XGBoost, and “sk” for sklearn models.

Returns

object containing the model predictions. Type will be dependent on model type.

Return type

object

nucml.model.utilities.regression_error_metrics(v1, v2)[source]

Calculates the MAE, MSE, EVS, MAEM, and R2 between two vectors.

Parameters
  • v1 (np.array) – First array.

  • v2 (np.array) – Second array.

Returns

Dictionary containing all 5 error metrics in key:value pairs.

Return type

dict

nucml.model.utilities.remove_unused_models(model_results_path, acedate_directory)[source]

Finds best models in terms of train, validation and testing sets and deletes all others. It also keeps the best models in terms of multiplication factor.

WARNING: Once deleted, other models will not be accessible.

Parameters
  • model_results_path (str) – Filepath to model training results CSV file generated using the model training scripts.

  • acedate_directory (str) – Path to the relevant directory were all models for a given algorithm are stored.

Returns

None

Module contents