.. _basic-walkthrough-label: .. Note:: Documentation in progress. Basic Walkthrough ================= The cloned repository does not only contain the information necessary to start generating the datasets but also a variety of tutorial notebooks that demonstrate the capabilities of NucML. Be sure to check them out. For more information in the contents please refer to the repository README file in GitHub. If you haven't installed NucmL yet, please follow the instructions in the :ref:`Installation Guide`. In this section we summarize the main pipeline steps for a complete nuclear data neutron-induce cross section evaluation including loading the data, modeling, and validating using criticality benchmarks. 1. Load the Neutrons EXFOR Data ------------------------------- The `nucml.datasets` module offers the capability of loading various datasets from AME, ENSDF, ENDF, and EXFOR easily. In this example we load the EXFOR/AME dataset for neutron-induce reactions. .. code-block:: python # first we import the datasets module import nucml.datasets as nuc_data # We can then load the EXFOR data. data, x_train, x_test, y_train, y_test, to_scale, scaler = nuc_data.load_exfor(mode="neutrons", log=True, low_en=True, max_en=2.0E7, num=True, basic=0, normalize=True) There are a couple of things going on when loading the data. It starts by setting `mode="neutrons"` meaning we only want datapoints for neutron-induce reactions. The `log` arguments specifies that we want the Cross Section and Incident Energy data in log form. These are highly skewed features that benefit from this transformation. The `low_en` arguments tells the loading function that we only want low energy points (up to 2.0E7 eV). The EXFOR dataset contains a lot of information including Author, Institutions, Dates, and more. This is generally not useful for a ML model. We tell the loader function by setting `num=True` that we only want useful numerical and categorical data. This will automatically one-hot encode categorical variables. The `normalize=True` makes sure that the data is normalized using a standard scaler. There are other transformers avaliable. The `basic` argument specifies the complexity of the dataset in terms of features included. See the documentation for more information. In this example `basic=0` means that we only want the most basic features (Energy, Data, Z, N, A, MT, Center of Mass Flag, Target Flag). The loader returns eight objects: - data: The original numerical dataset. - x_train and y_train: The trainining data and labels. - x_test and y_test: The testing data and labels. - to_scale: A list of the features subject to normalization by the scaler. - scaler: A scikit-learn scaler object. This is the transformer used to normalized the data. 2. Building the ML Models ------------------------- Building the model is user dependent. There are a variety of models and an infinite number of hyperparameter combinations. Tuning these models are outside the scope of this utility but here a simple Decision Tree model will be trained for demonstration purposes. .. code-block:: python # importing the tree model from sklearn from sklearn import tree # initalizing the model instance dt_model = tree.DecisionTreeRegressor(max_depth=20, min_samples_split=3, min_samples_leaf=2) # training the model dt_model.fit(x_train, y_train) You can evaluate the model with various performance metrics. NucML provides a convinience function that allows you to calculate all performance metrics easily. .. code-block:: python # import nucml modelling utilities import nucml.model.model_utilities as model_utils # making model predictions to calculate metrics y_hat_train = dt_model.predict(x_train) y_hat_test = dt_model.predict(x_test) # getting performance metrics using NucML train_error_metrics = model_utils.regression_error_metrics(y_hat_train, y_train) test_error_metrics = model_utils.regression_error_metrics(y_hat_test, y_test) After training the model make sure to save it for later use in the validation phase. Remember to also save the scaler. This latter object is often forgotted but is needed to tranform the same or new data in later work. Make sure the model name is unique since it is probable you will train various models of the same type. .. note:: The model name will be used to create a directory with the same name. .. code-block:: python # import joblib.dump to save the model and scaler from joblib import dump # specify the path and names for the model and scaler model_saving_directory = "path/to/saving/directory/" model_name = "dt_model_mss3_msl2_maxdepth20.joblib" model_saving_path = os.path.join(model_saving_directory, model_name) scaler_saving_path = os.path.join(model_saving_directory, 'scaler.pkl') # save the models and scaler dump(dt_model, model_saving_path) dump(scaler, open(scaler_saving_path, 'wb')) In the next section we need to present a `DataFrame` to the ACE utilities to create model dependent benchmark files. The ACE module is expecting a certain format containing at least the performance error metrics calculated in the previouse code snippet and the path to the saved model and scaler. In this case we are saving a results `DataFrame` with only one row since we only trained one model but in reality will include many more models. .. code-block:: python # transform the obtained error metrics using the model utilities dt_results = model_utils.create_train_test_error_df(0, train_error_metrics, test_error_metrics) # adding the paths to dt_results dt_results["model_path"] = os.path.abspath(model_saving_path) dt_results["scaler_path"] = os.path.abspath(scaler_saving_path) # You can also append extra information dt_results["normalizer"] = "standard_scaler" dt_results["max_depth"] = dt_model.get_depth() dt_results["mss"] = 3 dt_results["msl"] = 2 # save the results results_filepath = "path/to/saving/dir/dt_results.csv" dt_results.to_csv(results_filepath, index=False) There are some python scripts included in the ML_Nuclear_Data to help you get started training scikit-learn and XGBoost models easily. It is a great way to get get started and to experience some of the modeling capabilities of NucML and includes all necessary code automate more of these tasks. 4. Generating Benchmark Files ----------------------------- While the benchmark library is small, NucML allows the user to add more benchmark files by following a set of instructions. It is best practice to create benchmark files for all avaliable benchmarks. For demonstration we use only the U-233 Jezzebel criticality benchmark. .. code-block:: python # import the ace utilities import nucml.ace.data_utilities as ace_utils # 1) specify directory where all benchmark files will be created dt_ml_ace_dir = "DT_B0/" # 2) Use the dt_results dataframe to generate benchmark files ace_utils.generate_bench_ml_xs(data, dt_results, "U233_MET_FAST_001", to_scale, dt_ml_ace_dir) Under the hood, the ace utilities performs several things: 0. Creates a new directory with the `model_name` as the name within `dt_ml_ace_dir` 1. Search for the queried benchmark template 2. Reads the composition and extracts isotopes for which ML cross sections are needed 3. Loads the model and scaler using the paths in the `dt_results` 4. Generates and processes cross sections 5. Creates the `.ace` files for the ML-generated cross sections and copies other needed isotopes from the main ACE directory 6. Creates the `.xsdir` file needed by SERPENT2 .. note:: The benchmark name is based on the Benchmark Catalog. If a benchmark of choice is not incldued you can include your own by following the instructions in the Benchmarks documentation. Next, we can generate a `bash` script that will allow us to run all benchmark input files (in this case is just one) and transform the resulting `.m` file into a `.mat` file. .. code-block:: python ace_utils.generate_serpent_bash(dt_ml_ace_dir) The generated bash script will be saved in the same directory (dt_ml_ace_dir). 5. Validating the ML Models --------------------------- .. note:: Both SERPENT2 and MATLAB need to be installed for the this section to work. MCNP is not supported. You can run your SERPENT2 calculations by running the generated `bash` script. .. code-block:: bash # navigate to the path where the bash script is cd path/to/dt_ml_ace_dir/ # run the script ./serpent_script.sh Once finished, you can gather all results using a simple `ace_utils` method. .. code-block:: python # gather results dt_jezebel_results = ace_utils.gather_benchmark_results(dt_ml_ace_dir) The resulting `DataFrame` contains the model name, benchmark name, the multiplication factor, and the error. 6. Next Steps ------------- Congratulations, you have performed an end-to-end ML-enhanced nuclear data evaluation using the U-233 Jezebel Benchmark. These are just some of the general submodules that NucML offers to help you navigate through the evaluation pipeline. Try going through the :ref:`Navigating the NDE` section for more information and tutorials.