Skip to content

This project shows building two models - one using Automated ML and one customized model whose hyperparameters are tuned using HyperDrive. The best performing model is deployed as a web service(ACI) to the endpoint .

Notifications You must be signed in to change notification settings

SwapnaKategaru/Capstone-Project

Repository files navigation

Capstone Project - Azure Machine Learning Engineer

Index

  • 1.1 Overview of the project
  • 1.2 Architectural Diagram
  • 1.3 Dataset
  • 1.4 Automated ML
  • 1.5 Further Improvements for automl model
  • 1.6 Hyperparameter Tuning
  • 1.7 HyperDrive Configuration
  • 1.8 Further Improvements for hyperdrive model
  • 1.9 Model Deployment
  • 1.10 Consume Web Service
  • 1.11 Future improvements for project
  • 1.12 Screenshots
  • 1.13 Screen Recording

1.1 Overview of the project

This project is part of the Udacity Azure ML Nanodegree. In this project, we build a machine learning model using the Python SDK and a provided Scikit-learn model. In this project, we create two models - one using Automated ML and one customized model whose hyperparameters are tuned using HyperDrive. The best performing model is deployed as a web service(ACI) to the endpoint that can be consumed by sending sample requests to the service.

1.2 Architectural Diagram

1.3 Dataset

The dataset used in this notebook is heart_failure_clinical_records_dataset.csv which is an external dataset available in kaggle. This dataset contains data of 299 patients and 12 features that are useful to predict mortality of a individual due to heart failure.

Overview of Dataset

  • No.of patients data collected : 299
  • Input variables or features : age, anaemia, creatinine_phosphokinase, diabetes, ejection_fraction, high_blood_pressure, platelets, serum_creatinine, serum_sodium, sex, smoking, time
  • Output/target variable : DEATH_EVENT

Task

In this project, we create a Classification Model for predicting mortality rate/DEATH_EVENT(target variabe) that is caused due to Heart Failure using an external dataset.
This project shows creating two trained models - one using Automated ML and one customized model whose hyperparameters are tuned using HyperDrive. Followed by, comparison of performance of both the models and deployment of best performing model as a web service for consumption of endpoint.

This project will demonstrate your ability to use an external dataset in your workspace,

Accessing of data

  • Importing of data from csv file (heart_failure_clinical_records_dataset) using azureml's TabularDatasetFactory class.
  • Create a tabular dataset.

1.4 Automated ML

Automl Settings - Using AUC weighted as the primary metric for optimisation during model training as it optimizes well for datasets that are of small size, featurisation set to auto, max_concurrent_iterations set as '4' for maximum number of iterations to execute in parallel, verbosity level set to default as logging.INFO for writing to log file.

AutoML Configuration - Classification experiment with experiment timeout minutes set to 15 minutes and 2 cross-validation folds. Blocked model is XGBoostClassifier along with values for training data and label column name parameters.

Results

Voting Ensemble algorithm is generated by AutoML as the best model and the best metric score is 0.9. Ensembled algorithms of this model includes 'GradientBoosting', 'RandomForest' and 'LightGBM'. Best ensemble weights for these algorithms is 0.1 and best individual pipeline score is 0.885.

Screenshot of models generated with AutoML Run and Best model shown as Voting Ensemble.

A prefitted soft voting classifier is applied where every individual classifier provides a probability value, the predictions are weighted according to classifier's importance that are summed up and greatest sum of weighted probabilities wins the vote.

Parameters of gradientboostingclassifier are:

  • learning_rate : Learning rate shrinks the contribution of each tree
  • max_depth : Maximum depth limits the number of nodes in the tree
  • loss : Loss function(deviance) to optimize for classification with probabilistic outputs

Parameters of randomforestclassifier are:

  • max_features : Number of features to consider when looking for the best split
  • min_samples_leaf : Minimum number of samples required to be at a leaf node
  • n_jobs : Number of jobs to run in parallel

Parameters of lightgbmclassifier are:

  • importance_type : Type of feature importance to fill into feature importance includes split and gain
  • n_estimators : Number of boosted trees to fit having default set to 100
  • num_leaves : Maximum tree leaves for base learners

1.5 Further Improvements for automl model

This model can be improved further by specifying additional parameters for automl configuration and settings that contribute for its better performance. Some of the ways to improve could be:

  • Using appropriate no.of cross validations as it reduces bias and improves generalizing pattern by the model.
  • Specifying custom ensemble behavior in an AutoMLConfig object using ensemble setting of parameters.
  • Also, the xgboostclassifier is blocked in AutoMLConfig due to incompatible dependency issue for sdk version(1.22.0) used for this project and enabling xgboost can improve performance as it uses a more regularized model formalization that controls over-fitting.
  1. Screenshot of Run Details Widget of AutoML Run.

  1. Screenshot of model showing its RunID.

1.6 Hyperparameter Tuning

The model used for this experiment is Logistic Regression from scikit learn library. It is because logistic regression is simple and effective classification algorithm used for binary classification task as seen in this case we are predicting the DEATH_EVENT variable of an individual and gives good accuracy for simple datasets. Logistic regression uses Maximum Likelihood Estimation(MLE) to obtain the model coefficients that relate predictors to the target.

Train script python file

This python script file is used to run the hyperdrive experiment that includes custom coded Logistic Regression model using sklearn.

Data Import : Create tabular dataset of heart_failure_clinical_records_dataset csv file by importing of data using azureml TabularDatasetFactory class.

Clean data : A clean_data function is used to clean the dataset and is subjected to normalisation.

Split Data : Split of data into train and test subsets is done using train_test_split function with specified random state of split(random_state=42) and size of test set(test_size=0.33).

Script arguments : LogisticRegression class is used to regularise by specifying parameters like Regularization strength and Maximum number of iterations.

1.7 HyperDrive Configuration

  1. Screenshot showing Sampling method, Early Termination Policy, Hyperparameter Search space used for hyperdrive run.

Types of parameters:

Configuration for hyperdrive run to execute experiment with specified parameters like maximum total no.of runs to create and execute concurrently, name of the primary metric and primary metric goal is defined along with following hyperparameters:

Parameter sampler :

Specifying parameter sampler using RandomParameterSampling class that enables random sampling over a hyperparameter search space from a set of discrete or continuous values(C and max_iter).

Random Parameter Sampler is used as:

  • Random parameter sampling supports both discrete and continuous hyperparameter values.
  • Helps to identify low performing runs and thereby helps in early termination.
  • Low bias in random sampling as hyperparameter values are randomly selected from the defined search space and have equal probability of being selected.
  • choice function helps to sample from only specified set of values.
  • uniform function helps to maintain uniform distribution of samples taken.

Termination Policy :

Specifies early termintaion policy for early stopping with required amount of evaluation interval, slack factor and delay_evaluation.

Bandit Policy is used as:

  • It is a early termination policy that terminates low performing runs which improves computation for existing runs.
  • It terminates runs if primary metric is not within specified slack factor in comparison to the best performing run.
  • This policy is based on slack factor and evaluation interval.

Results

The hyperdrive model whose hyperparameters are optimized for Logistic Regression model produced an accuracy of 0.727272. The best run shows --C value of 0.89 and --max_iter value of 140.

Screenshot of Run Details Widget showing C and max_iter values for hyperdrive runs.

The hyperdrive optimized logistic regression model produced an accuracy of 0.72. Parameters of the model includes:

  • C : Inverse of regularization strength where smaller values specify stronger regularization
  • max_iter : Maximum number of iterations taken for the solvers to converge
  • penalty : Used to specify the norm used in the penalization
  • fit_intercept : Specifies if a constant(bias or intercept) should be added to the decision function
  • intercept_scaling : A synthetic feature with constant value equal to intercept_scaling appended to instance vector
  • tol : Tolerance for stopping criteria

1.8 Further Improvements for hyperdrive model

Hyperdrive model can be improved further by tuning with different hyperparameters that contribute for improvement in its performance. We can also improve the scoring by optimizing with other metrics like Log Loss and F1-Score. Use more appropriate parameters for hyperdrive configuration settings and increase the count of maximum total runs and concurrent runs.

  1. Screenshot of Run Details Widget of Hyperdrive Run.

  1. Screenshot of hyperdrive with its RunID.

1.9 Model Deployment

The model with best accuracy is deployed as web service which is a AutoML model with better performance showing Accuracy of 0.9.

  • A model is deployed to Azure Container Instances(ACI), by creating a deployment configuration that describes the compute resources required(like number of cores and memory).
  • Creating an inference configuration, which describes the environment needed to host the model and web service.
  • AzureML-AutoML environment is used which is a curated environment available in Azure Machine Learning workspace.
  • Deploying an Azure Machine Learning model as a web service creates a REST API endpoint. This project shows key based authentication used and Swagger URI that is generated through inference schema in score.py script file.

Screenshot of deployed model showing REST Endpoint URL, Key based authentication and Swagger URI.

1.10 Consume Web Service

  • This endpoint can be used to consume web service using scoring endpoint URL and Primary Key. We can send data and make a request to this endpoint and receive the prediction returned by the model.

Screenshot showing Consume section of deployed model that has Endpoint URL and Primary Key for authentication.

  • In this project, the data is requested to endpoint through endpoint.py script that has sample input from the dataset used and also using Python SDK where input json payload of two sets of data instaces is used.
  • A request(query) is sent to endpoint that uses scoring_uri and primary key for authentication. This results in displaying json response as output from endpoint.

Screenshot of endpoint.py script run that shows response from model.

1.11 Future improvements for project

  • Enable deep learning while specifying classification task type for autoML as it applies default techniques depending on the number of rows present in training dataset provided and applies train/validation split with required no.of cross validations without explicitly being provided.
  • Use iterations parameter of AutoMLConfig Class that enables use of different algorithms and parameter combinations to test during an automated ML experiment and increase experiment timeout minutes.
  • Use Azure Kubernetes Services(AKS) instead of Azure Container Insance(ACI) as AKS helps in minimizing infrastructure maintenance, uses automated upgrades, repairs, monitoring and scaling. This leads to faster development and integration.
  • Use Dedicated virtual machine instead of low-priority as these vm do not guarantee compute nodes.
  • GPU can be used instead of CPU as it enormously increases the speed.

1.12 Screenshots

  1. Screenshot showing automl model as Registered.

  1. Screenshot showing hyperdrive model as Registered.

  1. Screenshot showing Deployment status of endpoint model as Healthy.

  1. Screenshot showing Webservice being deleted and Shutdown of computes used.

Enabling logging with logs.py script file

  1. Screenshot showing False for Application insights before logging.

  1. Screenshot showing script run of python logs file

  1. Screenshot showing Application insights enabled to True

1.13 Screen Recording

Link to screencast : https://drive.google.com/file/d/1sSQcEJRL6a4sbfy2R8nU7wpQdqyQLEaJ/view?usp=sharing

Screencast demonstrates a working model, deployed model, a sample request sent to the endpoint and its response and additional feature of the model that shows enabling application insights through logs.py script file.
Logging the deployed model is important as it helps detect anomalies, includes analytic tools, retrieve logs from a deployed model and has vital role to debug problems in production environments.

About

This project shows building two models - one using Automated ML and one customized model whose hyperparameters are tuned using HyperDrive. The best performing model is deployed as a web service(ACI) to the endpoint .

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published