Capstone Project - Azure Machine Learning Engineer

Index

1.1 Overview of the project
1.2 Architectural Diagram
1.3 Dataset
1.4 Automated ML
1.5 Further Improvements for automl model
1.6 Hyperparameter Tuning
1.7 HyperDrive Configuration
1.8 Further Improvements for hyperdrive model
1.9 Model Deployment
1.10 Consume Web Service
1.11 Future improvements for project
1.12 Screenshots
1.13 Screen Recording

1.1 Overview of the project

This project is part of the Udacity Azure ML Nanodegree. In this project, we build a machine learning model using the Python SDK and a provided Scikit-learn model. In this project, we create two models - one using Automated ML and one customized model whose hyperparameters are tuned using HyperDrive. The best performing model is deployed as a web service(ACI) to the endpoint that can be consumed by sending sample requests to the service.

1.2 Architectural Diagram

1.3 Dataset

The dataset used in this notebook is heart_failure_clinical_records_dataset.csv which is an external dataset available in kaggle. This dataset contains data of 299 patients and 12 features that are useful to predict mortality of a individual due to heart failure.

Overview of Dataset

No.of patients data collected : 299
Input variables or features : age, anaemia, creatinine_phosphokinase, diabetes, ejection_fraction, high_blood_pressure, platelets, serum_creatinine, serum_sodium, sex, smoking, time
Output/target variable : DEATH_EVENT

Task

In this project, we create a Classification Model for predicting mortality rate/DEATH_EVENT(target variabe) that is caused due to Heart Failure using an external dataset.
This project shows creating two trained models - one using Automated ML and one customized model whose hyperparameters are tuned using HyperDrive. Followed by, comparison of performance of both the models and deployment of best performing model as a web service for consumption of endpoint.

This project will demonstrate your ability to use an external dataset in your workspace,

Accessing of data

Importing of data from csv file (heart_failure_clinical_records_dataset) using azureml's TabularDatasetFactory class.
Create a tabular dataset.

1.4 Automated ML

Automl Settings - Using AUC weighted as the primary metric for optimisation during model training as it optimizes well for datasets that are of small size, featurisation set to auto, max_concurrent_iterations set as '4' for maximum number of iterations to execute in parallel, verbosity level set to default as logging.INFO for writing to log file.

AutoML Configuration - Classification experiment with experiment timeout minutes set to 15 minutes and 2 cross-validation folds. Blocked model is XGBoostClassifier along with values for training data and label column name parameters.

Results

Voting Ensemble algorithm is generated by AutoML as the best model and the best metric score is 0.9. Ensembled algorithms of this model includes 'GradientBoosting', 'RandomForest' and 'LightGBM'. Best ensemble weights for these algorithms is 0.1 and best individual pipeline score is 0.885.

Screenshot of models generated with AutoML Run and Best model shown as Voting Ensemble.

A prefitted soft voting classifier is applied where every individual classifier provides a probability value, the predictions are weighted according to classifier's importance that are summed up and greatest sum of weighted probabilities wins the vote.

Parameters of gradientboostingclassifier are:

learning_rate : Learning rate shrinks the contribution of each tree
max_depth : Maximum depth limits the number of nodes in the tree
loss : Loss function(deviance) to optimize for classification with probabilistic outputs

Parameters of randomforestclassifier are:

max_features : Number of features to consider when looking for the best split
min_samples_leaf : Minimum number of samples required to be at a leaf node
n_jobs : Number of jobs to run in parallel

Parameters of lightgbmclassifier are:

importance_type : Type of feature importance to fill into feature importance includes split and gain
n_estimators : Number of boosted trees to fit having default set to 100
num_leaves : Maximum tree leaves for base learners

1.5 Further Improvements for automl model

This model can be improved further by specifying additional parameters for automl configuration and settings that contribute for its better performance. Some of the ways to improve could be:

Using appropriate no.of cross validations as it reduces bias and improves generalizing pattern by the model.
Specifying custom ensemble behavior in an AutoMLConfig object using ensemble setting of parameters.
Also, the xgboostclassifier is blocked in AutoMLConfig due to incompatible dependency issue for sdk version(1.22.0) used for this project and enabling xgboost can improve performance as it uses a more regularized model formalization that controls over-fitting.

Screenshot of Run Details Widget of AutoML Run.

Screenshot of model showing its RunID.

1.6 Hyperparameter Tuning

The model used for this experiment is Logistic Regression from scikit learn library. It is because logistic regression is simple and effective classification algorithm used for binary classification task as seen in this case we are predicting the DEATH_EVENT variable of an individual and gives good accuracy for simple datasets. Logistic regression uses Maximum Likelihood Estimation(MLE) to obtain the model coefficients that relate predictors to the target.

Train script python file

This python script file is used to run the hyperdrive experiment that includes custom coded Logistic Regression model using sklearn.

Data Import : Create tabular dataset of heart_failure_clinical_records_dataset csv file by importing of data using azureml TabularDatasetFactory class.

Clean data : A clean_data function is used to clean the dataset and is subjected to normalisation.

Split Data : Split of data into train and test subsets is done using train_test_split function with specified random state of split(random_state=42) and size of test set(test_size=0.33).

Script arguments : LogisticRegression class is used to regularise by specifying parameters like Regularization strength and Maximum number of iterations.

1.7 HyperDrive Configuration

Screenshot showing Sampling method, Early Termination Policy, Hyperparameter Search space used for hyperdrive run.

Types of parameters:

Configuration for hyperdrive run to execute experiment with specified parameters like maximum total no.of runs to create and execute concurrently, name of the primary metric and primary metric goal is defined along with following hyperparameters:

Parameter sampler :

Specifying parameter sampler using RandomParameterSampling class that enables random sampling over a hyperparameter search space from a set of discrete or continuous values(C and max_iter).

Random Parameter Sampler is used as:

Random parameter sampling supports both discrete and continuous hyperparameter values.
Helps to identify low performing runs and thereby helps in early termination.
Low bias in random sampling as hyperparameter values are randomly selected from the defined search space and have equal probability of being selected.
choice function helps to sample from only specified set of values.
uniform function helps to maintain uniform distribution of samples taken.

Termination Policy :

Specifies early termintaion policy for early stopping with required amount of evaluation interval, slack factor and delay_evaluation.

Bandit Policy is used as:

It is a early termination policy that terminates low performing runs which improves computation for existing runs.
It terminates runs if primary metric is not within specified slack factor in comparison to the best performing run.
This policy is based on slack factor and evaluation interval.

Results

The hyperdrive model whose hyperparameters are optimized for Logistic Regression model produced an accuracy of 0.727272. The best run shows --C value of 0.89 and --max_iter value of 140.

Screenshot of Run Details Widget showing C and max_iter values for hyperdrive runs.

The hyperdrive optimized logistic regression model produced an accuracy of 0.72. Parameters of the model includes:

C : Inverse of regularization strength where smaller values specify stronger regularization
max_iter : Maximum number of iterations taken for the solvers to converge
penalty : Used to specify the norm used in the penalization
fit_intercept : Specifies if a constant(bias or intercept) should be added to the decision function
intercept_scaling : A synthetic feature with constant value equal to intercept_scaling appended to instance vector
tol : Tolerance for stopping criteria

1.8 Further Improvements for hyperdrive model

Hyperdrive model can be improved further by tuning with different hyperparameters that contribute for improvement in its performance. We can also improve the scoring by optimizing with other metrics like Log Loss and F1-Score. Use more appropriate parameters for hyperdrive configuration settings and increase the count of maximum total runs and concurrent runs.

Screenshot of Run Details Widget of Hyperdrive Run.

Screenshot of hyperdrive with its RunID.

1.9 Model Deployment

The model with best accuracy is deployed as web service which is a AutoML model with better performance showing Accuracy of 0.9.

A model is deployed to Azure Container Instances(ACI), by creating a deployment configuration that describes the compute resources required(like number of cores and memory).
Creating an inference configuration, which describes the environment needed to host the model and web service.
AzureML-AutoML environment is used which is a curated environment available in Azure Machine Learning workspace.
Deploying an Azure Machine Learning model as a web service creates a REST API endpoint. This project shows key based authentication used and Swagger URI that is generated through inference schema in score.py script file.

Screenshot of deployed model showing REST Endpoint URL, Key based authentication and Swagger URI.

1.10 Consume Web Service

This endpoint can be used to consume web service using scoring endpoint URL and Primary Key. We can send data and make a request to this endpoint and receive the prediction returned by the model.

Screenshot showing Consume section of deployed model that has Endpoint URL and Primary Key for authentication.

In this project, the data is requested to endpoint through endpoint.py script that has sample input from the dataset used and also using Python SDK where input json payload of two sets of data instaces is used.
A request(query) is sent to endpoint that uses scoring_uri and primary key for authentication. This results in displaying json response as output from endpoint.

Screenshot of endpoint.py script run that shows response from model.

1.11 Future improvements for project

Enable deep learning while specifying classification task type for autoML as it applies default techniques depending on the number of rows present in training dataset provided and applies train/validation split with required no.of cross validations without explicitly being provided.
Use iterations parameter of AutoMLConfig Class that enables use of different algorithms and parameter combinations to test during an automated ML experiment and increase experiment timeout minutes.
Use Azure Kubernetes Services(AKS) instead of Azure Container Insance(ACI) as AKS helps in minimizing infrastructure maintenance, uses automated upgrades, repairs, monitoring and scaling. This leads to faster development and integration.
Use Dedicated virtual machine instead of low-priority as these vm do not guarantee compute nodes.
GPU can be used instead of CPU as it enormously increases the speed.

1.12 Screenshots

Screenshot showing automl model as Registered.

Screenshot showing hyperdrive model as Registered.

Screenshot showing Deployment status of endpoint model as Healthy.

Screenshot showing Webservice being deleted and Shutdown of computes used.

Enabling logging with logs.py script file

Screenshot showing False for Application insights before logging.

Screenshot showing script run of python logs file

Screenshot showing Application insights enabled to True

1.13 Screen Recording

Link to screencast : https://drive.google.com/file/d/1sSQcEJRL6a4sbfy2R8nU7wpQdqyQLEaJ/view?usp=sharing

Screencast demonstrates a working model, deployed model, a sample request sent to the endpoint and its response and additional feature of the model that shows enabling application insights through logs.py script file.
Logging the deployed model is important as it helps detect anomalies, includes analytic tools, retrieve logs from a deployed model and has vital role to debug problems in production environments.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
environments_file		environments_file
images		images
model_files		model_files
README.md		README.md
automl.ipynb		automl.ipynb
endpoint.py		endpoint.py
heart_failure_clinical_records_dataset.csv		heart_failure_clinical_records_dataset.csv
hyperdrive.ipynb		hyperdrive.ipynb
logs.py		logs.py
score.py		score.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capstone Project - Azure Machine Learning Engineer

Index

1.1 Overview of the project

1.2 Architectural Diagram

1.3 Dataset

Overview of Dataset

Task

Accessing of data

1.4 Automated ML

Results

1.5 Further Improvements for automl model

1.6 Hyperparameter Tuning

Train script python file

1.7 HyperDrive Configuration

Types of parameters:

Results

1.8 Further Improvements for hyperdrive model

1.9 Model Deployment

1.10 Consume Web Service

1.11 Future improvements for project

1.12 Screenshots

Enabling logging with logs.py script file

1.13 Screen Recording

About

Releases

Packages

Languages

SwapnaKategaru/Capstone-Project

Folders and files

Latest commit

History

Repository files navigation

Capstone Project - Azure Machine Learning Engineer

Index

1.1 Overview of the project

1.2 Architectural Diagram

1.3 Dataset

Overview of Dataset

Task

Accessing of data

1.4 Automated ML

Results

1.5 Further Improvements for automl model

1.6 Hyperparameter Tuning

Train script python file

1.7 HyperDrive Configuration

Types of parameters:

Results

1.8 Further Improvements for hyperdrive model

1.9 Model Deployment

1.10 Consume Web Service

1.11 Future improvements for project

1.12 Screenshots

Enabling logging with logs.py script file

1.13 Screen Recording

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages