Data Science Checklist: 10 Weeks Free Bootcamp

Only concepts that you need to review to ace your Data Science Interview

Week 1: Basics

Introduction to Data Science
Data Science is an interdisciplinary field related to many other emerging fields and has different phases involved.
Mathematics for Data Science
Mathematics is widely used in Data Science. This article gives us an overview of the few topics of mathematics that are profusely used.
Over and under sampling
Over and under sampling are two ways to randomly sample an imbalanced dataset to make it balanced.
Supervised, Unsupervised and Semi-Supervised Learning
In machine learning, the tasks are broadly categorized into supervised,unsupervised and semi-supervised learning which forms the foundation of our understanding of machine learning.
Neural Network and Deep learning
Deep learning is a subset of machine learning. It extensively uses neural networks to imitate the learning techniques of the human brain. There are different types of neural networks available.
Beginner's Guide to Google Colaboratory
Google colaboratory is a free, web-based Jupyter notebook environment. It allows you to write and execute Python code, document your code using Markdown, visualize datasets, and is an excellent tool for data scientists.
Data analysis tools
The process of Data analysis is the process of collection, organization, transformation and modeling of data to draw conclusions, make predictions and also make informed decisions. Data scientists mostly use python for data analysis and also other tools like tableau for visualization.

Week 2: Machine Learning Basics

Feature engineering
Feature engineering is done is to make data better for the problem you are trying to solve using machine learning. LASSO is a popular technique used to select features.
Regularization
Regularization is a method used to reduce the variance of your model and increase the bias. L1 and L2 regularizations are some of the widely used techniques. However, it is different from other techniques like standardization.
Frequently used terminologies
Some of the frequently used terms in ML are normalization, latency, throughput, quantization, pruning, bias and early stopping.
Model evaluation
In machine learning, model evaluation is used to find the algorithm is best suited to solve our problem. It is done by calculating the performance metrics; some of which are precision, recall, sensitivity and specificity.
Hyperparameters
Hyperparameters express “higher-level” properties of the model such as its complexity or how fast it should learn and are usually fixeed before training. Learning rate is an example. Hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm.
Gradient descent
Gradient Descent is an essential optimization algorithm that helps us finding optimum parameters of our machine learning models. It has different types of which stochastic gradient descent is widely used. The reverse process is called gradient ascent .
Ensemble methods
Ensemble methods combines predictions from several models into a single one. Boosting, stacking and voting classifier are some ensembling techniques.

Week 3: Classification

Classification
Classification is categorizing data into different classes. This is based on making predictions using past examples. We feed some examples where we know what the correct prediction is into the model and the model learns from these examples to make accurate predictions in the future.
Logistic Regression
Logistic Regression is an efficient algorithm that aims to predict categorical values, often binary.It has its own advantages & disadvantages. It can be implemented using scikit learn and tensorflow.
K-Nearest Neighbours
K-Nearest Neighbours is an algorithm which is used for classification and regression and is based on the idea of considering the nearest K data points for calculations. This example uses KNN for text classification.
Decision tree
Decision tree is a popular machine learning algorithm mainly used for classification. Usually, ID3 algorithm is used to build a decision tree.
Support Vector Machine
SVMs are a particularly powerful and flexible class of supervised algorithms for both classification and regression. It has many advantages and applications. It can be easily implemented

Week 4: Regression

Regression
Regressionis a statistical method used in various fields to find out how strong the relationship between a dependent variable and one or more independent variable is.
Linear Regression
Linear Regression is regression technique modelling relationship between dependent variable and one or more independent variables by using a linear approach. It has its own advantages & disadvantages. It can be implemented using scikit learn and tensorflow.
Random forest
Random forest are an ensemble learning method for classification and regression. It has various applications. This example uses random forest for regression.
Polynomial regression
Polynomial regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is not linear but it is the nth degree of polynomial.
Elastic Net regression
Elastic Net regression uses Elastic Net regularization.
Ridge and Lasso regression
Ridge and LASSO regressions use L2 and L1 regularizations that we saw previously.
Data analysis using regression techniques
This article explains how regression analysis is done.

Week 5: Unsupervised learning

K-means clustering
K-means clustering is a prime example of unsupervised learning and partitional clustering. An improved version of this is K+ means clustering algorithm.
DBSCAN clustering
DBSCAN clustering is a density-based clustering that identify clusters in the dataset by finding regions which are more densely populated than others.
Spectral clustering
Spectral clustering is a technique with roots in graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them.
Apriori algorithm
Apriori algorithm is a associative learning algorithm which is generally used in data mining.
Manifold learning
Manifold learning is the process of modeling manifolds where the data lies. It is a technique used for dimensionality reduction.
Principal component analysis
Principal component analysis is a technique to bring out strong patterns in a dataset by supressing variations. You can check out why PCA works to get a basic idea behind its working. KPCA is a variant of PCA.

Week 6: Deep Learning

Different layers and activation functions
There are many different layers in a deep learning model like fully connected layer . Hidden layers are the most intriguing ones. We also have certain activation functions that decides the state of the neuron.
Top Deep learning frameworks
Top deep learning frameworks include Tensorflow, Keras, Caffe2, PyTorch and many more.
Commonly Used Neural Networks
Commonly Used Neural Networks include various networks such as RBFNN, KNN and Hopfield Network .
CNN models
The CNN models have evolved and some of the commonly used models now include AlexNet, ResNet , GoogleNet , Xception , DenseNet and many more. CNNs are also widely used in image recognition and classification .
Data Augmentation
Data Augmentation, is the technique of increasing the size of data used for training a model.
GAN
Generative Adversarial Networks is an architecture for training a generative model. There are many types of GANs like SRGAN, Deep convolutional GAN, CycleGANs and [ Conditional GAN](https://iq.opengenus.org/conditional-generative-adversarial-net/ target=).
Inception models
Inception architecture is an important milestone in the development of CNN classifiers. It consists of many architectures like Inception-ResNet V1, Inception V3 and Inception V4.
VGG models
VGG came into picture as it addresses the depth of CNNs. It consists of many architectures like VGG-11, VGG-19 and VGG-16.
Boltzmann Machines
Boltzmann Machines are models used to discover features in datasets composed of binary vectors. A Restricted Boltzmann Machine is a variant in which a visible node is not connected to any other visible node and is used in deep belief networks.
YOLO
YOLO is a object detection algorithm that has variants like, YOLOv3, YOLOv4, Scaled YOLOv4, YOLOR and YOLOv5.
SSD
Single shot detection is an object detection algorithm and it's architecture is a modified version of VGG. It is used in SSD MobileNetV1 and RefineDet model .

Week 7: NLP

Introduction
NLP refers to the ability of the computers to understand human speech or text as it is spoken or written. Some core topics are listed here. TF-IDF is an important metric used in NLP mostly used to find similarities between documents .
NLP models
There are different types of NLP models present. Some of them are BERT, GPT, XLNet, RoBERTa and ALBERT.
Text Preprocessing
Text preprocessing process of converting a human language text into a machine-interpretable text for further usage. Stemming (Porter Stemmer algorithm) is an example. .
Text summarization
Text summarization is the process of creating a compact yet accurate summary of text documents. Some techniques include Luhn's Heuristic Method, Edmundson Heuristic Method, SumBasic algorithm , KL-Sum, LexRank , TextRank , Reduction , Latent Semantic Analysis and use of RNN.
Topic Modelling
There are different techniques for topic modelling. Some include Latent Dirichlet Allocation , Non Negative Matrix Factorization, Pachinko Allocation Model and Latent Semantic Analysis.
Information Retrieval
Information Retrieval can be defined as finding material of an unstructured nature that satisfies the information need from within large collections. It uses the concept of indexing . PageRank algorithm is used to rank web pages used for Google Search Engine.
Sentiment analysis
There are various techniques to perform sentiment analysis. Using Naive Bayes classifier, Lexicon-based techniques, ML approaches and LSTM are some of them.
Miscellaneous
Some other important topics in NLP are document clustering, language identification techniques, spell correction, word embedding, word representations and byte pair encoding.

Week 8: Time series

Introduction to Time Series Data
In time series data we have a collection of observations of a single entity at different time intervals. Weather records, economic indicators and patient health evolution metrics — all are time series data.
Basics of Time Series Prediction
Time series prediction involves concepts like stationarity, moving averages, seasonality and many more which you should be familiar with in order to better understand time series forecasting.
Time series forecasting models and techniques
Future trend prediction is made by discovering and analyzing underlying patterns in the time series data. Various methods and models are used for the same.
Time series prediction techniques
Various artificial neural network models are put to use when performing a time series prediction. This article elaborates on a few models.
Time series forecasting-Example
This is an example of time series forecasting where we put into use the techniques we saw in the previous articles.

Week 9: Statistics and probability

Statistical features
In Statistical features are those features of the dataset that can be defined and calculated via statistical analysis. It is the statistical concept that is probably most used in data science.
Types of hypotheses
A hypothesis is a precise, testable statement of what a researcher predicts will be the outcome of an experiment or study. There are different types of hypotheses that are widely used.
Hypothesis testing
Hypothesis testing is used to determine whether there is enough evidence to infer for a certain sample that a certain condition is true for the entire population. F Test is one such hypothesis test.
CLT and LLN
Central limit theorem and Law of large numbers are the two important statistical rules that's often put to use in Data Science.
Confidence intervals
Confidence intervals expresses a range of values within which we are pretty sure that the population parameter lies.
Bayesian model
A Bayesian model is a statistical model where we use probability to represent all uncertainty within the model, both the uncertainty regarding the output but also the uncertainty regarding the input to the model.
Markov model
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Another variant of this is the hidden markov model.
A/B testing
A/B testing is a famous testing technique used to compare two variants to determine the best of the two based on user experience. It is a randomized experimentation process.
Simulated annealing
Simulated annealing is a algorithm used in probability based on the physical annealing process used in metallurgy.
Monte carlo sampling techniques
Monte Carlo techniques are a group of computational algorithms for sampling probability distributions in a random manner.

Week 10: Projects

Project ideas
In This article contains a list of unique data science project ideas that you can explore.
Face recognition
Face recognition can be implemented in python using Eigenfaces or Fisherfaces.
Fraud detection
Fraud detection is the process of detecting fraudulent feats in credit card transactions and can be classified into anomaly detection.
Native Language Identification
Native language identification is the task of determining an author's native language based only on their writings or speeches in a second language.
Person re-identification
Person re-identification is the task of using a picture of a person to identify the presence of the same person is a set of images or video. It is used to identify a person in a CCTV footage.
Hindi Optical Character Recognition
Hindi OCR is basically a model which is used to recognize handwritten Hindi (Devanagari) characters.
Face reconstruction
In this project we find the set of faces when combined, resulting in face of person 'A' using machine learning techniques like PCA, face reconstruction and much more.

Practice interview questions

Basic data science questions
This article contains a list of basic data science interview questions.
Advanced data science questions
This article contains a list of advanced data science interview questions.
Python
Python is the most-tested programming language during data science interviews.
Machine learning
Knowledge on various ML topics such as TensorFlow (basic level), TensorFlow (advanced level), convolution, regression, random forest and PCA are widely tested.
Deep learning
Deep learning topics such as RNN , fully connected layer , convolution layer , GAN and autoencoders are frequently tested.
NLP
NLP topics such as text summarization , transformers and BERT are important for interviews.

Generated by OpenGenus. Updated on 2023-11-27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-science-checklist.md

data-science-checklist.md

Data Science Checklist: 10 Weeks Free Bootcamp

Week 1: Basics

Week 2: Machine Learning Basics

Week 3: Classification

Week 4: Regression

Week 5: Unsupervised learning

Week 6: Deep Learning

Week 7: NLP

Week 8: Time series

Week 9: Statistics and probability

Week 10: Projects

Practice interview questions

Files

data-science-checklist.md

Latest commit

History

data-science-checklist.md

File metadata and controls

Data Science Checklist: 10 Weeks Free Bootcamp

Week 1: Basics

Week 2: Machine Learning Basics

Week 3: Classification

Week 4: Regression

Week 5: Unsupervised learning

Week 6: Deep Learning

Week 7: NLP

Week 8: Time series

Week 9: Statistics and probability

Week 10: Projects

Practice interview questions