Skip to content

Latest commit

 

History

History
120 lines (107 loc) · 26.8 KB

data-science-checklist.md

File metadata and controls

120 lines (107 loc) · 26.8 KB

Data Science Checklist: 10 Weeks Free Bootcamp

Only concepts that you need to review to ace your Data Science Interview

Week 1: Basics

  1. Introduction to Data Science
    Data Science is an interdisciplinary field related to many other emerging fields and has different phases involved.
  2. Mathematics for Data Science
    Mathematics is widely used in Data Science. This article gives us an overview of the few topics of mathematics that are profusely used.
  3. Over and under sampling
    Over and under sampling are two ways to randomly sample an imbalanced dataset to make it balanced.
  4. Supervised, Unsupervised and Semi-Supervised Learning
    In machine learning, the tasks are broadly categorized into supervised,unsupervised and semi-supervised learning which forms the foundation of our understanding of machine learning.
  5. Neural Network and Deep learning
    Deep learning is a subset of machine learning. It extensively uses neural networks to imitate the learning techniques of the human brain. There are different types of neural networks available.
  6. Beginner's Guide to Google Colaboratory
    Google colaboratory is a free, web-based Jupyter notebook environment. It allows you to write and execute Python code, document your code using Markdown, visualize datasets, and is an excellent tool for data scientists.
  7. Data analysis tools
    The process of Data analysis is the process of collection, organization, transformation and modeling of data to draw conclusions, make predictions and also make informed decisions. Data scientists mostly use python for data analysis and also other tools like tableau for visualization.

Week 2: Machine Learning Basics

  1. Feature engineering
    Feature engineering is done is to make data better for the problem you are trying to solve using machine learning. LASSO is a popular technique used to select features.
  2. Regularization
    Regularization is a method used to reduce the variance of your model and increase the bias. L1 and L2 regularizations are some of the widely used techniques. However, it is different from other techniques like standardization.
  3. Frequently used terminologies
    Some of the frequently used terms in ML are normalization, latency, throughput, quantization, pruning, bias and early stopping.
  4. Model evaluation
    In machine learning, model evaluation is used to find the algorithm is best suited to solve our problem. It is done by calculating the performance metrics; some of which are precision, recall, sensitivity and specificity.
  5. Hyperparameters
    Hyperparameters express “higher-level” properties of the model such as its complexity or how fast it should learn and are usually fixeed before training. Learning rate is an example. Hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm.
  6. Gradient descent
    Gradient Descent is an essential optimization algorithm that helps us finding optimum parameters of our machine learning models. It has different types of which stochastic gradient descent is widely used. The reverse process is called gradient ascent .
  7. Ensemble methods
    Ensemble methods combines predictions from several models into a single one. Boosting, stacking and voting classifier are some ensembling techniques.

Week 3: Classification

  1. Classification
    Classification is categorizing data into different classes. This is based on making predictions using past examples. We feed some examples where we know what the correct prediction is into the model and the model learns from these examples to make accurate predictions in the future.
  2. Logistic Regression
    Logistic Regression is an efficient algorithm that aims to predict categorical values, often binary.It has its own advantages & disadvantages. It can be implemented using scikit learn and tensorflow.
  3. K-Nearest Neighbours
    K-Nearest Neighbours is an algorithm which is used for classification and regression and is based on the idea of considering the nearest K data points for calculations. This example uses KNN for text classification.
  4. Decision tree
    Decision tree is a popular machine learning algorithm mainly used for classification. Usually, ID3 algorithm is used to build a decision tree.
  5. Support Vector Machine
    SVMs are a particularly powerful and flexible class of supervised algorithms for both classification and regression. It has many advantages and applications. It can be easily implemented

Week 4: Regression

  1. Regression
    Regressionis a statistical method used in various fields to find out how strong the relationship between a dependent variable and one or more independent variable is.
  2. Linear Regression
    Linear Regression is regression technique modelling relationship between dependent variable and one or more independent variables by using a linear approach. It has its own advantages & disadvantages. It can be implemented using scikit learn and tensorflow.
  3. Random forest
    Random forest are an ensemble learning method for classification and regression. It has various applications. This example uses random forest for regression.
  4. Polynomial regression
    Polynomial regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is not linear but it is the nth degree of polynomial.
  5. Elastic Net regression
    Elastic Net regression uses Elastic Net regularization.
  6. Ridge and Lasso regression
    Ridge and LASSO regressions use L2 and L1 regularizations that we saw previously.
  7. Data analysis using regression techniques
    This article explains how regression analysis is done.

Week 5: Unsupervised learning

  1. K-means clustering
    K-means clustering is a prime example of unsupervised learning and partitional clustering. An improved version of this is K+ means clustering algorithm.
  2. DBSCAN clustering
    DBSCAN clustering is a density-based clustering that identify clusters in the dataset by finding regions which are more densely populated than others.
  3. Spectral clustering
    Spectral clustering is a technique with roots in graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them.
  4. Apriori algorithm
    Apriori algorithm is a associative learning algorithm which is generally used in data mining.
  5. Manifold learning
    Manifold learning is the process of modeling manifolds where the data lies. It is a technique used for dimensionality reduction.
  6. Principal component analysis
    Principal component analysis is a technique to bring out strong patterns in a dataset by supressing variations. You can check out why PCA works to get a basic idea behind its working. KPCA is a variant of PCA.

Week 6: Deep Learning

  1. Different layers and activation functions
    There are many different layers in a deep learning model like fully connected layer . Hidden layers are the most intriguing ones. We also have certain activation functions that decides the state of the neuron.
  2. Top Deep learning frameworks
    Top deep learning frameworks include Tensorflow, Keras, Caffe2, PyTorch and many more.
  3. Commonly Used Neural Networks
    Commonly Used Neural Networks include various networks such as RBFNN, KNN and Hopfield Network .
  4. CNN models
    The CNN models have evolved and some of the commonly used models now include AlexNet, ResNet , GoogleNet , Xception , DenseNet and many more. CNNs are also widely used in image recognition and classification .
  5. Data Augmentation
    Data Augmentation, is the technique of increasing the size of data used for training a model.
  6. GAN
    Generative Adversarial Networks is an architecture for training a generative model. There are many types of GANs like SRGAN, Deep convolutional GAN, CycleGANs and [ Conditional GAN](https://iq.opengenus.org/conditional-generative-adversarial-net/ target=).
  7. Inception models
    Inception architecture is an important milestone in the development of CNN classifiers. It consists of many architectures like Inception-ResNet V1, Inception V3 and Inception V4.
  8. VGG models
    VGG came into picture as it addresses the depth of CNNs. It consists of many architectures like VGG-11, VGG-19 and VGG-16.
  9. Boltzmann Machines
    Boltzmann Machines are models used to discover features in datasets composed of binary vectors. A Restricted Boltzmann Machine is a variant in which a visible node is not connected to any other visible node and is used in deep belief networks.
  10. YOLO
    YOLO is a object detection algorithm that has variants like, YOLOv3, YOLOv4, Scaled YOLOv4, YOLOR and YOLOv5.
  11. SSD
    Single shot detection is an object detection algorithm and it's architecture is a modified version of VGG. It is used in SSD MobileNetV1 and RefineDet model .

Week 7: NLP

  1. Introduction
    NLP refers to the ability of the computers to understand human speech or text as it is spoken or written. Some core topics are listed here. TF-IDF is an important metric used in NLP mostly used to find similarities between documents .
  2. NLP models
    There are different types of NLP models present. Some of them are BERT, GPT, XLNet, RoBERTa and ALBERT.
  3. Text Preprocessing
    Text preprocessing process of converting a human language text into a machine-interpretable text for further usage. Stemming (Porter Stemmer algorithm) is an example. .
  4. Text summarization
    Text summarization is the process of creating a compact yet accurate summary of text documents. Some techniques include Luhn's Heuristic Method, Edmundson Heuristic Method, SumBasic algorithm , KL-Sum, LexRank , TextRank , Reduction , Latent Semantic Analysis and use of RNN.
  5. Topic Modelling
    There are different techniques for topic modelling. Some include Latent Dirichlet Allocation , Non Negative Matrix Factorization, Pachinko Allocation Model and Latent Semantic Analysis.
  6. Information Retrieval
    Information Retrieval can be defined as finding material of an unstructured nature that satisfies the information need from within large collections. It uses the concept of indexing . PageRank algorithm is used to rank web pages used for Google Search Engine.
  7. Sentiment analysis
    There are various techniques to perform sentiment analysis. Using Naive Bayes classifier, Lexicon-based techniques, ML approaches and LSTM are some of them.
  8. Miscellaneous
    Some other important topics in NLP are document clustering, language identification techniques, spell correction, word embedding, word representations and byte pair encoding.

Week 8: Time series

  1. Introduction to Time Series Data
    In time series data we have a collection of observations of a single entity at different time intervals. Weather records, economic indicators and patient health evolution metrics — all are time series data.
  2. Basics of Time Series Prediction
    Time series prediction involves concepts like stationarity, moving averages, seasonality and many more which you should be familiar with in order to better understand time series forecasting.
  3. Time series forecasting models and techniques
    Future trend prediction is made by discovering and analyzing underlying patterns in the time series data. Various methods and models are used for the same.
  4. Time series prediction techniques
    Various artificial neural network models are put to use when performing a time series prediction. This article elaborates on a few models.
  5. Time series forecasting-Example
    This is an example of time series forecasting where we put into use the techniques we saw in the previous articles.

Week 9: Statistics and probability

  1. Statistical features
    In Statistical features are those features of the dataset that can be defined and calculated via statistical analysis. It is the statistical concept that is probably most used in data science.
  2. Types of hypotheses
    A hypothesis is a precise, testable statement of what a researcher predicts will be the outcome of an experiment or study. There are different types of hypotheses that are widely used.
  3. Hypothesis testing
    Hypothesis testing is used to determine whether there is enough evidence to infer for a certain sample that a certain condition is true for the entire population. F Test is one such hypothesis test.
  4. CLT and LLN
    Central limit theorem and Law of large numbers are the two important statistical rules that's often put to use in Data Science.
  5. Confidence intervals
    Confidence intervals expresses a range of values within which we are pretty sure that the population parameter lies.
  6. Bayesian model
    A Bayesian model is a statistical model where we use probability to represent all uncertainty within the model, both the uncertainty regarding the output but also the uncertainty regarding the input to the model.
  7. Markov model
    A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Another variant of this is the hidden markov model.
  8. A/B testing
    A/B testing is a famous testing technique used to compare two variants to determine the best of the two based on user experience. It is a randomized experimentation process.
  9. Simulated annealing
    Simulated annealing is a algorithm used in probability based on the physical annealing process used in metallurgy.
  10. Monte carlo sampling techniques
    Monte Carlo techniques are a group of computational algorithms for sampling probability distributions in a random manner.

Week 10: Projects

  1. Project ideas
    In This article contains a list of unique data science project ideas that you can explore.
  2. Face recognition
    Face recognition can be implemented in python using Eigenfaces or Fisherfaces.
  3. Fraud detection
    Fraud detection is the process of detecting fraudulent feats in credit card transactions and can be classified into anomaly detection.
  4. Native Language Identification
    Native language identification is the task of determining an author's native language based only on their writings or speeches in a second language.
  5. Person re-identification
    Person re-identification is the task of using a picture of a person to identify the presence of the same person is a set of images or video. It is used to identify a person in a CCTV footage.
  6. Hindi Optical Character Recognition
    Hindi OCR is basically a model which is used to recognize handwritten Hindi (Devanagari) characters.
  7. Face reconstruction
    In this project we find the set of faces when combined, resulting in face of person 'A' using machine learning techniques like PCA, face reconstruction and much more.

Practice interview questions

  1. Basic data science questions
    This article contains a list of basic data science interview questions.
  2. Advanced data science questions
    This article contains a list of advanced data science interview questions.
  3. Python
    Python is the most-tested programming language during data science interviews.
  4. Machine learning
    Knowledge on various ML topics such as TensorFlow (basic level), TensorFlow (advanced level), convolution, regression, random forest and PCA are widely tested.
  5. Deep learning
    Deep learning topics such as RNN , fully connected layer , convolution layer , GAN and autoencoders are frequently tested.
  6. NLP
    NLP topics such as text summarization , transformers and BERT are important for interviews.

Generated by OpenGenus. Updated on 2023-11-27