- Email is the preferred method of communication. Class mailing list will be created as PHBS.MLF@allmail.net. But, the announcements will be made in DingTalk group chat.
- Course slides: Intro | Regression | SVM/KNN/Tree | SVD/PCA/LDA | Hyperparameter | Neural Network | Graphical Model
- Project: Current | 2019 | 2018 | 2017 | 2016
- Past years' exam: 2019 (online take-home) | 2018 | 2017 | Exams from Tom Michell's ML course (Carnegie Mellon University)
No | Date | Contents |
---|---|---|
01 | 9.06 Mon | Course overview (Syllabus) | Required software (Python, Github, PyCharm) | Python crash course (Basic, Numpy, Notebook Shorcut Keys) |
02 | 9.09 Thur | Intro (Slides, Reading: PML Ch. 1) | Notations, Regression, Weight update (Slides) |
03 | 9.13 Mon | PML Ch. 2. Perceptron, Adaline, Gradient descent, Stochastic Gradient Descent |
04 | 9.16 Thur | PML Ch. 3. Logistic Regression (LR) (Slides) and Support Vector Machine (SVM) (Slides) |
05 | 9.20 Mon | Pandas crash course (Notebook. Also see Datacamp, CheatSheet) | KNN (Slides, Reading: PML Ch. 3) |
06 | 9.23 Thur | PML Ch. 3 Code. Decision Tree (Slides, Reading: PML Ch. 3). |
07 | 9.27 Mon | Data Preprocessing (Rading: PML Ch. 4), SVD/PCA (Slides, Reading: PML Ch. 5) |
08 | 10.11 Mon | LDA (Slides, Reading: PML Ch. 5), Hyperparameters (Slides, Reading: PML Ch. 6) |
09 | 10.13 Wed | Bias-Variance, Cross-validation (Slides, Reading: PML Ch. 6) |
10 | 10.14 Thur | Evaluation Metric (Slides, Reading: PML Ch. 6), Ensenble (Reading: PML Ch. 7) |
11 | 10.18 Mon | Neural Network, Deep Learning, CNN (Reading: Ch. 12-15) |
12 | 10.21 Thur | Practical issues of applying ML to the real world. |
13 | 10.25 Mon | Topics in ML in Finance |
14 | 10.28 Thur | Topics in ML in Finance |
15 | 11.01 Mon | Midterm Exam |
16 | 11.04 Thur | HSBC Guest Lecture [1/2] | Midterm exam review |
17 | 11.08 Mon | HSBC Guest Lecture [2/2] |
18 | 11.11 Thur | Course Project Presentation |
-
- Register on Github.com and let TA know your ID (by DingTalk). Make sure to user your full real name in your profile. Accept invitation to the PHBS organization from TA.
- Create a designated repository
GITHUB_ID/PHBS_MLF_2021
for your HW and project. TickInitialize this repository with a README
and selectpython
under.gitignore
- Fork PML repository to your repository.
- Create a designated repository
- Install Github Desktop. Then clone the PML repository to your local storage.
- Install Anaconda Python distribution (3.X version, not 2.X version). Anaconda distribution is core Python + useful scientific computation libraries (e.g., numpy, scipy, pandas) + package management system (pip or conda)
- Install PyCharm Community version. (Or Professional version after applying for free student license)
- Send to TA the screenshots of (1) Github Desktop (showing the PML repository) (2) Jupyter Notebook (Anaconda) (3) PyCharm (See my example).
- Register on Github.com and let TA know your ID (by DingTalk). Make sure to user your full real name in your profile. Accept invitation to the PHBS organization from TA.
-
- The goal of this HW is to be familiar with
pandas
package and dataframe. Due to limited time, I cannot cover pandas in class. You need to teach yourself. Remenber that there's many answers to do the task I am asking below. Use your own way. - For this HW, we will use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (
4year.arff
) in yourYOUR_GITHUB_ID/PHBS_MLF_2021/data/
- I did a basic process of the data (loading to dataframe and creating
bankruptcy
column). See my github - We are going to use the following 4 features:
X1 net profit / total assets
,X2 total liabilities / total assets
,X7 EBIT / total assets
,X10 equity / total assets
, andclass
- Create a new dataframe with only 4 feataures (and and
Bankruptcy
). Properly rename the columns toX1
,X2
,X7
, andX10
- Fill-in the missing values (
nan
) with the column means. (Usepd.fillna()
or See Ch 4 ofPML
) - Find the mean and std of the 4 features among all, bankrupt and still-operating companies (3 groups).
- How many companies satisfy the condition,
X1 < mean(X1) - stdev(X1)
ANDX10 < mean(X10) - std(X10)
? - What is the ratio of the bankrupted companies among the sub-groups above?
- The goal of this HW is to be familiar with
-
- The goal of this HW is to be familiar with the basic classifiers PML Ch 3.
- For this HW, we continue to use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (
4year.arff
) in yourYOUR_GITHUB_ID/PHBS_MLF_2021/HW2/
- I did a basic process of the data (loading to dataframe, creating
bankruptcy
column, changing column names, filling-inna
values, training-vs-test split, standardizatino, etc). See my github - Select the 2 most important features using LogisticRegression with L1 penalty. (Adjust C until you see 2 features)
- Using the 2 selected features, apply LR / SVM / decision tree. Try your own hyperparameters (C, gamma, tree depth, etc) to maximize the prediction accuracy. (Just try several values. You don't need to show your answer is the maximum.)
- Visualize your classifiers using the
plot_decision_regions
function from PML Ch. 3 - Put your result in
YOUR_GITHUB_ID/PHBS_MLF_2021/HW2/Classifiers.ipynb
-
- The goal of this HW is to be familiar with PCA (feature extraction), grid search, pipeline, etc.
- For this HW, we continue to use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (
4year.arff
) in yourYOUR_GITHUB_ID/PHBS_MLF_2021/HW3/
- Use the same pre-precessing provided in Set 2 (loading to dataframe, creating
bankruptcy
column, changing column names, filling-inna
values, training-vs-test split, standardizatino, etc). See my github - Extract 3 features using PCA method.
- Using the selected features from above, we are going to apply LR / SVM / decision tree.
- Implement the methods using pipeline. (PML p185)
- Use grid search for finding optimal hyperparameters. (PML p199). In the search, apply 10-fold cross-validation.
- Lectures: Monday & Thursday 1:30 – 3:20 PM
- Venue: PHBS Building, Room 231
Instructor: Jaehyuk Choi
- Office: PHBS Building, Room 755
- Phone: 86-755-2603-0568
- Email: jaehyuk@phbs.pku.edu.cn
- Office Hour: TBA
- Email: pkuscc@stu.pku.edu.cn
- TA Office Hour (Room 213/214): Monday & Thursday 7-9 PM
With the advent of computation power and big data, machine learning (ML) recently became one of the most spotlighted research field in industry and academia. This course provides a broad introduction to ML in theoretical and practical perspectives. Through this course, students will learn the intuition and implementation behind the popular ML methods and gain hands-on experience of using ML software packages such as SK-learn and Tensorflow. This course will also explore the possibility of applying ML to finance and business. Each student is required to complete a final course project. This year, the compliance analytics team in HSBC bank (Gunagzhou) will give 2 guest lectures to demonstrate how ML is developed and shared in banking industry.
This course assumes prior knowkedge in probability/statistics and experience in Python. This course is ideally recommended for those who have taken introductory ML/AI courses from undergraduate program.
- PML (primary textbook): Python Machine Learning 3rd Ed. by Sebastian Raschka.
- Github (PHBS fork)
- ISLR: An Introduction to Statistical Learning (with Applications in R) by James, Witten, Hastie, and Tibshirani
- Python Implementation: PHBS/ISLR-python (PHBS fork)
- Bishop: Pattern Recognition and Machine Learning by Bishop (Microsoft)
- ESL: The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
- CML: Coursera Machine Learning by Andrew Ng
- DL: Deep Learning by Goodfellow, Bengio, and Courville
- AFML: Advances in financial machine learning by López de Prado
- Attendance 20%, Mid-term exam 30%, Assignments 20%, Course Project 30%
- Attendance: TBA Randomly checked. The score is calculated as
20 – 2x(#of absence)
. Leave request should be made 24 hours before with supporting documents, except for emergency. Job interview/internship cannot be a valid reason for leave. - Mid-term exam: 4.7 Tues. In-class open-book without computer/phone/calculator
- Course project: Data Proposal and Presentation. Group of up to ?? people.
- Attendance: checked randomly. The score is calculated as 20 – 2
x
(#of absence). Leave request should be made 24 hours before with supporting documents, except for emergency. Job interview/internship cannot be a valid reason for leave - Grade in letters (e.g., A+, A-, ... ,D+, D, F). A- or above < 30% and B- or below > 10%.