Training Manual

Introduction

The whole training is not end-to-end which must be split into several phases to get a best performance. So, reviewing the paper and codes structure will make you understand the training phase better.

As the paper shows, the whole model needs two separate models: EdgeModel and InpaintingModel. But in practice, the whole work actually needs three training phases with the two separate models , which makes results best while the training is confusing.

IMPORTANT: The three training phases I define here are called model in the original codes , which should not be confused with EdgeModel and InpaintingModel.

Phase	Command	Model	Input	Output	Description
1st	--model 1	`EdgeModel`	Masked Greyscale Image + Masked Edge + Mask	Full Edge	Train `EdgeModel` solely
2nd	--model 2	`InpaintingModel`	Masked Image + Full canny Edge from Original full Image+ Mask	Full Image	Pre-train `InpaintingModel` solely to learn the importance of edges
3rd	--model 3	`InpaintingModel`	Masked Image + Full Edge from 1st phase output + Mask	Full Image	Actual train `InpaintingModel` with the predicted edges from phase 1

Dataset

We need to prepare images dataset and masks dataset both.

Mask dataset:
- Irregular Mask Dataset (download link) provided by Liu et al. is recommended to handle with normal irregular defects.
- Block Masks don't need dataset which can be random generated by codes.
Image dataset:
- Places2, CelebA and Paris Street-View datasets are here.
- Anime Face dataset from getchu.com I used: ANIME305

We should split the whole image dataset into train/validation/test parts.

python scripts/flist_train_split.py --path <your dataset directory> --output <output path> --train 28 --val 1 --test 1

This script will split 30 images into 28 for train, 1 for validation and 1 for test. Images are split by order of names instead of shuffle, in order to get a best time-average-distribution dataset. Now there should be three .filst file in your <output path>, which conclude absolute image paths.

Copy the config.yml.example under root directory into your model path. Rename it into config.yml and edit it. Here is some key parameters related to dataset:

Edit the parameter MASK: 3(recommended as above, 4 is also feasible).
Edit the parameter TRAIN_FLIST, VAL_FLIST and TEST_FLIST into your .flist path which are got in step 2.
Edit the parameter TRAIN_MASK_FLIST, VAL_MASK_FLIST and TEST_MASK_FLIST into the same mask dataset path as we got in step 1.

Now my config.yml is:

MODE: 1             # 1: train, 2: test, 3: eval
MODEL: 1            # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3             # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1             # 1: canny, 2: external
NMS: 1              # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10            # random seed
DEVICE: 1           # 0: CPU, 1: GPU
GPU: [0]            # list of gpu ids
DEBUG: 1            # turns on debugging mode
VERBOSE: 0          # turns on verbose mode in the output console
SKIP_PHASE2: 1      # When training Inpaint model, 2nd and 3rd phases (model 2--->model 3 ) by order are needed. But we can merge 2nd phase into the 3rd one to speed up (however, lower performance).

TRAIN_FLIST: <your path>/train.flist
VAL_FLIST: <your path>/val.flist
TEST_FLIST: <your path>/test.flist

TRAIN_EDGE_FLIST: ./
VAL_EDGE_FLIST: ./
TEST_EDGE_FLIST: ./

# three options below could be the same
TRAIN_MASK_FLIST: <your mask dataset path>
VAL_MASK_FLIST: <your mask dataset path>
TEST_MASK_FLIST: <your mask dataset path>

Training Prepare

Download weights files: Which are available in my page and edge-connect
Strongly recommend you to start transfer learning with weight files. Otherwise you need about 10 days 2 million iterations training to coverage from scratch.
mkdir a model path which contains the config.yml and four .pth weights files.

edit the options in config.yml related to training:

Edit the parameter DEVICE: 1 which is a new option to use GPU or not.
Edit the parameter GPU: [0] to act a multi-gpu training.
Edit the parameter INPUT_SIZE to define the resize of input images
Edit the parameter BATCH_SIZE to adapt your GPU RAM
Edit the following options as u wish:

SAVE_INTERVAL: 1000           # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 200         # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 12               # number of images to sample
EVAL_INTERVAL: 0              # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 1000              # how many iterations to wait before logging training status (0: never)
PRINT_INTERVAL: 20            # how many iterations to wait before terminal prints training status (0: never)

Training

Before training, there are two training optimizations in my work you must know:

Add a skip phase 2 optional mode which can combine phase 2 and phase 3 together, in order to accelerate. If you cannot understand what it means, refer to the Introduction above.
Feel free about the checkpoints problems, the new checkpoint files are saved in your same model path. They are named followed by a iteration mark, e.g. InpaintingModel_dis_2074000.pth. Also the latest checkpoints (identified by name) will be auto-load when the training begins.

Faster Training steps

Train phase 1 which trains the EdgeModel.

python train.py --model 1 --path <your model dir path>

Check the samples at times and stop the training by yourself.

Train phase 2 and 3 together which trains the InpaintingModel using the well-trained EdgeModel in step 1.

IMPORTANT: SKIP_PHASE2 should be 1 in config.yml!

python train.py --model 3 --path <your model dir path>

Check the samples at times and stop the training by yourself. That's all!

(optional) Advanced Training steps

You can set SKIP_PHASE2 into 0 in config.yml to train phase 2 (use --model 2), and phase 3 separately by any order.
You can stop the training and then change the SIGMA in config.yml, then restart training. This way is really tricky.

训练指南🇨🇳

简介（必看）

整个训练过程并不是端到端（end-to-end）的，根据论文为了得到最佳效果训练被分为了几个阶段。有点复杂，所以理解论文并查看代码框架可以让你更好理解。

论文中说整个训练阶段有两个小模型：EdgeModel 和 InpaintingModel. 但是根据代码为了得到最佳效果，实际上整个训练分为训练俩小模型和三个训练阶段，训练完还要test和eval，所以一切变得都令人困惑。不用担心，这个手册写的可清晰了~

重点：这里被我成为阶段phase，在原作代码中被成为model，因为会和EdgeModel 和 InpaintingModel混淆，所以我叫做阶段。

e.g. 训练命令行中的 --model 参数指定的就是我所说的阶段

阶段	对应命令行	训练的小模型	输入	输出	说明
1st	--model 1	`EdgeModel`	Masked Greyscale Image + Masked Edge + Mask	Full Edge	单独训练 `EdgeModel`
2nd	--model 2	`InpaintingModel`	Masked Image + Full canny Edge from Original full Image+ Mask	Full Image	单独预训练 `InpaintingModel` ，为了让它学到Edge的重要性
3rd	--model 3	`InpaintingModel`	Masked Image + Full Edge from 1st phase output + Mask	Full Image	真正的训练 `InpaintingModel`，使用来自阶段1的输出Edge

数据集准备

我们需要同时准备图片和mask数据集：

Mask dataset:
- 不规则 Mask Dataset (download link) 来自 Liu et al. ，推荐使用这个来对付不规则的图片缺陷。
- 规则的方块mask不需要数据集，可使用代码生成
Image dataset:
- Places2, CelebA 和 Paris Street-View 数据集在这里.
- 来自getchu.com的动漫头像数据集在 ANIME305

接下来我们要把图片数据分成train/validation/test三个部分（Mask数据集不用）.

python scripts/flist_train_split.py --path <your dataset directory> --output <output path> --train 28 --val 1 --test 1

这个脚本会默认将30张图片分为28张训练，1张验证，1张测试。注意，分的时候没有shuffle打乱，是根据文件名排序一轮一轮均匀分的，因为动漫头像数据集是按年代排序的，我们想让数据集分布均匀。请修改脚本以适配你自己的数据集。现在，在<output path>目录下应该有三个.filst文件了，它们包含了图片的绝对路径。

复制根目录下的config.yml.example到你的模型文件夹下. 重命名为config.yml并编辑它. 下面是几个和数据集有关的配置需要修改：

修改 MASK: 3 (同样推荐使用4).
修改 TRAIN_FLIST, VAL_FLIST 和 TEST_FLIST 变成你的 .flist 路径。
修改 TRAIN_MASK_FLIST, VAL_MASK_FLIST 和 TEST_MASK_FLIST 变成你的mask数据集路径（三个相同）.

目前为止我的 config.yml 是这样:

MODE: 1             # 1: train, 2: test, 3: eval
MODEL: 1            # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3             # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1             # 1: canny, 2: external
NMS: 1              # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10            # random seed
DEVICE: 1           # 0: CPU, 1: GPU
GPU: [0]            # list of gpu ids
DEBUG: 1            # turns on debugging mode
VERBOSE: 0          # turns on verbose mode in the output console
SKIP_PHASE2: 1      # When training Inpaint model, 2nd and 3rd phases (model 2--->model 3 ) by order are needed. But we can merge 2nd phase into the 3rd one to speed up (however, lower performance).

TRAIN_FLIST: <your path>/train.flist
VAL_FLIST: <your path>/val.flist
TEST_FLIST: <your path>/test.flist

TRAIN_EDGE_FLIST: ./
VAL_EDGE_FLIST: ./
TEST_EDGE_FLIST: ./

# three options below could be the same
TRAIN_MASK_FLIST: <your mask dataset path>
VAL_MASK_FLIST: <your mask dataset path>
TEST_MASK_FLIST: <your mask dataset path>

训练准备

在这里my page 和这里 edge-connect 下载预训练的模型文件
强烈推荐你在预训练好的文件上进行迁移学习。要知道，从0开始训练大概要花费10天，两百万次iterations来收敛到最佳（迁移学习大概十分之一时间）。
把你的config.yml和四个权重文件.pth放到同一个模型目录下

修改config.yml 中有关训练的配置:

修改 DEVICE: 1 代表是否使用GPU.
修改 GPU: [0] 如果你有多块GPU进行并行训练的话.
修改 INPUT_SIZE 来定义输入图片的剪裁尺寸
修改 BATCH_SIZE ，以适合你的GPU显存
修改下面的一些训练时参数：

SAVE_INTERVAL: 1000           # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 200         # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 12               # number of images to sample
EVAL_INTERVAL: 0              # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 1000              # how many iterations to wait before logging training status (0: never)
PRINT_INTERVAL: 20            # how many iterations to wait before terminal prints training status (0: never)

训练！启动！

训练之前，此项目有两个优化点你必须了解：

为了加速训练，提供了一种跳过阶段2（其实是同时结合阶段2和阶段3）的训练模式，对应配置SKIP_PHASE2。不理解的话请回看简介中的阶段说明。
不用担心checkpoints的存储问题：
- 新的checkpoints会被存储在模型文件夹下，名字最后带有值。例如：InpaintingModel_dis_2074000.pth.
- 同时，开始训练的时候会自动加载最新（根据文件名判断）的.pth模型文件。

快速两行命令训练

训练阶段1，对应模型 EdgeModel.

python train.py --model 1 --path <your model dir path>

时不时查看sample，自己手动停止。

训练阶段3，对应 InpaintingModel ，需要用到上一步中训练好的EdgeModel的.pth。

重点: 我们跳过了训练阶段2（实际上是融合了），在 config.yml中SKIP_PHASE2 必须配置为 1 !

python train.py --model 3 --path <your model dir path>

时不时查看sample，自己手动停止。训练完毕啦~

(可选) 高级训练手段

配置 SKIP_PHASE2 为 0 来训练阶段 2 (使用 --model 2),阶段2和3能够以任何顺序接替训练。例如：训练1天阶段2，训练1天阶段3，接着训练阶段2…… checkpoints文件不需要你担心。
中断训练后调整 SIGMA 配置, 然后继续训练。（tricky）

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training_manual.md

training_manual.md

Training Manual

Introduction

Dataset

Training Prepare

Training

Faster Training steps

(optional) Advanced Training steps

训练指南🇨🇳

简介（必看）

数据集准备

训练准备

训练！启动！

快速两行命令训练

(可选) 高级训练手段

Files

training_manual.md

Latest commit

History

training_manual.md

File metadata and controls

Training Manual

Introduction

Dataset

Training Prepare

Training

Faster Training steps

(optional) Advanced Training steps

训练指南🇨🇳

简介（必看）

数据集准备

训练准备

训练！启动！

快速两行命令训练

(可选) 高级训练手段