MLPerf™ HPC Training Rules

Table of Contents

1. Overview
2. Benchmarks
3. Divisions
- 3.1. Closed Division
- 3.2. Open Division
4. Exceptions to the MLPerf™ Training Rules
5. Data Set
- 5.1. Data State at Start of Run
6. Training Loop
- 6.1. Hyperparameters and Optimizer
7. Run Results
8. Benchmark Results

Version 3.0 May 15, 2023

Points of contact: Andreas Prodromou (aprodromou@nvidia.com), Murali Emani (memani@anl.gov), David Kanter (david@mlcommons.org)

1. Overview

All rules are taken from the MLPerf™ Training Rules, the version at the time of HPC rules freeze, (https://github.com/mlcommons/training_policies/commit/5a910009c991cb3b98448c9baf60b7ba018c3c06) except for those that are overridden here.

The MLPerf name and logo are trademarks of the MLCommons® Association ("MLCommons"). In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. MLCommons reserves the right to solely determine if a use of its name or logos is acceptable.

2. Benchmarks

The benchmark suite consists of the benchmarks shown in the following table.

Problem	Dataset	Quality Target
Climate segmentation	CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)	IOU 0.82
Cosmological parameter prediction	CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets	Mean average error 0.124
Modeling catalysts	Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set	Forces mean absolute error 0.036
Protein Structure Prediction (OpenFold)	OpenProteinSet and Protein Data Bank	Local Distance Difference Test (lDDT-Cα) >= 0.8

3. Divisions

There are two divisions of the HPC benchmark suite, the Closed division and the Open division.

3.1. Closed Division

The Closed division requires using the same preprocessing, model, and training method as the reference implementation.

The closed division models are:

Problem	Model
Climate segmentation	https://github.com/mlcommons/hpc/tree/main/deepcam
Cosmological parameter prediction	https://github.com/mlcommons/hpc/tree/main/cosmoflow
Modeling catalysts	https://github.com/mlcommons/hpc/tree/main/open_catalyst
Protein Structure Prediction	https://github.com/mlcommons/hpc/tree/main/openfold

3.2. Open Division

The open division enables novel implementations of the benchmarks to improve the metrics. The newer implementations will still need to be mathematically equivalent to the model reference implementation. Hyperparameters and optimizer may be freely changed.

4. Exceptions to the MLPerf™ Training Rules

In OpenFold benchmark it is allowed to use provided InitialTrainingDataloaderPQ class in the training loop. This custom dataloader uses non-blocking priority queue to enable higher throughput at the cost of non-deterministic order of samples.

5. Data Set

5.1. Data State at Start of Run

Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.

Starting with submission round v3.0 (October 2023), data state at start of run follows the the same rules as MLPerf Training submissions.

Data can start on any durable storage system such as local disks and cloud storage systems. This explicitly excludes RAM.

— MLPerf Training Rules (Section 6.1)
as of May 15 2023

Submissions prior to v3.0 required data to start on a parallel, persistent file system. This requirement is no longer in effect.

6. Training Loop

6.1. Hyperparameters and Optimizer

CLOSED:

Allowed hyperparameter and optimizer settings are specified here. For anything not explicitly mentioned here, submissions must match the behavior and settings of the reference implementations.

Model	Name	Constraint	Definition	Reference Code
CosmoFlow	global_batch_size	unconstrained	the global batch size for training	local `batch_size` (`--batch-size`) times number of workers. Baseline config is 64
CosmoFlow	opt_name	"sgd"	the optimizer name	`--optimizer` or config
CosmoFlow	sgd_opt_momentum	0.9	SGD momentum	config
CosmoFlow	opt_base_learning_rate	unconstrained	The base learning rate	`base_lr` times scaling factor, e.g. `global_batch_size/base_batch_size` if scaling="linear". Config
CosmoFlow	opt_learning_rate_warmup_epochs	unconstrained	the number of epochs for learning rate to warm up to base value	config
CosmoFlow	opt_learning_rate_warmup_factor	unconstrained	the constant factor applied at learning rate warm up	scaled learning rate / `base_lr`
CosmoFlow	opt_learning_rate_decay_boundary_epochs	list of positive integers	Epochs at which learning rate decays	config
CosmoFlow	opt_learning_rate_decay_factor	`0 < value < 1`, and you may use a different value for each decay	the learning rate decay factor(s) at the decay boundary epochs	config
CosmoFlow	dropout	`0 ⇐ value < 1`	Dropout regularization probability for the dense layers	`dropout` setting in config
CosmoFlow	opt_weight_decay	`value >= 0`	L2 regularization parameter for the dense layers	`l2` setting in config
DeepCAM	global_batch_size	unconstrained	the global batch size for training	`--local_batch_size` times number of workers
DeepCAM	batchnorm_group_size	`value >= 1`	Determines how many ranks participate in the batchnorm	`--batchnorm_group_size`
DeepCAM	opt_name	Adam, AdamW, or LAMB	the optimizer name	`--optimizer`
DeepCAM	opt_eps	1e-6	epsilon for Adam	`--adam_eps`
DeepCAM	opt_betas	unconstrained	Momentum terms for Adam-type optimizers	`--optimizer_betas`
DeepCAM	opt_weight_decay	`value >= 0`	L2 weight regularization	`--weight_decay`
DeepCAM	opt_lr	unconstrained	the base learning rate	`--start_lr` times warmup factor
DeepCAM	scheduler_lr_warmup_steps	`value >= 0`	the number of epochs for learning rate to warm up to base value	`--lr_warmup_steps`
DeepCAM	scheduler_lr_warmup_factor	`value >= 1`	When warmup is used, the target learning_rate will be lr_warmup_factor * start_lr	`--lr_warmup_factor`
DeepCAM	scheduler_type	multistep or cosine_annealing	Specifies the learning rate schedule	`--lr_schedule`
DeepCAM	scheduler_milestones	unconstrained	If multistep, the steps at which learning rate is decayed	milestones in `--lr_schedule type="multistep",milestones="3000 10000",decay_rate="0.1"`
DeepCAM	scheduler_decay_rate	unconstrained	If multistep, the learning rate decay factor	decay_rate in `--lr_schedule type="multistep",milestones="15000 25000",decay_rate="0.1"`
DeepCAM	scheduler_t_max	`value >= 0`	For cosine_annealing, period length in steps	`--lr_schedule`
DeepCAM	scheduler_eta_min	`value >= 0`	For cosine_annealing, sets the minimal LR	`--lr_schedule`
DeepCAM	gradient_accumulation_frequency	`value >= 1`	Specifies the number of gradient accumulation steps before a weight update is performed	`--gradient_accumulation_frequency`
OpenCatalyst	global_batch_size	`value >= 1`	the global batch size	`batch_size` times number of GPUs
OpenCatalyst	opt_name	AdamW	the optimizer name	config setting `optim` `name`
OpenCatalyst	opt_base_learning_rate	`value > 0`	the base learning rate	config setting `lr_initial`
OpenCatalyst	opt_learning_rate_warmup_steps	`value >= 0`	the number of steps for learning rate to warm up to base value	`warmup_steps`
OpenCatalyst	opt_learning_rate_warmup_factor	`0 ⇐ value ⇐ 1`	the factor applied to the learning rate at the start of warmup	`warmup_factor`
OpenCatalyst	opt_learning_rate_decay_boundary_steps	list of positive integers	The steps at which learning rate is decayed	`lr_milestones`
OpenCatalyst	opt_learning_rate_decay_factor	`0 ⇐ value ⇐ 1`	the factor applied to decay the learning rate at each decay boundary step	`lr_gamma`
OpenFold	global_batch_size	`value >= 1`	the global batch size	`batch_size` times number of GPUs
OpenFold	opt_name	Adam	the optimizer name	not configurable
OpenFold	opt_base_learning_rate	`value >= 0.0`	Base learning rate value	`--base_lr`
OpenFold	opt_learning_rate_warmup_init	`value >= 0.0`	Warm-up initial learning rate value	`--warmup_lr_init`
OpenFold	opt_learning_rate_warmup_steps	`value >= 0`	Num iterations for learning rate warm-up	`--warmup_lr_iters`
OpenFold	initial_training_dataloader_type	InitialTrainingDataloaderPT or InitialTrainingDataloaderPQ	Which dataloader type use for training. PT - standard PyTorch DataLoader. PQ - custom dataloader with higher throughput	`--initial_training_dataloader_type`

OPEN: Hyperparameters and optimizer may be freely changed.

7. Run Results

Note: Starting with submission round v3.0 (October 2023), we are transitioning to more descriptive metric names. "Strong Scaling" is now Time To Solution (TTS) and "Weak Scaling" is now Throughput. Rules regarding these two metrics rename unchanged.

MLPerf HPC submissions consist of the following two metrics: Time To Solution (TTS) and Throughput. TTS is mandatory for a compliant submission whereas Throughput is optional:

7.1. Time To Solution (TTS)

This is a mandatory metric: see MLPerf Training Run Results for reference. The same rules apply here.

7.2. Throughput

This is an optional metric. It was designed to test the training capacity of a system.

Measurement: we will define 3 important parameters first.

number of models M: number of model instances which are going to be trained in this benchmark.
instance scale S: each individual model instance will be trained at this scale.
total utilized scale T: the total scale used for running this benchmark. For example, if all M models are trained concurrently, then T=M*S. More generally we can write that S⇐T⇐M*S if (some of) the models are trained sequentially.

Notes:

All three numbers M,S,T are chosen by the submitter. This allows the submitter to accomodate their submission to available machine resources, i.e. compute capacity and compute time.
S and T should be in units of compute resources, e.g. nodes, GPUs or other accelerators. This choice should be aligned with the HPC system description. For example, if the systems descriptions table lists number GPUs to define the scale of the system, then S should be specified in numbers of GPUs.
S and T can be chosen independently of the submission for metric 1 (strong scaling). We encourage to choose T as large as possible, ideally full system scale, but this is not required.

The submitter then trains M models on the resource partitioning (S,T) as defined above to convergence.

We define a Time-To-Train-all (TTTa) number by computing the difference between the end time of the instance which needs longest time to converge and the start time of the instance which starts up fastest. Mathematically this can be expressed as

TTTa = max(run_stop) - min(run_start) where the max/min are taken over all instances M.

Note: the submitter is allowed to prune this number by removing results from individual training instances. As long as the minimum number of models rule is satisfied (see section Benchmark Results below), the submission is valid. They then use a modified number of models M'⇐M and computes TTTa over the reduced set. This allows the submitter to remove occasional outliers or stragglers which would otherwise reduce the score disproportionally.

Reporting: the submitter reports the the tuple (T, S, M', TTTa). It is required to submit a separate MLLOG file for each of the training instances, so that reviewers can verify the quoted numbers. It is not allowed to merge logging files for individual instances.

Restrictions:

Due to large number of simultaneously-trained instances it’s possible that some random seeds will match. Runs with identical seeds must be pruned from final results. Submitters can avoid issue by choosing non-matching seeds for their runs.
The submitter must not report this score on its own. It has to be reported in conjunction with at least one score from Time To Solution (TTS) from the same benchmark.
this score does not allow for extrapolation. All reported M' training instances must have converged and it is not allowed to extrapolate results in S or T.
Due to large scale of weakly-scaled submissions, it’s possible that hardware failures can occur during training. Although unfortunate, this issue is not a sufficient reason to request a post-deadline re-run and re-submission. Submitters are responsible to plan ahead and give themselves enough time to overcome any challenges that may cause them to miss the submission deadline.
Pruned logs should be valid and compliant with the HPC benchmark rules, i.e. should pass the compliance checker script.

In case of Throughput resubmission due to HP borrowing: Due to the high overhead of these runs, submitters are not obligated to replace their original results. Instead they can opt to keep both sets of results (pre- and post- HP borrowing).

In case of a re-submission caused by HP borrowing: Resubmission scale can at most be the proven scale of the original submission. Max scale is proved with submitted log files, including log files of pruned results which should be submitted alongside non-pruned results, using the directory structure defined in submission rules.

Power

Optional power measurement numbers can also be reported by the submitters. These are to be captured from run_start to run_stop with node-level power sampling. The rules for measureing and reporting power efficiency numbers are followed by the training policies and adapted to the HPC group, if any, as listed here explicitly.

8. Benchmark Results

We follow MLPerf Training Benchmark Results rule along with the following required number of runs per benchmark. Note that since run-to-run variability is already captured by spatial multiplexing in case of metric 3, we use the adjusted requirement that the number of trained instances has to be at least equal to the number of runs for metric 1 and 2.

Benchmark	Number of Runs (Metric 1, 2)	M' (Metric 3)
DeepCAM	5	>=5
CosmoFlow	10	>=10
OpenCatalyst	5	>=5
OpenFold	10	>=10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hpc_training_rules.adoc

hpc_training_rules.adoc

MLPerf™ HPC Training Rules

1. Overview

2. Benchmarks

3. Divisions

3.1. Closed Division

3.2. Open Division

4. Exceptions to the MLPerf™ Training Rules

5. Data Set

5.1. Data State at Start of Run

6. Training Loop

6.1. Hyperparameters and Optimizer

7. Run Results

7.1. Time To Solution (TTS)

7.2. Throughput

Power

8. Benchmark Results

Files

hpc_training_rules.adoc

Latest commit

History

hpc_training_rules.adoc

File metadata and controls

MLPerf™ HPC Training Rules

1. Overview

2. Benchmarks

3. Divisions

3.1. Closed Division

3.2. Open Division

4. Exceptions to the MLPerf™ Training Rules

5. Data Set

5.1. Data State at Start of Run

6. Training Loop

6.1. Hyperparameters and Optimizer

7. Run Results

7.1. Time To Solution (TTS)

7.2. Throughput

Power

8. Benchmark Results