Skip to content

Latest commit

 

History

History
507 lines (493 loc) · 26.3 KB

INSTALL.md

File metadata and controls

507 lines (493 loc) · 26.3 KB

Build Instructions for E3SM-IO Benchmark

Software Requirements

  • Autotools utility
    • autoconf 2.69
    • automake 1.16.1
    • libtool 2.4.6
    • m4 1.4.18
  • MPI C and C++ compilers
    • Configured with a std 11 C++ compiler (supporting constant initializer)
  • (Optional) PnetCDF 1.12.3
  • (Optional) HDF5 1.13.2
    • Configured with parallel I/O support (configured with --enable-parallel is required)
  • (Optional) HDF5 Log-based VOL
    • Experimental software developed as part of the Datalib project
  • (Optional) ADIOS 2.8.1
    • Configured with parallel I/O support (cmake with -DADIOS2_USE_MPI=ON is required)
  • (Optional) NetCDF-C 4.9.0
    • Configured with parallel HDF5 support (i.e. --enable-netcdf4)
    • Note using NetCDF-C versions prior to 4.9.0 will fail to run due to a bug related to dimension scales.

Instructions for Building Dependent I/O Libraries

  • Build PnetCDF
    • Download a PnetCDF official released software
    • Configure PnetCDF with MPI C compiler
    • Run make install
    • Example build commands are given below. This example will install the PnetCDF library under folder ${HOME}/PnetCDF/1.12.3.
      % wget https://parallel-netcdf.github.io/Release/pnetcdf-1.12.3.tar.gz
      % tar -zxf pnetcdf-1.12.3.tar.gz
      % cd pnetcdf-1.12.3
      % ./configure --prefix=${HOME}/PnetCDF/1.12.3 CC=mpicc
      % make -j 4 install
      
  • Build HDF5 with parallel I/O support
    • Download an HDF5 official released software.
    • Configure HDF5 with parallel I/O enabled.
    • Run make install
    • Example build commands are given below. This example will install the HDF5 library under the folder ${HOME}/HDF5/1.13.2.
      % wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.13/hdf5-1.13.2/src/hdf5-1.13.2.tar.gz
      % tar -zxf hdf5-1_13_0.tar.gz
      % cd hdf5-1.13.2
      % ./configure --prefix=${HOME}/HDF5/1.13.2 --enable-parallel CC=mpicc
      % make -j 4 install
      
  • Build HDF5 log-based VOL plugin.
    • Download the official released software.
    • Configure log-based VOL
      • Enable shared library support (--enable-shared)
      • Compile with zlib library to enable metadata compression (--enable-zlib)
    • Example commands are given below.
      % wget https://github.com/DataLib-ECP/vol-log-based/archive/refs/tags/logvol.1.2.0.tar.gz
      % tar -zxf logvol.1.2.0.tar.gz
      % cd vol-log-based-logvol.1.2.0
      % ./configure --prefix=${HOME}/Log_VOL/1.2.0 --with-hdf5=${HOME}/HDF5/1.13.2 --enable-shared CC=mpicc
      % make -j 4 install
      
  • Build ADIOS with parallel I/O support
    • Download and extract the ADIOS source codes
    • Configure ADIOS with MPI support enabled (-DADIOS2_USE_MPI=ON)
    • Run make install
    • Example build commands are given below. This example will install the ADIOS2 library under the folder ${HOME}/ADIOS2/2.8.1.
      % wget https://github.com/ornladios/ADIOS2/archive/refs/tags/v2.8.1.tar.gz
      % tar -zxf v2.8.1.tar.gz
      % mkdir ADIOS2_BUILD
      % cd ADIOS2_BUILD
      % cmake -DCMAKE_INSTALL_PREFIX=${HOME}/ADIOS2/2.8.1 -DADIOS2_USE_MPI=ON ../ADIOS2-2.8.1
      % make -j 4 install
      
  • Build NetCDF-C
    • Download a NetCDF-C official released software.
    • Configure NetCDF-C with parallel HDF5 I/O enabled.
    • Run make install
    • Example build commands are given below. This example will install the NetCDF library under the folder ${HOME}/NetCDF/4.9.0.
      % wget https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.0.tar.gz
      % tar -zxf v4.9.0.tar.gz
      % cd netcdf-c-4.9.0
      % ./configure --prefix=${HOME}/NetCDF/4.9.0 \
                    CC=mpicc \
                    CPPFLAGS=-I${HOME}/HDF5/1.13.2/include \
                    LDFLAGS=-L${HOME}/HDF5/1.13.2/lib \
                    LIBS=-lhdf5
      % make -j 4 install
      

Build E3SM-IO

  • Clone this E3SM-I/O benchmark repository
  • Run command autoreconf -i
  • Configure the E3SM-I/O benchmark with MPI C and C++ compilers
    • Add PnetCDF installation path (--with-pnetcdf=/path/to/implementation) that contains the PnetCDF library. This is required when running the benchmark with PnetCDF I/O methods.
    • Add HDF5 installation path (--with-hdf5=/path/to/implementation) that contains the HDF5 library. This is required when running the benchmark with HDF5 based I/O methods.
    • Add HDF5 log-based VOL installation path (--with-logvol=/path/to/implementation). This is required when running the benchmark with command-line option -a hdf5_log -x log.
    • Add ADIOS installation path (--with-adios2=/path/to/implementation) to enable ADIOS API support. This is required when running the benchmark with command-line option -a adios -x log.
    • Add NetCDF4 installation path (--with-netcdf4=/path/to/implementation) that contains the NetCDF4 library. This is required when running the benchmark with NetCDF4 I/O methods.
  • Run make
  • Example commands are given below.
    % git clone https://github.com/Parallel-NetCDF/E3SM-IO.git
    % cd E3SM-IO
    % autoreconf -i
    % ./configure --with-pnetcdf=${HOME}/PnetCDF/1.12.3 \
                  --with-hdf5=${HOME}/HDF5/1.13.2 \
                  --with-logvol=${HOME}/Log_VOL/1.2.0 \
                  --with-adios2=${HOME}/ADIOS2/2.8.1 \
                  --with-netcdf4=${HOME}/NetCDF/4.9.0 \
                  CC=mpicc CXX=mpicxx
    % make -j 8
    
  • The executable file, named 'e3sm_io', is created in folder 'src'.
  • Note the make command can take long to finish, as there is a total of about 1000 climate variables across all F/G/I cases to be defined and each has several attributes.

Prepare the Data Decomposition Map Files

  • Data decomposition maps generated by the PIO library are in text format (with file extension name ".dat". The decomposition maps must first be combined and converted into a NetCDF file to be read in parallel as the input file by this benchmark program. For the F, G, and I cases, there are 3, 6, and 5 data decomposition text files, respectively.
  • See utils/README.md for instructions to run utility programs
    • dat2nc converts the decomposition map .dat files to NetCDF CDF-5 files.
    • dat2decomp is more general utility program that can convert the decomposition map .dat files in text format to a CDF5/HDF5/NetCDF-4/BP file.
    • decomp_copy copies and converts a decomposition map file in an HDF5/NetCDF-4/BP format to a different format.

Run command

  • Example run commands using mpiexec and 16 MPI processes are given below.

    • Run the write test with default settings, i.e. using PnetCDF library and producing files storing variables in a canonical data layout.
      % mpiexec -n 16 src/e3sm_io -o can_F_out.nc datasets/map_f_case_16p.nc
      
  • The number of MPI processes used to run this benchmark can be smaller than the one used when creating the decomposition maps, i.e. the value of variable decomp_nprocs stored in the decomposition NetCDF file. For example, in file datasets/map_f_case_16p.nc, the value of scalar variable decomp_nprocs is 16, which is the number of MPI processes originally used to generate the decomposition .dat files. When running this benchmark using a smaller number of MPI processes, the I/O workload will be divided among all the allocated MPI processes. When using more processes than decomp_nprocs, the processes with MPI ranks greater than or equal to decomp_nprocs will have no data to write but still participate the collective I/O in the benchmark.

  • Command-line Options:

      % ./e3sm_io -h
      Usage: ./e3sm_io [OPTION] FILE
         [-h] Print this help message
         [-v] Verbose mode
         [-k] Keep the output files when program exits (default: deleted)
         [-m] Run test using noncontiguous write buffer (default: contiguous)
         [-f num] Output history files h0 or h1: 0 for h0 only, 1 for h1 only,
                  -1 for both. Affect only F and I cases. (default: -1)
         [-r num] Number of time records/steps written in F case h1 file and I
                  case h0 file (default: 1)
         [-y num] Data flush frequency. (1: flush every time step, the default,
                  and -1: flush once for all time steps. (No effect on ADIOS
                  and HDF5 blob I/O options, which always flushes at file close).
         [-s num] Stride interval of ranks for selecting MPI processes to perform
                  I/O tasks (default: 1, i.e. all MPI processes).\n\
         [-g num] Number of subfiles, used by ADIOS I/O only (default: 1).
         [-t time] Add sleep time to emulate the computation in order to 
                   overlapping I/O when Async VOL is used.
         [-o path] Output file path (folder name when subfiling is used, file
                   name otherwise).
         [-a api]  I/O library name
             pnetcdf:   PnetCDF library (default)
             netcdf4:   NetCDF-4 library
             hdf5:      HDF5 library
             hdf5_md:   HDF5 library using multi-dataset I/O APIs
             hdf5_log:  HDF5 library with Log-based VOL
             adios:     ADIOS library using BP3 format
         [-x strategy] I/O strategy
             canonical: Store variables in the canonical layout (default).
             log:       Store variables in the log-based storage layout.
             blob:      Pack and store all data written locally in a contiguous
                        block (blob), ignoring variable's canonical order.
         FILE: Name of input file storing data decomposition maps
    
  • Both F and I cases create two history files, referred to as 'h0' and 'h1' files. The supplied file name in option -o will be used to construct the output file names by inserting/appending strings "_h0" and "_h1" to indicate the two history files. If the input path contains file extension .nc or .h5, "_h0" and "_h1" will be inserted before the file extension. Otherwise, they will be appended at the end. See examples in "Output files" section below.

  • When using HDF5 API (i.e. "-a hdf5" or "-a hdf5_log"), the environment variable HDF5_VOL_CONNECTOR, if used, must be set to match the I/O strategy used.

    • If I/O strategy is canonical, i.e. "-a hdf5 -x canonical"), HDF5_VOL_CONNECTOR must not be set to use log-based VOL.
    • If I/O strategy is log ("-x log"), HDF5_VOL_CONNECTOR must be set to use log-based VOL.

Current supported APIs (option -a) and I/O strategies (option -x)

  • Table below lists the supported combinations.

    pnetcdf hdf5 hdf5_log netcdf4* adios
    canonical yes yes no yes no
    log no yes yes yes no
    blob yes yes no no yes

    * NetCDF-C version 4.9.0 or newer is required.

  • -a pnetcdf -x canonical

    • A single NetCDF file in the classic CDF5 format will be created. All variables stored in the file are in the canonical order and understandable by NetCDF and its third-party software.
    • If the output file systems allow users to customize the file striping configuration, such as Lustre, users are recommended to write to a folder with a high file striping count to obtain a good I/O performance.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.nc -k -o can_F_out.nc -a pnetcdf -x canonical -r 25
      
  • -a pnetcdf -x blob

    • Multiple subfiles in the NetCDF format will be created. The files conform with NetCDF file format specification.
    • There will be one subfile per compute node used.
    • File name provided in option -o will be used as a base to create the subfile names, which have the numerical IDs appended as the suffix.
    • Because all variables are stored in a blob fashion in the files, the subfiles altogether can only be understood by the conversion utility tool utils/pnetcdf_blob_replay.c, which is developed to run off-line after the completion of an E3SM run to convert the subfiles into a single regular NetCDF file.
    • The blobs are per-record based, which means all write requests to the same variable by different MPI processes are packed and stored in a contiguous file space, called blob. Within that blob, data layout follows the process rank order.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.nc -k -o blob_F_out.nc -a pnetcdf -x blob -r 25
      
  • -a hdf5 -x canonical

    • This option writes/reads data using HDF5 APIs H5Dwrite/H5Dread.
    • The data layout of datasets store in the output file is in a canonical order.
    • This option will ignore environment variables HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.h5 -k -o can_F_out.h5 -a hdf5 -x canonical -r 25
      
  • -a hdf5_md -x canonical

    • This option writes/reads data using HDF5 multi-dataset APIs H5Dwrite_multi/H5Dread_multi.
    • The data layout of datasets store in the output file is in a canonical order.
    • This option will ignore environment variables HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.h5 -k -o can_F_out.h5 -a hdf5_md -x canonical -r 25
      
  • -a hdf5 -x blob

    • This is the blob I/O implementation using HDF5 library. Different from the PnetCDF blob I/O, the implementation of uses the per-process based blob I/O strategy, in which each process writes only one blob at file close time, no matter how many data sets/variables are written. All write requests to all variables by a process are first cached in memory until file close time, in which they are packed into a contiguous buffer, a blob, to be flushed out in a single write call. There is an additional write for the header data blob written by the root process only. This per-process based strategy is the same one used by ADIOS.
    • Multiple subfiles in HDF5 format will be created.
    • There will be one subfile per compute node used.
    • File name provided in option -o will be used as a base to create the subfile names, which have the numerical IDs appended as the suffix.
    • The HDF5 subfiles cannot be understood by the traditional HDF5 software. A utility tool program will be developed in the future to convert the subfiles into a single regular HDF5 file.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.h5 -k -o blob_F_out.h5 -a hdf5 -x blob -r 25
      
  • -a hdf5 -x log

    • This option writes data using the HDF5 log-based VOL.
    • API "H5Dwrite" is used to write the datasets.
    • The log-based VOL stores data in a log layout, rather than a canonical layout. The output file is a valid HDF5 file but requires the log-based VOL to read and understand the data structures.
    • This option will ignore environment variables HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.h5 -k -o can_F_out.h5 -a hdf5 -x log -r 25
      
  • -a hdf5_log -x log

    • This option writes data using the HDF5 log-based VOL and specifically makes use of the new API, "H5Dwrite_n", to allow writing multiple subarrays of a dataset in a single API call.
    • This option will ignore environment variables HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.h5 -k -o log_F_out.h5 -a hdf5_log -x log -r 25
      
  • -a netcdf4 -x canonical

    • This option writes data using the NetCDF-4 library.
    • The output files are in HDF5 format.
    • The data layout of datasets store in the output file is in a canonical order.
    • Because the number of write requests are different among processes, the independent I/O mode is used when writing the data to files.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.nc4 -k -o can_F_out.nc4 -a netcdf4 -x canonical -r 25
      
    • If environment variables HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH are set to use log-based VOL, then the execution will abort, as this option is equivalent to the below one, i.e. -a netcdf4 -x log.
  • -a netcdf4 -x log

    • This option writes data using the NetCDF-4 library which calls the HDF5 log-based VOL underneath.
    • The log-based VOL stores data in a log layout, rather than a canonical layout. The output file is a valid HDF5 file but requires the log-based VOL to read and understand the data structures.
    • Requirements - The two environment variables HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH must be set to use log-based VOL in order to run. The e3sm_io program will check and error out if they are not set.
    • Example run command:
      export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HOME}/LOG_VOL/lib
      export HDF5_PLUGIN_PATH=${HOME}/LOG_VOL/lib
      export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.nc4 -k -o log_F_out.nc4 -a netcdf4 -x log -r 25
      
  • -a adios -x blob

    • This option writes data using the ADIOS library.
    • Multiple subfiles in BP format will be created.
    • The number of subfile is determined by command-line option -g.
    • File name provided in option -o will be used as a base to create the folder names which store the subfiles. The folder names will have suffix ".bp.dir" appended. Each subfile name in its folder will have ".bp" and a numerical ID appended.
    • Because all variables are stored in a blob fashion in the files, the subfiles can only be understood by the Scorpio's conversion utility tool, adios2pio-nm, which is developed to run off-line to convert the subfiles into a single regular NetCDF file.
    • This option requires the original PIO decomposition maps in the text format. They can be included in the converted NetCDF decomposition file by adding a command-line option -r when running dat2nc. See README file in folder utils for more instructions. If the original decomposition map is not in the decomposition file, the E3SM benchmark will create it by expanding the offset and length pairs in the converted decomposition map into list of offsets accessed.
    • Example run command:
      mpiexec -n 16 src/e3sm_io datasets/map_f_case_16p.bp -k -o blob_F_out -a adios -x blob -r 25
      

Example input and job script files

  • Three small-size decomposition map files are available for testing. They are generated from E3SM runs on 16 MPI processes.
    • F case uses 3 decomposition maps.
      • File datasets/map_f_case_16p.nc is in NetCDF classic CDF-5 format
      • File datasets/map_f_case_16p.h5 is in HDF5 format
      • File datasets/map_f_case_16p.nc4 is in NetCDF4 format
      • File datasets/map_f_case_16p.bp is in ADIOS BP format
    • G case uses 6 decomposition maps.
      • File datasets/map_g_case_16p.nc is in NetCDF classic CDF-5 format
      • File datasets/map_g_case_16p.h5 is in HDF5 format
      • File datasets/map_g_case_16p.nc4 is in NetCDF4 format
      • File datasets/map_g_case_16p.bp is in ADIOS BP format
    • I case uses 5 decomposition maps.
      • File datasets/map_i_case_16p.nc is in NetCDF classic CDF-5 format
      • File datasets/map_i_case_16p.h5 is in HDF5 format
      • File datasets/map_i_case_16p.nc4 is in NetCDF4 format
      • File datasets/map_i_case_16p.bp is in ADIOS BP format
  • File datasets/f_case_48602x72_512p.nc contains 3 decomposition maps for a median-size F case produced from a 512-process run.
  • Three large decomposition files are available upon request.
    • f_case_21600p.nc (266 MB) for F case produced from 21600 processes.
    • g_case_9600p.nc (303 MB) for G case produced from 9600 processes.
    • i_case_1344p.nc (12 MB)for I case produced from 1344 processes.
  • An example batch script file for running a job on Cori @NERSC with 8 KNL nodes, 64 MPI processes per node, is provided in slurm.knl.

Example Output Shown on Screen

  % mpiexec -n 16 src/e3sm_io -o can_F_out.nc datasets/map_f_case_16p.nc
  ==== Benchmarking F case =============================
  Total number of MPI processes      = 16
  Number of IO processes             = 16
  Input decomposition file           = datasets/map_f_case_16p.nc
  Number of decompositions           = 3
  Output file/directory              = can_F_out.nc
  Using noncontiguous write buffer   = no
  Variable write order: same as variables are defined
  ==== PnetCDF canonical I/O using varn API ============
  History output file                = can_F_out_h0.nc
  No. variables use no decomposition =     27
  No. variables use decomposition D0 =      1
  No. variables use decomposition D1 =    323
  No. variables use decomposition D2 =     63
  Total no. climate variables        =    414
  Total no. attributes               =   1421
  Total no. noncontiguous requests   = 1977687
  Max   no. noncontiguous requests   = 189503
  Min   no. noncontiguous requests   =  63170
  Write no. records (time dim)       =      1
  I/O flush frequency                =      1
  No. I/O flush calls                =      1
  -----------------------------------------------------------
  Total write amount                         = 16.16 MiB = 0.02 GiB
  Time of I/O preparing              min/max =   0.0008 /   0.0013
  Time of file open/create           min/max =   0.0005 /   0.0006
  Time of define variables           min/max =   0.0031 /   0.0033
  Time of posting write requests     min/max =   0.0124 /   0.0257
  Time of write flushing             min/max =   0.2817 /   0.2837
  Time of close                      min/max =   0.0029 /   0.0029
  end-to-end time                    min/max =   0.3175 /   0.3176
  Emulate computation time (sleep)   min/max =   0.0000 /   0.0000
  I/O bandwidth in MiB/sec (write-only)      = 56.9648
  I/O bandwidth in MiB/sec (open-to-close)   = 50.8962
  -----------------------------------------------------------
  ==== Benchmarking F case =============================
  Total number of MPI processes      = 16
  Number of IO processes             = 16
  Input decomposition file           = datasets/map_f_case_16p.nc
  Number of decompositions           = 3
  Output file/directory              = can_F_out.nc
  Using noncontiguous write buffer   = no
  Variable write order: same as variables are defined
  ==== PnetCDF canonical I/O using varn API ============
  History output file                = can_F_out_h1.nc
  No. variables use no decomposition =     27
  No. variables use decomposition D0 =      1
  No. variables use decomposition D1 =     22
  No. variables use decomposition D2 =      1
  Total no. climate variables        =     51
  Total no. attributes               =    142
  Total no. noncontiguous requests   =  38332
  Max   no. noncontiguous requests   =   3668
  Min   no. noncontiguous requests   =   1225
  Write no. records (time dim)       =      1
  I/O flush frequency                =      1
  No. I/O flush calls                =      1
  -----------------------------------------------------------
  Total write amount                         = 0.34 MiB = 0.00 GiB
  Time of I/O preparing              min/max =   0.0000 /   0.0000
  Time of file open/create           min/max =   0.0005 /   0.0005
  Time of define variables           min/max =   0.0002 /   0.0003
  Time of posting write requests     min/max =   0.0002 /   0.0004
  Time of write flushing             min/max =   0.0034 /   0.0034
  Time of close                      min/max =   0.0002 /   0.0003
  end-to-end time                    min/max =   0.0049 /   0.0049
  Emulate computation time (sleep)   min/max =   0.0000 /   0.0000
  I/O bandwidth in MiB/sec (write-only)      = 98.0321
  I/O bandwidth in MiB/sec (open-to-close)   = 68.3491
  -----------------------------------------------------------
  read_decomp=0.00 e3sm_io_core=0.32 MPI init-to-finalize=0.33
  -----------------------------------------------------------

Output Files

  • The above example command uses command-line option -k to keep the output files (otherwise the default is to delete them when the program exits.) For the F case, each run of e3sm_io produces two history output files whose names are created by inserting "_h0", and "_h1" to user-supplied file name. The header of F case files from running the provided decomposition file map_f_case_16p.nc using PnetCDF obtainable by command ncmpidump -h is available in datasets/f_case_h0.txt, and datasets/f_case_h1.txt.
  • The G case only creates one output file. When using the PnetCDF I/O method and the provided decomposition file map_g_case_16p.nc to run, the header of output file can be found in datasets/g_case_hist.txt.
  • The option '-a adios' automatically appends ".bp.dir" extension to the user-provided input path and creates two folders for F and I cases (one for G case.)
    • The names of output subfiles will be appended with file extension ".bp.dir".

  • Copyright (C) 2021, Northwestern University.
  • See COPYRIGHT notice in top-level directory.