README

# -*- Mode: sh; sh-basic-offset:2 ; indent-tabs-mode:nil -*-
#
# Copyright (c) 2014-2017 Los Alamos National Security, LLC.  All rights
#                         reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#

HIO Readme
==========

Last updated 2017-01-12

libHIO is a flexible, high-performance parallel IO package developed at LANL.
libHIO supports IO to either a conventional PFS or to DataWarp with management
of DataWarp space and stage-in and stage-out from and to the PFS.

libHIO has been released as open source and is available at:

https://github.com/hpc/libhio

For more information on using libHIO, see the github package, in particular:

README
libhio_api.pdf
hio_example.c
README.datawarp

See file NEWS for a description of changes to hio.  Note that this README
refers to various LANL clusters that have been used for testing HIO. Using
HIO in other environments may require some adjustments.


Building
--------

HIO builds via a standard autoconf/automake build.  So, to build:

1) Untar
2) cd to root of tarball
3) module load needed compiler or MPI environment 
4) ./configure
5) make 

Additional generally useful make targets include clean and docs.  make docs
will build the HIO API document, but it requires doxygen and various latex
packages to run, so you may prefer to use the document distributed in file
design/libhio_api.pdf.

Our target build environments include gcc with OpenMPI on Mac OS for unit
test and gcc, Intel and Cray compilers on LANL Cray systems and TOSS clusters
with Cray MPI or OpenMPI.  

Included with HIO is a build script named hiobuild.  It will perform all of
the above steps in one invocation.  The HIO development team uses it to launch
builds on remote systems.  You may find it useful; a typical invocation might 
look like:

./hiobuild -c -s PrgEnv-intel,PrgEnv-gnu

hiobuild will also create a small script named hiobuild.modules.bash that
can be sourced to recreate the module environment used for build.


API Example
-----------

The HIO distribution contains a sample program, test/hio_example.c.  This
is built along with libHIO. The script hio_example.sh will run the sample
program.


Simple DataWarp Test Job
------------------------

The HIO source contains a script test/dw_simple_sub.sh that will submit a
simple, small scale test job on a system with Moab/DataWarp integration.  See
the comments in the file for instructions and a more detailed description.


Testing
-------

HIO's tests are in the test subdirectory.  There is a simple API test named
test01 which can also serve as a coding example.  Additionally, other tests
are named run02, run03, etc.  Theses tests are able to run in a variety of
environments:

1) On Mac OS for unit testing
2) On a non-DataWarp cluster in interactive or batch mode
3) On one of the Trinity systems with DataWarp in interactive or batch mode

run02 and run03 are N-N and N-1 tests (respectively). Options help can be
displayed by invoking with a -h option.  These tests use a common script 
named run_setup to process options and establish the testing environment.
They invoke hio using a program named xexec which is driven by command strings
contained in the runxx test scripts.

A typical usage to submit a test DataWarp batch job on the small LANL test system
named buffy might look like:

cd <tarball>/test
./run02 -s m -r 32 -n 2 -b 

Options used:
  -s m    ---> Size medium (200 MB per rank)
  -r 32   ---> Use 32 ranks
  -n 2    ---> Use 2 nodes
  -b      ---> Submit a batch job  

The runxx tests will use the hiobuild.modules.bash files saved by hiobuild
(if available) to reestablish the same module environment used at build
time.

A multi-job submission script to facilitate running a large number of tests
with one command  is available.  A typical usage for a fairly thorough test
on a large system like Trinity might look like:

run_combo -t ./run02 ./run03 ./run12 -s x y z -n 32 64 128 256 512 1024 -p 32 -b

This will submit 54 jobs (3 x 3 x 6) with all combinations of the specified
tests and parameters.  The job scripts and output will be in the test/run
subdirectory.


Step by step procedure for building and running HIO tests on LANL system Trinite:
---------------------------------------------------------------------------------

This procedure is accurate as of 2017-01-12 with HIO.1.3.0.6.

1) Get the distribution tarball libhio-1.3.0.6.tar.gz from github at
   https://github.com/hpc/libhio/releases

2) Untar

3) cd <dir>/libhio-1.3       ( <dir> is where you untarred HIO )

4) ./hiobuild -cf -s PrgEnv-intel,PrgEnv-gnu

   At the end of the build you will see:

    tt-fey1 ====[HIOBUILD_RESULT_START]===()===========================================
    tt-fey1 hiobuild : Checking /users/cornell/tmp/libhio-1.3/hiobuild.out for build problems
    24:configure: WARNING: using cross tools not prefixed with host triplet
    259:Warning:
    tt-fey1 hiobuild : Checking for build target files
    tt-fey1 hiobuild : Build errors found, see above.
    tt-fey1 ====[HIOBUILD_RESULT_END]===()=============================================

   Ideally, the two warning messages would not be present, but at the moment, they can be ignored.

5) cd test

6) ./run_combo -t ./run02 ./run03 ./run12 ./run20 -s s m -n 1 2 4 -p 32 -b

   This will create 24 job scripts in the libhio-1.3/test/run directory and submit the jobs.
   Msub messages are in the corresponding .jobid files in the same directory. Job output is
   directed to corresponding .out files.  The number and mix of jobs is controlled by the
   parameters. Issue run_combo -h for more information.

7) After the jobs complete, issue the following:

   grep -c "RESULT: SUCCESS" run/*.out

   If all jobs ran OK, grep should show 24 files with a count of 1.  Like this:

   cornell@tr-login1:~/pgm/hio/tr-gnu/libhio-1.3/test> grep -c "RESULT: SUCCESS" run/*.out
   run/job.20170108.080917.out:1
   run/job.20170108.080927.out:1
   run/job.20170108.080936.out:1
   run/job.20170108.081422.out:1
     . . . .
   run/job.20170108.082133.out:1
   run/job.20170108.082141.out:1

   Investigate any missing job output or counts of 0.

8) Alternatively, cd to the libhio-1.3/test directory, run the script

   ./check_test

   This will show how many jobs are queued and currently running and how many 
   output files are incomplete or have failures.

9) Resources for better understanding and/or modifying these procedures:

   libhio-1.3/README
   libhio-1.3/README.datawarp
   libhio-1.3/hiobuild -h
   libhio-1.3/test/run_combo -h
   libhio-1.3/test/run_setup -h
   libhio-1.3/test/run02, run03, run12, run20
   libhio-1.3/test/xexec -h
   libhio-1.3/design/libhio_api.pdf
   libhio-1.3/test/hio_example.c

10) Additional test commands, check the results the same way as above.

   Very simple small single job Moab/DataWarp test:

     ./run02 -s t -n 1 -r 1 -b

   Alternate multi job test suitable for a large system like Trinity:

     ./run_combo -t ./run02 ./run03 ./run12 ./run20 -s l x -n 1024 512 256 128 64 -p 32 -b

   Additional many job submission contention test

     ./run90 -p 5 -s t -n 1 -b

     This test submits two jobs that each submit two additional jobs.  Job
     submission continues until the -p parameter is exhausted.  So, the
     total number of jobs is given by (p^2) - 2.  Be cautious about increasing
     the -p parameter.  Since this is only a job submission test, the normal
     scan for RESULT: SUCCESS is not applicable.  Simply wait for the queue to
     empty and look for the expected number of .sh and .out files in the run
     directory.  If there are any .sh files without corresponding .out files,
     look for errors via checkjob -v on the job IDs in the .jobid file.

  DataWarp stage-out can impose a significant load on the scratch file system. 
  To inhibit stage-out (which will reduce test coverage) set the environment:

    export HIO_datawarp_stage_mode=disable.


--- End of README ---