-
Notifications
You must be signed in to change notification settings - Fork 11
/
README
229 lines (159 loc) · 7.79 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# -*- Mode: sh; sh-basic-offset:2 ; indent-tabs-mode:nil -*-
#
# Copyright (c) 2014-2017 Los Alamos National Security, LLC. All rights
# reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
HIO Readme
==========
Last updated 2017-01-12
libHIO is a flexible, high-performance parallel IO package developed at LANL.
libHIO supports IO to either a conventional PFS or to DataWarp with management
of DataWarp space and stage-in and stage-out from and to the PFS.
libHIO has been released as open source and is available at:
https://github.com/hpc/libhio
For more information on using libHIO, see the github package, in particular:
README
libhio_api.pdf
hio_example.c
README.datawarp
See file NEWS for a description of changes to hio. Note that this README
refers to various LANL clusters that have been used for testing HIO. Using
HIO in other environments may require some adjustments.
Building
--------
HIO builds via a standard autoconf/automake build. So, to build:
1) Untar
2) cd to root of tarball
3) module load needed compiler or MPI environment
4) ./configure
5) make
Additional generally useful make targets include clean and docs. make docs
will build the HIO API document, but it requires doxygen and various latex
packages to run, so you may prefer to use the document distributed in file
design/libhio_api.pdf.
Our target build environments include gcc with OpenMPI on Mac OS for unit
test and gcc, Intel and Cray compilers on LANL Cray systems and TOSS clusters
with Cray MPI or OpenMPI.
Included with HIO is a build script named hiobuild. It will perform all of
the above steps in one invocation. The HIO development team uses it to launch
builds on remote systems. You may find it useful; a typical invocation might
look like:
./hiobuild -c -s PrgEnv-intel,PrgEnv-gnu
hiobuild will also create a small script named hiobuild.modules.bash that
can be sourced to recreate the module environment used for build.
API Example
-----------
The HIO distribution contains a sample program, test/hio_example.c. This
is built along with libHIO. The script hio_example.sh will run the sample
program.
Simple DataWarp Test Job
------------------------
The HIO source contains a script test/dw_simple_sub.sh that will submit a
simple, small scale test job on a system with Moab/DataWarp integration. See
the comments in the file for instructions and a more detailed description.
Testing
-------
HIO's tests are in the test subdirectory. There is a simple API test named
test01 which can also serve as a coding example. Additionally, other tests
are named run02, run03, etc. Theses tests are able to run in a variety of
environments:
1) On Mac OS for unit testing
2) On a non-DataWarp cluster in interactive or batch mode
3) On one of the Trinity systems with DataWarp in interactive or batch mode
run02 and run03 are N-N and N-1 tests (respectively). Options help can be
displayed by invoking with a -h option. These tests use a common script
named run_setup to process options and establish the testing environment.
They invoke hio using a program named xexec which is driven by command strings
contained in the runxx test scripts.
A typical usage to submit a test DataWarp batch job on the small LANL test system
named buffy might look like:
cd <tarball>/test
./run02 -s m -r 32 -n 2 -b
Options used:
-s m ---> Size medium (200 MB per rank)
-r 32 ---> Use 32 ranks
-n 2 ---> Use 2 nodes
-b ---> Submit a batch job
The runxx tests will use the hiobuild.modules.bash files saved by hiobuild
(if available) to reestablish the same module environment used at build
time.
A multi-job submission script to facilitate running a large number of tests
with one command is available. A typical usage for a fairly thorough test
on a large system like Trinity might look like:
run_combo -t ./run02 ./run03 ./run12 -s x y z -n 32 64 128 256 512 1024 -p 32 -b
This will submit 54 jobs (3 x 3 x 6) with all combinations of the specified
tests and parameters. The job scripts and output will be in the test/run
subdirectory.
Step by step procedure for building and running HIO tests on LANL system Trinite:
---------------------------------------------------------------------------------
This procedure is accurate as of 2017-01-12 with HIO.1.3.0.6.
1) Get the distribution tarball libhio-1.3.0.6.tar.gz from github at
https://github.com/hpc/libhio/releases
2) Untar
3) cd <dir>/libhio-1.3 ( <dir> is where you untarred HIO )
4) ./hiobuild -cf -s PrgEnv-intel,PrgEnv-gnu
At the end of the build you will see:
tt-fey1 ====[HIOBUILD_RESULT_START]===()===========================================
tt-fey1 hiobuild : Checking /users/cornell/tmp/libhio-1.3/hiobuild.out for build problems
24:configure: WARNING: using cross tools not prefixed with host triplet
259:Warning:
tt-fey1 hiobuild : Checking for build target files
tt-fey1 hiobuild : Build errors found, see above.
tt-fey1 ====[HIOBUILD_RESULT_END]===()=============================================
Ideally, the two warning messages would not be present, but at the moment, they can be ignored.
5) cd test
6) ./run_combo -t ./run02 ./run03 ./run12 ./run20 -s s m -n 1 2 4 -p 32 -b
This will create 24 job scripts in the libhio-1.3/test/run directory and submit the jobs.
Msub messages are in the corresponding .jobid files in the same directory. Job output is
directed to corresponding .out files. The number and mix of jobs is controlled by the
parameters. Issue run_combo -h for more information.
7) After the jobs complete, issue the following:
grep -c "RESULT: SUCCESS" run/*.out
If all jobs ran OK, grep should show 24 files with a count of 1. Like this:
cornell@tr-login1:~/pgm/hio/tr-gnu/libhio-1.3/test> grep -c "RESULT: SUCCESS" run/*.out
run/job.20170108.080917.out:1
run/job.20170108.080927.out:1
run/job.20170108.080936.out:1
run/job.20170108.081422.out:1
. . . .
run/job.20170108.082133.out:1
run/job.20170108.082141.out:1
Investigate any missing job output or counts of 0.
8) Alternatively, cd to the libhio-1.3/test directory, run the script
./check_test
This will show how many jobs are queued and currently running and how many
output files are incomplete or have failures.
9) Resources for better understanding and/or modifying these procedures:
libhio-1.3/README
libhio-1.3/README.datawarp
libhio-1.3/hiobuild -h
libhio-1.3/test/run_combo -h
libhio-1.3/test/run_setup -h
libhio-1.3/test/run02, run03, run12, run20
libhio-1.3/test/xexec -h
libhio-1.3/design/libhio_api.pdf
libhio-1.3/test/hio_example.c
10) Additional test commands, check the results the same way as above.
Very simple small single job Moab/DataWarp test:
./run02 -s t -n 1 -r 1 -b
Alternate multi job test suitable for a large system like Trinity:
./run_combo -t ./run02 ./run03 ./run12 ./run20 -s l x -n 1024 512 256 128 64 -p 32 -b
Additional many job submission contention test
./run90 -p 5 -s t -n 1 -b
This test submits two jobs that each submit two additional jobs. Job
submission continues until the -p parameter is exhausted. So, the
total number of jobs is given by (p^2) - 2. Be cautious about increasing
the -p parameter. Since this is only a job submission test, the normal
scan for RESULT: SUCCESS is not applicable. Simply wait for the queue to
empty and look for the expected number of .sh and .out files in the run
directory. If there are any .sh files without corresponding .out files,
look for errors via checkjob -v on the job IDs in the .jobid file.
DataWarp stage-out can impose a significant load on the scratch file system.
To inhibit stage-out (which will reduce test coverage) set the environment:
export HIO_datawarp_stage_mode=disable.
--- End of README ---