Usage¶
jrnr
is a python library currently configured to work on systems using slurm workload managers. If your computing workflows can be parallelized, jrnr can help.
jrnr
is an application that relies on click, the python command line tool.
At the top of your python module add this to the import
section:
from jrnr.jrnr import slurm_runner
Interactive mode¶
Frequently, you’ll want to do some basic debugging and iteration to make sure your batch jobs will run as expected. To assist this process, jrnr
has an interactive mode that allows you to run a single job in an ipython session.
In [1]: import tas
In [2]: tas.make_tas.run_interactive(42)
2018-01-10 17:01:55,001 Beginning job
kwargs: { 'model': 'NorESM1-M', 'scenario': 'rcp45', 'year': '2054'}
2018-01-10 17:02:43,733 beginning
2018-01-10 17:02:43,733 producing_tas
Out[3]:
<xarray.Dataset>
Dimensions: (lat: 720, lon: 1440, time: 365)
Coordinates:
* lon (lon) float32 -179.875 -179.625 -179.375 -179.125 -178.875 ...
* time (time) datetime64[ns] 2054-01-01T12:00:00 2054-01-02T12:00:00 ...
* lat (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
Data variables:
tas (time, lat, lon) float32 272.935 272.937 272.931 272.911 ...
Attributes:
version: 1.0
repo: https://gitlab.com/ClimateImpactLab/tas/
frequency: annual
oneline: Average Daily Temperature, tavg
file: tas.py
year: 2054
write_variable: tas
description: Average Daily Temperature, tavg\n\n Average Daily Temper...
execute: python tas.py run
project: gcp
team: climate
dependencies: ['/global/scratch/groups/co_laika/gcp/climate/nasa_bcsd/...
model: NorESM1-M
As you can see, if you setting up logging, the logging information will print to wherever you direct stdout. In this case, ininteractive mode, it prints to the ipython terminal. In batch mode, jrnr logs can be found in the directory you specified as run-{job_name}-{job_id}-{task-id}.log
.
Running your job in batch mode¶
The slurm_runner
decorator function in jrnr
acts as a wrapper around your main function. Make sure that above your main function you have added @slurm_runner()
. With this enabled, you can use the command line to launch your jobs on the slurm workload manager.
Make sure you are in the directory where your python module is located. Let’s say we are running the job specified in Example jrnr script. Let’s look at what the help
function does.
$ python tas.py --help
Usage: tas.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
cleanup
do_job
prep
run
status
wait
We can see that this will give us the list of options. Let’s look at run.
run
¶
Let’s first have a look at the options with the run command.
$ python run --help
Usage: tas.py run [OPTIONS]
Options:
-l, --limit INTEGER Number of iterations to run
-n, --jobs_per_node INTEGER Number of jobs to run per node
-x, --maxnodes INTEGER Number of nodes to request for this job
-j, --jobname TEXT name of the job
-p, --partition TEXT resource on which to run
-d, --dependency INTEGER
-L, --logdir TEXT Directory to write log files
-u, --uniqueid TEXT Unique job pool id
--help Show this message and exit.
The most important options are u
, j
and L
. To specify a job you need u
and j
since these parameters uniquely identify a job and allow you to track the progress of your jobs. An example command is below
$ python tas.py run -u 001 -j tas
This creates a job with a unique id of 001 and a job name of tas.
By specifying some of the options listed above, you can adjust the behavior of your slurm jobs. For example, you can put your log files in a specific directory by specifying a value for argument L
. Additionally, if you want to use a specific partition on your cluster you can speify the p option. Similarly, if your job is particularly compute intensive, with n
you can adjust the number of jobs per node.
$ python tas.py run -u 001 -j tas -L /logs/tas/ -p savio2_bigmem -n 10
Its important to note that, by default, log files will be written to the directory where you are executing the file. Depending on how large your job is you may want to put these log files elsewhere.
If you want to fully take advantage of BRC’s computing capacity you can run
$ python tas.py run -u 001 -j tas -L /logs/tas/ -p savio_bigmem -n 10
run job: 98
on-finish job: 99
$ python tas.py run -u 001 -j tas -L /logs/tas/ -p savio2_bigmem -n 10
run job: 100
on-finish job: 101
$ python tas.py run -u 001 -j tas -L /logs/tas/ -p savio2 -n 5
run job: 104
on-finish job: 105
$ python tas.py run -u 001 -j tas -L /logs/tas/ -p savio -n 5
run job: 106
on-finish job: 107
How many jobs should you run on each node?
To determine this, you’ll need to divide the amount of memory per node by the amount of memory required by your job. To determine the amount of memory per node, you can look at the Savio user guide. For example, if I have a job that requires 6GB of RAM and I am running on the savio2_bigmem
partition. Then we’ll add 2GB of buffer to our 6GB RAM requirement and take the result of 128/8
to get 16 jobs.
status
¶
You launched your job 10 minutes ago and you want to check on the status of your jobs. We can check with the status
option. Let’s look again at our tas.py
file.
$ python tas.py status -u 001 -j tas
jobs: 4473
done: 3000
in progress: 1470
errored: 3
Notice that we use the unique id 001
and the jobname tas
that we used when we created the job. You must use these values or we cannot compute the progress of our job.
Technical note¶
How does this jrnr
track the status of my jobs?¶
In your directory where you are running your job, jrnr
creates a locks directory. In this locks
directory, for each job in your set of batch jobs a file is created with the following structure {job_name}-{unique_id}-{job_index}
. When a node is working on a job, it adds the .lck
file extension to the file. When the job is completed, it converts the .lck extension to a .done
extension. If, for some reason, the job encounters an error, the extension will shift to .err
. When you call the status
command jrnr
is just displaying the count of files with each file extension in the locks directory.
How does jrnr
construct a job specification?¶
Each jrnr
job can be specified with arguments from key, value dictionaries. Since these arguments are taken from a set of known possible inputs we can take each key and its associated set of possible values and compute the cartesian product of every key, value combination. In the background of jrnr
, we take lists of dictionaries and use the python method itertools.product
to specify the superset of possible batch jobs. A demonstration is below:
In [1]: def generate_jobs(job_spec):
for specs in itertools.product(*job_spec):
yield _unpack_job(specs)
In [2]: def _unpack_job(specs):
job = {}
for spec in specs:
job.update(spec)
return job
In [3]: MODELS = list(map(lambda x: dict(model=x), [
'ACCESS1-0',
'bcc-csm1-1',
'BNU-ESM',
'CanESM2',
]))
In [4]: PERIODS = (
[dict(scenario='historical', year=y) for y in range(1981, 2006)] +
[dict(scenario='rcp45', year=y) for y in range(2006, 2100)])
In [5]: job_spec = [PERIODS, MODELS]
In [6]: jobs = list(generate_jobs(job_spec))
In [7]: jobs[:100:10]
Out[7]:
[{'model': 'ACCESS1-0', 'scenario': 'historical', 'year': 1981},
{'model': 'BNU-ESM', 'scenario': 'historical', 'year': 1983},
{'model': 'ACCESS1-0', 'scenario': 'historical', 'year': 1986},
{'model': 'BNU-ESM', 'scenario': 'historical', 'year': 1988},
{'model': 'ACCESS1-0', 'scenario': 'historical', 'year': 1991},
{'model': 'BNU-ESM', 'scenario': 'historical', 'year': 1993},
{'model': 'ACCESS1-0', 'scenario': 'historical', 'year': 1996},
{'model': 'BNU-ESM', 'scenario': 'historical', 'year': 1998},
{'model': 'ACCESS1-0', 'scenario': 'historical', 'year': 2001},
{'model': 'BNU-ESM', 'scenario': 'historical', 'year': 2003}]