The job engine is a key part of OneFS and is
responsible for maintaining the health of your cluster.
On a healthy cluster, the job engine should load at
boot time and remain active unless manually disabled. You can check that
the job service is running via the below:
zeus-1# isi services isi_job_d
Service 'isi_job_d' is enabled.
The above shows that the service is enable, however
running isi services by itself will not report the status of this services, to
see (or modify) isi_job_d you will need to specify the -a option so that all
services are returned.
The minus -a option is a little verbose and returns
58 services as opposed to the default view of just 18, you might want to pipe
the output through grep.
zeus-1# isi services -a | grep isi_job_d
isi_job_d
Job Daemon Enabled
The below commands can be used to stop and start
the job engine.
zeus-1# isi services -a isi_job_d disable
The service 'isi_job_d' has been disabled.
zeus-1# isi services -a isi_job_d enable
The service 'isi_job_d' has been enabled.
The isi job list command can be used to see all
defined job engine jobs. The below are the more common jobs that you may
see running.
Name
Policy Description
--------------------------------------------------------------------------------------
AutoBalance LOW
Balance free space in the cluster.
Collect LOW
Reclaims space that couldn't be freed due to node or
disk issues.
FlexProtect MEDIUM
Reprotect the file system.
MediaScan LOW
Scrub disks for media-level errors.
MultiScan LOW
Runs Collect and AutoBalance jobs concurrently.
QuotaScan LOW
Update quota accounting for existing files.
SmartPools LOW
Enforces SmartPools file policies.
SnapshotDelete MEDIUM Free space
associated with deleted snapshots.
TreeDelete
HIGH Delete a path in
/ifs.
In the above the Policy column refers to the
schedule of the job and also the impact (the amount of CPU it can utilise).
Running isi job policy list will return the default scheduling.
zeus-1# isi job policy list
Job
Policies:
Name
Start
End
Impact
--------------- ------------ ------------
----------
HIGH
Sun 00:00 Sat 23:59 High
One of the key things to remember is that OneFS can
only execute one policy at a time, as each job is scheduled a job ID is assigned
to the job. The job engine then executes the job with the lowest
(integer) priority. When two jobs have the same priority the job with the
lowest job ID is executed first.
If for any reason you need a job other than the
current running job to execute, you can either start a job with a low priority
than any currently scheduled, or you can pause all currently scheduled jobs
apart from the one you want to run.
To see the current running job, you can execute isi
job status, this will also return information on paused , failed and recent
jobs.
zeus-1# isi job status
Running
jobs:
Job
Impact Pri Policy Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
AutoBalance[8]
Low
4 LOW 1/3
0:00:01
No paused or waiting jobs.
No failed jobs.
Recent job
results:
Time
Job
Event
--------------- --------------------------
------------------------------
10/17 15:38:15 MultiScan[1]
Succeeded
(LOW)
When Jobs Fail To Run
There are a few situations where jobs wont run.
Firstly if there are no scheduled jobs, this is common on newly
commissioned clusters where there is little or no data. You can manually
kick off a job to ensure everything runs as expected.
If you have jobs that are schedule but are not
running then one of the below may be the reason.
A node is offline or has just rebooted.
The job engine will only run when all nodes are
available, if a node has gone offline or if a node has only just booted then
you may find that no jobs are running.
Coordinator node is unavailable.
The job engine relies on one of the nodes acting as
a job coordinator node, this is usually the first node in the cluster) if this
node is unreachable, heavily loaded or read-only then the jobs will be
suspended. You can identify the coordinator node and its health by
running the below.
zeus-1# isi job status
-r
coordinator.connected=True
coordinator.devid=1
coordinator.down_or_read_only=False
The isi job history command can be used to see
confirm when jobs last ran and how long they took.
--limit [number of jobs to return, 0
returns all]
-v [verbose
output]
--job [return information
about a particular job type]
zeus-1# isi job history --limit=0 --job=AutoBalance -v
Job
events:
Time
Job
Event
--------------- --------------------------
------------------------------
10/18 16:37:25
AutoBalance[8]
Waiting
10/18 16:37:25
AutoBalance[8]
Running (LOW)
10/18 16:37:25
AutoBalance[8]
Phase 1: begin drive scan
10/18 16:37:26
AutoBalance[8]
Phase 1: end drive scan
Elapsed
time:
1 second
Errors:
0
Drives:
4
LINs:
3
Size:
0
Eccs:
0
10/18 16:37:27
AutoBalance[8]
Phase 2: begin rebalance
10/18 16:37:27 AutoBalance[8]
Phase 2: end
rebalance
Elapsed time:
1 second
Errors:
0
LINs:
169
Zombies:
0
10/18 16:37:28
AutoBalance[8]
Phase 3: begin check
10/18 16:37:28 AutoBalance[8]
Phase 3: end check
Elapsed
time:
1 second
Errors:
0
Drives:
0
LINs:
0
Size:
0
Eccs:
0
10/18 16:37:28
AutoBalance[8]
Succeeded (LOW)
The job engine may log information
/var/log/messages and /var/log/isi_job_d.log