Wednesday, 7 November 2012

The Job Engine - Good, But Not Great


In my last post I wrote about the virtues of the Job Engine, and as good as it is it suffers from one frailty, it can only execute one job at a time.

So what do you do if your cluster is 90% full and a drive fails...

Well firstly your cluster is now going to be more than 90% full and the FlexProtect job is going to kick in at the highest priority.

Depending on the loading of your cluster, the number of nodes and the type of drives, the FlexProtect could be running for a couple of days.  So what happens to all those Snapshots that would normally be deleted each night, that's right they start to build up and your clusters free space depletes faster.


Manual Snapshot Delete


Being able to manually clear up expired snapshots is something you may need to do from time to time, whilst the Job Engine is off doing something else.

If you run isi snap list, you will get a list of all current snapshots, but not those which have expired.

zeus-1# isi snap list
SIQ-571-latest                      31G     n/a (R)    0.01% (T)
SIQ-6ab-latest                      12G     n/a (R)    0.00% (T)
SIQ-e2f-latest                      21G     n/a (R)    0.01% (T)
SIQ-e2f-new                        597M     n/a (R)    0.00% (T)
...

To get a complete list of snapshots run isi snap usage, this will show the expired snapshots as well.

zeus-1# isi snap list
SIQ-571-latest                      31G     n/a (R)    0.01% (T)
SIQ-6ab-latest                      12G     n/a (R)    0.00% (T)
SIQ-e2f-latest                      21G     n/a (R)    0.01% (T)
SIQ-e2f-new                        597M     n/a (R)    0.00% (T)
[snapid 30171, delete pending]      15T     n/a (R)    6.21% (T)
[snapid 30186, delete pending]     1.6T     n/a (R)    0.67% (T)
[snapid 30283, delete pending]      31G     n/a (R)    0.01% (T)
...

As you can see from the above there is a significant amount of space that can be freed up by deleting the expired snapshots.

To delete a snapshot run isi_stf_test delete snapid (replacing snapid with the numeric snapshot id)

zeus-1# isi_stf_test delete 30171

The manual snapshot clean-up will be much slower than the Job Engine, as the manual process will only execute on a single node.

Manually cleaning up multiple snapshots


Here is a very quick one-liner to script the clean-up of your expired snapshots.

isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}'

This should generate an output similar to the below, you can then either copy + paste it at the command line to run the clean-up or incorporate it into a script.

zeus-1# isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}'

zeus-1# isi_stf_test delete 30171
zeus-1# isi_stf_test delete 30186
zeus-1# isi_stf_test delete 30183

To generate a list of expired snapshots, write them to /ifs/delete-me.sh and then execute /ifs/delete-me.sh

isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh | sh /ifs/delete-me.sh

Want to take it one step further?

isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh | isi_for_array "sh /ifs/delete-me.sh"

The above will now execute the script on all nodes.  If multiple nodes attempt to delete the same snapshot then the first one will win and the remaining nodes will report something like the below and move onto the next snapshot in your list.

zeus-1# isi_stf_test delete 30183
zeus-2# truncate:  ifs_snap_delete_lin: Stale NFS File handle
zeus-3# truncate:  ifs_snap_delete_lin: Stale NFS File handle
zeus-2# isi_stf_test delete 30186
zeus-3# truncate:  ifs_snap_delete_lin: Stale NFS File handle
zeus-3# isi_stf_test delete 30183

If the SnapshotDelete Job Engine job does start to run then it will clean up all remaining expired snapshot.  If you are manually deleting a snapshot and the Job Engine attempts to delete the same snapshot then you'll get the Stale NFS error at the command line and the Job Engine will take over the clean-up of that particular snapshot - the Job Engine out ranks the command line :-)

Tuesday, 23 October 2012

Job Engine


The job engine is a key part of OneFS and is responsible for maintaining the health of your cluster.

On a healthy cluster, the job engine should load at boot time and remain active unless manually disabled.  You can check that the job service is running via the below:

zeus-1# isi services isi_job_d
Service 'isi_job_d' is enabled.

The above shows that the service is enable, however running isi services by itself will not report the status of this services, to see (or modify) isi_job_d you will need to specify the -a option so that all services are returned. 

The minus -a option is a little verbose and returns 58 services as opposed to the default view of just 18, you might want to pipe the output through grep.

zeus-1# isi services -a | grep isi_job_d
isi_job_d            Job Daemon         Enabled

The below commands can be used to stop and start the job engine.

zeus-1# isi services -a isi_job_d disable
The service 'isi_job_d' has been disabled.

zeus-1# isi services -a isi_job_d enable
The service 'isi_job_d' has been enabled.
  
The isi job list command can be used to see all defined job engine jobs.  The below are the more common jobs that you may see running.

Name             Policy     Description                   
--------------------------------------------------------------------------------------
AutoBalance      LOW        Balance free space in the cluster.
Collect          LOW        Reclaims space that couldn't be freed due to node or disk issues.
FlexProtect      MEDIUM     Reprotect the file system.
MediaScan        LOW        Scrub disks for media-level errors.
MultiScan        LOW        Runs Collect and AutoBalance jobs concurrently.
QuotaScan        LOW        Update quota accounting for existing files.
SmartPools       LOW        Enforces SmartPools file policies.
SnapshotDelete   MEDIUM     Free space associated with deleted snapshots.
TreeDelete       HIGH       Delete a path in /ifs. 

In the above the Policy column refers to the schedule of the job and also the impact (the amount of CPU it can utilise).  Running isi job policy list will return the default scheduling.

zeus-1# isi job policy list
Job Policies:                                                                  
Name            Start        End          Impact    
--------------- ------------ ------------ ----------
HIGH            Sun 00:00    Sat 23:59    High

One of the key things to remember is that OneFS can only execute one policy at a time, as each job is scheduled a job ID is assigned to the job.  The job engine then executes the job with the lowest (integer) priority.  When two jobs have the same priority the job with the lowest job ID is executed first.

If for any reason you need a job other than the current running job to execute, you can either start a job with a low priority than any currently scheduled, or you can pause all currently scheduled jobs apart from the one you want to run.

To see the current running job, you can execute isi job status, this will also return information on paused , failed and recent jobs.

zeus-1# isi job status          

Running jobs:                                                                  
Job                        Impact Pri Policy     Phase Run Time  
-------------------------- ------ --- ---------- ----- ----------
AutoBalance[8]             Low    4   LOW        1/3   0:00:01

No paused or waiting jobs.
No failed jobs.

Recent job results:                                                                                                                                                                                                                        
Time            Job                        Event                         
--------------- -------------------------- ------------------------------
10/17 15:38:15  MultiScan[1]               Succeeded (LOW) 

When Jobs Fail To Run

There are a few situations where jobs wont run.  Firstly if there are no scheduled jobs, this is common on newly commissioned clusters where there is little or no data.  You can manually kick off a job to ensure everything runs as expected.

If you have jobs that are schedule but are not running then one of the below may be the reason.

A node is offline or has just rebooted.

The job engine will only run when all nodes are available, if a node has gone offline or if a node has only just booted then you may find that no jobs are running.

Coordinator node is unavailable.

The job engine relies on one of the nodes acting as a job coordinator node, this is usually the first node in the cluster) if this node is unreachable, heavily loaded or read-only then the jobs will be suspended.  You can identify the coordinator node and its health by running the below.

zeus-1# isi job status -r       
coordinator.connected=True
coordinator.devid=1
coordinator.down_or_read_only=False

The isi job history command can be used to see confirm when jobs last ran and how long they took.

--limit     [number of jobs to return, 0 returns all]
-v           [verbose output]
--job       [return information about a particular job type]
  
zeus-1# isi job history --limit=0 --job=AutoBalance -v
Job events:                                                                                                                                                                                                                               
Time            Job                        Event                         
--------------- -------------------------- ------------------------------
10/18 16:37:25  AutoBalance[8]             Waiting
10/18 16:37:25  AutoBalance[8]             Running (LOW)
10/18 16:37:25  AutoBalance[8]             Phase 1: begin drive scan
10/18 16:37:26  AutoBalance[8]             Phase 1: end drive scan
        Elapsed time:                        1 second
        Errors:                              0
        Drives:                              4
        LINs:                                3
        Size:                                0
        Eccs:                                0
10/18 16:37:27  AutoBalance[8]             Phase 2: begin rebalance
10/18 16:37:27  AutoBalance[8]             Phase 2: end rebalance
        Elapsed time:                        1 second
        Errors:                              0
        LINs:                              169
        Zombies:                             0
10/18 16:37:28  AutoBalance[8]             Phase 3: begin check
10/18 16:37:28  AutoBalance[8]             Phase 3: end check
        Elapsed time:                        1 second
        Errors:                              0
        Drives:                              0
        LINs:                                0
        Size:                                0
        Eccs:                                0
10/18 16:37:28  AutoBalance[8]             Succeeded (LOW) 

The job engine may log information /var/log/messages and /var/log/isi_job_d.log