In my last post
I wrote about the virtues of the Job Engine, and as good as it is it suffers
from one frailty, it can only execute one job at a time.
So what do you
do if your cluster is 90% full and a drive fails...
Well firstly
your cluster is now going to be more than 90% full and the FlexProtect job is
going to kick in at the highest priority.
Depending on
the loading of your cluster, the number of nodes and the type of drives, the
FlexProtect could be running for a couple of days. So what happens to all those Snapshots that
would normally be deleted each night, that's right they start to build up and
your clusters free space depletes faster.
Manual Snapshot Delete
Being able to
manually clear up expired snapshots is something you may need to do from time
to time, whilst the Job Engine is off doing something else.
If you run isi
snap list, you will get a list of all current snapshots, but not those which
have expired.
zeus-1# isi snap list
SIQ-571-latest 31G n/a (R)
0.01% (T)
SIQ-6ab-latest 12G n/a (R)
0.00% (T)
SIQ-e2f-latest 21G n/a (R)
0.01% (T)
SIQ-e2f-new 597M n/a (R)
0.00% (T)
...
To get a
complete list of snapshots run isi snap usage, this will show the expired snapshots as well.
zeus-1# isi snap list
SIQ-571-latest 31G n/a (R)
0.01% (T)
SIQ-6ab-latest 12G n/a
(R) 0.00% (T)
SIQ-e2f-latest 21G n/a (R)
0.01% (T)
SIQ-e2f-new 597M n/a (R)
0.00% (T)
[snapid 30171, delete pending] 15T
n/a (R) 6.21% (T)
[snapid 30186, delete pending] 1.6T
n/a (R) 0.67% (T)
[snapid 30283, delete pending] 31G
n/a (R) 0.01% (T)
...
As you can see
from the above there is a significant amount of space that can be freed up by
deleting the expired snapshots.
To delete a
snapshot run isi_stf_test delete snapid (replacing snapid with the numeric
snapshot id)
zeus-1# isi_stf_test delete 30171
The manual
snapshot clean-up will be much slower than the Job Engine, as the manual
process will only execute on a single node.
Manually cleaning up multiple snapshots
Here is a very
quick one-liner to script the clean-up of your expired snapshots.
isi snap usage | grep snapid |sed -e 's/\,//' | awk
'{print "isi_stf_test delete " $2}'
This should
generate an output similar to the below, you can then either copy + paste it
at the command line to run the clean-up or incorporate it into a script.
zeus-1# isi snap usage | grep snapid |sed -e 's/\,//'
| awk '{print "isi_stf_test delete " $2}'
zeus-1# isi_stf_test delete 30171
zeus-1# isi_stf_test delete 30186
zeus-1# isi_stf_test delete 30183
To generate a list of expired snapshots, write them to /ifs/delete-me.sh and then execute /ifs/delete-me.sh
isi snap usage | grep snapid |sed -e 's/\,//' | awk
'{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh | sh
/ifs/delete-me.sh
Want to take it
one step further?
isi snap usage | grep snapid |sed -e 's/\,//' | awk
'{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh |
isi_for_array "sh /ifs/delete-me.sh"
The above will
now execute the script on all nodes. If
multiple nodes attempt to delete the same snapshot then the first one will win
and the remaining nodes will report something like the below and move onto the next
snapshot in your list.
zeus-1# isi_stf_test delete 30183
zeus-2# truncate:
ifs_snap_delete_lin: Stale NFS File handle
zeus-3# truncate:
ifs_snap_delete_lin: Stale NFS File handle
zeus-2# isi_stf_test delete 30186
zeus-3# truncate:
ifs_snap_delete_lin: Stale NFS File handle
zeus-3# isi_stf_test delete 30183
If the
SnapshotDelete Job Engine job does start to run then it will clean up all
remaining expired snapshot. If you are
manually deleting a snapshot and the Job Engine attempts to delete the same
snapshot then you'll get the Stale NFS error at the command line and the Job
Engine will take over the clean-up of that particular snapshot - the Job Engine
out ranks the command line :-)