Friday 19 October 2012

Upgrades, Removing Nodes & Re-image

OS Updates


With a monthly release schedule, upgrading the OneFS operating system could be seen as a regular task.  Fortunately Isilon support aren't insistent on customers running the latest build.  As with most thing, reviewing the build release notes will give you a good views as to whether a build should be deployed.

The upgrade process is very straightforward and can be driven from both the WebUI and the command line offering the same upgrade options.

The below covers the upgrade process from the command line

  • Request build from Isilon support
  • ssh to the cluster, download build (via ftp) to any directory below /ifs
  • cd into the directory containing the build and run isi update

Update Options


-r            [Nodes upgraded & restarted individually (rolling) so cluster remains online]
--manual      [Prompt before rebooting each node following the upgrade]
--drain-time  [Specify how long to give clients to disconnect before rebooting the node]

(Drain time is in seconds but can change to hours with h, days with d and weeks with w)


If you don't go with the rolling reboot, then you will be prompted to reboot the cluster once the update process has completed.

If you do go with a rolling upgrade then the upgrade process will loop through the nodes in sequential order starting with the node after the one you are on - os if you have a four node cluster and run the upgrade from node 2 then the upgrade will run in this order 3, 4, 1, 2.

zeus-1# isi update 
Connecting to remote and local upgrade processes...
         successfully connected to node [  1].
Loading image...

Please specify the image to update. You can specify the image from:
-- an absolute path (i.e. /usr/images/my.tar)
-- http (i.e. http://host/images/my.tar)
Please specify the image to update:/ifs/build/OneFS_v6.5.4.4_Install.tar.gz
Node version :v6.5.3.1032 B_6_5_3_QFE_1032(RELEASE) (0x605030000040800)
Image version:6.5.4.4 B_6_5_4_76(RELEASE) (0x60504000040004c)
Are you sure you wish to upgrade (yes/no)?yes
Please wait, updating...
Initiating IMDD...
         node[  1] initialized.
Verifying md5...
Installing image...
         node[  1] installed.
Restoring user changes...
         node[  1] restored.
Checking for Firmware Updates...
Firmware update checks skipped...
         node[  1] Firmware check phase completed.
Updating Firmware...
Firmware updates skipped...
         node[  1] Firmware update phase completed.
Upgrade installed successfully.
Reboot to complete the process? (yes/no [yes])yes
Shutting down services.
         node[  1] Services shutdown.

Update Failures


If the upgrade fails due to the update process timing out when a node fails to shutdown / reboot cleanly then you can restart the update process.  Just manually reboot the node that failed and launch the update  once more and OneFS will be intelligent enough to continue through the nodes it missed.

Three update logs are written to /var/log for each attempt to upgrade a cluster named update_engine, upgrade_engine and update_proxy, with update_engine probably being the most useful.


Removing Nodes


One of the great things about the Isilon platform is that you can add in nodes of different types (as long as you have a minimum of three of the same type).  This can make replacing older hardware with newer hardware generations much easier.

Removing a node is referred to as SmartFailing.  Multiple nodes can be SmartFailed at the same time, providing that that at least 50% + 1 node remain within the cluster.  For safety 50% + 2 would be a better maximum as you could then survive a further node failure during the SmartFail without the cluster going read-only.

SmartFail

zeus-1# isi devices -a smartfail -d 1

!! Node 1 is currently healthy and in use. Do you want to initiate the
!! smartfail process on node 1? (yes, [no])

>>> yes

A FlexProtect job will start a priority of 1, which will cause any other running jobs to pause until the SmarFail process completes.  The time to SmartFail a node will depend on a number of variables such as; node type, amount of data on node(s), capacity within cluster, average file size, cluster load and job impact setting.  Finger in the air would suggest 1 - 2 days per node.
isi status displays an S next to any nodes that are SmartFailing

hermese-1# isi stat -q
Cluster Name: hermese
Cluster Health:     [ATTN]
Cluster Storage:  HDD                 SSD           
Size:             13G (13G Raw)       0             
VHS Size:         0                  
Used:             159M (1%)           0 (n/a)       
Avail:            13G (99%)           0 (n/a)       

                   Health Throughput (bps)    HDD Storage      SSD Storage
ID |IP Address     |DASR|  In   Out  Total| Used / Size      |Used / Size
---+---------------+----+-----+-----+-----+------------------+-----------------
  1|172.30.0.100   |--S-|    0|    0|    0|  159M/  13G(  1%)|    (No SSDs)   
------------------------+-----+-----+-----+------------------+-----------------

 Cluster Totals:        |    0|    0|    0|  159M/  13G(  1%)|    (No SSDs)    


Failing The SmartFail

If during the SmartFail process you decide that you no longer want to fail the node, you can cancel the process by executing a StopFail.

hermese-1# isi devices -a stopfail -d 1

!! This node is currently in the process of being smartfailed. We
!! recommend that you allow the process to complete. Do you want to
!! abort the smartfail process for this node? (yes, [no])

>>> yes
'stopfail' action succeeded.


Re-image / Re-format

In certain scenarios (single-node test clusters) you might want to re-image a node, the isi_reimage command can be used to accomplish this.  When used in conjunction with with the -b options, it is possible to re-image the node with any build you have media for the node.

isi_reimage -b OneFS_v5.5.4.21_Install.tar.gz

The isi_reformat_node command can be used reset the configuration on a node, format the dirves and reimage.  The command performs a variety of functions such as checking ware on SSD drives before proceeding with the reformat.

isi_reformat_node with the --factory options will format / reimage the node, turn off the nvram battery and power off the node.  Useful if you are pulling a node for long-term storage or shipping to another site.  

As with isi_reimage, you don't want to run either of these command on a node that is a member of a multi-node cluster.


3 comments:

  1. How often would you say a typical customer would need to go with the cluster reboot? Also generally how long of an outage is that?

    ReplyDelete
  2. If you can get away with the downtime then the upgrade all node + reboot is much faster.

    Were as a rolling reboot on a 20 node cluster may take you a couple of hours, the upgrade all + reboot should complete in 20 - 30 mins (assuming no nodes get stuck and need a manual reboot).

    Node reboots will become faster in OneFS7.

    ReplyDelete
  3. Is there any ability to remove a node from cluster which has "down" state and unable to be smartfailed ?

    Is there any ability to do the factory reset for the cluster ?

    Thank you in advance

    ReplyDelete