Monday, 2 June 2014

OneFS 6.x - So Long Dear Friend

As many will have seen, EMC have given notice that OneFS 6.x and 6.5.x will be going end-of-life during the next twelve months.

Having worked with Isilon storage since OneFS3.something_nasty, OneFS6.5 has long been my favourite outing of the scale-out filesystem.  In fact, 6.5 has been such a good product for me that I completely skipped 7.0.x and have only recently truly embraced OneFS 7.1.x.

The OneFS 6 product line gave us so many feature, key among them SmartPools, Kerberised NFS v3 and a much improved SyncIQ.

As I still have a number of systems running 6.5, I intend to keep blogging about OneFS6.5 until it becomes irrelevant, but will also be providing more posts on 7.1.x.

OneFS7.1 is slowly becoming a mature and stable storage OS, that will, I'm sure, take the crown of my favourite scale-out filesystem very soon.


Friday, 30 May 2014

The Long Road Back

It has been a little over a year since my last blog post, this was not intentional, but the result of a family bereavement and me taking time to reflect on life.

I am still a supporter of the Isilon technology and a keen believer in the importance of sharing knowledge and experience, so am pleased to make my return to this blog.

Saturday, 20 April 2013

A Node By Ay Other Name


One of the great things about the Isilon architecture is that you can add and remove nodes from your cluster.  

Let’s say you have a cluster of three 12000X nodes and you want to replace then with three new x200 nodes, now you could leave the original nodes in the cluster as a lower / slower tier of storage and make use of the SmartPools technology to place you different data types on the most appropriate nodes, or you could simply replace you old nodes with new ones.

  • Suppose my cluster has three 12000X nodes zeus-1, zeus2 and zeus3.
  • I add three X200 nodes into the cluster, which are assigned the names zeus-4, zeus-5 and zeus-6.
  • I decide to retire / SmartFail the 12000X nodes and now have a cluster with just three nodes named zeus-4, zeus-5 and zeus-6. 

I could leave things exactly as they are, but I’d rather have my three nodes with names zeus-1, zeus-2 and zeus-3; no problem I can renamed then (without downtime) using the isi conf command.

From an ssh window, launch isi conf

zeus-4# isi conf
  
zeus >>> lnnset 4 1
  
Node 4 changed to Node 1. Change will be applied on 'commit'

zeus >>> commit
 

Commit succeeded.
zeus-4# 
  
As you can see in the above, you may need to reconnect to your ssh session before the new node name is automatically changed. 

zeus-4#

zeus-4# hostname

zeus-1


Saturday, 23 March 2013

You don’t have to be a Maverick to adopt OneFS7, but it helps

OneFS 7

Isilon’s latest operating system emerged from beta in Q4 2013.  During the beta Isilon chose the name Mavericks, which was a departure from many years of naming their betas after chillis.

As OneFS 7 has now reached the level of maturity where it will ship by default on new hardware, it seems a good time to see if OneFS 7 is a scorching Naga or a tepid Bell Pepper.

As someone who has used OneFS for many years, there are a number of key features in OneFS7 that jump out at me:

Roles-based Administration – Think of the current implementation very much as version 1, Isilon will build on this implementation and refine the granularity of permissions further with future releases, but even in its current incarnation, I am more than pleased to see it in the product.

Fast snapshot restoration, writable snapshots and file cloning.  All “enterprise” features that improve the ability for Isilon to complete in general purpose storage deployments against NetApp and others.

The excellent SyncIQ replication technology has been enhanced though the implementation of push-button failover / failback (this is something I need to spend more time playing with, but early signs are promising).

The IO capabilities of the filesystem have taken a step forward with much focus being placed on concurrent and sequential file access, but also improvements to IO latency – motivated in no small part by hopes of hosting VMware deployments on the platform.  The term “Endurant Cache” has been coined to cover Isilon’s new approach to caching and I am looking forward to delving into this in more detail in the near future. 

Unsurprisingly tighter integration with VMware is a much-highlighted feature of OneFS 7.  Take VAAI and VASA APIs, sprinkle in some Endurant Cache, Metadata Acceleration (via SSD’s) and per-file cloning and you have a platform that is better placed to handle VMware workloads.  But critically still no deduplication or read-acceleration to see off boot storms (read as PAM cards) to put their VMware solution on a par with NetApp.

It would also be unfair not to comment on the continued progress being made by Likewise, the company acquired by Isilon in early 2012.

Likewise have done a great job in making the SMB/SMB2 implementation on Isilon far better than it was a couple of major revisions back, there is still some way to go until the “unified” protocol access to the Isilon is a good as some other platforms, but things are certainly going in the right direction.

Wrapping It Up

 So where are we with OneFS7, that probably depends on your use case.

In a Windows-centric environment, OneFS7 is probably a better fit that 6.5.x.  SMB / SMB2 performance should be much better and more tuneable.

In a UNIX-heavy environment, I would still favour OneFS 6.5.x,  Endurant cache will help in many circumstances, but I’ve also heard that some operations may be slower in the current 7.0.x builds.

Undoubtedly OneFS7 is an improvement on OneFS6.5 and is testament to the significant investment that EMC continue to make into this true scale-out storage platform, but in my opinion it's still a little too new for prime-time.  There will certainly be deployment scenarios where you could deploy and run today (Windows-centric or VMware), outside of those, you’d need to be a Maverick to deploy into production so soon after launch. 

Based on previous Isilon release schedules, I would expect a substantial OneFS 7 point release by Q4 2013 that will bring with it additional features and stability.  In the mean time, I would recommend that Isilon customer grab a copy of the new OS and try it where they can.

I'll post in more detail about the new feature in future blog posts.

Thursday, 7 March 2013

Sometimes you need to flush

One of the great things about the 200-series nodes (X and S) are that you can specify how much memory or SSD's you want to add into a node.  Fantastic! I can put 48GB of RAM and 2 SSD's (for metadata acceleration) in an X200 node to host my commodity data and 96GB RAM and 4 SSD's in an S200 node to support my high-performance storage requirements.

The issue here is that you could potentially be the first / only customer running a particular config.

So what happens when you send a shut down command to a node with 96GB RAM running OneFS 6.5.x well?  From some testing I ran at the start of this year it look 70 / 30 that the nodes will shut down as expected.  In the minority of cases the shut down is aborted, due to a timeout flushing data from memory.

To work around this issue you can run isi_flush before issuing the shutdown command.  Testing of the flush before shut down proved to increase success to 100%, so we have a fix until we have a fix.


As you might expect, you can run isi_flush through through isi_for_array to flush all nodes in a cluster prior to a shut down.

isi_for_array "isi_flush"

Interestingly, only the shut down command is impacted by the memory flush, reboots always work - go figure.

Wednesday, 7 November 2012

The Job Engine - Good, But Not Great


In my last post I wrote about the virtues of the Job Engine, and as good as it is it suffers from one frailty, it can only execute one job at a time.

So what do you do if your cluster is 90% full and a drive fails...

Well firstly your cluster is now going to be more than 90% full and the FlexProtect job is going to kick in at the highest priority.

Depending on the loading of your cluster, the number of nodes and the type of drives, the FlexProtect could be running for a couple of days.  So what happens to all those Snapshots that would normally be deleted each night, that's right they start to build up and your clusters free space depletes faster.


Manual Snapshot Delete


Being able to manually clear up expired snapshots is something you may need to do from time to time, whilst the Job Engine is off doing something else.

If you run isi snap list, you will get a list of all current snapshots, but not those which have expired.

zeus-1# isi snap list
SIQ-571-latest                      31G     n/a (R)    0.01% (T)
SIQ-6ab-latest                      12G     n/a (R)    0.00% (T)
SIQ-e2f-latest                      21G     n/a (R)    0.01% (T)
SIQ-e2f-new                        597M     n/a (R)    0.00% (T)
...

To get a complete list of snapshots run isi snap usage, this will show the expired snapshots as well.

zeus-1# isi snap list
SIQ-571-latest                      31G     n/a (R)    0.01% (T)
SIQ-6ab-latest                      12G     n/a (R)    0.00% (T)
SIQ-e2f-latest                      21G     n/a (R)    0.01% (T)
SIQ-e2f-new                        597M     n/a (R)    0.00% (T)
[snapid 30171, delete pending]      15T     n/a (R)    6.21% (T)
[snapid 30186, delete pending]     1.6T     n/a (R)    0.67% (T)
[snapid 30283, delete pending]      31G     n/a (R)    0.01% (T)
...

As you can see from the above there is a significant amount of space that can be freed up by deleting the expired snapshots.

To delete a snapshot run isi_stf_test delete snapid (replacing snapid with the numeric snapshot id)

zeus-1# isi_stf_test delete 30171

The manual snapshot clean-up will be much slower than the Job Engine, as the manual process will only execute on a single node.

Manually cleaning up multiple snapshots


Here is a very quick one-liner to script the clean-up of your expired snapshots.

isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}'

This should generate an output similar to the below, you can then either copy + paste it at the command line to run the clean-up or incorporate it into a script.

zeus-1# isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}'

zeus-1# isi_stf_test delete 30171
zeus-1# isi_stf_test delete 30186
zeus-1# isi_stf_test delete 30183

To generate a list of expired snapshots, write them to /ifs/delete-me.sh and then execute /ifs/delete-me.sh

isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh | sh /ifs/delete-me.sh

Want to take it one step further?

isi snap usage | grep snapid |sed -e 's/\,//' | awk '{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh | isi_for_array "sh /ifs/delete-me.sh"

The above will now execute the script on all nodes.  If multiple nodes attempt to delete the same snapshot then the first one will win and the remaining nodes will report something like the below and move onto the next snapshot in your list.

zeus-1# isi_stf_test delete 30183
zeus-2# truncate:  ifs_snap_delete_lin: Stale NFS File handle
zeus-3# truncate:  ifs_snap_delete_lin: Stale NFS File handle
zeus-2# isi_stf_test delete 30186
zeus-3# truncate:  ifs_snap_delete_lin: Stale NFS File handle
zeus-3# isi_stf_test delete 30183

If the SnapshotDelete Job Engine job does start to run then it will clean up all remaining expired snapshot.  If you are manually deleting a snapshot and the Job Engine attempts to delete the same snapshot then you'll get the Stale NFS error at the command line and the Job Engine will take over the clean-up of that particular snapshot - the Job Engine out ranks the command line :-)

Tuesday, 23 October 2012

Job Engine


The job engine is a key part of OneFS and is responsible for maintaining the health of your cluster.

On a healthy cluster, the job engine should load at boot time and remain active unless manually disabled.  You can check that the job service is running via the below:

zeus-1# isi services isi_job_d
Service 'isi_job_d' is enabled.

The above shows that the service is enable, however running isi services by itself will not report the status of this services, to see (or modify) isi_job_d you will need to specify the -a option so that all services are returned. 

The minus -a option is a little verbose and returns 58 services as opposed to the default view of just 18, you might want to pipe the output through grep.

zeus-1# isi services -a | grep isi_job_d
isi_job_d            Job Daemon         Enabled

The below commands can be used to stop and start the job engine.

zeus-1# isi services -a isi_job_d disable
The service 'isi_job_d' has been disabled.

zeus-1# isi services -a isi_job_d enable
The service 'isi_job_d' has been enabled.
  
The isi job list command can be used to see all defined job engine jobs.  The below are the more common jobs that you may see running.

Name             Policy     Description                   
--------------------------------------------------------------------------------------
AutoBalance      LOW        Balance free space in the cluster.
Collect          LOW        Reclaims space that couldn't be freed due to node or disk issues.
FlexProtect      MEDIUM     Reprotect the file system.
MediaScan        LOW        Scrub disks for media-level errors.
MultiScan        LOW        Runs Collect and AutoBalance jobs concurrently.
QuotaScan        LOW        Update quota accounting for existing files.
SmartPools       LOW        Enforces SmartPools file policies.
SnapshotDelete   MEDIUM     Free space associated with deleted snapshots.
TreeDelete       HIGH       Delete a path in /ifs. 

In the above the Policy column refers to the schedule of the job and also the impact (the amount of CPU it can utilise).  Running isi job policy list will return the default scheduling.

zeus-1# isi job policy list
Job Policies:                                                                  
Name            Start        End          Impact    
--------------- ------------ ------------ ----------
HIGH            Sun 00:00    Sat 23:59    High

One of the key things to remember is that OneFS can only execute one policy at a time, as each job is scheduled a job ID is assigned to the job.  The job engine then executes the job with the lowest (integer) priority.  When two jobs have the same priority the job with the lowest job ID is executed first.

If for any reason you need a job other than the current running job to execute, you can either start a job with a low priority than any currently scheduled, or you can pause all currently scheduled jobs apart from the one you want to run.

To see the current running job, you can execute isi job status, this will also return information on paused , failed and recent jobs.

zeus-1# isi job status          

Running jobs:                                                                  
Job                        Impact Pri Policy     Phase Run Time  
-------------------------- ------ --- ---------- ----- ----------
AutoBalance[8]             Low    4   LOW        1/3   0:00:01

No paused or waiting jobs.
No failed jobs.

Recent job results:                                                                                                                                                                                                                        
Time            Job                        Event                         
--------------- -------------------------- ------------------------------
10/17 15:38:15  MultiScan[1]               Succeeded (LOW) 

When Jobs Fail To Run

There are a few situations where jobs wont run.  Firstly if there are no scheduled jobs, this is common on newly commissioned clusters where there is little or no data.  You can manually kick off a job to ensure everything runs as expected.

If you have jobs that are schedule but are not running then one of the below may be the reason.

A node is offline or has just rebooted.

The job engine will only run when all nodes are available, if a node has gone offline or if a node has only just booted then you may find that no jobs are running.

Coordinator node is unavailable.

The job engine relies on one of the nodes acting as a job coordinator node, this is usually the first node in the cluster) if this node is unreachable, heavily loaded or read-only then the jobs will be suspended.  You can identify the coordinator node and its health by running the below.

zeus-1# isi job status -r       
coordinator.connected=True
coordinator.devid=1
coordinator.down_or_read_only=False

The isi job history command can be used to see confirm when jobs last ran and how long they took.

--limit     [number of jobs to return, 0 returns all]
-v           [verbose output]
--job       [return information about a particular job type]
  
zeus-1# isi job history --limit=0 --job=AutoBalance -v
Job events:                                                                                                                                                                                                                               
Time            Job                        Event                         
--------------- -------------------------- ------------------------------
10/18 16:37:25  AutoBalance[8]             Waiting
10/18 16:37:25  AutoBalance[8]             Running (LOW)
10/18 16:37:25  AutoBalance[8]             Phase 1: begin drive scan
10/18 16:37:26  AutoBalance[8]             Phase 1: end drive scan
        Elapsed time:                        1 second
        Errors:                              0
        Drives:                              4
        LINs:                                3
        Size:                                0
        Eccs:                                0
10/18 16:37:27  AutoBalance[8]             Phase 2: begin rebalance
10/18 16:37:27  AutoBalance[8]             Phase 2: end rebalance
        Elapsed time:                        1 second
        Errors:                              0
        LINs:                              169
        Zombies:                             0
10/18 16:37:28  AutoBalance[8]             Phase 3: begin check
10/18 16:37:28  AutoBalance[8]             Phase 3: end check
        Elapsed time:                        1 second
        Errors:                              0
        Drives:                              0
        LINs:                                0
        Size:                                0
        Eccs:                                0
10/18 16:37:28  AutoBalance[8]             Succeeded (LOW) 

The job engine may log information /var/log/messages and /var/log/isi_job_d.log