As many will have seen, EMC have given notice that OneFS 6.x and 6.5.x will be going end-of-life during the next twelve months.
Having worked with Isilon storage since OneFS3.something_nasty, OneFS6.5 has long been my favourite outing of the scale-out filesystem. In fact, 6.5 has been such a good product for me that I completely skipped 7.0.x and have only recently truly embraced OneFS 7.1.x.
The OneFS 6 product line gave us so many feature, key among them SmartPools, Kerberised NFS v3 and a much improved SyncIQ.
As I still have a number of systems running 6.5, I intend to keep blogging about OneFS6.5 until it becomes irrelevant, but will also be providing more posts on 7.1.x.
OneFS7.1 is slowly becoming a mature and stable storage OS, that will, I'm sure, take the crown of my favourite scale-out filesystem very soon.
Isi Blogging?
Oh the fun you can have running Big Data storage in a small storage team.
Monday 2 June 2014
Friday 30 May 2014
The Long Road Back
It has been a little over a year since my last blog post, this was
not intentional, but the result of a family bereavement and me taking
time to reflect on life.
I am still a supporter of the Isilon technology and a keen believer in the importance of sharing knowledge and experience, so am pleased to make my return to this blog.
I am still a supporter of the Isilon technology and a keen believer in the importance of sharing knowledge and experience, so am pleased to make my return to this blog.
Saturday 20 April 2013
A Node By Ay Other Name
One of the great things about the Isilon
architecture is that you can add and remove nodes from your cluster.
Let’s say you have a cluster of three
12000X nodes and you want to replace then with three new x200 nodes, now you
could leave the original nodes in the cluster as a lower / slower tier of
storage and make use of the SmartPools technology to place you different data
types on the most appropriate nodes, or you could simply replace you old nodes
with new ones.
- Suppose my cluster has three 12000X nodes zeus-1, zeus2 and zeus3.
- I add three X200 nodes into the cluster, which are assigned the names zeus-4, zeus-5 and zeus-6.
- I decide to retire / SmartFail the 12000X nodes and now have a cluster with just three nodes named zeus-4, zeus-5 and zeus-6.
I could leave things exactly as they are, but I’d rather have my three nodes with names zeus-1, zeus-2 and zeus-3; no problem I can renamed then (without downtime) using the isi conf command.
From an ssh window, launch isi conf
zeus-4# isi conf
zeus >>> lnnset 4 1
Node 4 changed to Node 1. Change will be
applied on 'commit'
zeus >>> commit
Commit succeeded.
zeus-4#
As you can see in the above, you may need
to reconnect to your ssh session before the new node name is automatically
changed.
zeus-4#
zeus-4# hostname
zeus-1
Saturday 23 March 2013
You don’t have to be a Maverick to adopt OneFS7, but it helps
OneFS 7
Isilon’s latest operating system emerged from beta in Q4
2013. During the beta Isilon chose the
name Mavericks, which was a departure from many years of naming their betas
after chillis.
As OneFS 7 has now reached the level of maturity where it
will ship by default on new hardware, it seems a good time to see if OneFS 7 is
a scorching Naga or a tepid Bell Pepper.
As someone who has used OneFS for many years, there are a
number of key features in OneFS7 that jump out at me:
Roles-based Administration – Think of the current
implementation very much as version 1, Isilon will build on this implementation
and refine the granularity of permissions further with future releases, but
even in its current incarnation, I am more than pleased to see it in the
product.
Fast snapshot restoration, writable snapshots and file
cloning. All “enterprise” features that
improve the ability for Isilon to complete in general purpose storage
deployments against NetApp and others.
The excellent SyncIQ replication technology has been
enhanced though the implementation of push-button failover / failback (this is
something I need to spend more time playing with, but early signs are
promising).
The IO capabilities of the filesystem have taken a step
forward with much focus being placed on concurrent and sequential file access,
but also improvements to IO latency – motivated in no small part by hopes of
hosting VMware deployments on the platform.
The term “Endurant Cache” has been coined to cover Isilon’s new approach
to caching and I am looking forward to delving into this in more detail in the
near future.
Unsurprisingly tighter integration with VMware is a
much-highlighted feature of OneFS 7.
Take VAAI and VASA APIs, sprinkle in some Endurant Cache, Metadata
Acceleration (via SSD’s) and per-file cloning and you have a platform that is
better placed to handle VMware workloads.
But critically still no deduplication or read-acceleration to see off
boot storms (read as PAM cards) to put their VMware solution on a par with
NetApp.
It would also be unfair not to comment on the continued progress
being made by Likewise, the company acquired by Isilon in early 2012.
Likewise have done a great job in making the SMB/SMB2 implementation
on Isilon far better than it was a couple of major revisions back, there is
still some way to go until the “unified” protocol access to the Isilon is a
good as some other platforms, but things are certainly going in the right
direction.
Wrapping It Up
So where are we with OneFS7, that probably depends on your
use case.
In a Windows-centric environment, OneFS7 is probably a
better fit that 6.5.x. SMB / SMB2 performance
should be much better and more tuneable.
In a UNIX-heavy environment, I would still favour OneFS
6.5.x, Endurant cache will help in many circumstances,
but I’ve also heard that some operations may be slower in the current 7.0.x
builds.
Undoubtedly OneFS7 is an improvement on OneFS6.5 and is testament to the significant investment that EMC continue to make into this true scale-out storage platform, but in my
opinion it's still a little too new for prime-time. There will certainly be deployment scenarios
where you could deploy and run today (Windows-centric or VMware), outside of
those, you’d need to be a Maverick to deploy into production so soon after
launch.
Based on previous Isilon release schedules, I would expect a
substantial OneFS 7 point release by Q4 2013 that will bring with it additional
features and stability. In the mean
time, I would recommend that Isilon customer grab a copy of the new OS and try
it where they can.
I'll post in more detail about the new feature in future blog posts.
Thursday 7 March 2013
Sometimes you need to flush
One of the great things about the 200-series nodes (X and S) are that you can specify how much memory or SSD's you want to add into a node. Fantastic! I can put 48GB of RAM and 2 SSD's (for metadata acceleration) in an X200 node to host my commodity data and 96GB RAM and 4 SSD's in an S200 node to support my high-performance storage requirements.
The issue here is that you could potentially be the first / only customer running a particular config.
So what happens when you send a shut down command to a node with 96GB RAM running OneFS 6.5.x well? From some testing I ran at the start of this year it look 70 / 30 that the nodes will shut down as expected. In the minority of cases the shut down is aborted, due to a timeout flushing data from memory.
To work around this issue you can run isi_flush before issuing the shutdown command. Testing of the flush before shut down proved to increase success to 100%, so we have a fix until we have a fix.
As you might expect, you can run isi_flush through through isi_for_array to flush all nodes in a cluster prior to a shut down.
isi_for_array "isi_flush"
Interestingly, only the shut down command is impacted by the memory flush, reboots always work - go figure.
The issue here is that you could potentially be the first / only customer running a particular config.
So what happens when you send a shut down command to a node with 96GB RAM running OneFS 6.5.x well? From some testing I ran at the start of this year it look 70 / 30 that the nodes will shut down as expected. In the minority of cases the shut down is aborted, due to a timeout flushing data from memory.
To work around this issue you can run isi_flush before issuing the shutdown command. Testing of the flush before shut down proved to increase success to 100%, so we have a fix until we have a fix.
As you might expect, you can run isi_flush through through isi_for_array to flush all nodes in a cluster prior to a shut down.
isi_for_array "isi_flush"
Interestingly, only the shut down command is impacted by the memory flush, reboots always work - go figure.
Wednesday 7 November 2012
The Job Engine - Good, But Not Great
In my last post
I wrote about the virtues of the Job Engine, and as good as it is it suffers
from one frailty, it can only execute one job at a time.
So what do you
do if your cluster is 90% full and a drive fails...
Well firstly
your cluster is now going to be more than 90% full and the FlexProtect job is
going to kick in at the highest priority.
Depending on
the loading of your cluster, the number of nodes and the type of drives, the
FlexProtect could be running for a couple of days. So what happens to all those Snapshots that
would normally be deleted each night, that's right they start to build up and
your clusters free space depletes faster.
Manual Snapshot Delete
Being able to
manually clear up expired snapshots is something you may need to do from time
to time, whilst the Job Engine is off doing something else.
If you run isi
snap list, you will get a list of all current snapshots, but not those which
have expired.
zeus-1# isi snap list
SIQ-571-latest 31G n/a (R)
0.01% (T)
SIQ-6ab-latest 12G n/a (R)
0.00% (T)
SIQ-e2f-latest 21G n/a (R)
0.01% (T)
SIQ-e2f-new 597M n/a (R)
0.00% (T)
...
To get a
complete list of snapshots run isi snap usage, this will show the expired snapshots as well.
zeus-1# isi snap list
SIQ-571-latest 31G n/a (R)
0.01% (T)
SIQ-6ab-latest 12G n/a
(R) 0.00% (T)
SIQ-e2f-latest 21G n/a (R)
0.01% (T)
SIQ-e2f-new 597M n/a (R)
0.00% (T)
[snapid 30171, delete pending] 15T
n/a (R) 6.21% (T)
[snapid 30186, delete pending] 1.6T
n/a (R) 0.67% (T)
[snapid 30283, delete pending] 31G
n/a (R) 0.01% (T)
...
As you can see
from the above there is a significant amount of space that can be freed up by
deleting the expired snapshots.
To delete a
snapshot run isi_stf_test delete snapid (replacing snapid with the numeric
snapshot id)
zeus-1# isi_stf_test delete 30171
The manual
snapshot clean-up will be much slower than the Job Engine, as the manual
process will only execute on a single node.
Manually cleaning up multiple snapshots
Here is a very
quick one-liner to script the clean-up of your expired snapshots.
isi snap usage | grep snapid |sed -e 's/\,//' | awk
'{print "isi_stf_test delete " $2}'
This should
generate an output similar to the below, you can then either copy + paste it
at the command line to run the clean-up or incorporate it into a script.
zeus-1# isi snap usage | grep snapid |sed -e 's/\,//'
| awk '{print "isi_stf_test delete " $2}'
zeus-1# isi_stf_test delete 30171
zeus-1# isi_stf_test delete 30186
zeus-1# isi_stf_test delete 30183
To generate a list of expired snapshots, write them to /ifs/delete-me.sh and then execute /ifs/delete-me.sh
isi snap usage | grep snapid |sed -e 's/\,//' | awk
'{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh | sh
/ifs/delete-me.sh
Want to take it
one step further?
isi snap usage | grep snapid |sed -e 's/\,//' | awk
'{print "isi_stf_test delete " $2}' > /ifs/delete-me.sh |
isi_for_array "sh /ifs/delete-me.sh"
The above will
now execute the script on all nodes. If
multiple nodes attempt to delete the same snapshot then the first one will win
and the remaining nodes will report something like the below and move onto the next
snapshot in your list.
zeus-1# isi_stf_test delete 30183
zeus-2# truncate:
ifs_snap_delete_lin: Stale NFS File handle
zeus-3# truncate:
ifs_snap_delete_lin: Stale NFS File handle
zeus-2# isi_stf_test delete 30186
zeus-3# truncate:
ifs_snap_delete_lin: Stale NFS File handle
zeus-3# isi_stf_test delete 30183
If the
SnapshotDelete Job Engine job does start to run then it will clean up all
remaining expired snapshot. If you are
manually deleting a snapshot and the Job Engine attempts to delete the same
snapshot then you'll get the Stale NFS error at the command line and the Job
Engine will take over the clean-up of that particular snapshot - the Job Engine
out ranks the command line :-)
Tuesday 23 October 2012
Job Engine
The job engine is a key part of OneFS and is
responsible for maintaining the health of your cluster.
On a healthy cluster, the job engine should load at
boot time and remain active unless manually disabled. You can check that
the job service is running via the below:
zeus-1# isi services isi_job_d
Service 'isi_job_d' is enabled.
The above shows that the service is enable, however
running isi services by itself will not report the status of this services, to
see (or modify) isi_job_d you will need to specify the -a option so that all
services are returned.
The minus -a option is a little verbose and returns
58 services as opposed to the default view of just 18, you might want to pipe
the output through grep.
zeus-1# isi services -a | grep isi_job_d
isi_job_d
Job Daemon Enabled
The below commands can be used to stop and start
the job engine.
zeus-1# isi services -a isi_job_d disable
The service 'isi_job_d' has been disabled.
zeus-1# isi services -a isi_job_d enable
The service 'isi_job_d' has been enabled.
The isi job list command can be used to see all
defined job engine jobs. The below are the more common jobs that you may
see running.
Name
Policy Description
--------------------------------------------------------------------------------------
AutoBalance LOW
Balance free space in the cluster.
Collect LOW
Reclaims space that couldn't be freed due to node or
disk issues.
FlexProtect MEDIUM
Reprotect the file system.
MediaScan LOW
Scrub disks for media-level errors.
MultiScan LOW
Runs Collect and AutoBalance jobs concurrently.
QuotaScan LOW
Update quota accounting for existing files.
SmartPools LOW
Enforces SmartPools file policies.
SnapshotDelete MEDIUM Free space
associated with deleted snapshots.
TreeDelete
HIGH Delete a path in
/ifs.
In the above the Policy column refers to the
schedule of the job and also the impact (the amount of CPU it can utilise).
Running isi job policy list will return the default scheduling.
zeus-1# isi job policy list
Job
Policies:
Name
Start
End
Impact
--------------- ------------ ------------
----------
HIGH
Sun 00:00 Sat 23:59 High
One of the key things to remember is that OneFS can
only execute one policy at a time, as each job is scheduled a job ID is assigned
to the job. The job engine then executes the job with the lowest
(integer) priority. When two jobs have the same priority the job with the
lowest job ID is executed first.
If for any reason you need a job other than the
current running job to execute, you can either start a job with a low priority
than any currently scheduled, or you can pause all currently scheduled jobs
apart from the one you want to run.
To see the current running job, you can execute isi
job status, this will also return information on paused , failed and recent
jobs.
zeus-1# isi job status
Running
jobs:
Job
Impact Pri Policy Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
AutoBalance[8]
Low
4 LOW 1/3
0:00:01
No paused or waiting jobs.
No failed jobs.
Recent job
results:
Time
Job
Event
--------------- --------------------------
------------------------------
10/17 15:38:15 MultiScan[1]
Succeeded
(LOW)
When Jobs Fail To Run
There are a few situations where jobs wont run.
Firstly if there are no scheduled jobs, this is common on newly
commissioned clusters where there is little or no data. You can manually
kick off a job to ensure everything runs as expected.
If you have jobs that are schedule but are not
running then one of the below may be the reason.
A node is offline or has just rebooted.
The job engine will only run when all nodes are
available, if a node has gone offline or if a node has only just booted then
you may find that no jobs are running.
Coordinator node is unavailable.
The job engine relies on one of the nodes acting as
a job coordinator node, this is usually the first node in the cluster) if this
node is unreachable, heavily loaded or read-only then the jobs will be
suspended. You can identify the coordinator node and its health by
running the below.
zeus-1# isi job status
-r
coordinator.connected=True
coordinator.devid=1
coordinator.down_or_read_only=False
The isi job history command can be used to see
confirm when jobs last ran and how long they took.
--limit [number of jobs to return, 0
returns all]
-v [verbose
output]
--job [return information
about a particular job type]
zeus-1# isi job history --limit=0 --job=AutoBalance -v
Job
events:
Time
Job
Event
--------------- --------------------------
------------------------------
10/18 16:37:25
AutoBalance[8]
Waiting
10/18 16:37:25
AutoBalance[8]
Running (LOW)
10/18 16:37:25
AutoBalance[8]
Phase 1: begin drive scan
10/18 16:37:26
AutoBalance[8]
Phase 1: end drive scan
Elapsed
time:
1 second
Errors:
0
Drives:
4
LINs:
3
Size:
0
Eccs:
0
10/18 16:37:27
AutoBalance[8]
Phase 2: begin rebalance
10/18 16:37:27 AutoBalance[8]
Phase 2: end
rebalance
Elapsed time:
1 second
Errors:
0
LINs:
169
Zombies:
0
10/18 16:37:28
AutoBalance[8]
Phase 3: begin check
10/18 16:37:28 AutoBalance[8]
Phase 3: end check
Elapsed
time:
1 second
Errors:
0
Drives:
0
LINs:
0
Size:
0
Eccs:
0
10/18 16:37:28
AutoBalance[8]
Succeeded (LOW)
The job engine may log information
/var/log/messages and /var/log/isi_job_d.log
Subscribe to:
Posts (Atom)