Digital Edge uses a propitiatory methodology when analyzing any type of enterprise storage. We want to make feedback from industry professionals available to the public. This could be a very useful guide for IT experts who are not deeply involved in storage but would like to receive a high-level understanding of storage health, capacity and performance conditions. The proposed methodology is solely Digital Edge's approach of assessing enterprise storage and is not bound to any other manufacturer or storage brand.
Before we begin, we want to remind you that storage is not just capacity technology, but capacity AND performance technology that must be evaluated together. When capacity is very easy to analyze, the performance parameters may become confusing and not so obvious.
Areas to be analyzed:
1. Capacity allocation and expected IOPS.
2. Expected IOPS and load from servers.
3. Network expected performance and network load from servers
4. System errors and warnings
5. Patch levels and recommendations
6. Conclusion.
Here is brief description of the information collected and analyzed for each item. This description also explains why we believe our methodology is both valid and convenient for high level assessment. This methodology may not produce completely accurate troubleshooting-ready statistics; instead it assesses conditions and indicators for further tuning and troubleshooting.
Some fundamental statements to simplify our analysis:
-
Enterprise storage could be SAN, NAS or a unified platform playing role of SAN and NAS at the same time.
-
Enterprise class NAS is a SAN with servers attached to the SAN infrastructure that exposes SAN storage to servers over NAS protocols. Those servers in terminology of EMC are called "data movers." They are attached to SAN through fiber interface. From SAN's perspective, data movers are the same clients as any other servers connected to it.
-
It is relatively easy for clients to build those servers without purchasing them from hardware manufacturers. However, servers pre-configured by manufacturers with high availability and management interface may be beneficial.
-
SAN consist of controllers that are connected to Storage Area Network through multiple fiber channel and/or iSCSI interfaces on the frontend and to disk trays on the back.
-
Capacity is provided by disks.
-
Performance is the function of performance parameters of disks themselves, controllers and the network.
-
Each disk has pre-defined performance parameters. The faster disk, the faster it can perform an I/O operation.
-
The more disks playing in I/O load, the better performance of the system is.
-
Disks are congregated in RAID groups. Performance of the SAN disks is a function of configuration of the RAID groups. Performance of RAID group depends on the amount of disks included in the group, their speed, type, and its penalty.
-
RAID groups are cut on LUNs. LUNs are exposed to servers.
-
As performance of storage depends on RAID group configuration - LUNs on the same RAID group will affect each other. LUNs on separate RAID groups will not affect each other. This is true considering network I/O is not a bottleneck.
-
Network performance is a function of types and links to Storage Area Network processing power of controllers.
Logical View of SAN
1. Capacity Allocation and Expected IOPs.
Capacity analysis can easily be introduced in the capacity report. Capacity is shown by RAID group and how those RAID groups are cut on LUNs. The total expected IO performance is displayed per RAID group.
Disk 0/ 0 |
RAID Group 0 RAID5 Drive Type: FC Capacity:286GB Percent Full: 99% IOPS: 900 |
LUN 61 PROD-ORALCE-Data Size: 286GB
|
Host: NYORAN1/2 Type: Oracle ASM Used: 192GB (51%) Free: 94GB
|
Disk 0/ 1 |
|||
Disk 0/ 2 |
|||
Disk 0/ 3 |
|||
Disk 0/ 4 |
Disk 3/ 0 |
RAID Group 3 RAID 5 Drive Type: SATA Capacity: 11005.93GB Percent Full: 99% Free: 0.928GB IOPS:630 |
LUN 16 PROD-VMStore1 Size: 2048GB |
Host: ESXi1/2/3/4/5 Type: VM Datastore Used: 1.4TB Free: 614GB |
Disk 3/ 1 |
LUN 29 PROD-VMStore5 Size: 1 TB |
Host: ESXi1/2/3/4/5 Type: VM Datastore Used: 969TB Free: 55GB |
|
Disk 3/ 2 |
LUN 30 PROD-SQLCUSTER_DATA Size: 500GB |
Host: NYSQL1/2 Type: Windows Used: 299B Free: 201GB |
|
Disk 3/ 3 |
LUN 0 Place Holder Size: 1GB
|
None |
|
Disk 3/ 4 |
|||
Disk 3/ 5 |
|||
Disk 5/ 1 |
Disk 5/2 |
RAID Group 9 RAID 5 Drive Type: SATA Capacity: 5502 GB Percent Full: 83% Free: 927GB IOPS:360 |
LUN 41 PROD-ORACLE-LOGS Size: 500GB |
Host: NYORA1/2 Type: ORACLE ASM Used: 47GB Free: 453GB |
LUN 42 PROD-SQLCLUSTER_LOG Size: 500GB |
Host: NYSQL1/2 Type: Windows Used: 136B Free: 453GB |
||
Disk 5/3 |
LUN 45 PROD-EXCH-DATA Size: 500GB |
Host: EX1/EX2 Type: Windows Used: 284B Free: 216GB |
|
LUN 46 PROD-EXCH_LOG Size: 1.4 TB |
Host: EX1/2 Type: Windows Used: 699GB Free: 1.3TB |
||
Disk 5/4 |
LUN 49 QA-VMDATASTORE Size: 325GB |
Host: ESXi6/7/8 Type: VM datastore Used: 123B Free: 202GB |
|
LUN 58 PROD-EXCH-DATA-II Size: 500GB |
Host: EX1/2 Type: Widnows Used: 166B Free: 334GB |
||
Disk 5/5 |
Lun 68 QA-SQL-DATA-2 Size: 300GB |
Host: QASQL1/2; Type: Windows Used: Unmounted Free: Unmounted IOPS avg: 0 IOPS max: 0 |
The first row includes disk information and its position in disk tray. The RAID group information includes: RAID type, disk type, total capacity, free space and excepted IOPS. Expected IOPS are calculated based on disks number in the group, disk speed and RAID type.
LUN information includes: total capacity, host(s) that mounts onto the LUN, and the space used by host(s).
2. Expected IOPS And Load From Servers.
In contrast to capacity, performance is something that is difficult to assess. Therefore we offer a method that allows assessing SAN performance and illuminates potential problem spots. IT professionals can then use different techniques to go deeper into actual performance tuning and troubleshooting.
We often see multiple examples of a mistaken vision when people think about SANs. People tend to think that the more loads you put on SAN, the slower the SAN will work. That is wrong! SAN will work per-parameters it was built. If you configure a RAID group to provide 900 IOPS, it will deliver those expected IOPS. The applications on servers that are pushing I/O to SAN may slow down however. SAN cannot satisfy all of the requests. In such a case, requests will be queued on the server and the end user will begin to feel the SAN performing slower. In all actuality, the SAN is working at the same speed; it just has more requests waiting for each other to finish.
SAN baseline performance can be easily tested with tools like iometer. After the Storage Area Network connectivity and raid groups are setup, the performance of the SAN itself should remain constant. Performance might be affected by degrading RAID levels using mismatching hot spare disks or RAID re-building. Under normal circumstances however, the SAN will not slow down.
To assess the SAN performance, we evaluate the expected IOPS provided by RAID groups. Then we compare this value with aggregated average and maximum IOPS from servers on all LUN of the analyzed RAID group. Here’s what it may look like:
Disk 1/0
|
RAID Group 1 RAID 10 Drive Type: FC Capacity: 1,073GB Percent Full: 99% IOPS: 720
|
LUN 1 UAT-ORACLE-FILES Size: 200GB |
Host: NYORAUAT1/2 Type: Oracle Used: 181GB Free: 19 GB IOPS: 45/354 |
Disk 1/1 |
LUN 8 PROD-ORACLE-FILES Size:200 GB |
Host: NYORAPROD1/2 Type: Oracle Used: 150 Free: 50 IOPS: 52/643
|
|
Disk 2/0
|
LUN 43 PROD-SQL-SERVER-DB Size: 260GB |
Host: NYSQLPROD Type: Windows Used: 184GB Free: 76GB IOPS: 198/2077 |
|
Disk 2/1
|
LUN 44 PROD-SQLUAT-SERVER-DB Size: 260GB |
Host: none Free 260G |
|
|
TOTAL EXPECTED: 720 |
|
TOTAL PUSHED (avg/max): 295/3074 |
In this example, RAID Group 1 is RAID 10 raid group built on 1500 RPM Fiber Channel disks. The expected performance of such configuration is 720 IOPS. The I/O is measured for LUN1, LUN8 and LUN43 from the server side using host built-in tools like PerfMon or IOStat. Average and Max values are recorded and then totals are compared.
In the end, a follow up report is created for the entire SAN:
RAID GROUP # |
IOPS EXPECTED |
AGREGATED IOPS PUSHED (avg/max) |
AGREGATED WAITS (avg/max) ms |
RAID Group 0 |
900 |
233/376 |
1.25/23 |
RAID Group 1 |
720 |
295/3074 |
7.21/865.06 |
RAID Group 2 |
360 |
2/134 |
315.04/29329.85 |
RAID Group 3 |
630 |
1233/6103 |
4302.15/27239 |
RAID Group 4 |
270 |
106 / 3160 |
580.97/14342.94 |
RAID Group 5 |
180 |
4.26/250 |
2546.45/51500 |
RAID Group 6 |
720 |
31.38/602 |
6462.16/224913.33 |
RAID Group 9 |
360 |
3145/29233 |
885.42/23764.33 |
RAID Group 10 |
720 |
6.6/305 |
4838.33/126350 |
RAID Group 11 |
720 |
45/ 2875 |
4958.44/160240.66 |
RAID Group 14 |
630 |
264/ 2696 |
1320/4030 |
RAID Group 15 |
900 |
164/1903 |
837.21/2990 |
RAID Group 16 |
720 |
23 / 2262 |
2.258/50 |
RAID Group 17 |
720 |
2.371 / 377 |
1.154/49 |
RAID Group 18 |
360 |
147/1394 |
1510/9170 |
RAID Group 19 |
360 |
35/ 2571 |
4.21/69.95 |
RAID Group 20 |
180 |
6.9 / 80 |
6.88/135.75 |
RAID Group 21 |
180 |
150 / 790 |
3.905/38.09 |
RAID Group 22 |
180 |
9 / 224 |
9.91/131.6 |
RAID Group 23 |
180 |
0 |
0/0 |
RAID Group 24 |
720 |
1335/ 15037 |
0.92/26.22 |
RAID Group 25 |
720 |
2290/10973 |
2.08/29.625 |
RAID Group 26 |
180 |
55/239 |
1970/6280 |
META |
|
|
1660.17/12966.29 |
Red RAID groups are oversubscribed on average. Hosts are trying to push much more I/O requests than the host can handle. This can be demonstrated using Average and Maximum waits (the last column). These are assessment indicators that tell the storage admin to take closer look at the LUNs. The reason for oversubscription could be constant load when applications are “frying” disks and desperately need more I/O. In which case, more I/O can be gained by spreading loads to more physical disks like spindles or introducing some flash disks adding cashing and some others.
High load indicators could be a result of spikes. In the event of high load indicators, time factors, nature of spikes and their time span should be reviewed and analyzed. It may be that RAID Group X with expected 900 IOPS has LUNs A, B and C. Then the report from all LUNs will show 800 IOPS on average or maximum of about 1000. In some cases the report can be totally well balanced between LUNs and RAID Group. The max I/O could be produced in different time frames and the overall average would not yield more load than RAID Group provisioned.
A deeper analysis of READ/WRITE, WAITS and DISK QUEUE graphs should correlate showing spikes at the same time. Sometimes spikes are caused by backups and could be totally ignored if they do not occur during production hours.
Storage pools could be analyzed using the same logic.
3. Network Expected Performance and Network Load from Servers
Next, an assessment report is created based on network statistics collected on SAN, switches and hosts.
|
|
|
Expected Speed (Mbps) |
Actual Avg and Max Load on switch (Mbps) |
Aggregated load from servers (Mbps) avg/max |
Aggregated waits from servers (Ms) |
Backend |
SPA |
Nic1 |
10,000 |
86/213 |
173/422 |
1342/28938 |
Nic2 |
10,000 |
87/209 |
||||
SPB |
Nic1 |
10,000 |
67/209 |
112/405 |
||
Nic2 |
10,000 |
45/196 |
In this situation we have a SAN with 4x10G iSCSI connections. Based on average and maximum load from the switches, we see that we are far from saturation. Large waits are functions of IOPS. They are accumulating when hosts are waiting for read/write operations from SAN. The network doesn’t contribute to the waits in this case.
4. System Errors and Warnings
System errors and warnings are collected on SPA and SPB controllers. In most cases we find out about any errors through our Enterprise Storage Monitoring system. However for a complete report, we assemble all the logs and load them into our database. Next, we group them by type and determine whether anything should be reported or taken under closer consideration.
Any database engine can be used to semi-automate analyses of large amounts of data.
Overtime, we accumulated lots of SQL stored procedures and statements through our analyses of logs. These procedures and statements help us to complete analyses faster.
5. Patch Levels and Recommendations
We review the level of the management software in comparison to the current versions. Then we classify each patch with following classification:
-
Critical – Data Lose or Downtime
-
Critical – Security
-
Non critical
We also check EOL (End of Life) or EOW (End of Warranty) dates and provide recommendations for our clients.
6. Conclusion.
Digital Edge believes that the preceding methodology should be practiced to analyze storage devices at least once a quarter or six month. We believe that “even hardware should go for a blood test from time to time.”
This gives enterprise IT groups assurance that everything working as it supposed to, not oversubscribed, that applications are not “frying” HDDs.
We understand that enterprise IT groups have their own expert methods of using and configuring storage. Our methodology could be used by any of them or Digital Edge Enterprise Storage team could be engaged to provide independent audit and assessment.