Thursday, May 08, 2014

Linux - Disk I/O

Disk I/O encompasses the input/output operations on a physical disk. Let's say if you are reading data from a file on a disk, the processor needs to wait for the file to be read. What really is the killer is the disk access time. Disk access time means the time required for a computer to process a data request from processor and then retrieve the required data from the storage device. The hard disk is mechanical, much slower than the CPU, you need to wait for the disk to rotate to the required disk sector.

Hard disk latency is around 13ms - that's 1.3e+7 nano seconds (could be smaller), the RAM latency is only 83 nanoseconds. You can see the difference. The difference in latency between RAM and a hard disk is enormous.

Usually IOWait is a indicator of IO performance (but it is possible to have a healthy system with nearly 100% iowait, or have a disk bottleneck with 0% IOWait), IOWait is a CPU metric which measuring the percent of time the CPU is idle but waiting for an I/O to complete. Sometimes in random I/O workloads, %iowait could be misleading, because it measures CPU performance, not I/O. To be more acurate, %iowait measures the percent of time the CPU is idle, but waiting for an I/O request to complete. So it is only indirectly related to I/O performance.

How to mornitor IOWait:

You can use "iostat" command to check IOWait.
For example:
$ iostat 1

Linux 2.6.32-358.2.1.el6.x86_64 (xxx.xxx.com)  05/07/2014  _x86_64_ (8 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.41    0.00    0.61    0.03    0.00   97.95
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb               0.65         0.01         3.36     226092   70440785
sda               2.12        10.25        55.19  214858840 1157345956
sdc              73.89        11.17       546.11  234341572 11452453033
sdd               0.20         0.30         2.20    6327596   46053137

The "1" here means "iostat" refreshes everysecond, you can change this value if you want it over a longer period.

tps: transaction per second, the number of transfers (I/O requests) per second issued to the device.
Blk_read/s: the number of blocks per second read from the device.
Blk_wrtn/s: the number of blocks per second write to the device.
Blk_read: the total number of blocks read.
Blk_wrtn: the total number of blocks written.

If you want to get a more detailed output, use "iostat -x".

But iowait could be misleading:

High %iowait is becoming more common as processor speeds increases. While processor performance doubles every 12 to 18 months, disk performance relatively remain constant. This imbalance has resulted in a trend toward higher %iowait on healthy systems.

The following example shows you that how CPU can increase %iowait.

Before CPU speed increases:
CPU time = 60 ms
IO time = 20 ms
Total transaction time = CPU + IO = 60 + 20 = 80 ms
%iowait = IO time/total time = 20/80 = 25%

After CPU speed increase 4 times:
CPU time = 60 ms/4 = 15 ms
IO time = 20 ms
Total transaction time = CPU + IO = 15 + 20 = 35 ms
%iowait = IO time/total time = 20/35 = 57%

In this example, there are 2x increases in %iowait even transaction performance doubled.

How do you identify I/O problem?

The best way to identify I/O problem is to use filemon or GLSOF. As a basic rule, read/write time should average 15-20 ms on non-cached disk subsystems. On cached disk subsystems, reads should average 5-20 ms, and writes should average 2-3 ms. Unfortunately filemon is not available in Linux, it is available in Windows and OpenBSD.

Here is an example of filemon:
A 90 seconds filemon trace from an actual customer system that was heavily utilized. The filemon command was:

# filemon -o /tmp/filemon.out -O lv,pv -T 320000; sleep 90; trcstop

The output is in /tmp/filemon.out. From the Detailed Physical Volume Section:
VOLUME: /dev/hdisk60  description: EMC Symmetrix FCP Raid1
reads:     9217   (0 errs)
  read sizes (blks):  avg    71.8 min       8 max     256 sdev    93.4
  read times (msec): avg  61.515 min   0.011 max 1643.486 sdev 130.135
  read sequences:  6249
  read seq. lengths: avg   105.9 min       8 max    3920 sdev   309.8
writes:    7023   (0 errs)
  write sizes (blks):  avg    43.0 min       8 max     256 sdev    37.6
  write times (msec): avg  40.651 min   0.003 max 1544.865 sdev  88.734
  write sequences:  6939
  write seq. lengths: avg    76.6 min       8 max    1696 sdev    88.9

seeks:       10188 (62.7%)
  seek dist (blks): init 0, avg 16784566.3 min       8 max 78792992 sdev 19185871.9
  seek dist (%tot blks):init 0.00000, avg 17.80295   min 0.00001 max 83.57367 sdev 20.34995
time to next req(msec): avg  22.074 min   0.006 max 2042.710 sdev  54.050
throughput:    1598.1 KB/sec
utilization:   0.73

What's good or bad?

If the first time you look at it is when you are in trouble, then this is less helpful. You should have an idea of how much I/O your server typically do (for critical servers, set up a cronjob and save the output to a file). Once you have an idea of your normal disk I/O, you will have a good candidate explanation for them.

Your expectation:
As a rule of thumb, for a single disk:

  • 7.2k RPM ~100 IOPS (tps)
  • 10k RPM  ~150 IOPS (tps)
  • 15k RPM  ~200 IOPS (tps)
Extreme Linux Performance Monitoring and Tuning

No comments: