I/O Queue Depth Strategy for Peak Performance

​Download a PDF of this Article

 

 

SQL Server’s simple I/O queue depth model leaves much on the table. Here’s research-based evidence on how to improve it with a more comprehensive model of disk, SSD, and storage system performance.

I am finally getting around to setting up a solid-state drive (SSD) array, starting with a few devices, then working up to maybe 20 units across two controllers and 4x4 serial-attached SCSI (SAS) ports. During initial testing, I observed very high disk latency, in the range of 100ms+ for reads and up to 400ms+ for writes in certain operations. This occurs during periods of very high disk queue depth. So the questions arise:

1. Do high queue-depth and latency improve performance?
2. Do they cause problems, including slowing responsiveness, with other operations?
3. And can these problems be avoided without giving up performance?

Let’s start our search for answers with single disk drive performance principles, then proceed to disk arrays, caching RAID controllers, and finally to the complete storage system, including storage area networks (SANs) with consideration for SSD characteristics. We can trace the origin of common storage performance advice often cited without reference or context as to scope. And with this insight, we can understand why the simple I/O queue depth model used by SQL Server leaves much on the table and could be improved with a more comprehensive model of disk, SSD, and storage system performance.

 

Hard Disk IOPS Theory

The standard theory of disk drive random I/O rate is that average access time is the sum of the rotational latency, the average seek time, the transfer time, command overhead, and propagation delays. For small block I/O, only the first two items make significant contribution. The average rotational latency for a 15K disk drive is 2ms, and the typical 15K 3.5-inch disk cites an average seek time of 3.4ms. The two main contributors plus others add up to a total average access time of around 5.5ms. The two key qualifiers are:

1. For random accesses to data distributed over the entire disk
2. At queue depth 1, one I/O is issued; the next is issued after the completion of the first

A 15K disk is capable of 180 I/O operations per second (IOPS) at queue depth 1 for random accesses distributed across the entire disk. Too many people neglect to mention the two key qualifiers.

When the data accesses are to a limited range of disk cylinders, the average seek time is less, and queue depth 1 IOPS performance is higher.

 

Hard Disk Random Read IOPS vs. Queue Depth

Now consider the implication of the second qualifier: random-read IOPS versus queue depth. At higher queue depth, there are multiple I/Os to disk outstanding, either issued simultaneously or subsequent I/O issued before previously issued I/O are completed. The controller on the disk drive itself can reorder the I/O, with a net effect of reducing time between each I/O to increase the IOPS—at the expense of higher latency for each individual I/O.

There is a small gain at queue depth 2, perhaps to 200 IOPS; larger gains to queue depth 4 at 240 IOPS; and about 40-50 IOPS with each doubling of queue depth to 32, with small gain to queue depth 64. (The disk drives from 2005 have a 64-deep task queue; current ones have a 128-deep queue.) With each doubling of the queue depth, the latency almost doubles.

Figure 1 shows both the short-effect and queue-depth impact on IOPS. Increasing queue depth for I/O distributed across the entire disk improves IOPS to 400 at queue depth 64. The short-stroke effect improves performance at queue depth 1 to nearly 300 IOPS with 2.8% utilization. When the two effects are combined, it is possible drive over 600 IOPS per disk.

 

Figure 1 IOPS versus queue depth for various disk space utilizations.png 

Figure 1: IOPS versus queue depth for various disk space utilizations

 

Figure 2 shows the short-stroke effect and queue-depth versus access latency. Increasing queue depth for data accesses distributed across the entire disk has a high price in latency. Restricting data accesses to a narrow range significantly mitigates the high queue depth latency penalty.

 

Figure 2 Latency versus queue depth for various disk space utilizations.png 

Figure 2: Latency versus queue depth for various disk space utilizations

 

Individually, both effects improve hard disk performance, but the two combined produce more dramatic benefits. In online transaction processing (OLTP), response time (and hence disk I/O latency) is as important as throughput performance. Thus, the industry adopted the common rule to keep (time) averaged queue depth below 2 per disk (disregarding transient spikes) for OLTP. This rule was popular in the days of 5400rpm and 7200rpm drives.

In batch processing, where there’s not a live person waiting for each transaction to complete, the strategy is to drive queue depth high for the improved throughput performance. However, in data warehousing and decision-support systems (DSSs), all such rules go out the window. If there is unused throughput not being utilized, then that’s throughput squandered. Note the importance of the qualifiers, and yet people still feel that it’s acceptable to state the rule of “queue depth per disk below 2” without qualification.

 

RAID Controllers and Disk Arrays

In earlier days, we had just a bunch of disks (JBOD). The major RDBMS handled this situation by supporting multiple files, and multiple filegroups as well, for each database. Then there was light (sorry, wrong book) RAID and RAID controllers. And people saw that RAID was good, with fewer "disks" to manage at the operating system and database levels. A disk array appears to the operating system as a single disk, and performance counters were usually read from the operating system, not the storage system.

The rule of queue depth 2 per disk (and accompanying qualifiers) does not translate directly to the queue depth determined from the operating system performance counter. So it became popular to cite a latency rule that the data should not be higher than roughly 10ms-20ms, corresponding to the rule of queue depth 2 per disk (probably on 7200rpm and 10Krpm disks), having long since forgotten that there were additional qualifiers.

In general, data access latency below 10ms is usually an indication that transaction response time should be very good. Latency in the range 10ms-20ms should support acceptable transaction response time. Latency over 20ms corresponds to a heavily loaded disk system. More important, any transient surge would push the disk I/O into the very high queue depth range, with sharp spikes in response time. So even if average transaction response time is deemed acceptable, there could be a noticeable distribution tail experiencing very poor responsiveness.

 

RAID Group Random Read IOPS versus Queue Depth

So what should the I/O characteristics of a set of disks be in a RAID group? If disk read I/O were issued one at a time (i.e., queue depth 1), then one disk in the group would get the I/O. The other disks would be idle, for 180 IOPS delivered from the RAID group. At queue depth 2, it is likely the two I/O would go to different disks, so two disks would operate at queue depth 1, with the remaining disks idle. Only when the average queue depth is one per disk in the group would IOPS reach 180 per 15K disk, with some distribution probability adjustments.

For queue depth at some whole integer multiple-of-one per disk, the expected IOPS per disk is similar to the single-disk IOPS versus queue depth. The important point is that at queue depth 1, a given thread will experience single-disk IOPS, not disk-group IOPS performance.

 

Log Write Latency

In the old direct-attach storage days, the advice was to provision one dedicated RAID 1 disk pair for each high-transaction-volume database log. It was rarely mentioned, but the pure sequential small-block log write I/O could achieve latency on the order of 0.3ms, and around 3000-5000 IOPS.

SAN vendors frequently suggest not bothering with dedicated physical disks for each high-transaction-volume log. Everything will be fine, trust them. When dedicated disks are provisioned, and perhaps even with a dedicated service processor, the SAN still could not achieve very low log write latency. As SAN storage systems became pervasive, Microsoft changed SQL Server to allow more log writes to be in flight. (SQL Server 2000 SP4 and 2005 RTM allowed 8 log I/O outstanding per database. SQL Server 2005 SP1 allows 8 for 32-bit and 32 for 64-bit SQL Server and 480KB. SQL Server 2008 allows 3840KB.)

 

RAID Small-Block Random Write

RAID-level write overhead is outside the scope of this article. However, note that although people like to cite RAID 5 and 10 rules without qualification, the rule commonly cited concerning RAID 5 write performance only applies to small-block random writes. On a non-caching controller, we would expect write I/O to have similar IOPS characteristics as read I/O, adjusted for RAID-level overhead, both theoretical and controller specific.

 

Caching RAID Controllers and Read I/O

In other discussions on my blog and website, I’ve explained why read cache is counter-productive. In essence, the database engine itself is a data cache that is much closer and less expensive to access than cache on the storage controller. In addition, on a properly configured system, the database engine should have a much larger buffer cache than on the storage system. It is very unlikely that anything in the storage controller cache will be accessed again. Finally, the overhead of read-caching is significant in a storage system configured for high IOPS performance. Read-caching on the storage controllers incurs overhead for blocks that will almost never be accessed again.

Read-caching is usually disabled in TPC benchmark systems for the reasons just cited. One reputable source stated that a small, 2MB (not GB!) read cache per LUN to enable read-ahead is the preferred strategy. I recall someone saying that a specific server system with 48GB of memory showed I/O performance improvement when SAN cache was increased from 80GB to 120GB. What this result proves could be argued along more than one angle.

 

Caching RAID Controllers and Write I/O

And now to the random write I/O performance characteristics on a caching RAID controller. We have mostly skirted around write I/O until now. There is a reason for this. Figure 3 shows the small-block random write IOPS pattern with a caching RAID controller.

 

Figure 3 Small-block random write IOPS pattern with a caching RAID controller.png 

Figure 3: Small-block random write IOPS pattern with a caching RAID controller

 

When SQL Server or the operating systems sends one or more write I/Os to the RAID controller, the I/O is written to the controller cache, and a completion signal is sent back to the source. The next I/O is then sent. There is nearly no variation in IOPS versus queue depth. Latency is very low until the write volume reaches the IOPS limit. Beyond this, the write cache fills up, and latency becomes high until the source throttles back on write I/O volume.

 

Large Storage Systems with Large RAID Groups

As systems became more powerful, with compute performance growing 40% per year and hard disk performance averaging less than 10% per year (7.2K to 10K to 15K, then nothing until SSD), it was necessary to build storage systems with very large numbers of disks. During this period, SAN systems became pervasive, especially for large storage systems.

It was soon noticed that the SAN could not deliver anywhere near the expected IOPS based on the number of disks. One cause was traced to the Fibre Channel Host Bus Adapter (FC HBA) default queue depth setting of 32 (per adapter, now per target). The reasoning behind the default was the SAN-vendor doctrine of shared storage. To prevent one host from generating too much load, the I/O was throttled with the HBA queue depth setting so that all hosts could get a share of I/O volume.

If you were to measure IOPS versus the HBA queue depth setting on LUNs comprised of many disks, you should find that IOPS performance increases with the higher queue depth—all the way up to the maximum. This behavior is the same as described earlier in the section on RAID group IOPS versus queue depth.

 

FC HBA Queue Depth Setting

Note that in the early days, the HBA queue depth setting applied to either the HBA or each HBA FC port. On the more recent Emulex FC HBA, the default is now queue depth 32 per LUN, with the option of per-LUN or for the entire target. (QLogic uses the term Execution Throttle.) I suppose that in one of the few TPC-C benchmark reports with SAN storage system, a reference was made to changing the HBA queue depth from 32 to 254 without addition explanation.

The TPC-C systems all have very large disk arrays. Of course, it is proper to dial the HBA queue depth setting to maximum. Eventually, this was noted, and the recommendation to change HBA queue depth from 32 to 255 made its way into various Microsoft documents. The ones I have seen gave no explanation to the underlying cause and effect and supporting measurements.

So now what about a SAN with a small disk array—especially if the queue depth setting is per LUN and each LUN is comprised of four disks? Should the queue depth setting be increased to 254? I suggest following my guidelines for IOPS and latency versus queue depth, with adjustments for the number of disks per LUN. And weigh this all against whether your objective is OLTP responsiveness or batch/DSS pure throughput.

 

Sequential I/O

All of the IOPS-versus-queue-depth discussion so far has not pertained to sequential disk I/O. For large-block sequential I/O, a queue depth of 1 per LUN could be sufficient to generate maximum I/O bandwidth—if the I/O were large enough to span all disks in the LUN. I’m inclined to think that the theory is that the I/O size x queue depth should be larger than the number of disks in the array x the RAID stripe size. The reasoning is that each disk will have I/O to process, but I have not verified this hypothesis.

Increasing queue depth beyond the minimum necessary to achieve near maximum bandwidth will only serve to increase latency. In a mixed small- and large-block I/O workload, perhaps a higher queue depth on the large block might improve the large-block portion of throughput, but this hasn’t been studied. In a SAN, there are some suggestions that a higher queue depth may be necessary to reach maximum sequential bandwidth, along with multiple LUNs per RAID group. A satisfactory plausible explanation has not been provided.

 

SQL Server I/O Characteristics

There are several Microsoft documents that describe SQL Server I/O in detail. A selection include:

  1. • The CSS SQL Server Engineers blog How It Works: Bob Dorr's SQL Server I/O  
  2. • The SQL Server Technical Article by Emily Wilson, Mike Ruthruff, Thomas
  3. • SQL Server 2008R2 Books Online has the following under Buffer Management,
  4.    and Craig Freedman's Random Prefetching discussion on asynchronous I/O

In brief, in a table-scan operation, SQL Server will issue I/O to try to stay 1024 pages ahead of the scan with Enterprise Edition and 128 pages ahead in Standard Edition.

 

SQL Server Synchronous and Asynchronous I/O

In the random-like 8KB accesses for key lookup and loop join inner source rows, SQL Server switches from synchronous to asynchronous I/O at an estimated 25 rows, according to source.

Now consider the situation of a transaction-processing system that also handles reports. The transactions consist of several serially issued I/O at queue depth 1. The report is a single query that generates several hundred I/O issued asynchronously at high queue depth. Suppose that with just transactions being processed, the average queue depth per disk is 1, and the average latency is 5ms. And let’s say that a transaction requires 20 synchronous I/O completes in 100ms—a reasonable response time. Now the report runs, generating asynchronous I/O and driving queue depth 8 per disk and latency to 30ms. The report runs fine because the storage system is delivering 350 IOPS per disk. But the transaction with 20 serially issued I/O now takes 600ms to complete. In essence, the report has the effect of having higher priority than transaction processing!

 

What about Tempdb?

SQL Server I/O to tempdb frequently occurs at high queue depth. This is because the query has large hash or sort operations. If it were not large, then the hash or sort would be done in memory. So tempdb activity is frequently from large queries that generate asynchronous I/O at high queue depth.

If you only knew of the simple rule of I/O latency below 20ms, you might draw the conclusion that the tempdb disks are overloaded because the average latencies are very high. In fact, what is happening is that SQL Server is simply following the strategy for best performance with a throughput-oriented metric. The proper metric is whether tempdb can deliver sufficient I/O volume, not that tempdb I/O needs to be low.

 

High Queue Depth SQL on SSD

In a table-scan query without lock hints, a read queue depth of over 1300 was observed. The I/O size was 8KB, and read latency went above 200ms even on SSD storage. With the table lock, the I/O size was around 500K (probably mostly 512K plus a few small-block I/O), disk latency was less than 50ms, and queue depth was around 40.

For key lookup, we saw 8KB of I/O, and queue depth was around 160 with 7ms latency. With HDD storage and 20 or so disk, queue depth of 160 works out to 8 per disk, a reasonable number for good I/O but not excessive latency.

Marc Bevand on the Zorinaq's blog pointed out that IOPS at queue depth 1 is essentially a measure of latency. Suppose an SSD is rated at 100μs latency, and 30K IOPS for 8KB IO (30K x 8KB = 240MB). Then the queue depth 1 IOPS should be 10K (1,000,000 μs/s / 100μs). So the theory is that queue depth 3 or higher may be required to reach 30K IOPS. Keeping queue depth at the bare minimum necessary for maximum IOPS does not degrade performance for the query generating the huge I/O volume and does provide good responsiveness for other concurrent queries.

 

High Write Latency in Creating Clustered Indexes

The CREATE CLUSTERED INDEX command was observed to generate very high write latency. Queue depth was 500, latency was 600ms+, and I/O size averaged 100KB. Since this high latency should not occur during the work day, it’s a cause for concern. Still, there is no point issuing so many outstanding I/O. With either a caching RAID controller or SSD, write I/O bandwidth can be saturated even at low queue depth. Driving I/O so high only makes the system highly unresponsive for any function requiring I/O to the affected drives.

 

Wait Stat Tuning and Asynchronous Operations

Before wrapping up our look at I/O queue depth, I want to briefly bring up the subject of wait stat tuning. Many people today are tuning solely on wait statistics as the metric.

Consider the following example. Our storage system is comprised of 100 disks. A query generates 1 million I/O operations. If the I/O is issued synchronously at queue depth 1 per disk or 100 to the entire storage system, then the IOPS is 200 per disk or 20,000 for the storage system and disk latency is 5ms. The query should take 50 seconds to complete. The total wait time is 5ms per I/O, or 5,000 seconds for 1M I/O.

Now consider asynchronous I/O, driving queue depth per disk to 16, for 400 IOPS per disk and 40ms latency. The query now completes 1M I/O in 25 sec, but the total wait time is 40,000 seconds.

It is important to stay focused on the true metric and always evaluate system performance counters, not just wait time stats.

 

I/O Queue Depth Summary

We have explored in brief the key components that are impacted by IO queue depth strategy. The following are the main points to consider.

  1. • Random-read IOPS on hard disks can improve from operating at higher queue
  2.    depth at the expense of latency.
    • Sequential I/O does not or should not need high queue depth operation
  3.    beyond what is necessary to keep all disks busy. Staying 1024 pages ahead on
  4.    a table scan seems reasonable for large-block I/O, but I wouldn’t flood the
  5.    queue with 8KB I/O. This strategy should be adjusted based on I/O size, or
  6.    perhaps we should ask why sometimes 8K I/O is issued if the table is not
  7.    fragmented.
    • Random writes to a RAID controller with write cache do not need deep queue
  8.    depth for best performance.
    • SSD storage systems do not need very high queue depth for maximum
  9.    performance.

SQL Server appears to follow a set strategy on I/O queue depth, dependent only on edition (Standard or Enterprise). The number of disks behind each LUN is not considered; the usage model (OLTP versus data warehouse/DSS) is not considered.

However, here is a proposed strategy to follow to optimize performance:

  1. 1. Sequential I/O should not try for 1024 pages ahead if I/O size is 8KB.
    2. Write I/O to controllers with write cache should use lower queue depth.
    3. It is important to adjust random-read I/O queue depth based on the type of
  2.     storage: HDD or SSD.
    4. It is helpful to adjust HDD random-read I/O by usage model: OLTP or
  3.     DW/DSS.
    5. It is helpful to adjust HDD random-read I/O based on disks per LUN.

Some of the above adjustments could be detected automatically. Others might require a parameter setting, such as using sp_configure. As much as we would like a universal answer applied to the system independent of user action, making customized adjustments for your needs could greatly improve the usability of SQL Server. Today, a number of operations can render the SQL Server system completely unresponsive due to disk queue flooding, even with SSD storage. Only very large, perfectly configured storage systems would be immune.

 

About the Author

 

 

Joe Chang (blog | website) is a SQL Server consultant with Solid Quality Mentors and a tools developer specializing in quantitative database performance modelling and analysis, benchmarking, and storage and I/O performance. He has developed several free tools, including ExecStats for cross-referencing execution plans to index usage, SQL Trace for parsing SQL Server Profiler traces, SQL Clone for distribution statistics, and SQLSystem for system performance monitoring.

 

Follow us on: