Download a PDF of this Article
Packing of intervals is a fundamental operation in both the relational model and in SQL. Mostly treatment of intervals in SQL is concerned with temporal ones, where each time interval has start and end points in time; but the packing techniques covered in this article can be used with other types of intervals as well, like spatial ones on a line. Packing of intervals means grouping each set of contiguous intervals with which no other interval overlaps or is adjacent to (abutting), and returning the minimum start and maximum end for each group. Often, packing problems in SQL also involve a partitioning element (e.g. user, application, etc.), where the packing is done for each partition independently.
This article introduces new, highly efficient (good performance, close to linear scaling), set-based solutions using standard SQL constructs. Some of the techniques do make use of T-SQL specific constructs, but those constructs have standard parallels so adaptation to other platforms should be fairly straightforward. For example, one of the solutions makes use of the T-SQL-specific construct APPLY. Standard SQL has a similar construct called LATERAL correlation.
I provided the packing intervals problem as a challenge in my blog in SQL Server Magazine (http://www.sqlmag.com/blogs/puzzled-by-t-sql.aspx), and I do recommend looking at the different solutions people suggested. Here I’m going to focus and expand on my solutions.
I’d like to thank a few people first. One of the techniques I used to address the interval packing problem was inspired by a technique used by Ben Flanaghan to address a different problem called Maximum Concurrent Sessions. I’ll tell you a little secret; when I saw Ben’s solution to that problem, I nearly cried; and having seen lots of very creative SQL techniques in my life, that’s saying something. Thanks to Alejandro Mesa for helping to verify the validity of the solutions. And also thanks to Herbert Albert, Gianluca Hotz, and Dejan Sarka for editing the article and providing input.
Sample Data and Desired Result
The scenario I’ll use to demonstrate my solutions involves user sessions against some application or service. Use the code in Listing 1 (on Microsoft SQL Server) to create a table called Sessions, two indexes to support the solutions, and sample data to test the solution’s validity
Use the code in Listing 2 to create and populate the Sessions table on Oracle to test the portability of the solutions.
The desired result for the small set of sample data is shown in Table 1.
Figure 1 has a graphical depiction of both the original intervals from the Sessions table (green bars), as well as the packed intervals (red segments).
Concerning the grain, or precision, used for the start and end points of intervals; I’ll first present solutions that assume that if the end of one interval is less than the start of another, the two do not overlap or abuts (are adjacent). I will later also provide a solution in case a gap of up to a certain length is to be ignored.
You can use the code in Listing 3 to populate the Sessions table in Microsoft SQL Server with a large set of sample data to test the performance of the solutions.
The code in Listing 3 populates the Sessions table with 5,000,000 rows. I filled it with data for 2,000 users, each with 2,500 sessions, during a period of a week, with each session lasting up to one hour. But the code allows you to change any element that you like to test the scaling of the solutions. The performance measures I’ll mention in the article are for the sample data with the current values in Listing 3. I tested the queries on a Dell Alienware M15x laptop, with a Core i7 processor (quad with hyper threading, namely 8 logical CPUs), 4 GB RAM 1333 MHz, against hot cache.
Note that if you’re not getting parallel plans where I do, it’s probably because the machine you’re using has fewer logical CPUs than in mine. To get similar plans to the ones I get, just for test purposes, you can use the SQL Server service startup option -P8, which will cause SQL Server to use 8 schedulers like in an environment with 8 logical CPUs. The -P startup parameter is not an officially documented one, so use it just for test purposes to mimic a machine with a desired number of CPUs, not for production purposes.
Solution 1, Using Row Numbers
The first solution that I’ll cover is a standard solution that uses the ROW_NUMBER function. It runs without any alteration on both SQL Server and Oracle. The complete solution is shown in Listing 4. It runs for 18 seconds on my laptop against the sample data provided in Listing 3.
The code in the CTE called C1 unifies start events with end events into one chronological sequence of events (start or end). Start events are marked with a +1 event type since they increase the count of active sessions, and end events are marked with a -1 event type since they decrease the count of active sessions. Figure 2 shows the chronological sequence of unified events sorted by username, ts, type DESC, id, with green bars representing how many sessions are active before and after each event.
Observe that a packed interval always starts when the number of active sessions prior to a start event was zero, and ends when the number of active sessions after an end event is zero. Therefore, in respect to each start event you need to know how many sessions were active prior to it, and in respect to each end event you need to know how many sessions are active after it. This information is calculated in steps. Observe that the code in the CTE C1 calculates start ordinals for start events (attribute called s), with NULLs used as place holders in that attribute for end events, and end ordinals for end events (attribute called e), with NULLs used as place holders in that attribute for start events. The code in the CTE C2 then simply adds an ordinal for start or end events (attribute called se), partitioned by username, sorted by ts, type DESC, id. Table 2 has the output of the code in C2, sorted by username, ts, type DESC, id. For readability, I marked the start event types as +1 instead of just 1 and replaced NULLs with blanks.
Then the code in the CTE C3 is where most of the magic is done. For each start event you know how many sessions started so far (s), and you know how many sessions either started or ended so far (se). Therefore, you can easily calculate how many sessions ended so far (se - s). Now that you know how many sessions started and how many sessions ended, you can calculate how many sessions are active after the start event: s - (se - s). Think of it just like calculating how many people are in a room if x people enter the room and y people leave the room. Finally, to tell how many sessions were active prior to the start event, simply subtract 1 from the calculation: s - (se - s) - 1.
In a very similar way you can calculate the number of active sessions after each end event. Having both the number of sessions that ended thus far (e) and the number of sessions that either started or ended (se), you can calculate how many sessions started as: se - e. Then the number of active sessions is (se - e) - e.
Now, remember that you want to filter only start events where the number of active sessions prior to the event was zero, and end events where the number of active sessions after the event was zero. You can generalize the two filters into one:
What you have left post filtering are pairs of adjacent start-end events, each representing the start and end of a packed interval. So you need to assign a group identifier to each pair in order to be able to later pivot each pair into one row. This can be achieved by assigning row numbers (call it n), and applying the calculation (n - 1) / 2 + 1, where / represents integer division. For n values 1, 2, 3, 4, … you get a result of 1, 1, 2, 2, …
In SQL Server the arithmetic operator / represents integer division when the operands are integers, but in Oracle you get a decimal division, so I added a FLOOR function so that the code would run correctly on both platforms. So the code in the CTE C3 generates the output shown in Table 3.
What’s left to the outer query to do is group the rows from C3 by username and grpnum, and return the minimum ts as the packed interval’s start time, and the maximum ts as the end time.
The plan generated by SQL Server’s optimizer for this query is highly efficient, given that you create the aforementioned indexes: idx_user_start_id and idx_user_end_id. The plan is shown in Figure 3.
What’s amazing about this plan is that it applies two ordered scans of the indexes created to support this solution (idx_user_start_id and idx_user_end_id), and relies on the ordered scans to (take a deep breath now): calculate the row numbers for start ordinals (s), as well as row numbers for end ordinals (e), as well as performing a merge join to unify the results, as well as calculating the start or end ordinals (se) on the unified sets, as well as calculating the row numbers that are used to produce grpnum post filtering; and all this without requiring even one sort operator! That’s truly remarkable to see an optimizer that so beautifully understands the concept of order preservation. That’s a big Kudos to the SQL Server optimization team for this design. Finally a hash aggregate is used to group the data by grpnum (only the remaining rows post filtering). Since most of the operations used in this plan have linear complexity, this solution should scale close to linearly.
So in total, this plan performs only two scans of the data (one of each index), in index order. As mentioned, this solution runs on my laptop for 18 seconds. The one thing that this solution doesn’t exploit well is parallelism. That’s where the second solution excels…
Solution 2, Using Row Numbers and APPLY to Exploit Parallelism
To exploit parallelism well, what you want is to encapsulate the logic from the solution in Listing 4 in a table expression that operates on a single customer, and apply the table expression to each user. I’m assuming here that you have a table holding the distinct users, which is a fair assumption to make. The code in Listing 5 creates a table called Users and fills it with the same username values as the ones that appear in the Sessions table.
It is then convenient to encapsulate the logic from the solution in Listing 4 for a single user in an inline table function, as shown in Listing 6.
And then finally, use the CROSS APPLY operator (similar to LATERAL correlation in the standard) to apply the function to each user from the Users table as shown in Listing 7.
SQL Server generates a very efficient parallel plan for this query. The plan for this solution is shown in Figure 4.
As you can see, the plan uses a parallel scan of the clustered index on the Users table, and then performs the work for each user in the inner branch of the Nested Loops join. The work done in this inner branch should look familiar; it’s very similar to the work done in the plan shown in Figure 3, only this time for the data associated with one user. This inner branch is, of course, executed in parallel by multiple threads. This solution runs for only 2 seconds on my laptop.
Ignoring Gaps of Up To a Certain Length
In case you want the solution to ignore gaps of up to a certain length (e.g. 10 seconds), you can use the following approach:
• Add a computed column for endtime + <duration_to_ignore>; call that column endtime2
• Create an index on username, endtime2
• Use one of the aforementioned solutions with a couple of revisions:
o Operate on endtime2 instead of endtime
o In the outer query subtract <duration_to_ignore> from the result endtime
So, suppose you want to ignore gaps of up to 10 seconds. Use the following code to add the computed column endtime2 and the index:
The code in Listing 8 then shows the revised solution from Listing 4 with the two aforementioned revisions.
The performance is similar to that of the solution in Listing 4. And naturally, also here you can use the APPLY approach for efficient use of parallelism.
Solution 3, Using a Window Aggregate
Solution 3 is shown in Listing 9. It relies on a window aggregate function with elements that are standard but not supported by SQL Server as of version 2008 R2. The solution is supported by Oracle, and hopefully the missing functionality will be added to SQL Server in a future version.
The solution in Listing 9 uses very similar principals to those used by Solution 1, only instead of using row numbers to calculate the number of active sessions at any given point, it uses a window SUM aggregate. A running sum of the type (recall +1 represents a start event and -1 an end event), partitioned by username, in chronological order, is the number of active sessions at any given point. Now, remember that for start events you need the number of active sessions prior to the event, and for end events you need the number post the event. Therefore you need to subtract 1 from the count with start events, and subtract nothing with end events. So the solution generates an attribute called sub with 1 for start events and 0 for end events, and then subtracts that value from the running total, using the following expression:
The rest is very similar to the logic in Solution 1.
In this article I presented new solutions to the fundamental interval packing problem. The first solution was based mainly on use of row numbers to calculate ordinals for interval start and/or end events and based on those, the count of active sessions. The second solution implemented the same logic but for a single user, and applied it to each user using the APPLY operator, and this way utilized parallelism better. I also showed how to apply the solutions when a gap of up to a certain length is to be ignored. Finally I showed a third, more concise, solution that is based on a standard window aggregate function with elements that at the moment are not supported by Microsoft SQL Server (as of version 2008 R2), but hopefully will be added in the future.
About the Author
Itzik Ben-Gan is a Mentor and Co-Founder of Solid Quality Mentors. A SQL Server
Microsoft MVP (Most Valuable Professional) since 1999, Itzik has delivered numerous training events around the world focused on T-SQL Querying, Query Tuning and Programming. Itzik is the author of several books including Microsoft SQL Server 2008: T-SQL Fundamentals, Inside Microsoft SQL Server 2008: T-SQL Querying and Inside Microsoft SQL Server 2008: T-SQL Programming.