MythBusters: Select * Redux

So a couple days ago I posted a blog about the myth that the wild card (*) in a SQL query degrades the performance of the query in comparison to a comparable query using explicit column declarations (for all the columns). As with many of the MythBusters TV episodes it seems I was a bit hasty to declare the myth busted. And as with the TV series I'm now going back and doing more testing. Will I eat crow? Find out in this thrilling new installment!

A fan writes:

In the extremely small test bed you created, you will not see any performance difference. But when working with tables with millions of rows, the performance increase gained by explicitly naming the columns versus using SELECT * can be very pronounced. I know this from personal experience, but that experience is only on MS-SQL 2K & 2K5. I do know on those platforms that if you use SELECT *, the DB engine will NOT be able to create an optimized, possibly pre-cached execution plan. On an extremely busy server, this can make a big difference in performance solely because it is waiting for CPU time to create the plan.

...

I have seen negative effects of SELECT * in as little as 100K rows. Using SELECT * also changes the way ODBC/JDBC drivers create the actual database calls, which can negatively impact performance as well.

If a DB Server is very active, and the main bottleneck is the CPU, then SELECT * is VERY detrimental to Query Performance. This is because the Execution Plan must be created on-the-fly for every single call of that query. That eats up CPU time, and it will also cause Lightweight SQL locks that can prevent other DB actions from happening until the locks are released.

And as I tried to illustrate in my post, there are things like MS-SQL Views that will flat-out BREAK if you use SELECT * and make ANY changes to the table columns or their "order" in the table design. Making this problem even worse, it does not even throw an error when this happens! MS-SQL will simply return 0 ROWS, with no error code, every time the View is called until it is manually recompiled by the user.

Okay, so lets review the myth. I was commenting on statements like one found on the SixSigns blog in which someone said "there is no excuse for using the *". The myth here is that irrespective of the width or volume of the table or the individual use case, the mere presence of a wild-card (*) will automatically degrade performance (as was stated in the previous quoted example) and is a big red warning flag indicating that the person who wrote the query is incompetent.

I'll reiterate here by saying that this myth is NOT that a query will return faster if it uses fewer columns, but that performance is automatically degaded when the same columns are returned using a wild card.

(I also didn't address the issue with views needing to be rebuilt, because that's not part of the myth. Moreover if you've worked with views SQL Server at all, you're well aware of this issue (one that Microsoft is in no hurry to fix even though they could) and hopefully have adopted a strategy of addressing it every time you alter a table. At several of the companies I've worked for this was done by maintaining a script with all the views in order of dependency and executing that script after any change to tables. If you have the proper permissions you can actually create a trigger on the sysobjects table that will recompile the views any time a column is added. It doesn't work so well if a column is dropped, but it's really poor design in SQL Server to begin with. And many people don't work with views at all, in which case this isn't an issue anyway, so it yet again is far from a reason for the wild-card to set off a big red flag.)

So following the previous reader comment, I removed most of the columns from my table, stripping it down to just four columns. I then duplicated all the data in my data set until I had 22 million records. Even with just this one very tiny table with two integers and two 35-character strings, this one database is now about 8 GIGABYTES in size. I then turned off all other applications (browsers, IM clients, etc.), disabled several non-essential Windows pocesses (including ColdFusion) and executed this query several times against SQL Server 2005 Express in SQL Studio:

select *
from MyTestTable
where col02 = 'UUID'

Using a notebook with 1gb RAM and a 1ghz dual-core processor, this query returned 2k records in an average of 3 minutes. I then replaced the wild card character with an explicit column list and executed the query several more times.

select col01, col02, col03, col04
from MyTestTable
where col02 = 'UUID'

Using the explicit column names the query executed also in an average of 3 minutes. It's interesting to note that the fastest time of 2:30 came from a query using the wild-card, while the slowest time of 5:00 came from a query using the explicit columns.

With 22 million records I ran out of drive space to create an index that would cover all four columns. I was curious to find out however in addition to the myth, what the performance difference would be between a 35-character varchar column and an integer column. I removed half of the records so I had about 12 million records and then created an index with all four columns. This time when I filtered using one of the varchar columns, the query returned in about 90 seconds. Then I filtered using one of the integer columns like this:

select *
from MyTestTable
where col01 = 4801

Without the index this query took the same 90 seconds as the varchar filter. When I used the integer column with the index however this query executed in less than a second, even with 11 million records in the table. However there is still no difference between the query with the wild-card and the query with explicit columns. They perform the same.

My inclination is to say based on these results that the SQL Server 2005 is expanding the columns in the table, caching the expanded list and then making the substitution before caching the execution plan. I say this because when I've restarted all the services and run the query with the wild-card, the first iteration of the query takes some time to complete and subsequent queries use the cached execution plan as expected. However when I switch between the wild-card and the query using the explicit columns, there is no initial load while it caches a new execution plan - it seems to use the same execution plan whether it's using the wild-card or explicit columns.

What remains is for me to test a "high traffic" situation.

I don't have a lot of equipment available at my disposal, so my test here is likely to be somewhat primitive. I am however using ColdFusion 8, which means that even with the developer edition on my notebook, I can use cfthread to simulate simultaneous requests. I'm limited to two additional threads per request, which means I can have at most three threads executing at a time (the main thread plus two additional threads).

I've created a page that loops over a list: 1,5,10,50. This is used to fetch that number of UUIDs from the table for a subsequent fetch operation. The page then loops over my choice of column lists with or without the wild-card and spawns two additional threads. In each thread (including the main thread), I then execute a cfhttp request to a second page which inserts a new record into the same table and performs a select statement to fetch records with the provided UUID's in the col02 column. It does this in a loop for ten iterations, just to make sure there's a good long overlap where all three threads are querying the database simultaneously with both inserts and selects. So as the data is being fetched, it's also being changed and the index is being updated.

I can tell you that throughout the duration of this test, both cores of my processor were pegged at 98-100% utilization except for very brief downward spikes. Checking the processes, SQL Server was consuming the majority of that CPU usage. So there should be no question that this is a "busy SQL Server".

When the page is asked to fetch records for only one UUID, it returns a little under two thousand records in the result and execution time for the query averages between 300 and 600. What remains is no movement on the performance between using the wild-card and using explicit columns. As the number of UUIDs passed to the querying page increase, the times to execute those queries also increase. With 5 IDs it's fetching about 9k records in about 1.5 seconds and the wild-card still has no effect. With 10 IDs it's fetching about 17k records in about 1.5-2.5 seconds and the wild-card still has no effect. Finally with 50 IDs it's fetching about 90k records in about 10 seconds and the performance difference? Still bupkiss.

To be honest I expected if there was going to be a performance difference it would be with the smaller queries. I had also run this test initially using the integer columns instead of the UUID columns as filters, which meant queries that executed much faster, but still failed to produce the desired result. And based on what I know about the caching of execution plans in SQL server, I would think actually that to run into a problem you would need to be executing a very wide variety of queries, which isn't happening here. But I think this pretty well proves that the mere number or size of the queries won't cause any wild-card problems related to the caching of execution plans... that is at least not unless you've already got much bigger problems to begin with.

The way to simulate problems with execution plan caching really would be to create a variety of random tables and then execute queries with random column lists and random selection criteria against them to produce the maximum number of possibly unique queries. The objective would be to create a system in which you had more than 10k (or whatever the limit is) unique SQL statements being executed. Note that this is different than the number of queries being executed. An individual query may be executed 4-billion times, it's still only ONE unique SQL statement (and using cfqueryparam tags will help to reduce the number of cached statements to improve performance). My guess based on what I've seen from these tests, is that in an environment in which you were executing enough unique SQL statements to cause SQL Server to be unable to handle caching its execution plans, the presence of wild-cards would be one of your smallest problems.

So is it plausible that a wild-card could cause a performance problem in lieu of selecting all the table's columns explicitly? Yes it's plausible... But that's not the myth.

What's the myth? The myth is that the mere presence of a wild-card character (*) in a select query is an automatic performance/scalability problem and that using a wild-card in production is a sign of an incompetent programmer.

That's not what we've seen however. What we've seen is that, with quite a bit of effort, I've been totally unable to reproduce a situation in which a wild-card performed any differently than explicitly declaring all columns. It seems that in a scenario in which all the columns are actually being used, creating a performance/scalability problem with a wild-card character is actually quite a challenge.

Myth BUSTED.

NOTE: New code for these tests can be downloaded below. These files should never be placed on a production server. I deliberately avoided using cfqueryparam tags to try and produce performance issues, which means these files are vulnerable to SQL injection. You've been warned! :)

Related Blog Entries

Comments
Gerald Guido's Gravatar The moral of the story? Don't use SQL server.

I just pulled 50,000 rows out of mysql (using SELECT * or the other way--matters not) in 0.07s.
# Posted By Gerald Guido | 10/22/08 4:47 PM
Troy Allen's Gravatar @Ike: Wow…that is some great testing and analysis…you really put some time into trying to bust this myth. For those readers who hit this post without reading the previous one, I am the "Fan" that Ike has quoted. It appears that you have changed the myth a bit from:

Using "select *" in the syntax of a query will slow the query down.

…to:

The mere presence of a wild-card character (*) in a select query is an automatic performance/scalability problem and that using a wild-card in production is a sign of an incompetent programmer.

So I will try to make sure I address the revised version of the myth. As I have already conceded, “never” is too strong of a word for this. Heck, in our business most things are rarely “never” or “always”…there is usually a spectrum, and there are almost always edge cases. But as good as your test scenario would appear to be thought out and executed, it is unfortunately still a test case and does not represent “real-world” usage of a busy MS-SQL Server (pure CPU load alone is not enough…you have to look at Latches, Locks, and a whole host of other indicators). I was trying to keep my previous posts quick and to-the-point without any real background information, but I guess I need to provide more details to make my point.

Before I do, however, in case no one reading this really cares about this extremely anal and granular level of detail about this “issue”, let me state one more time for the record: I agree with Ike’s overall point of “never use SELECT *” being OVERSTATED. There are cases when you can use it with no ill effects and in most cases it will not greatly affect performance. But Ike did add the whole “sign of an incompetent programmer” to the end of the myth, which makes my earlier point about breaking SQL Views a little more germane again. I also think for good SQL code “self-documentation”, explicitly listing the columns is also courteous to future developers, if nothing else.

Now on to the promised picking of the nit. I spent three years in charge of an absolute “hot house” case for SQL performance tuning and problems. This company was running 20 databases on one MS-SQL 2000 box. There were hundreds of Stored Procedures and Views (maybe over a thousand…I can’t remember the exact numbers). There were also hundreds of SQL Agent jobs running, including some that “ran constantly” and others that ran every minute. Some of the tables had millions of rows in them, and the internal and Web users combined meant average concurrent users in the thousands. This poor server was on its knees, completely overworked. When I took over, we immediately invested over $25K in hardware and software for a dedicated server to run Enterprise SQL Monitoring Tools from Quest and Idera. We had to know where the bottlenecks were, and what we should attack first. The first thing we discovered was that we were disk-bound, as we suspected. So we upgraded the pathetic single RAID-5 array (that’s right…one 6-disk Array for everything…YUCK) storage to multiple RAID 1+0 arrays to separate logs, data, and OS into their own arrays. This unclogged the data read/writes, so we were no longer disk bound.

As usually happens in this situation, once the drives could feed data fast enough to the server, we then became CPU bound. I discovered that all of the stored procedures names were prefixed with sp_ , which forces a recompile on every call of the SP (this is another rant, but if you have SPs named like this, CHANGE THEM ALL as soon as you can). We renamed all of the SPs, starting with the ones that ran most often, and noticed improvements immediately. But even once we finished that task, we were still CPU bound, and still getting Cache Misses. We dug down into the debugging logs of the tools, and noticed that almost all of the Cache Misses were for Execution Plans for SELECT * queries. So we again started with the “biggest offenders” and replaced * with the column names. We immediately saw positive improvements, and when we had finished them all, our server was no longer constantly CPU-bound.

We had the live monitoring tools, mainly Spotlight on SQL Server from Quest, to see the negative impact of SELECT * as it happened. We could watch the SQL requests for compiling execution plans and stored procedures/views getting backed up in the queue waiting on SELECT * plans to be created that should have already been in the cache.

Again, I know this is a pretty extreme case, and I have never seen anything near as bad (before or since). So if you want to totally dismiss this as an “edge case”, then fine. But it does clearly illustrate that there is the POTENTIAL for negative effects on performance by using SELECT *. And as I have said before, I am still waiting to hear someone give me a really good reason WHY to use SELECT * in the face of these potential issues.

I really enjoy your blog, Ike, and this exchange has been enjoyable. Keep “pushing the edges” of what are supposedly “common perceptions”…it is ALWAYS good to make people think about what they are doing and why they are doing it!
# Posted By Troy Allen | 10/23/08 2:37 PM
ike's Gravatar Thanks Troy.

I probably shouldn't have described it as a "common perception" -- it's common enough that I've run into it from a variety of people I've met in the IT industry, although that doesn't necessarily make it "common". Maybe I should have said "common enough". :P

So yes it happens. I just have kind of a knee-jerk response when I see comments like the one on the Six Signs blog. For my part I tend to feel that it's just not that important to consider performance on this scale (read: things you can't reasonably test for) until you actually have a performance issue. Once you've got issues with a server having problems like you described, then you can put analysis tools on it, fix those problems and move on. And preferably you would do that several times over the course of a few years and never wind up with the kind of large batch of problems you saw at that particular company.

I do think it would be interesting to know what it would take to create a test case that would actually cause those problems. And I say that in spite of the fact that given your back story my first inclination is to wonder about adding more hardware and load-balancing on the SQL server. That is assuming that you couldn't just move half those catalogs to a separate SQL server and run them in segregation -- which was how one of the SaaS companies I worked for handled it when each client had their own catalog.

It actually wasn't my intent to "change the myth". So it's "confirmed but unlikely" that it will cause these kinds of problems for equivalent queries.

I'm actually not married to using select * either, my comments about "why increase the maintenance cost" were more a matter of playing devil's advocate than anything else.
# Posted By ike | 10/23/08 4:47 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.5.006. | Protected by Akismet | Blog with WordPress