So a couple days ago I posted a blog about the myth that the wild card (*) in a SQL query degrades the performance of the query in comparison to a comparable query using explicit column declarations (for all the columns). As with many of the MythBusters TV episodes it seems I was a bit hasty to declare the myth busted. And as with the TV series I'm now going back and doing more testing. Will I eat crow? Find out in this thrilling new installment!
A fan writes:
In the extremely small test bed you created, you will not see any performance difference. But when working with tables with millions of rows, the performance increase gained by explicitly naming the columns versus using SELECT * can be very pronounced. I know this from personal experience, but that experience is only on MS-SQL 2K & 2K5. I do know on those platforms that if you use SELECT *, the DB engine will NOT be able to create an optimized, possibly pre-cached execution plan. On an extremely busy server, this can make a big difference in performance solely because it is waiting for CPU time to create the plan....
I have seen negative effects of SELECT * in as little as 100K rows. Using SELECT * also changes the way ODBC/JDBC drivers create the actual database calls, which can negatively impact performance as well.
If a DB Server is very active, and the main bottleneck is the CPU, then SELECT * is VERY detrimental to Query Performance. This is because the Execution Plan must be created on-the-fly for every single call of that query. That eats up CPU time, and it will also cause Lightweight SQL locks that can prevent other DB actions from happening until the locks are released.
And as I tried to illustrate in my post, there are things like MS-SQL Views that will flat-out BREAK if you use SELECT * and make ANY changes to the table columns or their "order" in the table design. Making this problem even worse, it does not even throw an error when this happens! MS-SQL will simply return 0 ROWS, with no error code, every time the View is called until it is manually recompiled by the user.
Okay, so lets review the myth. I was commenting on statements like one found on the SixSigns blog in which someone said "there is no excuse for using the *". The myth here is that irrespective of the width or volume of the table or the individual use case, the mere presence of a wild-card (*) will automatically degrade performance (as was stated in the previous quoted example) and is a big red warning flag indicating that the person who wrote the query is incompetent.
I'll reiterate here by saying that this myth is NOT that a query will return faster if it uses fewer columns, but that performance is automatically degaded when the same columns are returned using a wild card.
(I also didn't address the issue with views needing to be rebuilt, because that's not part of the myth. Moreover if you've worked with views SQL Server at all, you're well aware of this issue (one that Microsoft is in no hurry to fix even though they could) and hopefully have adopted a strategy of addressing it every time you alter a table. At several of the companies I've worked for this was done by maintaining a script with all the views in order of dependency and executing that script after any change to tables. If you have the proper permissions you can actually create a trigger on the sysobjects table that will recompile the views any time a column is added. It doesn't work so well if a column is dropped, but it's really poor design in SQL Server to begin with. And many people don't work with views at all, in which case this isn't an issue anyway, so it yet again is far from a reason for the wild-card to set off a big red flag.)
So following the previous reader comment, I removed most of the columns from my table, stripping it down to just four columns. I then duplicated all the data in my data set until I had 22 million records. Even with just this one very tiny table with two integers and two 35-character strings, this one database is now about 8 GIGABYTES in size. I then turned off all other applications (browsers, IM clients, etc.), disabled several non-essential Windows pocesses (including ColdFusion) and executed this query several times against SQL Server 2005 Express in SQL Studio:
select *
from MyTestTable
where col02 = 'UUID'
Using a notebook with 1gb RAM and a 1ghz dual-core processor, this query returned 2k records in an average of 3 minutes. I then replaced the wild card character with an explicit column list and executed the query several more times.
select col01, col02, col03, col04
from MyTestTable
where col02 = 'UUID'
Using the explicit column names the query executed also in an average of 3 minutes. It's interesting to note that the fastest time of 2:30 came from a query using the wild-card, while the slowest time of 5:00 came from a query using the explicit columns.
With 22 million records I ran out of drive space to create an index that would cover all four columns. I was curious to find out however in addition to the myth, what the performance difference would be between a 35-character varchar column and an integer column. I removed half of the records so I had about 12 million records and then created an index with all four columns. This time when I filtered using one of the varchar columns, the query returned in about 90 seconds. Then I filtered using one of the integer columns like this:
select *
from MyTestTable
where col01 = 4801
Without the index this query took the same 90 seconds as the varchar filter. When I used the integer column with the index however this query executed in less than a second, even with 11 million records in the table. However there is still no difference between the query with the wild-card and the query with explicit columns. They perform the same.
My inclination is to say based on these results that the SQL Server 2005 is expanding the columns in the table, caching the expanded list and then making the substitution before caching the execution plan. I say this because when I've restarted all the services and run the query with the wild-card, the first iteration of the query takes some time to complete and subsequent queries use the cached execution plan as expected. However when I switch between the wild-card and the query using the explicit columns, there is no initial load while it caches a new execution plan - it seems to use the same execution plan whether it's using the wild-card or explicit columns.
What remains is for me to test a "high traffic" situation.
I don't have a lot of equipment available at my disposal, so my test here is likely to be somewhat primitive. I am however using ColdFusion 8, which means that even with the developer edition on my notebook, I can use cfthread to simulate simultaneous requests. I'm limited to two additional threads per request, which means I can have at most three threads executing at a time (the main thread plus two additional threads).
I've created a page that loops over a list: 1,5,10,50. This is used to fetch that number of UUIDs from the table for a subsequent fetch operation. The page then loops over my choice of column lists with or without the wild-card and spawns two additional threads. In each thread (including the main thread), I then execute a cfhttp request to a second page which inserts a new record into the same table and performs a select statement to fetch records with the provided UUID's in the col02 column. It does this in a loop for ten iterations, just to make sure there's a good long overlap where all three threads are querying the database simultaneously with both inserts and selects. So as the data is being fetched, it's also being changed and the index is being updated.
I can tell you that throughout the duration of this test, both cores of my processor were pegged at 98-100% utilization except for very brief downward spikes. Checking the processes, SQL Server was consuming the majority of that CPU usage. So there should be no question that this is a "busy SQL Server".
When the page is asked to fetch records for only one UUID, it returns a little under two thousand records in the result and execution time for the query averages between 300 and 600. What remains is no movement on the performance between using the wild-card and using explicit columns. As the number of UUIDs passed to the querying page increase, the times to execute those queries also increase. With 5 IDs it's fetching about 9k records in about 1.5 seconds and the wild-card still has no effect. With 10 IDs it's fetching about 17k records in about 1.5-2.5 seconds and the wild-card still has no effect. Finally with 50 IDs it's fetching about 90k records in about 10 seconds and the performance difference? Still bupkiss.
To be honest I expected if there was going to be a performance difference it would be with the smaller queries. I had also run this test initially using the integer columns instead of the UUID columns as filters, which meant queries that executed much faster, but still failed to produce the desired result. And based on what I know about the caching of execution plans in SQL server, I would think actually that to run into a problem you would need to be executing a very wide variety of queries, which isn't happening here. But I think this pretty well proves that the mere number or size of the queries won't cause any wild-card problems related to the caching of execution plans... that is at least not unless you've already got much bigger problems to begin with.
The way to simulate problems with execution plan caching really would be to create a variety of random tables and then execute queries with random column lists and random selection criteria against them to produce the maximum number of possibly unique queries. The objective would be to create a system in which you had more than 10k (or whatever the limit is) unique SQL statements being executed. Note that this is different than the number of queries being executed. An individual query may be executed 4-billion times, it's still only ONE unique SQL statement (and using cfqueryparam tags will help to reduce the number of cached statements to improve performance). My guess based on what I've seen from these tests, is that in an environment in which you were executing enough unique SQL statements to cause SQL Server to be unable to handle caching its execution plans, the presence of wild-cards would be one of your smallest problems.
So is it plausible that a wild-card could cause a performance problem in lieu of selecting all the table's columns explicitly? Yes it's plausible... But that's not the myth.
What's the myth? The myth is that the mere presence of a wild-card character (*) in a select query is an automatic performance/scalability problem and that using a wild-card in production is a sign of an incompetent programmer.
That's not what we've seen however. What we've seen is that, with quite a bit of effort, I've been totally unable to reproduce a situation in which a wild-card performed any differently than explicitly declaring all columns. It seems that in a scenario in which all the columns are actually being used, creating a performance/scalability problem with a wild-card character is actually quite a challenge.
Myth BUSTED.
NOTE: New code for these tests can be downloaded below. These files should never be placed on a production server. I deliberately avoided using cfqueryparam tags to try and produce performance issues, which means these files are vulnerable to SQL injection. You've been warned! :)