Today on MythBusters: Select *
Yesterday I noticed an entry on the Six Signs blog about some security issues on a fairly well known server. I don't know who posted it unfortunately because their blog doesn't show the author's name. He pointed out several rather glaring security issues on the site in question like not scoping and the lack of cfqueryparam tags. There is however one comment he made which I feel is something of a myth in the software community and I'd like to take a minute to address that myth.
The myth is this: that using "select *" in the syntax of a query will slow the query down.
Here's the quote from SixSigns:
First off, one should never use the "*" for getting back all the columns of the database. This is slowing the database server down and is a poor man coding style. In todays tool environment there is no excuse for using the "*" anymore (I am talking for production use here).
Now I believe this is actually a pretty subtle myth. Like most myths I think there's an element of truth to it. The issue is that the phrasing of the statement is misleading. The way it was described on the SixSigns blog as in many cases would lead you to believe that the wild card is slower than a comparable query in which you explicitly declared all the columns in the table. In other words, the myth is not that selecting a smaller number of columns is faster, but that selecting the same columns explicitly is faster than selecting them with the wild card.
So to test this I created a new table in a SQL Server database I had handy. I chose to make it a particularly wide table (at least for me) with twenty columns. For odd numbered columns I created an integer column and inserted a random number. For even numbered columns I created a 35-character string and inserted a UUID. I then inserted about 8 thousand records into my new table and I executed a few select statements against the data. These select statements had the following syntax:
from TestTable
[where col01 < 5000]
I alternated the list of columns in the select clause using either the wild card (*), the list of all 20 columns or a list of just the first 10 columns. I then executed each query 100 times and displayed the average time for each query. I added the where clause in a second round of tests.
The end result is conclusive. A query using the wildcard (*) executes at exactly the same speed as a comparable query (in my test this was about 200ms). Even after adding indexes, the only thing that sped up this query was reducing the number of columns or rows returned. Returning only the first 10 columns resulted in a query that executed in exactly half the time (100ms). As you might expect, the overhead is similar for each column. (Note that I didn't test CLOBs and the overhead for those would probably be much more significant.) Reducing the number of rows returned scaled the query down in a similarly linear fashion, cutting the time in half again when returning only half the records.
So if you're fetching data from a given table and you really do need all the columns (as is often the case for the CRUD forms an ORM tool might be designed to manage), then there's no reason to avoid using the wild card (*). Most of the time there's also no specific reason to avoid using the wild card if you're returning data even from a big query if you're going to be using most of the columns in the table (i.e. listing names and phone numbers of contacts). There might be some exceptions in cases where the table has a lot of columns but just one or two of those columns happens to be a CLOB that you're not going to use. However especially for tables with small numbers of columns, such as look-up tables (or for example category tables), there's no inherent advantage to explicitly declared columns. They aren't any faster, meaning that select * is viable even in a production environment in many cases.
This myth is BUSTED.
The case in which you want to make sure that you do explicitly declare your columns is when you only want a small amount of the data in a given table. If you only need a name and a date from a table with lots of columns or if there's a CLOB you won't be using, then yes absolutely, explicitly declare the columns you want.
EDIT: Apparently I declared this busted a bit too early. See the continuation here.
p.s. For those who might be interested, I'm adding the CF script I used to test this myth as an enclosure on this entry. You should see a download link just below this text.

If you have 40 columns in your table and you only need 5, then doing a select * will cause your query to return 35 columns of data that you don't need and may well slow down the response time, especially if some of the columns you are returning are huge text or blob columns.
but in your wording you are right, there is no performance difference between doing a select * or a qualified select if both return all the columns in the table.
I was going to say the same. However, I also think it's not nearly said often enough. I've received the same advice (don't use select *) in a situation where I explicitly said I was needing all the columns (in an ORM).
Another note on performance that hasn't been mentioned is covered queries. The more columns being returned means a greater chance of bookmarked lookups. You probably won't notice the difference south of a few hundred thousand records, but it creeps up on you.
I think it is good that you have spelled out when there WILL be a difference in performance, but I also think you are reaching to say that it is a "Busted Myth". It is ALWAYS good practice to avoid using SELECT * for all of the reasons you have mentioned plus more, including the important case where SELECT * should never be used in Views because they will have to be recompiled if the columns change (or even if the column ORDER is changed).
If the only reasons you have to use SELECT * is that it "does not really hurt performance when returning all/most columns" and "it is quicker to code", I personally feel that those are not good enough reasons given the other risks associated with doing so.
I'm not saying it isn't so. I'm just wondering why they don't, or perhaps they might and we don't know it.
But the reason it can not pre-build/cache the execution plan is because by using SELECT *, you are actually saying "give me all of the columns as they are defined right now". That tells the DB engine that the columns may change, so don't cache the plan. By explicitly declaring the columns, however, you are telling the DB that "all I care about are these columns, even if someone adds one at a later time". So it can confidently create and cache your execution plan. This is because just like Query Caching in CF, the DB needs to know the parameters with which it should "map" the cached plan. These include the Returned, Join, and WHERE clause columns. Changing any of those items will usually cause a new Execution Plan.
I hope my explanation makes sense...
Furthermore, if you want to argue that in tables with smaller row sets, any performance gain is negligible, I will concede that point as well. But how do you know in advance how large that table may grow? Why not always build for scalability? And for the record, I said Millions of rows, not *Billions* as you stated in your reply. Please do not increase my examples by a factor of 1,000. <grin> I have seen negative effects of SELECT * in as little as 100K rows. Using SELECT * also changes the way ODBC/JDBC drivers create the actual database calls, which can negatively impact performance as well.
I also think you missed my point about how it can greatly negatively affect busy SQL servers. If a DB Server is very active, and the main bottleneck is the CPU, then SELECT * is VERY detrimental to Query Performance. This is because the Execution Plan must be created on-the-fly for every single call of that query. That eats up CPU time, and it will also cause Lightweight SQL locks that can prevent other DB actions from happening until the locks are released.
And as I tried to illustrate in my post, there are things like MS-SQL Views that will flat-out BREAK if you use SELECT * and make ANY changes to the table columns or their "order" in the table design. Making this problem even worse, it does not even throw an error when this happens! MS-SQL will simply return 0 ROWS, with no error code, every time the View is called until it is manually recompiled by the user.
So seeing that there are risks, both to performance (even if it is only in the “edge cases”) and more concerning, to code breakage, why take that risk? What does SELECT * save you? How does it “save maintenance costs” as you said? If you are using an ORM, than the ORM handles that “maintenance” for you (and I have already conceded that this is one time that SELECT * makes sense). Any other time you add or remove a column to a table, you are going to need to add/remove it in your code/presentation as well. How much time are you saving by not having to add/remove it in your CFQUERY tag(s)?
I am genuinely open to hearing about the possible time/maintenance savings using SELECT * may provide, but right now I am just not seeing it.