Playing with MySQL’s index merge
So, I mentioned before that I found out about index_merge at the MySQL Conference. I was wondering why I had not heard more about it since it came out in 5.0.3. When talking with some MySQL people about it, I received mixed results. So, I decided to kind of run my own tests on some data and see what I could figure out.
I apologize for WordPress’ bad output. =(
The Data
I created a table with 5 million rows. Early tests with MySQL’s Harrison Fisk (HarrisonF) over my shoulder with small data sets showed MySQL would optimize out the indexes in favor of table scans. I wanted to avoid that. This is my table schema:
CREATE TABLE `test2` (
`id1` int(10) unsigned NOT NULL default ‘0′,
`id2` int(10) unsigned NOT NULL default ‘0′,
`id3` int(10) unsigned NOT NULL default ‘0′,
`dt` datetime NOT NULL default ‘0000-00-00 00:00:00′,
`somevar` varchar(255) NOT NULL default ”,
KEY `id1` (`id1`),
KEY `id2` (`id2`)
) ENGINE=MyISAM
The field id1 was filled with random vaules between 1 and 5000. I filled id2 with random values between 1 and 100, except that about half the data has the value 999 in it. This was to emulate the issue we were seeing on the smaller table. We found that if a value was in more than n% of the rows, the optimizer would skip the index. I wanted to test that on larger data sets. id3 was filled with random values between 1 and 1000000. dt was a random date/time between 1999 and 2008. and somevar was a random string chars.
Intersect Merges
mysql> explain select count(*) from test2 where id2=99 and id1=4795;
+—-+————-+——-+————-+—————+———+———+——+——+—————————————————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——-+————-+—————+———+———+——+——+—————————————————-+
| 1 | SIMPLE | test2 | index_merge | id1,id2 | id1,id2 | 4,4 | NULL | 3 | Using intersect(id1,id2); Using where; Using index |
+—-+————-+——-+————-+—————+———+———+——+——+—————————————————-+
This is the most basic of example. MySQL uses the two indexes, finds where they intersect and merges the data together. This query is quite fast, although a key on the two together would be faster. If you have this showing up a lot, you probably need to combine the two keys into one. I should also note that in this example, only the keys are needed, no data from the tables. This is important.
mysql> explain select sql_no_cache somevar from test2 where id2=99 and id1=4795;
+—-+————-+——-+——+—————+——+———+——-+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——-+——+—————+——+———+——-+——+————-+
| 1 | SIMPLE | test2 | ref | id1,id2 | id1 | 4 | const | 930 | Using where |
+—-+————-+——-+——+—————+——+———+——-+——+————-+
As you see, as soon as we ask for data that is not in the indexes, our intersect is dropped in favor of using the key with the least values and simply scanning on those to match the rest of the where clause. This was the case pretty much every time I tried it. I was never able to use an index_merge with intersect when requesting data not available in the key.
Union Merges
explain select sql_no_cache somevar from test2 where id2=99 or id1=4795;
+—-+————-+——-+————-+—————+———+———+——+——-+———————————–+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——-+————-+—————+———+———+——+——-+———————————–+
| 1 | SIMPLE | test2 | index_merge | id1,id2 | id2,id1 | 4,4 | NULL | 27219 | Using union(id2,id1); Using where |
+—-+————-+——-+————-+—————+———+———+——+——-+———————————–+
mysql> select sql_no_cache somevar from test2 where id2=99 or id1=4795;
26237 rows in set (0.20 sec)
This merge type takes to keys involved in an OR and then merges the data much like a UNION statement would. As you can see, in this case, it did use the index even though we requested `somevar` that is not in the index.
To show the alternative to this, I selected using id3 instead of id1. id3 has no index.
mysql> explain select sql_no_cache somevar from test2 where id2=99 or id3=266591;
+—-+————-+——-+——+—————+——+———+——+———+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——-+——+—————+——+———+——+———+————-+
| 1 | SIMPLE | test2 | ALL | id2 | NULL | NULL | NULL | 5000000 | Using where |
+—-+————-+——-+——+—————+——+———+——+———+————-+
mysql> select sql_no_cache somevar from test2 where id2=99 or id3=266591;
25252 rows in set (26.01 sec)
As you can see, this does a table scan even though there is a key on id2. It does you know good.
Sort Union Merge
mysql> explain select sql_no_cache id1, id2 from test2 where id2=99 or id1 between 4999 and 5000;
+—-+————-+——-+————-+—————+———+———+——+——-+—————————————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——-+————-+—————+———+———+——+——-+—————————————-+
| 1 | SIMPLE | test2 | index_merge | id1,id2 | id2,id1 | 4,4 | NULL | 44571 | Using sort_union(id2,id1); Using where |
+—-+————-+——-+————-+—————+———+———+——+——-+—————————————-+
mysql> select sql_no_cache somevar from test2 where id2=99 or id1 between 4999 and 5000;
27295 rows in set (0.19 sec)
This behaves much like the union merge. However, because one index is using a range, MySQL must first sort one index and then merge the two. Again, if I switch this to an AND instead of an OR, index_merge is not used in favor of scanning the id2 indexed data for matches to the rest of the where clause.
Conclusion
Hmm, after all this, I see why this was not a big announcement. It can only make bad SQL and tables better. Tables and queries that are already optimized using composite indexes will see no benefit from this. At best this will help me with some one off queries or reports that are only run monthly where I don’t want to pollute the indexes with special cases just for those queries.
2008 MySQL Conference, part 1
It is always surprising what I learn when I go to a conference these days. Years ago, I could go to any talk and just suck it all in. Now, it is the little nuggets. The topics as a whole do more to confirm what I have already developed while running the Phorum project and building the infastructure for dealnews.com. That confirmation is still nice. You know you are not the only one that thought a particular solution was a good idea.
One of the confirmations I have had is that the big sites like Flickr, Wikipedia, Facebook and others don’t use exotic setups when it comes to their hardware and OS. During a keynote panel, they all commented that they did not do any virtualization on their servers. Most did not use SANs. Some ran older MySQL versions but some were running quite recent versions. I have kept thinking that I did not have the desire to get to fancy with that stuff and clearly I am not the only one.
One of the little nuggets that will likely change my world is index_merge in MySQL. I feel silly as this has been around since 5.0.3 but I was not aware of it. Basically MySQL will now use more than one key to resolve a where clause and possibly an order by depending on the query. This could lead to me removing several keys from tables in both Phorum and at dealnews.
There were others, but I am tired and trying to get OpenID into the Phorum trunk right now so I will have to think of more later.
Phorum turns 10
So, I am at the MySQL Conference this week with my Phorum co-developers. We got to talking last night about how old Phorum is. We knew it was about 10 years. We pulled up some old archived zip file of version 1.5 and found in the this in the comment block.
* Created 04/16/1998
Whoa! That means that yesterday was the 10th birthday of the Phorum project. I would guess that is the date I originally put the code up on my personal web site for people to download. I remember sending that email to the PHP General mailing list. I told people they could have the code if they would help debug it. Later I officially made a GPL license and then a BSD style license as I became more knowledgeable about the open source and free software world.
So, for kicks we decided to install version 1.6 on the phorum.org site. Keep in mind the release date for that was March 30, 1999. The only hurdles were a default value on an auto increment column in the .sql file, needing register_globals and adding .php3 to be parsed as PHP. That got it up and running. I had hoped to post the URL for fun, but sadly, 5 lines in were sql injection vulnerabilties. Ah, the good ol’ days.
Sadly, I don’t have my emails from 1998. I lost everything in 2001 due to either a hard drive crash or some shady deal I had with someone hosting the Phorum site at the time. I can’t remember. If anyone happens to have UseNet archives or mailing list archives of the PHP General list from April 1998, please let me know. I would love to have that old stuff.
IRC for MySQL Conference #mysqlconf
I always enjoy #oscon at the O’Reilly Open Source Convention every year. I propose we use #mysqlconf on freenode during the 2008 MySQL Conference. I am there. Come join me.
What to do in the Bay Area?
So, as I said before, I will be at the MySQL Conference next week. I am renting a car this year so I don’t have to wait on cabs or deal with them at all. So, I am mobile and being from a modern Southern US city, used to driving 30 minutes just to go to dinner. So, where should we go? Anything good in San Jose? Should I go all the way to San Francisco? I am willing to go where ever. Help me locals, you are my only hope!
2008 MySQL Conference
In just two weeks I will be heading to the 2008 MySQL Conference. I will be speaking this year. My two talks are:
MySQL Hacks and Tricks to Make Phorum Fast
04/16/2008 4:25pm PDT Room: Ballroom A
From One Server to a Cluster
04/16/2008 5:15pm PDT Room: Ballroom C
I have to pull back to back talks. *PHEW* I hope I can hold up. To make it worse, they did not put me in the same room. If I remember right though, those are really close to each other.
Of course, the Phorum team will be in the Expo Hall in the DotOrg pavilion. Just look for the big dog.
