MySQL Cluster SQL Tips

So, I mentioned in my MySQL Cluster post that I found out that cluster and joins don’t get a long too well. There are a couple of tips I have for using joins or replacing them with other techniques that I thought I would share.

Here are some things I keep in mind when using the NDB engine. These may apply to the other engines as well.

  1. Test using the NDB engine. Testing against another engine is not adequate.
  2. Avoid JOIN if you can, test often if you can’t
  3. Avoid subqueries always
  4. Select using primary keys when at all possible. Always select using a key.
  5. Be sure to include both your where and order by columns in your keys.

Try your join from several directions.

Sometimes, the way you join can affect the engine. This is more apparent with NDB than MyISAM or InnoDB. Have a look at these two queries:

select pub_articles.article_id
from pub_articles
inner join article_edition_lookup on
pub_articles.article_id=article_edition_lookup.article_id and
pub_articles.publication_id=article_edition_lookup.publication_id and
publication_date='2007-06-07'
where pub_articles.publication_id=2;

select article_id
from article_edition_lookup
inner join pub_articles using (article_id, publication_id)
where publication_id=2 and publication_date='2007-06-07';

They look similar. Same tables. Same basic criteria. In fact the explain output looks nearly identical. But, the first takes 20 seconds on cluster and the second takes .02 seconds. Now, some MySQL internals person may be able to look at this and know why, but to me, I just have to test them to see. (If anyone cares, the primary key on pub_articles is publication_id, article_id and the PK on article_edition_lookup is publication_id, publication_date, article_id).

Try using temporary tables.

If you can’t get the join to be fast no matter what, try temporary tables. I have had some success using the HEAP engine to store data and then joining the two tables together. I don’t have an example of that from any code I have currently written. I have mostly just used it at the mysql prompt to get data. Here is an example that uses the same tables from above and retrieves the same data using a temporary table.

create temporary table if not exists foo engine=heap
select article_id from article_edition_lookup where publication_id=2 and publication_date='2007-06-08';

select article_id
from pub_articles use key (primary)
inner join foo using (article_id)
where publication_id=2 order by first_publish_time desc;

Together that all happens in about .06 seconds. You need to realize that this does create a table, in memory, on the API node. So, this data will only be good for this one connection. If your temporary table has lots of rows, you may need to add a key to it when you create it. I am not sure I would use this method for a query that is going to be run a lot. Chances are there is a better solution if you have to resort to this technique.

Subqueries

I have not found a way to make subqueries fast. Even simple queries like the following are slow.

select article_id from pub_articles where publication_id=2 and article_id in (select article_id from article_edition_lookup where publication_id=2 and publication_date='2007-06-07' );

That is a primary key lookup on both of those tables. Yet the query takes 30 seconds to run. Just stay away from them altogether. For what it’s worth, the same query takes 2 seconds on the InnoDB engine. I don’t think subqueries are optimized in 5.0. This may be a known issue. I got used to not having them for so long that I never think to use them now.

Unions

I have not used UNION much with cluster. When I have it has worked well. I would use caution when using it. Be sure to test the queries before putting them in a production environment.

About these ads

8 Responses to MySQL Cluster SQL Tips

  1. Regarding the subqueries, isn’t that one of those cases where MySQL does not optimize the subquery and we have to rewrite it as a join or possibly a derived table?

    How about something like this?

    SELECT p.article_id
    FROM pub_articles p
    JOIN article_edition_lookup l ON l.publication_id=2 AND l.publication_date=’2007-06-07′
    WHERE p.publication_id=2;

    SELECT article_id
    FROM ( SELECT article_id FROM article_edition_lookup ON publication_id=2 AND publication_date=’2007-06-07′ ) AS tmp
    JOIN pub_articles USING (publication_id);

  2. “Try your join from several directions.”

    There is at least on other difference. The first statement includes the

    publication_date=’2007-06-07′

    bit as part of the join condition – the second statement puts that in the where.

    I think it is quite likely this may be causing the difference in performance, because there is a difference in evaluation order.

  3. doughboy says:

    Sure, you can rewrite it. But, if I you can rewrite it that easily, why can’t the mysql optimizer deal with it like that?

  4. doughboy says:

    Ronald:

    My point is not about optimizing joins. That join runs in .04 seconds on my InnoDB server. The point is that joins on cluster are very finicky.

  5. spetrunia says:

    Re the ” join from several directions”:

    It is surprising that this happens, the optimizer is expected to convert the two mentioned queries into the same internal representation. Could you please submit this example as a bug at bugs.mysql.com (specify bug category as Optimizer, not NDB)?

    There is room for improvement in how MySQL runs joins over NDB tables, but those issues should not result in such effects.

    Sergey Petrunia
    MySQL AB, Query optimizer team

  6. [...] stored procedures in MySQL is very noteworthy. MySQL guys should think twice before using them. MySQL Cluster SQL Tips by Brian Moon seems also to be useful for using joins on a MySQL cluster. Xaprb published a fast, [...]

  7. ak says:

    1. temporary table won’t ever help – they will always take more to execute. You can use temporary tables when you need to watch for RAM size while making a lot of intermediate actions with DB before the final commit.

    2. All operators like JOIN, DISTINCT, GROUPING, UNIONS make query run slower much slower – you need always to think if your queries containing this operator are maximum optimized.

    3. Denormalization can help in some situations.

    4. What for types of columns when doing additional operations with select query

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: