问题描述
两张桌子.
电子邮件id (int10) |所有权 (int10)
emails id (int10) | ownership (int10)
消息emailid (int10) 索引 |消息(中文本)
messages emailid (int10) indexed | message (mediumtext)
子查询(这在 mysql 中很糟糕).
Subquery (which is terrible in mysql).
从消息中选择 COUNT(*)WHERE 消息 LIKE '%word%' ANDemailid IN (SELECT id FROM emails WHERE Ownership = 32)
SELECT COUNT(*) FROM messages WHERE message LIKE '%word%' AND emailid IN (SELECT id FROM emails WHERE ownership = 32)
这里的用法是我对电子邮件进行搜索(在上面的示例中显然简化了),生成了一个包含 3,000 个电子邮件 ID 的列表.然后我想对消息进行搜索,因为我需要进行文本匹配 - 仅来自针对消息的 3000 封电子邮件.
The usage here is that I run a search on emails (which is obviously simplified in the sample above), that generates a list of say 3,000 email id's. I then want to do a search against messages because i need to do a text match - from only those 3000 emails against the message.
对消息的查询很昂贵(消息没有被索引)但这很好,因为它只会检查几行.
The query against messages is expensive (message is not indexed) but this is fine because it would only ever be checking against a few rows.
想法:
i) 连接.到目前为止,我在这方面的尝试没有奏效,并导致对消息表进行全表扫描(即未使用 emailid 索引)ii) 临时表.我认为这可以工作.iii) 在客户端缓存 id 并运行 2 个查询.这确实有效.不优雅.iv) 子查询.mySQL 子查询每次运行第二个查询,所以这不起作用.可能在 mysql 6 中修复.
i) A join. My attempts at this so far have not worked and have resulted in full table scans of the message table (i.e. the emailid index not used) ii) temporary table. This could work I think. iii) cache ids in client and run 2 queries. This does work. Not elegant. iv) subquery. mySQL subqueries run the 2nd query each time so this does not work. maybe fixed in mysql 6.
好的,这就是我目前所拥有的.这些是实际的字段名称(我已经简化了一些问题).
Ok, here is what I have so far. These are the actual field names (I had simplified a bit in question).
查询:
SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject
ON (ticket_subject.ticketid = ticket.id)
WHERE category IN (1)
AND ticket_subject.subject LIKE "%about%"
结果:
1 SIMPLE ticket ref PRIMARY,category category 4 const 28874
1 SIMPLE ticket_subject eq_ref PRIMARY PRIMARY 4 deskpro.ticket.id 1 Using where
耗时 0.41 秒,返回计数 (*) 为 113.
It takes 0.41 seconds and returns a count(*) of 113.
跑步:
SELECT COUNT (*) FROM ticket WHERE category IN (1)
花费 0.01 秒并找到 33,000 个结果.
Takes 0.01 seconds and finds 33,000 results.
跑步
SELECT COUNT (*) FROM ticket_subject WHERE subject LIKE "%about%"
花费 0.14 秒并找到 1,300 个结果.
Takes 0.14 seconds and finds 1,300 results.
ticket 表和ticket_subject 表都有 300,000 行.
Both the ticket table and ticket_subject table have 300,000 rows.
ticket_subject.ticketid 和 ticket.category 上有一个索引.
There is an index on ticket_subject.ticketid and ticket.category.
我现在意识到使用 LIKE 语法是一个错误 - 因为它对 FULLTEXT 有点红鲱鱼.这不是问题.问题是:
I realise now that using the LIKE syntax was a mistake - as it has been a bit of a red herring about FULLTEXT. THis is not the issue. The issue is:
1) 表 A - 非常快速的查询,在索引上运行.0.001 秒2) 表 B - 中等到慢速查询,无索引 - 进行全表扫描.0.1 秒.
1) Table A - very fast query, run on index. 0.001 seconds 2) Table B - moderate to slow query, no index - does full table scan. 0.1 seconds.
这两个结果都很好.问题是我必须加入他们,搜索需要 0.3 秒;这对我来说毫无意义,因为表 B 上的组合查询的慢方面应该更快,因为我们现在只搜索该表的一小部分 - 即它不应该进行全表扫描,因为正在加入的字段on 已编入索引.
Both of these results are fine. The problem is I have to JOIN them and the search takes 0.3 seconds; which to me makes no sense because the slow aspects of the combined query on Table B should be quicker because we are now only searching over a fraction of that table - ie it should not be doing a full table scan because the field that is being JOINED on is indexed.
推荐答案
记得利用布尔值 短路评估:
SELECT COUNT(*)
FROM messages
join emails ON emails.id = messages.emailid
WHERE ownership = 32 AND message LIKE '%word%'
它在评估 LIKE
谓词之前按 ownership
过滤.总是把你便宜的表达放在左边.
This filters by ownership
before it evaluates the LIKE
predicate. Always put your cheaper expressions on the left.
另外,我同意@Martin Smith 和@MJB 的观点,您应该考虑使用 MySQL 的 FULLTEXT
索引以加快速度.
Also, I agree with @Martin Smith and @MJB that you should consider using MySQL's FULLTEXT
indexing to make this faster.
关于您的评论和其他信息,这里有一些分析:
Re your comment and additional information, here's some analysis:
explain SELECT COUNT(*) FROM ticket WHERE category IN (1)G
id: 1
select_type: SIMPLE
table: ticket
type: ref
possible_keys: category
key: category
key_len: 4
ref: const
rows: 1
Extra: Using index
注意使用索引"是一个很好的东西,因为它意味着它可以通过读取索引数据结构来满足查询,甚至不接触表的数据.这肯定会运行得非常快.
The note "Using index" is a good thing to see because it means it can satisfy the query just by reading the index data structure, not even touching the data of the table. This is certain to run very fast.
explain SELECT COUNT(*) FROM ticket_subject WHERE subject LIKE '%about%'G
id: 1
select_type: SIMPLE
table: ticket_subject
type: ALL
possible_keys: NULL <---- no possible keys
key: NULL
key_len: NULL
ref: NULL
rows: 1
Extra: Using where
这表明没有可能的键可以使通配符 LIKE
谓词受益.它使用 WHERE 子句中的条件,但必须通过运行表扫描来评估它.
This shows that there are no possible keys that can benefit the wildcard LIKE
predicate. It uses the condition in the WHERE clause, but it has to evaluate it by running a table-scan.
explain SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject
ON (ticket_subject.ticketid = ticket.id)
WHERE category IN (1)
AND ticket_subject.subject LIKE '%about%'G
id: 1
select_type: SIMPLE
table: ticket
type: ref
possible_keys: PRIMARY,category
key: category
key_len: 4
ref: const
rows: 1
Extra: Using index
id: 1
select_type: SIMPLE
table: ticket_subject
type: ref
possible_keys: ticketid
key: ticketid
key_len: 4
ref: test.ticket.id
rows: 1
Extra: Using where
同样,访问工单表也很快,但是会被 LIKE
条件导致的表扫描破坏.
Likewise, accessing the ticket table is quick, but that's spoiled by the table-scan incurred by the LIKE
condition.
ALTER TABLE ticket_subject ENGINE=MyISAM;
CREATE FULLTEXT INDEX ticket_subject_fulltext ON ticket_subject(subject);
explain SELECT COUNT(*) FROM ticket JOIN ticket_subject
ON (ticket_subject.ticketid = ticket.id)
WHERE category IN (1) AND MATCH(ticket_subject.subject) AGAINST('about')
id: 1
select_type: SIMPLE
table: ticket
type: ref
possible_keys: PRIMARY,category
key: category
key_len: 4
ref: const
rows: 1
Extra: Using index
id: 1
select_type: SIMPLE
table: ticket_subject
type: fulltext
possible_keys: ticketid,ticket_subject_fulltext
key: ticket_subject_fulltext <---- now it uses an index
key_len: 0
ref:
rows: 1
Extra: Using where
你永远不会让 LIKE
表现良好.请参阅我的演示 MySQL 中的实用全文搜索一个>.
You're never going to make LIKE
perform well. See my presentation Practical Full-Text Search in MySQL.
关于您的评论:好的,我已经对类似大小的数据集(堆栈溢出数据转储中的用户和徽章表 :-) 进行了一些实验.这是我发现的:
Re your comment: Okay, I've done some experiments on a dataset of similar size (the Users and Badges tables in the Stack Overflow data dump :-). Here's what I found:
select count(*) from users
where reputation > 50000
+----------+
| count(*) |
+----------+
| 37 |
+----------+
1 row in set (0.00 sec)
这真的很快,因为我在信誉列上有一个索引.
That's really fast, because I have an index on the reputation column.
id: 1
select_type: SIMPLE
table: users
type: range
possible_keys: users_reputation_userid_displayname
key: users_reputation_userid_displayname
key_len: 4
ref: NULL
rows: 37
Extra: Using where; Using index
select count(*) from badges
where badges.creationdate like '%06-24%'
+----------+
| count(*) |
+----------+
| 1319 |
+----------+
1 row in set, 1 warning (0.63 sec)
正如预期的那样,因为该表有 700k 行,并且它必须进行表扫描.现在让我们进行连接:
That's as expected, since the table has 700k rows, and it has to do a table-scan. Now let's do the join:
select count(*) from users join badges using (userid)
where users.reputation > 50000 and badges.creationdate like '%06-24%'
+----------+
| count(*) |
+----------+
| 19 |
+----------+
1 row in set, 1 warning (0.03 sec)
这似乎并没有那么糟糕.这是解释报告:
That doesn't seem so bad. Here's the explain report:
id: 1
select_type: SIMPLE
table: users
type: range
possible_keys: PRIMARY,users_reputation_userid_displayname
key: users_reputation_userid_displayname
key_len: 4
ref: NULL
rows: 37
Extra: Using where; Using index
id: 1
select_type: SIMPLE
table: badges
type: ref
possible_keys: badges_userid
key: badges_userid
key_len: 8
ref: testpattern.users.UserId
rows: 1
Extra: Using where
这似乎是在智能地使用索引进行连接,并且它有助于我拥有一个包括用户 ID 和声誉的复合索引.请记住,MySQL 每个表只能使用一个索引,因此为您需要执行的查询定义正确的复合索引非常重要.
This does seem like it's using indexes intelligently for the join, and it helps that I have a compound index including userid and reputation. Remember that MySQL can use only one index per table, so it's important to get define the right compound indexes for the query you need to do.
您的评论:好的,我已经尝试过声誉 > 5000、声誉 > 500 和声誉 > 50 的情况.这些应该匹配更大的用户集.
Re your comment: OK, I've tried this where reputation > 5000, and where reputation > 500, and where reputation > 50. These should match a much larger set of users.
select count(*) from users join badges using (userid)
where users.reputation > 5000 and badges.creationdate like '%06-24%'
+----------+
| count(*) |
+----------+
| 194 |
+----------+
1 row in set, 1 warning (0.27 sec)
select count(*) from users join badges using (userid)
where users.reputation > 500 and badges.creationdate like '%06-24%'
+----------+
| count(*) |
+----------+
| 624 |
+----------+
1 row in set, 1 warning (0.93 sec)
select count(*) from users join badges using (userid)
where users.reputation > 50 and badges.creationdate like '%06-24%'
--------------
+----------+
| count(*) |
+----------+
| 1067 |
+----------+
1 row in set, 1 warning (1.72 sec)
解释报告在所有情况下都是相同的,但是如果查询在 Users 表中找到更多匹配行,那么它自然必须根据表中更多匹配行来评估 LIKE
谓词徽章表.
The explain report is the same in all cases, but if the query finds more matching rows in the Users table, then it naturally has to evaluate the LIKE
predicate against a lot more matching rows in the Badges table.
确实,进行连接需要一些成本.令人惊讶的是,它如此昂贵.但是,如果您使用索引,则可以减轻这种情况.
It's true that there is some cost to doing a join. It's a little surprising that it's so dramatically expensive. But this can be mitigated if you use indexes.
我知道您说过您有一个不能使用索引的查询,但也许是时候考虑使用原始列数据的一些转换版本创建一个冗余列,以便您可以 索引它.在上面的示例中,我可能会创建一个列 creationdate_day
并从 DAYOFYEAR(creationdate)
填充它.
I know you said you have a query that can't use an index, but perhaps it's time to consider creating a redundant column with some transformed version of the data of your original column, so you can index it. In the example above, I might create a column creationdate_day
and populate it from DAYOFYEAR(creationdate)
.
这就是我的意思:
ALTER TABLE Badges ADD COLUMN creationdate_day SMALLINT;
UPDATE Badges SET creationdate_day = DAYOFYEAR(creationdate);
CREATE INDEX badge_creationdate_day ON Badges(creationdate_day);
select count(*) from users join badges using (userid)
where users.reputation > 50 and badges.creationdate_day = dayofyear('2010-06-24')
+----------+
| count(*) |
+----------+
| 1067 |
+----------+
1 row in set, 1 warning (0.01 sec) <---- not too shabby!
这是解释报告:
id: 1
select_type: SIMPLE
table: badges
type: ref
possible_keys: badges_userid,badge_creationdate_day
key: badge_creationdate_day <---- here is our new index
key_len: 3
ref: const
rows: 1318
Extra: Using where
id: 1
select_type: SIMPLE
table: users
type: eq_ref
possible_keys: PRIMARY,users_reputation_userid_displayname
key: PRIMARY
key_len: 8
ref: testpattern.badges.UserId
rows: 1
Extra: Using where
这篇关于帮我把一个 SUBQUERY 变成一个 JOIN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!