包含 8000 万条记录的表并添加索引需要超过 18 小时(或永远)!怎么办?

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?(包含 8000 万条记录的表并添加索引需要超过 18 小时(或永远)!怎么办?)
本文介绍了包含 8000 万条记录的表并添加索引需要超过 18 小时(或永远)!怎么办?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简要回顾所发生的事情.我正在处理 7100 万条记录(与其他人处理的数十亿条记录相比并不多).在另一个线程上,有人建议我的集群的当前设置不适合我的需要.我的表结构是:

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table structure is:

CREATE TABLE `IPAddresses` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `ipaddress` bigint(20) unsigned default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM;

然后我添加了 7100 万条记录,然后做了一个:

And I added the 71 million records and then did a:

ALTER TABLE IPAddresses ADD INDEX(ipaddress);

已经 14 小时了,操作仍未完成.通过谷歌搜索,我发现有一个众所周知的方法可以解决这个问题——分区.我知道我现在需要根据 ipaddress 对我的表进行分区,但是我可以在不重新创建整个表的情况下执行此操作吗?我的意思是,通过 ALTER 语句?如果是,则有一个要求说要分区的列应该是主键.我将使用这个 ipaddress 的 id 来构建一个不同的表,所以 ipaddress 不是我的主键.在这种情况下,我该如何分区我的表?

It's been 14 hours and the operation is still not completed. Upon Googling, I found that there is a well-known approach for solving this problem - Partitioning. I understand that I need to partition my table now based on the ipaddress but can I do this without recreating the entire table? I mean, through an ALTER statement? If yes, there was one requirement saying that the column to be partitioned on should be a primary key. I will be using the id of this ipaddress in constructing a different table so ipaddress is not my primary key. How do I partition my table given this scenario?

推荐答案

好吧,事实证明这个问题不仅仅是一个简单的创建表,索引它然后忘记问题:) 这是我所做的,以防其他人遇到同样的问题(我使用了 IP 地址的示例,但它也适用于其他数据类型):

Ok turns out that this problem was more than just a simple create a table, index it and forget problem :) Here's what I did just in case someone else faces the same problem (I have used an example of IP Address but it works for other data types too):

问题:您的表有数百万个条目,您需要非常快速地添加索引

用例:考虑在查找表中存储数百万个 IP 地址.添加 IP 地址应该不是什么大问题,但为它们创建索引需要 14 多个小时.

Usecase: Consider storing millions of IP addresses in a lookup table. Adding the IP addresses should not be a big problem but creating an index on them takes more than 14 hours.

解决方案:使用 MySQL 对表进行分区分区策略

案例 #1:当您想要的表尚未创建时

CREATE TABLE IPADDRESSES(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  ipaddress BIGINT UNSIGNED,
  PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;

案例 2:当您想要的表已经创建时.似乎有一种方法可以使用 ALTER TABLE 来做到这一点,但我还没有找到合适的解决方案.相反,有一个效率稍低的解决方案:

Case #2: When the table you want is already created. There seems to be a way to use ALTER TABLE to do this but I have not yet figured out a proper solution for this. Instead, there is a slightly inefficient solution:

CREATE TABLE IPADDRESSES_TEMP(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  ipaddress BIGINT UNSIGNED,
  PRIMARY KEY(id)
) ENGINE=MYISAM;

将您的 IP 地址插入此表中.然后创建带有分区的实际表:

Insert your IP addresses into this table. And then create the actual table with partitions:

CREATE TABLE IPADDRESSES(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  ipaddress BIGINT UNSIGNED,
  PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;

最后

INSERT INTO IPADDRESSES(ipaddress) SELECT ipaddress FROM IPADDRESSES_TEMP;
DROP TABLE IPADDRESSES_TEMP;
ALTER TABLE IPADDRESSES ADD INDEX(ipaddress)

就这样……在一台 3.2GHz 的机器上用 1GB 的内存在新表上建立索引花了我大约 2 小时:) 希望这会有所帮助.

And there you go... indexing on the new table took me about 2 hours on a 3.2GHz machine with 1GB RAM :) Hope this helps.

这篇关于包含 8000 万条记录的表并添加索引需要超过 18 小时(或永远)!怎么办?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Hibernate reactive No Vert.x context active in aws rds(AWS RDS中的休眠反应性非Vert.x上下文处于活动状态)
Bulk insert with mysql2 and NodeJs throws 500(使用mysql2和NodeJS的大容量插入抛出500)
Flask + PyMySQL giving error no attribute #39;settimeout#39;(FlASK+PyMySQL给出错误,没有属性#39;setTimeout#39;)
auto_increment column for a group of rows?(一组行的AUTO_INCREMENT列?)
Sort by ID DESC(按ID代码排序)
SQL/MySQL: split a quantity value into multiple rows by date(SQL/MySQL:按日期将数量值拆分为多行)