由于索引导致记录数增加,SQLite 插入速度变慢

SQLite insert speed slows as number of records increases due to an index(由于索引导致记录数增加,SQLite 插入速度变慢)
本文介绍了由于索引导致记录数增加,SQLite 插入速度变慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原问题

背景

众所周知,SQLite ,而不是SQLite的断言只是无法处理.显然它可以处理大型数据集如果索引该数据集不属于您的用例.为此,我一直在使用 SQLite 作为日志系统的后端,现在 不需要 需要索引,所以我对我所经历的减速感到非常惊讶.

结论

如果有人发现自己想要使用 SQLite 存储大量数据将其编入索引,使用分片 可能是答案.我最终决定使用 MD5 散列的前三个字符在 z 中的唯一列来确定分配给 4,096 个数据库中的一个.由于我的用例本质上主要是存档,因此模式不会改变,查询也永远不需要分片遍历.数据库大小是有限制的,因为非常旧的数据将被减少并最终被丢弃,因此这种分片、pragma 设置甚至一些de规范化的组合给了我一个很好的平衡,这将基于上面的基准测试,保持插入速度至少 10k 插入/秒.

解决方案

如果您的要求是找到特定的 z_idx_idsy_ids 链接到它(与快速选择一系列 z_ids 不同),您可以查看非索引哈希表嵌套关系数据库,它可以让您立即找到通往特定z_id 以获得它的 y_idsx_ids - 没有索引开销以及随着索引增长而在插入期间伴随的性能下降.为了避免聚集(也称为桶冲突),请选择一种密钥散列算法,该算法将最大权重放在 z_id 的具有最大变化(右加权)的数字上.

附:例如,使用 b-tree 的数据库起初可能比使用线性散列的 db 更快,但随着 b-tree 的性能开始下降,插入性能将与线性散列保持一致.

附言回答@kawing-chiu 的问题:这里相关的核心特征是这样的数据库依赖于所谓的稀疏"表,其中记录的物理位置由以记录键作为输入的散列算法确定.这种方法允许直接在表中查找记录的位置无需索引的中介.由于不需要遍历索引或重新平衡索引,因此随着表变得更加密集,插入时间保持不变.相比之下,对于 b 树,插入时间会随着索引树的增长而降低.具有大量并发插入的 OLTP 应用程序可以从这种稀疏表方法中受益.记录分散在整个表中.记录分散在稀疏表的苔原"中的缺点是收集具有共同值的大型记录集(例如邮政编码)可能会更慢.散列稀疏表方法经过优化,可以插入和检索单个记录,以及检索相关记录的网络,而不是具有某些共同字段值的大型记录集.

嵌套关系数据库是允许元组行的一列中的数据库.

Original question

Background

It is well-known that SQLite needs to be fine tuned to achieve insert speeds on the order of 50k inserts/s. There are many questions here regarding slow insert speeds and a wealth of advice and benchmarks.

There are also claims that SQLite can handle large amounts of data, with reports of 50+ GB not causing any problems with the right settings.

I have followed the advice here and elsewhere to achieve these speeds and I'm happy with 35k-45k inserts/s. The problem I have is that all of the benchmarks only demonstrate fast insert speeds with < 1m records. What I am seeing is that insert speed seems to be inversely proportional to table size.

Issue

My use case requires storing 500m to 1b tuples ([x_id, y_id, z_id]) over a few years (1m rows / day) in a link table. The values are all integer IDs between 1 and 2,000,000. There is a single index on z_id.

Performance is great for the first 10m rows, ~35k inserts/s, but by the time the table has ~20m rows, performance starts to suffer. I'm now seeing about 100 inserts/s.

The size of the table is not particularly large. With 20m rows, the size on disk is around 500MB.

The project is written in Perl.

Question

Is this the reality of large tables in SQLite or are there any secrets to maintaining high insert rates for tables with > 10m rows?

KnownworkaroundswhichI'dliketoavoidifpossible

  • Drop the index, add the records, and re-index: This is fine as a workaround, but doesn't work when the DB still needs to be usable during updates. It won't work to make the database completely inaccessible for x minutes / day
  • Break the table into smaller subtables / files: This will work in the short term and I have already experimented with it. The problem is that I need to be able to retrieve data from the entire history when querying which means that eventually I'll hit the 62 table attachment limit. Attaching, collecting results in a temp table, and detaching hundreds of times per request seems to be a lot of work and overhead, but I'll try it if there are no other alternatives.
  • Set SQLITE_FCNTL_CHUNK_SIZE: I don't know C (?!), so I'd prefer to not learn it just to get this done. I can't see any way to set this parameter using Perl though.

UPDATE

Following Tim's suggestion that an index was causing increasingly slow insert times despite SQLite's claims that it is capable of handling large data sets, I performed a benchmark comparison with the following settings:

  • inserted rows: 14 million
  • commit batch size: 50,000 records
  • cache_size pragma: 10,000
  • page_size pragma: 4,096
  • temp_store pragma: memory
  • journal_mode pragma: delete
  • synchronous pragma: off

In my project, as in the benchmark results below, a file-based temporary table is created and SQLite's built-in support for importing CSV data is used. The temporary table is then attached to the receiving database and sets of 50,000 rows are inserted with an insert-select statement. Therefore, the insert times do not reflect file to database insert times, but rather table to table insert speed. Taking the CSV import time into account would reduce the speeds by 25-50% (a very rough estimate, it doesn't take long to import the CSV data).

Clearly having an index causes the slowdown in insert speed as table size increases.

It's quite clear from the data above that the correct answer can be assigned to Tim's answer rather than the assertions that SQLite just can't handle it. Clearly it can handle large datasets if indexing that dataset is not part of your use case. I have been using SQLite for just that, as a backend for a logging system, for a while now which does not need to be indexed, so I was quite surprised at the slowdown I experienced.

Conclusion

If anyone finds themselves wanting to store a large amount of data using SQLite and have it indexed, using shards may be the answer. I eventually settled on using the first three characters of an MD5 hash a unique column in z to determine assignment to one of 4,096 databases. Since my use case is primarily archival in nature, the schema will not change and queries will never require shard walking. There is a limit to database size since extremely old data will be reduced and eventually discarded, so this combination of sharding, pragma settings, and even some denormalisation gives me a nice balance that will, based on the benchmarking above, maintain an insert speed of at least 10k inserts / second.

解决方案

If your requirement is to find a particular z_id and the x_ids and y_ids linked to it (as distinct from quickly selecting a range of z_ids) you could look into a non-indexed hash-table nested-relational db that would allow you to instantly find your way to a particular z_id in order to get its y_ids and x_ids -- without the indexing overhead and the concomitant degraded performance during inserts as the index grows. In order to avoid clumping (aka bucket collisions), choose a key hashing algorithm that puts greatest weight on the digits of z_id with greatest variation (right-weighted).

P.S. A database that uses a b-tree may at first appear faster than a db that uses linear hashing, say, but the insert performance will remain level with the linear hash as performance on the b-tree begins to degrade.

P.P.S. To answer @kawing-chiu's question: the core feature relevant here is that such a database relies on so-called "sparse" tables in which the physical location of a record is determined by a hashing algorithm which takes the record key as input. This approach permits a seek directly to the record's location in the table without the intermediary of an index. As there is no need to traverse indexes or to re-balance indexes, insert-times remain constant as the table becomes more densely populated. With a b-tree, by contrast, insert times degrade as the index tree grows. OLTP applications with large numbers of concurrent inserts can benefit from such a sparse-table approach. The records are scattered throughout the table. The downside of records being scattered across the "tundra" of the sparse table is that gathering large sets of records which have a value in common, such as a postal code, can be slower. The hashed sparse-table approach is optimized to insert and retrieve individual records, and to retrieve networks of related records, not large sets of records that have some field value in common.

A nested relational database is one that permits tuples within a column of a row.

这篇关于由于索引导致记录数增加,SQLite 插入速度变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Bulk insert with mysql2 and NodeJs throws 500(使用mysql2和NodeJS的大容量插入抛出500)
SQL/MySQL: split a quantity value into multiple rows by date(SQL/MySQL:按日期将数量值拆分为多行)
SQL Server Graph Database - shortest path using multiple edge types(SQL Server图形数据库-使用多种边类型的最短路径)
How should make faster SQL Server filtering procedure with many parameters(如何让多参数的SQL Server过滤程序更快)
FastAPI + Tortoise ORM + FastAPI Users (Python) - Relationship - Many To Many(FastAPI+Tortoise ORM+FastAPI用户(Python)-关系-多对多)
How can I generate an entity–relationship (ER) diagram of a database using Microsoft SQL Server Management Studio?(如何使用Microsoft SQL Server Management Studio生成数据库的实体关系(ER)图?)