如何使用 pyodbc 加速批量插入 MS SQL Server

How to speed up bulk insert to MS SQL Server using pyodbc(如何使用 pyodbc 加速批量插入 MS SQL Server)
本文介绍了如何使用 pyodbc 加速批量插入 MS SQL Server的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我需要帮助的代码.我必须运行超过 1,300,000 行,这意味着插入 ~300,000 行需要 40 分钟.

Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

我认为批量插入是加速它的途径?还是因为我通过 for data in reader: 部分遍历行?

I figure bulk insert is the route to go to speed it up? Or is it because I'm iterating over the rows via for data in reader: portion?

#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
    #hooks csv reader to file
    reader = csv.reader(f)
    #pulls out the columns (which match the SQL table)
    columns = next(reader)
    #trims any extra spaces
    columns = [x.strip(' ') for x in columns]
    #starts SQL statement
    query = 'bulk insert into SpikeData123({0}) values ({1})'
    #puts column names in SQL query 'query'
    query = query.format(','.join(columns), ','.join('?' * len(columns)))

    print 'Query is: %s' % query
    #starts curser from cnxn (which works)
    cursor = cnxn.cursor()
    #uploads everything by row
    for data in reader:
        cursor.execute(query, data)
        cursor.commit()

我故意动态选择我的列标题(因为我想创建尽可能多的 Pythonic 代码).

I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).

SpikeData123 是表名.

SpikeData123 is the table name.

推荐答案

更新 - 2021 年 7 月:bcpyaz 是 Microsoft 的 bcp 实用程序的包装器.

Update - July 2021: bcpyaz is a wrapper for Microsoft's bcp utility.

更新 - 2019 年 4 月:如@SimonLang 的评论所述,SQL Server 2017 及更高版本下的 BULK INSERT 显然支持 CSV 文件中的文本限定符(参考:此处).

Update - April 2019: As noted in the comment from @SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).

BULK INSERT 几乎肯定会比逐行读取源文件和对每一行执行常规 INSERT 快得多.但是,BULK INSERT 和 BCP 都对 CSV 文件有很大的限制,因为它们无法处理文本限定符(参考:此处).也就是说,如果您的 CSV 文件没有在其中包含限定的文本字符串...

BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...

1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07

... 然后你可以批量插入它,但如果它包含文本限定符(因为某些文本值包含逗号)...

... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...

1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07

... 然后 BULK INSERT 无法处理它.尽管如此,将这样的 CSV 文件预处理为以管道分隔的文件总体上可能会更快......

... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...

1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07

... 或制表符分隔的文件(其中 代表制表符)...

... or a tab-delimited file (where represents the tab character) ...

1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07

... 然后批量插入该文件.对于后者(制表符分隔)文件,BULK INSERT 代码如下所示:

... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:

import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\__tmp\biTest.txt' WITH (
    FIELDTERMINATOR='\t',
    ROWTERMINATOR='\n'
    );
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

注意:如评论中所述,执行BULK INSERT 语句仅适用于SQL Server 实例可以直接读取源文件的情况.对于源文件位于远程客户端的情况,请参阅此答案.

Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.

这篇关于如何使用 pyodbc 加速批量插入 MS SQL Server的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Execute complex raw SQL query in EF6(在EF6中执行复杂的原始SQL查询)
Bulk insert with mysql2 and NodeJs throws 500(使用mysql2和NodeJS的大容量插入抛出500)
Flask + PyMySQL giving error no attribute #39;settimeout#39;(FlASK+PyMySQL给出错误,没有属性#39;setTimeout#39;)
SSIS: Model design issue causing duplications - can two fact tables be connected?(SSIS:模型设计问题导致重复-两个事实表可以连接吗?)
SQL Server Graph Database - shortest path using multiple edge types(SQL Server图形数据库-使用多种边类型的最短路径)
Invalid column name when using EF Core filtered includes(使用EF核心过滤包括时无效的列名)