问题描述
我正在尝试用 Python 编写一个多线程程序来加速(1000 个以下).csv 文件的复制.多线程代码的运行速度甚至比顺序方法还要慢.我用 profile.py
对代码进行了计时.我确定我一定做错了什么,但我不确定是什么.
I am trying to write a multithreaded program in Python to accelerate the copying of (under 1000) .csv files. The multithreaded code runs even slower than the sequential approach. I timed the code with profile.py
. I am sure I must be doing something wrong but I'm not sure what.
环境:
- 四核 CPU.
- 2 个硬盘驱动器,其中一个包含源文件.另一个是目的地.
- 1000 个 csv 文件,大小从几 KB 到 10 MB 不等.
方法:
我把所有的文件路径放在一个Queue中,并创建4-8个工作线程从队列中拉取文件路径并复制指定的文件.在任何情况下,多线程代码都不会更快:
I put all the file paths in a Queue, and create 4-8 worker threads pull file paths from the queue and copy the designated file. In no case is the multithreaded code faster:
- 连续复制需要 150-160 秒
- 线程复制需要超过 230 秒
我假设这是一个 I/O 绑定任务,所以多线程应该有助于提高操作速度.
I assume this is an I/O bound task, so multithreading should help the operation speed.
守则:
import Queue
import threading
import cStringIO
import os
import shutil
import timeit # time the code exec with gc disable
import glob # file wildcards list, glob.glob('*.py')
import profile #
fileQueue = Queue.Queue() # global
srcPath = 'C:\temp'
destPath = 'D:\temp'
tcnt = 0
ttotal = 0
def CopyWorker():
while True:
fileName = fileQueue.get()
fileQueue.task_done()
shutil.copy(fileName, destPath)
#tcnt += 1
print 'copied: ', tcnt, ' of ', ttotal
def threadWorkerCopy(fileNameList):
print 'threadWorkerCopy: ', len(fileNameList)
ttotal = len(fileNameList)
for i in range(4):
t = threading.Thread(target=CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()
def sequentialCopy(fileNameList):
#around 160.446 seconds, 152 seconds
print 'sequentialCopy: ', len(fileNameList)
cnt = 0
ctotal = len(fileNameList)
for fileName in fileNameList:
shutil.copy(fileName, destPath)
cnt += 1
print 'copied: ', cnt, ' of ', ctotal
def main():
print 'this is main method'
fileCount = 0
fileList = glob.glob(srcPath + '\' + '*.csv')
#sequentialCopy(fileList)
threadWorkerCopy(fileList)
if __name__ == '__main__':
profile.run('main()')
推荐答案
当然慢.硬盘驱动器必须不断地在文件之间寻找.您认为多线程会使这项任务更快的信念是完全没有道理的.限制速度是您可以从磁盘读取数据或将数据写入磁盘的速度,从一个文件到另一个文件的每次寻道都会浪费本可以用于传输数据的时间.
Of course it's slower. The hard drives are having to seek between the files constantly. Your belief that multi-threading would make this task faster is completely unjustified. The limiting speed is how fast you can read data from or write data to the disk, and every seek from one file to another is a loss of time that could have been spent transferring data.
这篇关于多线程文件复制比多核 CPU 上的单线程慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!