如何在合并1000多个文件时将CSV文件的名称作为值添加到一列中？

本文介绍了如何在合并1000多个文件时将CSV文件的名称作为值添加到一列中？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用以下代码合并1000多个CSV文件：

path = r'path_to_files/' 
all_files = glob.glob(path + "/*.csv")

import shutil

with open('updated_thirteen_jan.csv','wb') as wfd:
    for f in all_files:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd)

我使用上面的代码来避免内存崩溃问题，它工作得很好。但是，我想做以下代码为我做的事情：

path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
fields = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
li = []

first_one = True
for filename in all_files:

    if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
        skip_row = [0]
    else:
        skip_row = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, skiprows = skip_row, engine='python', usecols=fields)
    df = df[(df['lang'] == 'en')]
    filename = os.path.basename(filename)
    df['file_name'] = filename


    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

通过此代码，我希望能够执行列选择fileds、row_skip并将file_name作为值添加。

有什么建议吗？

推荐答案

如果内存是约束，则基于pandas的解决方案是iterate over chunks of rows：

import os

import pandas as pd

print(pd.__version__)
# works with this version: '1.3.4'

# gen sample files
all_files = [f"{_}.csv" for _ in range(3)]
for filename in all_files:
    df = pd.DataFrame(range(3))
    df.to_csv(filename, index=False)

# combine into one
mode = "w"
header = True
for filename in all_files:
    with pd.read_csv(
        filename,
        engine="python",
        iterator=True,
        chunksize=10_000,
    ) as reader:
        for df in reader:
            filename = os.path.basename(filename)
            df["file_name"] = filename
            df.to_csv("some_file.csv", index=False, mode=mode, header=header)
            mode = "a"
            header = False

这篇关于如何在合并1000多个文件时将CSV文件的名称作为值添加到一列中？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除！

如何在合并1000多个文件时将CSV文件的名称作为值添加到一列中？

问题描述

推荐答案

相关文档推荐