在“from_delayed"JSON 文件中发现 DASK 元数据不匹配

DASK Metadata mismatch found in #39;from_delayed#39; JSON file(在“from_delayedJSON 文件中发现 DASK 元数据不匹配)
本文介绍了在“from_delayed"JSON 文件中发现 DASK 元数据不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚从 DASK 开始我的冒险,并且我正在学习一个 json 格式的示例数据集.我知道对于初学者来说这不是世界上最简单的数据格式:)

I'm just starting my adventure with DASK and land I'm learning on an example dataset in json format. I know that this is not the easiest data format in the world for a beginner :)

我有一个 json 格式的数据集.我通过 dd.read_json 将数据加载到数据框,一切顺利.例如,compute()len() 函数会出现问题.

I have a dataset in the json format. I loaded the data via dd.read_json to dataframe and everything goes well. The problem occurred with, for example, the compute() or len() function.

我收到此错误:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `DataFrame`
+----------+-------+----------+
| Column   | Found | Expected |
+----------+-------+----------+
| column1  |   -   | object   |
| column2  |   -   | object   |
+----------+-------+----------+

我尝试了不同的方法,但没有任何帮助.我不知道如何处理这个错误.

I tried different things, but nothing helps. I don't know how to handle this error.

请帮忙,我将不胜感激!

Please help, I will be very grateful !

推荐答案

我的猜测是你的 JSON 数据在数据的不同部分有不同的列.当 Dask DataFrame 加载您的 JSON 数据时,它会查看第一块数据以确定列名和数据类型是什么.然后它假设您的所有数据看起来像这样.

My guess is that your JSON data has different columns in different parts of the data. When Dask DataFrame loads your JSON data it looks at the first chunk of data to determine what the column names and datatypes are. It then assumes that all of your data looks like this.

这个假设在你的情况下是错误的,可能有一些列只出现在文件的后面.

This assumption turns out to be wrong in your case and probably there is some column that only appears later on in the file.

在确定列名等元数据时,您可能会考虑增加 Dask 读取的样本大小.

You might consider increasing the size of the sample that Dask reads when determining metadata like column names.

df = dd.read_json(..., sample=2**26)

默认为 1MB (2**20)

The default is 1MB (2**20)

这篇关于在“from_delayed"JSON 文件中发现 DASK 元数据不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)