将 NumPy 数组转换为集合需要太长时间

Converting NumPy array to a set takes too long(将 NumPy 数组转换为集合需要太长时间)
本文介绍了将 NumPy 数组转换为集合需要太长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行以下操作

从 numpy 导入 *x = array([[3,2,3],[711,4,104],.......,[4,4,782,7845]]) # 大的 nparray对于 x 中的项目:设置(项目)

与以下相比,它需要很长时间:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # 大 nparray对于 x 中的项目:item.tolist()

为什么将 NumPy 数组转换为 set 比转换为 list 需要更长的时间?我的意思是基本上两者都有复杂性O(n)?

解决方案

TL;DR: set() 函数使用 Python 的迭代协议创建一个集合.但是在 NumPy 数组上迭代(在 Python 级别上)非常慢,以至于在进行迭代之前使用 tolist() 将数组转换为 Python 列表会(快得多).

要了解迭代 NumPy 数组为何如此缓慢,了解 Python 对象、Python 列表和 NumPy 数组是如何存储在内存中的非常重要.

Python 对象需要一些簿记属性(如引用计数、指向其类的链接……)和它所代表的.例如整数 ten = 10 可能如下所示:

蓝色圆圈是您在 Python 解释器中为变量 ten 使用的名称",下部对象(实例)是实际表示整数的对象(因为簿记属性并不重要这里我在图片中忽略了它们).

一个 Python list 只是 Python 对象的集合,例如 mylist = [1, 2, 3] 会这样保存:

这次列表引用了 Python 整数 123,而名称 mylist 只是引用list 实例.

但是数组 myarray = np.array([1, 2, 3]) 不会将 Python 对象存储为元素:

123 直接存储在 NumPy array 实例中.

<小时>

有了这些信息,我可以解释为什么迭代 array 比迭代 list 慢得多:

每次访问 list 中的下一个元素时,list 只会返回一个存储的对象.这非常快,因为该元素已经作为 Python 对象存在(它只需要将引用计数加一).

另一方面,当你想要一个 array 的元素时,它需要在返回之前为包含所有簿记内容的值创建一个新的 Python盒子".当您遍历数组时,它需要为数组中的每个元素创建一个 Python 框:

创建这些盒子很慢,这是迭代 NumPy 数组比迭代存储值及其盒子的 Python 集合(列表/元组/集合/字典)慢得多的主要原因:

将 numpy 导入为 nparr = np.arange(100000)lst = 列表(范围(100000))定义迭代(obj):对于 obj 中的项目:经过%timeit 迭代(arr)# 每个循环 20.2 毫秒 ± 155 微秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit 迭代(lst)# 每个循环 3.96 毫秒 ± 26.6 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)

set构造函数"只是对对象进行迭代.

我不能肯定地回答的一件事是为什么 tolist 方法要快得多.最后,生成的 Python 列表中的每个值都需要位于Python 盒子"中,因此 tolist 可以 避免的工作并不多.但我肯定知道的一件事是 list(array)array.tolist() 慢:

arr = np.arange(100000)%timeit 列表(arr)# 每个循环 20 毫秒 ± 114 微秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit arr.tolist()# 每个循环 10.3 毫秒 ± 253 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)

其中每一个都有 O(n) 运行时复杂度,但常数因素非常不同.

在您的情况下,您确实将 set()tolist() 进行了比较 - 这不是一个特别好的比较.将 set(arr)list(arr)set(arr.tolist()) 进行比较会更有意义arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))def tosets(arr):对于 arr 中的行:设置(线)def tolists(arr):对于 arr 中的行:列表(行)def tolists_method(arr):对于 arr 中的行:line.tolist()def tosets_intermediatelist(arr):对于 arr 中的行:设置(line.tolist())%timeit tosets(arr)# 每个循环 72.2 ms ± 2.68 ms(平均值 ± 标准偏差.7 次运行,每次 10 个循环)%timeit tolists(arr)# 每个循环 80.5 毫秒 ± 2.18 毫秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit tolists_method(arr)# 每个循环 16.3 毫秒 ± 140 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)%timeit tosets_intermediatelist(arr)# 每个循环 38.5 毫秒 ± 200 微秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)

因此,如果您想要 set,最好使用 set(arr.tolist()).对于更大的数组,可以使用

集合比 lists 更复杂,因为它涉及 hashes 和空槽(Python 对集合使用开放寻址,所以我假设 numba 也会这样做).

与 NumPy array 一样,numba set 直接保存值.因此,当您将 NumPy array 转换为 numba set (或反之亦然)时,它根本不需要使用Python 盒子",所以当您创建sets 在 numba nopython 函数中它甚至比 set(arr.tolist()) 操作要快得多:

import numba as nb@nb.njitdef tosets_numba(arr):对于范围内的线(arr.shape [0]):设置(arr [lineno])tosets_numba(arr) # 热身%timeit tosets_numba(arr)# 每个循环 6.55 毫秒 ± 105 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)

这大约比 set(arr.tolist()) 方法快五倍.但重要的是要强调我没有从函数中返回 set.当您 return 从 nopython numba 函数到 Python Numba 的 set 会创建一个 python 集 - 包括为集中的所有值创建框"(这是 numba 隐藏的东西).

仅供参考:如果您将 lists 传递给 Numba nopython 函数或从这些函数返回列表,则会发生相同的装箱/拆箱.那么 Python 中的 O(1) 操作是 Numba 中的 O(n) 操作!这就是为什么通常最好将 NumPy 数组传递给 numba nopython 函数(即 O(1)).

我假设如果你从函数中返回这些集合(现在真的不可能,因为 numba 目前不支持集合列表)它会更慢(因为它会创建一个 numba 集合 并且 稍后将其转换为 python 集)或仅稍微快一点(如果转换 numbaset -> pythonset 真的非常快).

就我个人而言,只有当我不需要从函数中返回它们并在函数内对集合执行所有操作时,我才会将 numba 用于集合并且只有当集合上的所有操作都是在 nopython 模式下支持.在任何其他情况下,我都不会在这里使用 numba.

<小时>

请注意:from numpy import * 应该避免,这样做时会隐藏几个 python 内置函数(sum, min, max, ...) 并且它把很多东西放到你的全局变量中.最好使用 import numpy as np.函数调用前面的 np. 使代码更清晰,无需输入太多内容.

I'm trying to execute the following

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(item)

and it takes very long compared to:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    item.tolist()

Why does it take much longer to convert a NumPy array to a set than to a list? I mean basically both have complexity O(n)?

解决方案

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

The values 1, 2 and 3 are stored directly in the NumPy array instance.


With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The set "constructor" just does an iteration over the object.

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Each of these has O(n) runtime complexity but the constant factors are very different.

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).


In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.


Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

这篇关于将 NumPy 数组转换为集合需要太长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)