Home > Back-end >  Trying to save a DataFrame using Arrow.jl gives: ArgumentError: type does not have a definite number
Trying to save a DataFrame using Arrow.jl gives: ArgumentError: type does not have a definite number

Time:01-23

I have a dataframe that I'd like to save using Arrow.write().

I can save a subframe of it by omitting one column. But if I leave the column in, I get this error:

ArgumentError: type does not have a definite number of fields

The objects in this column are all 4-Tuples, and their elements are all either empty Tuples or 1- or 2-Tuples of Int64s. Typical examples would be ((1), (), (2), ()) and ((1, 2), (), (), ()). If I use Arrays of Arrays rather than Tuples of Tuples, it works just fine. I prefer to use tuples, and I would prefer not to have to process data before writing and after reading it (note that this also rules out things like using four separate columns -- plus I suspect having 2-tuples and 1-tuples and empty tuples in the same column would produce the same error).

I don't really understand the meaning of the error here, so I'm not sure how to fix it. Is there an easy fix? Or do I need to use arrays instead?

Here is a minimal working example which gives me this error:

using Arrow, DataFrames

x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
df = DataFrame(col = [x, y]);
Arrow.write("test.arrow", df)

If I use col=[x] or col=[y], it works, so the problem stems from having both tuple shapes in the same vector. Maybe this is a fundamental limitation of Arrow?

More details on the error message: The error message comes from reflection.jl on line 764, in fieldcount(@nospecialize t). This function is called by Arrow's arrowvector (in `arraytypes/struct.jl'). Here is the full function definition:

function arrowvector(::StructKind, x, i, nl, fi, de, ded, meta; kw...)
    len = length(x)
    validity = ValidityBitmap(x)
    T = Base.nonmissingtype(eltype(x))
    data = Tuple(arrowvector(ToStruct(x, j), i, nl   1, j, de, ded, nothing; kw...) for j = 1:fieldcount(T))
    return Struct{withmissing(eltype(x), namedtupletype(T, data)), typeof(data)}(validity, data, len, meta)
end

fieldcount is called on line 5, but I don't know what T will be for my use case.

CodePudding user response:

Probably you need to update your packages, because your problem is not reproducible under the current versions of these packages.

PS It is very difficult to find any good reason on earth to save such a structure in a data frame. Transform your data in such a way that each column has an optimal structure for data manipulation (like, Int, Float64,...)

CodePudding user response:

The problem is fixed by explicitly typing the array before constructing the DataFrame. Here is a fixed working example:

using Arrow, DataFrames

x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
T = Union{
    Tuple{Tuple{Int64}, Tuple{Int64}, Tuple{}, Tuple{}},
    Tuple{Tuple{Int64, Int64}, Tuple{}, Tuple{}, Tuple{}}
};
C = T[x, y];
df = DataFrame(col = C);
Arrow.write("test.arrow", df)
  •  Tags:  
  • Related