Allocating an array of Union{T, Missing} is very expensive in Julia. Is there any workaround it?
julia> @time Vector{Union{Missing, Int}}(undef, 10^7);
0.031052 seconds (2 allocations: 85.831 MiB)
julia> @time Vector{Union{Int}}(undef, 10^7);
0.000027 seconds (3 allocations: 76.294 MiB)
CodePudding user response:
Because if you make a Union of Missing with a bitstype like Int then Julia sets the flag that such a vector initially stores missing in each of its entries:
julia> Vector{Union{Missing, Int}}(undef, 10^7)
10000000-element Vector{Union{Missing, Int64}}:
missing
missing
⋮
missing
missing
If you used non-bitstype then such a flag for each entry does not have to be set as you can see here:
julia> Vector{Union{Missing, String}}(undef, 10^7)
10000000-element Vector{Union{Missing, String}}:
#undef
#undef
⋮
#undef
#undef
and in consequence the performance is the same:
julia> @btime Vector{Union{String}}(undef, 10^7);
11.672 ms (3 allocations: 76.29 MiB)
julia> @btime Vector{Union{Missing, String}}(undef, 10^7);
11.480 ms (2 allocations: 76.29 MiB)
CodePudding user response:
The difference is that union arrays get zero-initialized. You can see the code that decides this here:
https://github.com/JuliaLang/julia/blob/3f024fd0ab9e68b37d29fee6f2a9ab19819102c5/src/array.c#L191
This ends up as a call to memset:
So as a check, we can compare zeros vs allocating the union array:
julia> @time Vector{Union{Missing, Int}}(undef, 10^7);
0.020609 seconds (2 allocations: 85.831 MiB)
julia> @time zeros(Int, 10^7);
0.018375 seconds (2 allocations: 76.294 MiB)
Quite comparable timings.
However, I don't think this performance difference should end up mattering in your application unless you have structured it in a quite strange way. There is very little work you can do with that array until the allocation time becomes insignificant. For example, just setting the values of the uninitialized array makes the timing vs the union array quite similar:
julia> function f()
a = Vector{Int}(undef, 10^7)
for i in eachindex(a)
a[i] = 1
end
a
end;
julia> function f_union()
a = Vector{Union{Missing, Int}}(undef, 10^7)
for i in eachindex(a)
a[i] = 1
end
a
end;
julia> @time f();
0.015566 seconds (2 allocations: 76.294 MiB)
julia> @time f_union();
0.026414 seconds (2 allocations: 85.831 MiB)
