-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support null values when doing batched marshaling #112
Comments
The reflect side of the problem seems to admit only one obvious restatement of the class BatchReflect a where
...
reflectBatch :: Vector (Nullable a) -> IO (J (Batch a)) Update: nah, it could also be class BatchReflect a where
...
reflectBatch :: Vector (Maybe a) -> IO (J (Batch a)) where Nothing causes the corresponding position in the batch to be left uninitialized. In contrast, the version with Nullable asserts that the position in the batch corresponds to a null value. |
The second approach turned out to be lighter than the first one. If we demand all batches to deal with null values, when reifying a batch of a composite type, we will need boolean arrays for every component of the composite type, while we might be interested only in handling nulls at the top level. Say we are batching pairs The second approach allows us to use an instance of |
i don't know that I'm keen to have a Further, on the implementation side, how does Java or Spark do this kind of thing? Do they also resort to a side array indicating nullity? Are nulls simply banned from being stored in the container? Could you give a specific example you've encountered where nulls are something that we must store? |
Datasets read from parquet files can contain nulls. And I'm pretty sure that if a dataset is read from a database table which contains nulls, it will also contain nulls. We are already dealing with these cases in data analysis flows.
The side array is only necessary when batching. Spark isn't involved in it. |
Right. And presumably these are represented in-memory in an unboxed form? If so, I wonder how they do it. |
I've spent some time today reading the Spark source. I didn't reach yet the place where this is needed yet. I did find that a side array is used for this purpose in
Any ideas on how to solve it otherwise? |
The leanest I can think of, is using
This way, we can at least coerce between |
Grossly speaking a batch is an encoding of multiple Haskell or Java values as a bunch of primitive arrays. If we have a pair in Haskell
(True, 1)
, a batch for the type(Bool, Int32)
will have an array ofBool
and an array ofInt32
on which the components of the pair are stored at a given position.On the java side, this works too:
new scala.Tuple2<Boolean, Integer>(true, 1)
can be stored in a couple of primitive arrays in the same way. Primitive arrays are cheap to pass from Java to Haskell.But what do we do if the Java tuple is or contains
null
? There is no way to storenull
in primitive arrays, so we are forced to have a separate boolean array (boolean isnull[]
) which tells for each position in the batch if it corresponds to a null value or not.This is the interface that we currently have to reify a batch:
There are a few alternatives to handle nulls.
1. All batches can contain null.
Our interface changes to
where
Nullable a
is isomorphic toMaybe a
. All instances are forced to wrap values with theNullable
type.2. Only batches of types of the form
Nullable a
may contain null.We can have an instance like
Unfortunately, the above scheme requires producing dummy/default Haskell values in the positions of the vector
v
that correspond to nulls. Ideally, we would find a way to skip producing these values at all.We could change
reifyBatch
to:reifyBatch j sz p
produces a vector where some positions are yielded withNothing
. Only those positions whose index satisfiesp
provide aJust
value.Any preferences?
The text was updated successfully, but these errors were encountered: