Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support null values when doing batched marshaling #112

Open
facundominguez opened this issue Mar 2, 2018 · 7 comments
Open

Support null values when doing batched marshaling #112

facundominguez opened this issue Mar 2, 2018 · 7 comments

Comments

@facundominguez
Copy link
Member

facundominguez commented Mar 2, 2018

Grossly speaking a batch is an encoding of multiple Haskell or Java values as a bunch of primitive arrays. If we have a pair in Haskell (True, 1), a batch for the type (Bool, Int32) will have an array of Bool and an array of Int32 on which the components of the pair are stored at a given position.

On the java side, this works too: new scala.Tuple2<Boolean, Integer>(true, 1) can be stored in a couple of primitive arrays in the same way. Primitive arrays are cheap to pass from Java to Haskell.

But what do we do if the Java tuple is or contains null? There is no way to store null in primitive arrays, so we are forced to have a separate boolean array (boolean isnull[]) which tells for each position in the batch if it corresponds to a null value or not.

This is the interface that we currently have to reify a batch:

class BatchReify a where
  ...
  reifyBatch :: J (Batch a) -> Int32 -> IO (Vector a)

There are a few alternatives to handle nulls.

1. All batches can contain null.

Our interface changes to

class BatchReify a where
  ...
  reifyBatch :: J (Batch a) -> Int32 -> IO (Vector (Nullable a))

where Nullable a is isomorphic to Maybe a. All instances are forced to wrap values with the Nullable type.

2. Only batches of types of the form Nullable a may contain null.

We can have an instance like

  type instance Batch (Nullable a)
    = 'Class "scala.Tuple2" <>
         '[ 'Array ('Prim 'PrimBoolean)
          , Batch a
          ]

  instance BatchReify a => BatchReify (Nullable a) where
    ...
    reifyBatch jxs n = do
      isnull <- [java| $jxs._1() |]
      v <- [java| $jxs._2() |]
             -- reify a batch of values of type `a` and later pick the
             -- non-null values as told by the @isnull@ vector.
             >>= flip reifyBatch n
      return $ V.zipWith toNullable isnull v
      where
        toNullable :: Bool -> a -> Nullable a
        toNullable 0 a = NotNull a
        toNullable _ _ = Null

Unfortunately, the above scheme requires producing dummy/default Haskell values in the positions of the vector v that correspond to nulls. Ideally, we would find a way to skip producing these values at all.

We could change reifyBatch to:

class BatchReify a where
  ...
  reifyBatch :: J (Batch a) -> Int32 -> (Int32 -> Bool) -> IO (Vector (Maybe a))

reifyBatch j sz p produces a vector where some positions are yielded with Nothing. Only those positions whose index satisfies p provide a Just value.


Any preferences?

@facundominguez
Copy link
Member Author

facundominguez commented Mar 3, 2018

The reflect side of the problem seems to admit only one obvious restatement of the BatchReflect class.

class BatchReflect a where
  ...
  reflectBatch :: Vector (Nullable a) -> IO (J (Batch a))

Update: nah, it could also be

class BatchReflect a where
  ...
  reflectBatch :: Vector (Maybe a) -> IO (J (Batch a))

where Nothing causes the corresponding position in the batch to be left uninitialized. In contrast, the version with Nullable asserts that the position in the batch corresponds to a null value.

@facundominguez
Copy link
Member Author

facundominguez commented Mar 6, 2018

The second approach turned out to be lighter than the first one.

If we demand all batches to deal with null values, when reifying a batch of a composite type, we will need boolean arrays for every component of the composite type, while we might be interested only in handling nulls at the top level. Say we are batching pairs (Int32, Int32), and we want to account for the possibility that a value for a pair in java is null. The first approach would demand that we also deal with cases where the value for a pair is not null but one of its components is.

The second approach allows us to use an instance of ReifyBatcher (Nullable (Int32, Int32)). And if we want to deal with nulls in the components we use ReifyBatcher (Nullable (Nullable Int32, Nullable Int32)) instead.

@mboes
Copy link
Member

mboes commented Mar 6, 2018

i don't know that I'm keen to have a Nullable wrapper, either in solution 1 or solution 2. It seems strange to me. Because Int32 is a boxed thing on the Haskell side and a boxed thing on the Java side. Boxed objects on the Java side can always be NULL. That's a situation that's no different when batching than everywhere else.

Further, on the implementation side, how does Java or Spark do this kind of thing? Do they also resort to a side array indicating nullity? Are nulls simply banned from being stored in the container? Could you give a specific example you've encountered where nulls are something that we must store?

@facundominguez
Copy link
Member Author

Are nulls simply banned from being stored in the container?

Datasets read from parquet files can contain nulls. And I'm pretty sure that if a dataset is read from a database table which contains nulls, it will also contain nulls. We are already dealing with these cases in data analysis flows.

Do they also resort to a side array indicating nullity?

The side array is only necessary when batching. Spark isn't involved in it.

@mboes
Copy link
Member

mboes commented Mar 6, 2018

Datasets read from parquet files can contain nulls. And I'm pretty sure that if a dataset is read from a database table which contains nulls, it will also contain nulls.

Right. And presumably these are represented in-memory in an unboxed form? If so, I wonder how they do it.

@facundominguez
Copy link
Member Author

facundominguez commented Mar 9, 2018

Do they also resort to a side array indicating nullity?

I've spent some time today reading the Spark source. I didn't reach yet the place where this is needed yet. I did find that a side array is used for this purpose in org.apache.spark.sql.catalyst.expressions.UnsafeRow though.

i don't know that I'm keen to have a Nullable wrapper, either in solution 1 or solution 2

Any ideas on how to solve it otherwise?

@facundominguez
Copy link
Member Author

facundominguez commented Mar 6, 2020

i don't know that I'm keen to have a Nullable wrapper, either in solution 1 or solution 2.

The leanest I can think of, is using

{-# LANGUAGE PatternSynonyms #-}

import qualified Data.Coerce

newtype Nullable a = Nullable (Maybe a)

pattern Null :: Nullable a
pattern Null <- Nullable Nothing where
  Null = Nullable Nothing

pattern NotNull :: a -> Nullable a
pattern NotNull a <- Nullable (Just a) where
  NotNull a = Nullable (Just a)

f :: Nullable a -> Maybe a
f Null = Nothing
f (NotNull a) = Just a

f :: Nullable a -> Maybe a
f' = Data.Coerce.coerce

This way, we can at least coerce between Nullable and Maybe with no cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants