Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stream.group #10

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add Stream.group #10

wants to merge 2 commits into from

Conversation

pveber
Copy link
Contributor

@pveber pveber commented Jan 6, 2021

Here is a tentative implementation of Stream.group, similar to Core.List.group.

This PR has two commits, each with a different implementation. The first is very close to Stream.split, but since I noticed the comment above asking how efficient this implementation would be, I tried an alternative where the elements of each group are accumulated in a list. I believe it should behave the same but I'd be glad to have a confirmation. Meanwhile, I made a quick check to verify that performance-wise the second version is approximately twice as fast.

The PR certainly needs more work, please let me know what I can do.

@rizo
Copy link
Member

rizo commented Jan 8, 2021

Hi @pveber! Thanks for submitting this! 👍

Sorry in advance for the long reply. Here are my detailed thoughts on the proposed implementation and on the grouping operations in general.

Thoughts on Stream.group ~break

  1. The usage of group and break seems contradicting: in my view, when grouping it's more natural to say what items should go to the same group, and not when they are distinct. Names such as split, segment or partition seem better suited for negative predicates. For example, saying "group contacts by country" reads easier than "group contacts by breaking them when the country is different".
  2. The proposed implementation is not the real "group by". I'm not certain on the origin of "group by", but I think traditionally it is interpreted as SQL's GROUP BY operation where the result is a map of keys to items that match those keys. I think many people expect something like this from a "group by" operation. To be clear, the proposed implementation is also useful, I just want to be careful with naming.
  3. Should the inner group be just a list instead of Stream.t? This is not something I'm decided on, but in general I prefer to avoid implicit conversions. They add additional cost and might not even be what the user wants in the first place.
  4. Happy to hear that the list-based concatenation is more efficient!

"group by" in other places

Before we continue with the API design, allow me to share my notes on how "group by" works in other libraries and languages. It is useful not only to define the grouping operation but also to understand how it plays with other operations.

iter (also containers)

Iter offers two variants with very distinct semantics:

val group_succ_by : ?eq:('a -> 'a -> bool) -> 'a t -> 'a list t
(** Group equal consecutive elements. Linear time. Formerly synonym to [group]. *)
  • The grouping is consecutive as opposed to the traditional complete grouping.
  • The ~eq argument defines similarity between items, as opposed to Core's ~break.
  • The return type contains lists and not nested 'a ts.
  • The order of the items in groups (i.e., lists) is reversed. This is not clear from the signature, but looking at the implementation confirms this. I guess the rationale is that the items are "equal" according to ~eq so the order shouldn't matter. Would be nice to leave a note in the docs maybe.
val group_by : ?hash:('a -> int) -> ?eq:('a -> 'a -> bool) -> 'a t -> 'a list t
(** Group equal elements, disregarding their order of appearance.
    The result iterator is traversable as many times as required. *)
  • The grouping is not consecutive and consumes the entire input before producing the result.
  • The ~eq argument defines similarity between items, as in group_succ_by.
  • The return type contains lists and not nested 'a ts. There's also an implicit conversion from a Hashtbl.t in the implementation. Would it not be useful to return the table to be closer to the traditional "group by" and give more control to the user?
  • The order of the items in groups is reversed, similarly to group_succ_by.

group, groupBy and groupOn in Haskell

group :: Eq a => [a] -> [[a]]
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupOn :: Eq b => (a -> b) -> [a] -> [[a]]
  • In all of them the processing strategy is consecutive and the equality function by defines similarity.
  • The on suffix denotes a projection function to apply to the original items. This is similar to the traditional SQL definition of "group by", albeit with "on" instead of "by".
  • The naming convention is consistently used for other operations such maximumOn and sortBy.

Seq.groupBy in F#

groupBy (projection:'T->'Key) (source:seq<'T>) : seq<'Key * seq<'T>>
  • The entire input sequence is processed.
  • "by" means projection for a single item and not an equality check.
  • The return type is essentially a map similar to the traditional definition of "group by".

itertools.groupby in Python

>>> [(k, list(vs)) for k, vs, in itertools.groupby("Mississippi")]
[('M', ['M']),
 ('i', ['i']),
 ('s', ['s', 's']),
 ('i', ['i']),
 ('s', ['s', 's']),
 ('i', ['i']),
 ('p', ['p', 'p']),
 ('i', ['i'])]
  • Essentially the same as F#'s Seq.groupBy

group-by in Clojure

(group-by count ["a" "as" "asd" "aa" "asdf" "qwer"])
{1 ["a"], 2 ["as" "aa"], 3 ["asd"], 4 ["asdf" "qwer"]}
  • Essentially the same as F#'s Seq.groupBy but produces a map as a result.

Elixir, Ruby, Julia, etc

  • All provide very similar functionality to F#'s Seq.groupBy where a map is produced.

Conclusions

  • When grouping things, it's best to provide predicates for similarity where groups are formed on true.
  • The groups do not have to be streams. The elements always have to be buffered in the implementation anyway, converting the container back to a stream can be done by the user if they wish so.
  • Keeping the keys that were used for grouping is useful. Essentially, I think the consecutive implementations of "group by" that do not produce a map are the least useful ones. I've certainly had a need for the consecutive version in the past (when splitting up time series for example), but I wouldn't even call it "group by" because the actual "group by" (that I typically implement as a fold on a stream accumulating into a map) should also be provided and is equally (or more) useful.

Here are some naming ideas for a function that groups consecutive elements based on similarity:

val group_succ : ('a -> 'a -> bool) -> 'a t -> 'a list t
val group_serial : ('a -> 'a -> bool) -> 'a t -> 'a list t
val group_with : ('a -> 'a -> bool) -> 'a t -> 'a list t
val groups : ('a -> 'a -> bool) -> 'a t -> 'a list t
val segments : where('a -> 'a -> bool) -> 'a t -> 'a list t

(* Usage *)

let groups =
  [1; 2; 101; 102; 201; 202]
  |> Stream.of_list
  |> Stream.segments ~where:(fun a b -> b - a < 10) in
assert (Stream.to_list groups = [[1; 2]; [101; 102]; [201; 202]])

Alternatively with an inverted predicate:

val partition_by : ('a -> 'a -> bool) -> 'a t -> 'a list t
val split_betweeen : where:('a -> 'a -> bool) -> 'a t -> 'a list t

(* Usage *)

let groups =
  [1; 2; 101; 102; 201; 202]
  |> Stream.of_list
  |> Stream.split_between ~where:(fun a b -> b - a > 10) in
assert (Stream.to_list groups = [[1; 2]; [101; 102]; [201; 202]])

Not sure if this is more clear. I still need to think about this and consider other grouping/splitting operations. Very curious to know your opinion. I'm also happy to just add group ~break now and iterate on the design until 1.0.

Maybe @c-cube has any thoughts on this too based on his experience with Iter, gen, etc? 🙂

@c-cube
Copy link

c-cube commented Jan 8, 2021

I think providing a hashtable of key, elements with this key would be great, except that it's less convenient to express in OCaml as you cannot return a polymorphic hashtable. (Well, you can, but it's bad taste especially if you provide a custom equality/hash function). I think the design decisions in iter/containers are ok for general purpose APIs 🙂 , and more specific needs can redo their own version (it's not a lot of work after all).

@pveber
Copy link
Contributor Author

pveber commented Jan 11, 2021

@rizo Impressive summary, thanks!

First, I like the idea of representing inner groups as simple lists rather than a stream.

Second, while I followed Core's naming because the library is used widespread, I also like some of the alternatives you propose, notably segment ~where. Keeping group ~by for something that returns some kind of dictionary would be very handy, and I would definitely use it.

When you've settled on a naming convention, I'll happily rework this PR.

@rizo
Copy link
Member

rizo commented Aug 7, 2022

Hi @pveber! Really sorry for such a delay in getting back to you on this. Life has been pushing me into some other directions...

I have been trying to do some thorough and future-proof thinking about the API and came up with a few conventions for both naming and semantics for the streaming operations. You can find the full analysis here: https://docs.google.com/spreadsheets/d/101Xe69CBLcdupRKXRaBk1DaFbMvjwpqwr_z7t-vdnXo

One main idea I had was to avoid committing to a particular group representation for group_by. I suggested above that we use simple lists for this, but then realised that this can be any generic sink! With this in mind our group_by operation can have the following signature:

val group_by : ('a -> 'key) -> ?hash:('key -> int) -> into:('a, 'r) sink -> 'a t -> ('key * 'r) t

And, of course, producing list groups would work like:

Stream.group_by String.length ~into:Sink.list stream

The linked spreadsheet also includes other operations I mentioned previously that split the stream "by" and "between" elements.

In addition to that, I have been thinking about adding more operators, some of which are described in the following diagram (also included in the linked spreadsheet):

Screenshot 2022-08-07 at 15 08 28

Would love to know your thoughts on this and do let me know if you would still happy to implement the group_by operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants