Handle enums #52

marcenacp · 2023-06-06T09:50:59Z

marcenacp
Jun 6, 2023
Maintainer

Handle enums in Croissant

Problem

The Titanic dataset declares enums to:

Add semantic understanding for human and machine-readability;
Add a typing capability to check that a column has its value in a given set of values.

For instance, in the original dataset, the embarkation is a string in C, Q, S or U. The human would like to have the semantic translations: Cherbourg, Queenstown, Southampton or Unknown. The machine would like to have the semantic meaning using https://www.wikidata.org/wiki/Q3667188, https://www.wikidata.org/wiki/Q733093, https://www.wikidata.org/wiki/Q79848 and https://www.wikidata.org/wiki/Q24238356.

Current way of doing: we declare a record set, and we join based on this record set. Currently, we do not have clear guidelines on how to do this. The goal of this discussion is to propose and discuss guidelines.

Solution

After discussion, the solution is: #52 (reply in thread)

Proposal

Add a recommendation in the documentation that all enums SHOULD bring a semantic meaning. E.g., male->0 and female->1 doesn't bring any semantic meaning, and should be dropped. However, the fact that the dataset declares two genders (https://www.wikidata.org/wiki/Q6581097 and https://www.wikidata.org/wiki/Q6581072) In particular, enums do not change the value outputed by Croissant (like in the example of male/female).
We define an ml:Enum data type. The source is the actual source. ml:enum references the column of a record set that corresponds to the value in the source.
The reference record set can be defined either 1) by an external source (e.g., a CSV), or 2) by directly embedding the key/value in RDF using: https://schema.org/PropertyValue.
Human readability: Tools that display all semantic meanings (e.g., when hovering on a record).
Machine readability: Tools that display all semantic meanings thanks to the semantic meaning of the field (e.g., https://schema.org/GeoCoordinates can be placed on a map).

...
"recordSet":
    {
      "@type": "ml:RecordSet",
      "name": "embarkation_enum",
      "description": "Semantic meaning of embarkations.",
      "field": [
        {
          "name": "key",
          "description": "C, Q, S or ?",
          "@type": "ml:Field",
          "dataType": "sc:Text"
        },
        {
          "name": "label",
          "description": "Human-readable label",
          "@type": "ml:Field",
          "dataType": "sc:Text"
        },
        {
          "name": "url",
          "description": "Corresponding WikiData URL",
          "@type": "ml:Field",
          "dataType": [
            "sc:Url",
            "wd:Q515"
          ]
        },
      "itemListElement": [
        {
          "@type": "ListItem",
          "item": [
            {
              "@type": "PropertyValue",
              "name": "key",
              "value": "C"
            },
            {
              "@type": "PropertyValue",
              "name": "label",
              "value": "Cherbourg"
            },
            {
              "@type": "PropertyValue",
              "name": "url",
              "value": "wd:Q3667188"
            },
            ...
         ]
      },
      ....
    },
          ...
          {
            "name": "embarkation",
            "description": "Port of Embarkation (C: Cherbourg, Q: Queenstown, S: Southampton, ?: Unknown).",
            "@type": "ml:Field",
            "dataType": "ml:Enum",
            "source": "#{passengers-table/embarked}",
            "ml:enum": "#{embarkation_enum/key}"
          },

pierrot0 · 2023-06-06T12:00:37Z

pierrot0
Jun 6, 2023
Maintainer

Overall I agree with this proposal, especially with the introduction of ml:Enum data type.

The Property Value syntax is a bit ugly, but that's a limitation of JSON-LD...

For human labels, can we add the sc:name data type?
This way, tools know they can use that as a human label.

        {
          "name": "label",
          "description": "Human-readable label",
          "@type": "ml:Field",
          "dataType": ["sc:Text", "sc:name"] 
        }

For the dataType in the ml:Field object, I think it would make sense to specify the actual type of the data (here sc:Text), in addition to the ml:Enum type:

          {
            "name": "embarkation",
            "description": "Port of Embarkation (C: Cherbourg, Q: Queenstown, S: Southampton, ?: Unknown).",
            "@type": "ml:Field",
            "dataType": ["sc:Text", "ml:Enum"],
            "source": "#{passengers-table/embarked}",
            "ml:enum": "#{embarkation_enum/key}"
          }

Technically one could just use the actual type of the data without ml:Enum data type, and use the presence of ml:enum property to infer that this is an enumeration. However from a semantic point of view, I think it's cleaner to list the two types. Also, there might be cases of enums with no labels, and so no ml:enum property to set.

On the naming of the ml:enum property: would it make sense to use a more descriptive name? Something like ml:enum_reference?

Maybe if we have "dataType": ["sc:Text", "ml:Enum"],, then we can use the existing references mechanism, and tools should be able to infer the labels from sc:name data type?

2 replies

marcenacp Jun 15, 2023
Maintainer Author

I agree we should have both the data type (sc:Text) and the semantic type (ml:Enum).

benjelloun Jun 16, 2023
Maintainer

@pierrot0 With sc:name, you are suggesting using a schema.org property as a dataType. So far we have only talked about using reusing classes from other vocabularies. If needed, borrowin properties would likely need a different mechanism than the "dataType" property.

benjelloun · 2023-06-06T16:34:43Z

benjelloun
Jun 6, 2023
Maintainer

I like the overall approach, as we discussed offline.

I'm not convinced it's necessary to define "ml:Enum" as a data type. Also, the semantics of the "ml:enum" property is very close to the one of the "references" property used for joins. What does it buy us to define these explicitly?

1 reply

marcenacp Jun 15, 2023
Maintainer Author

I would tend to agree 100%, but what happens if you need to define a references and an ml:enum? What's the mechanism to define several references?

josvandervelde · 2023-06-08T16:36:03Z

josvandervelde
Jun 8, 2023

Good suggestions, I think! I do want to comment on the verbosity, though.

Previously we had

          { "key": 0, "label": "North" },
          { "key": 1, "label": "East" },

Now we have a list of ListItems, each with 10+ rows of json. It's very schema.org-compliant (and generic), but it's a lot of json.

We've discussed a possibility to alleviate the problem before: a simpler Croissant-view, convertible to the schema.org compliant json-ld. Do we want to go that route? If not, what do you think on the verbosity?

3 replies

marcenacp Jun 15, 2023
Maintainer Author

Valid point. I think we cannot avoid the verbosity, as we evolve in schema.org's syntax.

benjelloun Jun 19, 2023
Maintainer

Actually... json-ld has a provision for including json in content: https://w3c.github.io/json-ld-syntax/#json-literals

If schema.org supports this syntax (or just ignores it), then that may be a good way to keep things compact.

marcenacp Jun 20, 2023
Maintainer Author

Good catch! It looks great, let me try it.

benjelloun · 2023-06-16T14:47:42Z

benjelloun
Jun 16, 2023
Maintainer

Just to clarify the alternative proposal:

Enums are just RecordSets defined in the dataset. Their data can be inlined as records, or provided through external files.
The standard references mechanism is used to link to them.

So your example becomes:

{
            "name": "embarkation",
            "description": "Port of Embarkation (C: Cherbourg, Q: Queenstown, S: Southampton, ?: Unknown).",
            "@type": "ml:Field",
            "dataType": "sc:Text"
            "source": "#{passengers-table/embarked}",
            "references": "#{embarkation_enum/key}"
          },

with the RecordSet of embarkation_enum defined the same way as in your example.

I don't think it's very likely, but if needed, we can have multiple 'references' defined for the same field.

6 replies

marcenacp Jun 29, 2023
Maintainer Author

@benjelloun Thanks for the summary of our discussion. It completely makes sense! Enum/classes are an important notion in ML and should have their place in Croissant.

I still think the enum is a property of the Field, not of the RecordSet. If I take the example of the MovieLens dataset:

The RecordSet of movies could be perceived as a finite enumeration (example of ML task: predict which movie a user will watch in a finite catalogue).
It could also be perceived as a sample from an infinite set (example of ML task: predict the title of newly released movies given their features).

In that case, you want the RecordSet of movies to be re-usable for both cases. The information of whether it is or not an enum is a property of the ml:Field of the RecordSet that declares it (not of the source RecordSet).

That's why I think Enum is a key ML concept for the features of a model, and should be hold by Field as an ml:dataType.

Instead of writing:

{
      "@type": "ml:RecordSet",
      "name": "movies",
      "description": "The list of movies.",
      "ml:isEnumeration": true,  // <- enum is a property of ml:RecordSet
      "field": [
        {
          "@type": "ml:Field",
          "name": "id",
          "dataType": "sc:Integer"
        },
      ...
},
{
      "@type": "ml:RecordSet",
      "name": "movies_by_user",
      "field": [
        {
          "@type": "ml:Field",
          "name": "movie_id",
          "dataType": "sc:Integer",
          "references": "#{movies/id}"
        },
      ...
}

I would rather write:

{
      "@type": "ml:RecordSet",
      "name": "movies",
      "description": "The list of movies.",
      "field": [
        {
          "@type": "ml:Field",
          "name": "id",
          "dataType": "sc:Text"
        },
      ...
},
{
      "@type": "ml:RecordSet",
      "name": "movies_by_user",
      "field": [
        {
          "@type": "ml:Field",
          "name": "movie_id",
          "dataType": "sc:Enum",  // <- enum is a property of ml:Field
          "references": "#{movies/id}"
        },
      ...
}

This syntax also allows to define Enums that do not have their own RecordSet, e.g. genre doesn't have its own RecordSet:

{
      "@type": "ml:RecordSet",
      "name": "movies_by_user",
      "field": [
        {
          "@type": "ml:Field",
          "name": "movie_genre",
          "dataType": "sc:Enum",  // <- interesting to have an ML model predicting the genre as an enum (not as a text)
          "references": "#{movies/genre}"
        },
      ...
}

I'd be happy to discuss this. What do you think?

benjelloun Jun 29, 2023
Maintainer

@marcenacp Thanks for describing your POV in detail with examples!

I still find it a bit strange to say that enum is a property of the field and not the recordset, but I can see that in some cases you want to use it that way. Still, the majority of enums should be fine to be specified at the recordset level.

How about supporting both, by allowing a Boolean ml:isEnumeration property to be specified either on a recordset or field level?

So you can either write:

{
      "@type": "ml:RecordSet",
      "name": "movies",
      "description": "The list of movies.",
      "ml:isEnumeration": true,  // <- enum is a property of ml:RecordSet
      "field": [
        {
          "@type": "ml:Field",
          "name": "id",
          "dataType": "sc:Integer"
        }]
}

or:

{
      "@type": "ml:RecordSet",
      "name": "movies_by_user",
      "field": [
        {
          "@type": "ml:Field",
          "name": "movie_genre",
          "dataType": "sc:Text", 
          "ml:isEnumeration": "true", // <- enum is a property of ml:Field
          "references": "#{movies/genre}"
        }]
}

I think that's better than specifying enum as a dataType, since being an enumeration is orthogonal to the data type of the values.

WDYT?

marcenacp Jun 29, 2023
Maintainer Author

This looks good!

I felt the fact that a field is enumerated is a strong typing information. This is why I proposed the "dataType": "ml"Enum". We still have the actual data through the reference (e.g., movies/genre is sc:Text). We could even support "dataType": ["ml"Enum", "ml:Text"].

benjelloun Jun 29, 2023
Maintainer

Cool! I suggest we keep the enum separate from dataType for now, and revisit this later if a strong need for combining them emerges.

pierrot0 Jun 29, 2023
Maintainer

Solution SGTM, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle enums #52

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Handle enums #52

marcenacp Jun 6, 2023 Maintainer

Handle enums in Croissant

Problem

Solution

Proposal

Replies: 4 comments · 12 replies

pierrot0 Jun 6, 2023 Maintainer

marcenacp Jun 15, 2023 Maintainer Author

benjelloun Jun 16, 2023 Maintainer

benjelloun Jun 6, 2023 Maintainer

marcenacp Jun 15, 2023 Maintainer Author

josvandervelde Jun 8, 2023

marcenacp Jun 15, 2023 Maintainer Author

benjelloun Jun 19, 2023 Maintainer

marcenacp Jun 20, 2023 Maintainer Author

benjelloun Jun 16, 2023 Maintainer

marcenacp Jun 29, 2023 Maintainer Author

benjelloun Jun 29, 2023 Maintainer

marcenacp Jun 29, 2023 Maintainer Author

benjelloun Jun 29, 2023 Maintainer

pierrot0 Jun 29, 2023 Maintainer

marcenacp
Jun 6, 2023
Maintainer

Replies: 4 comments 12 replies

pierrot0
Jun 6, 2023
Maintainer

marcenacp Jun 15, 2023
Maintainer Author

benjelloun Jun 16, 2023
Maintainer

benjelloun
Jun 6, 2023
Maintainer

marcenacp Jun 15, 2023
Maintainer Author

josvandervelde
Jun 8, 2023

marcenacp Jun 15, 2023
Maintainer Author

benjelloun Jun 19, 2023
Maintainer

marcenacp Jun 20, 2023
Maintainer Author

benjelloun
Jun 16, 2023
Maintainer

marcenacp Jun 29, 2023
Maintainer Author

benjelloun Jun 29, 2023
Maintainer

marcenacp Jun 29, 2023
Maintainer Author

benjelloun Jun 29, 2023
Maintainer

pierrot0 Jun 29, 2023
Maintainer