Guidance on how to use sameAs
#702
-
Our own docs specify to use the sameAs property to flag "The URL of another Web resource that represents the same dataset as this one.". What level of "sameness" are we okay with here? For example, Kaggle is working together with OpenML to identify datasets across the two platforms that are the same in order to link them in the UI (and in Croissant). In some cases, column names and ordering make up the only difference between some valid matches. Is this still satisfactory for flagging with Let's take it one step further. Let's say we're okay with this situation where column naming and ordering is different but some column has been converted from a string-based "T"/"F" to a bool-oriented 1/0? Is that also good enough? Or is all of this up for interpretation on a per-user/repo basis? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Great question! My take on it is that sameAs is talking about the object being described, so the question is: Is it the same dataset? I would tend to say that if column order was changed or the representation of some field was modified, then it's still the same dataset, so using sameAs is okay. That said... the definition of sameAs in schema.org is a bit more stringent: URL of a reference Web page that unambiguously indicates the item's identity. E.g. the URL of the item's Wikipedia page, Wikidata entry, or official website. So in theory, sameAs should be used to point to a "canonical" url for the dataset (such as a DOI, or a reference page for the dataset, e.g., from its original provider), and not as a way to map between two non-canonical versions of the dataset. But, there is no simple property we can use instead that is readily available in schema.org The solutions I see are:
I'm leaning towards 1, as it's the simplest approach, and it may be difficult for users to know which URL is a canonical one for a given dataset. WDYT? |
Beta Was this translation helpful? Give feedback.
Great question! My take on it is that sameAs is talking about the object being described, so the question is: Is it the same dataset? I would tend to say that if column order was changed or the representation of some field was modified, then it's still the same dataset, so using sameAs is okay.
That said... the definition of sameAs in schema.org is a bit more stringent: URL of a reference Web page that unambiguously indicates the item's identity. E.g. the URL of the item's Wikipedia page, Wikidata entry, or official website.
So in theory, sameAs should be used to point to a "canonical" url for the dataset (such as a DOI, or a reference page for the dataset, e.g., from its original provide…