Skip to content
This repository has been archived by the owner on May 7, 2021. It is now read-only.

Data Schema

Shreekantha Devasya edited this page May 19, 2020 · 1 revision

Data Schema is the mechanism to describe the feature space in the Learning Agent. Before start to explain how the features are described, we should understand how to describe the input data using data schema. The data schema was developed inspired by JSON Schema but adapted to the learning agent use case.

Getting Started with Data Schemas

Data Schema consists of a tree of schemas which each node can be of a primitive schema, structure schema or an enumeration of strings. All nodes have a name, type, needed (if this node is required, by default true), and target (if this node is part or the ground truth, by default false). The primitive nodes can be of type number, int/integer, double, string, date/time and they don't have any additional field. The structure node can be of type o_bject_/map, and array (see below). Finally, the are enumerations, which are of type enum.

Primitive example

{
  "name": "example",
  "type": "string",
  "needed": false
}

Structure Node

The structure nodes have all fields that the primitive nodes plus additional ones and the fields can be interpreted differently. There are two types of structure node the object/map and array:

Object/Map

The object has a Map/Dictionary of properties in the field named properties which has additional schema nodes of any type. Additionally, fields related to the properties are:

  • required: a set/list of strings indicating which of the properties are needed. This is equivalent to set needed=true each property set in the required list inside the properties Map field.
  • targets: a set/list of strings indicating which of the properties are targets or ground truth. This is equivalent to set to target=true each property set in the targets list inside the properties map field.

Simple map example

{
   "name":"simpleMapSchema",
   "type":"object",
   "properties":{
      "property1":{
         "type":"string"
      },
      "property2":{
         "type":"int"
      }
   }
}

Array

Named Array

The named array has a list of schema nodes described in the items list field. Additionally, fields the array are the following:

  • targetSize: indicates the relative size of the targets. E.g., if the items field is of size10, and the targetSize3; this means, that the last 3 elements are set as target=true, and the first 7 as target=false.
  • minValue: if the vectors can vary in size, then the min value indicates what is the minimum size that the array could take. The minimum indicates the least amount of values the vector could have. The minValue set all values over the threshold to needed=false.

Simple named array example

{
   "name":"simpleListArraySchema",
   "type":"array",
   "items":[
      {
         "name":"item1",
         "type":"string"
      },
      {
         "name":"item2",
         "type":"int"
      }
   ]
}

Anonymous Array

The anonymous array describes the items as a whole instead than one by one. In other words, anonymous array describes how many items have using size and of which type using ofType. Finally, targetSize and minValue work the same way as in a named array.

Simple anonymous array example

{
   "name":"simpleAnonymousSchema",
   "type":"array",
   "size":10,
   "ofType":"int"
}

Special fields

There are a set of fields that have different behavior depending on where are used:

  • defaultValue: If the default is set in a primitive type then in case the element has no value this value will be used instead. However, it really usefulness comes when is used in structured types. If the default value is set in structured type, then all missing elements in the properties or items will be added in the correct position with the default value. This property invalidates in practice the minValue for the array type and the use of required in maps/objects.
  • ceilingValue: If the ceiling value is set in a primitive type (only numeric types allow) then in case the element has a value over the ceiling value, this value will be used instead. In structured types, this has the same effect but on all elements in the properties map or items list.
  • floorValue: the floor value is set in a primitive type (only numeric types allow) then in case the element has a value below the floor value, this value will be used instead. In structured types, has the same effect but on all elements in the properties map or items list.
  • minValue: the min value is set in a primitive type (only numeric types allow) then in case the element has a value below the min value the validation will fail. It can be used in an array, please see array for more info.
  • maxValue: If the max value is set in a primitive type (only numeric types allow) then in case the element has a value above the max value the validation will fail.
  • skip: the skip property is true whatever is found in the data for this node and all its sub-nodes (if any) will be accepted.

Advanced anonymous array example

{
   "name":"boundedAnonymousSchema",
   "type":"array",
   "minValue":5,
   "maxValue":10,
   "defaultValue":7,
   "ofType":"int"
}

Defined Nodes

It's possible to define types of node in the parent node in the field definition. Every time a node use the field ofDefinition will search on the definition dictionary and replace it with what it's there.

Advanced definition usage example

{
   "name":"mapDefTest",
   "type":"object",
   "properties":{
      "property1":{
         "ofDefinition":"test1"
      },
      "property2":{
         "ofDefinition":"test2"
      }
   },
   "definition":{
      "test1":{
         "name":"simpleMapSchema",
         "type":"object",
         "properties":{
            "property1":{
               "type":"string"
            },
            "property2":{
               "type":"int"
            }
         }
      },
      "test2":{
         "name":"simpleListArraySchema",
         "type":"array",
         "items":[
            {
               "name":"item1",
               "type":"string"
            },
            {
               "name":"item2",
               "type":"int"
            }
         ]
      }
   }
}

Data Schema for describing the Feature space

To describe a feature space, the agent needs a data schema using as root node an array or an object/map and describing in it a target/ground truth (if is supervise learning).

Simple feature space for linear regression

{
   "type":"array",
   "size":2,
   "targetSize":1,
   "ofType":"int"
}