Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Document the Derived Field feature #6943

Closed
1 of 4 tasks
rishabhmaurya opened this issue Apr 11, 2024 · 6 comments · Fixed by #7329
Closed
1 of 4 tasks

[DOC] Document the Derived Field feature #6943

rishabhmaurya opened this issue Apr 11, 2024 · 6 comments · Fixed by #7329
Assignees
Labels
3 - Done Issue is done/complete v2.15.0
Milestone

Comments

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Apr 11, 2024

What do you want to do?

  • Request a change to existing documentation
  • Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. Provide a summary of the request and all versions that are affected.

Derived fields allows users to add or manipulate existing indexed fields by running scripts on the _source document. It eliminates the need to index or store these fields separately while still enabling queries on them. However, this flexibility comes with a trade-off in query latency, as each matching document is evaluated against these derived fields by loading the _source document and running the script to determine if the document is a match.

Meta issue: opensearch-project/OpenSearch#12281
RFC: opensearch-project/OpenSearch#1133

Key Benefits

  • Adds or modifies fields on-the-fly during query time.
  • Reduces storage requirements by avoiding direct indexing of derived fields.
  • Enables dynamic data transformations and enrichments.

Supported types

boolean, keyword, date, long, double, geo_point, ip.

What other resources are available?
Steps for testing -

Example 1

Step 1: Create Index Mapping with Derived Fields

curl -X PUT "localhost:9200/my_index?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "product_name": { "type": "keyword" },
      "price": { "type": "double" },
      "quantity": { "type": "integer" }
    },
    "derived": {
      "total_cost": {
        "type": "double",
        "script": {
          "source": "emit(doc[\"price\"].value * doc[\"quantity\"].value)"
        }
      }
    }
  }
}
'

In this example, we define a derived field total_cost that calculates the total cost by multiplying the price and quantity fields.
Step 2: Ingest Documents

curl -X POST "localhost:9200/my_index/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index":{}}
{"product_name": "Widget", "price": 10.5, "quantity": 100}
{"index":{}}
{"product_name": "Gadget", "price": 15.75, "quantity": 50}
{"index":{}}
{"product_name": "Tool", "price": 8.25, "quantity": 200}
{"index":{}}
{"product_name": "Appliance", "price": 50.0, "quantity": 10}
{"index":{}}
{"product_name": "Accessory", "price": 5.0, "quantity": 300}
'

We have indexed 5 documents representing various products with their prices and quantities.

Step 3: Query Based on Derived Field

curl -X GET "localhost:9200/my_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "total_cost": {
        "gte": 1000  // Example query to find products with total cost >= 1000
      }
    }
  },
  "fields" : ["*"]
}
'

This query retrieves documents where the total_cost is greater than or equal to 1000.

Example 2

User sentiment analysis

Key Components

  • Derived Field: sentiment - Calculates the sentiment of the text based on predefined rules.
  • Indexed Fields: Additional fields for context-based queries.
  • User ID Filter: Narrowing down search results based on user ID.

Step 1: Create Index Mapping with Derived Field

curl -X PUT "localhost:9200/sentiment_analysis?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "text": { "type": "text", "index": false }, // Field not indexed
      "category": { "type": "keyword" },  // Additional indexed field for context
      "user_id": { "type": "keyword" }   // User ID field for filtering
    },
    "derived": {
      "sentiment": {
        "type": "keyword",
        "script": {
          "source": "if (params._source[\"text\"].contains(\"happy\")) { emit(\"positive\") } else if (params._source[\"text\"].contains(\"sad\")) { emit(\"negative\") } else { emit(\"neutral\") }"
        }
      }
    }
  }
}
'

This mapping defines a derived field sentiment that assigns sentiment labels (positive, negative, neutral) based on specific keywords in the text, along with indexed fields category and user_id for context-based queries and user filtering.

Step 2: Ingest Documents

curl -X POST "localhost:9200/sentiment_analysis/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index":{}}
{"text": "I am feeling happy today!", "category": "personal", "user_id": "123"}
{"index":{}}
{"text": "The news made me sad.", "category": "news", "user_id": "456"}
{"index":{}}
{"text": "The weather is neutral.", "category": "weather", "user_id": "789"}
'

We've indexed 3 documents with varying sentiments, categories, and user IDs for context-based queries and user filtering.

Step 3: Query Based on Sentiment, Indexed Fields, and User ID

Positive Sentiment in Personal Category for User ID "123"

curl -X GET "localhost:9200/sentiment_analysis/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "sentiment": "positive" } },
        { "match": { "category": "personal" } },
        { "match": { "user_id": "123" } }  // User ID filter
      ]
    }
  },
  "fields" : ["*"]
}
'

This query retrieves documents with a positive sentiment in the personal category for the user with ID "123", combining sentiment analysis with indexed field queries and user filtering.

Definition of derived field in search request

The same example can be used to define the derived fields in search request, here is an example -

Step 1: Create index

curl -X PUT "localhost:9200/sentiment_analysis?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "text": { "type": "text", "index": false }, // Field not indexed
      "category": { "type": "keyword" },  // Additional indexed field for context
      "user_id": { "type": "keyword" }   // User ID field for filtering
    }
  }
}
'

Step 2: Ingest Documents

curl -X POST "localhost:9200/sentiment_analysis/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index":{}}
{"text": "I am feeling happy today!", "category": "personal", "user_id": "123"}
{"index":{}}
{"text": "The news made me sad.", "category": "news", "user_id": "456"}
{"index":{}}
{"text": "The weather is neutral.", "category": "weather", "user_id": "789"}
'

Step 3: Query Based on Sentiment, Indexed Fields, and User ID

curl -X GET "localhost:9200/sentiment_analysis/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "sentiment": "positive" } },
        { "match": { "category": "personal" } },
        { "match": { "user_id": "123" } }  // User ID filter
      ]
    }
  },
  "fields" : ["*"],
  "derived": {
    "sentiment": {
      "type": "keyword",
      "script": {
        "source": "if (params._source[\"text\"].contains(\"happy\")) { emit(\"positive\") } else if (params._source[\"text\"].contains(\"sad\")) { emit(\"negative\") } else { emit(\"neutral\") }"
      }
    }
  }
}
'
@hdhalter
Copy link
Contributor

Hi @rishabhmaurya , what is your ETA for producing the doc PR? To meet entrance criteria, the PR must be open and in review by 4/30. Thanks!

@getsaurabh02
Copy link
Member

@rishabhmaurya to add the PR for 2.14 related doc content.

@hdhalter
Copy link
Contributor

Doc in progress.

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Apr 30, 2024

Given this feature isn't complete yet - its missing aggregation and scoring support, we will move it out of 2.14

cc: @smacrakis @getsaurabh02

@hdhalter
Copy link
Contributor

hdhalter commented Jun 6, 2024

@rishabhmaurya will submit PR by end of day 6/6. Thanks!

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Jun 7, 2024

@hdhalter Here is the draft PR: https://github.com/opensearch-project/documentation-website/pull/7329/files
Feel free to take a look. I need to work on the last section before I publish it.

@hdhalter hdhalter added 2 - In progress Issue/PR: The issue or PR is in progress. and removed 1 - Backlog - DEV Developer assigned to issue is responsible for creating PR. labels Jun 7, 2024
@hdhalter hdhalter added the 3 - Done Issue is done/complete label Jun 14, 2024
@hdhalter hdhalter removed the 2 - In progress Issue/PR: The issue or PR is in progress. label Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Done Issue is done/complete v2.15.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants