[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

nknize · 2022-01-14T19:30:15Z

Dynamic fields (e.g., runtime fields) are useful for many different applications (search time transforms, joins, scripted field search, etc).

One way of achieving this is to create in memory doc values at search time. This is what the FieldData implementation did before doc values were leveraged in Lucene. This is memory intensive, slow, etc. Some users are okay with the performance penalty in favor of flexibility. I postulate that "make it slow" to prioritize flexibility over performance should not be a trade off when it can be avoided; even if there is a mechanism such as the async api.

I'd like to propose and brainstorm using IndexWriter and an fsync-free implementation of Lucene commit to write local segments to a temporary directory consisting of on-disk index or docvalue representation of the search results.

In this manner the segments look just like a persisted index with one difference, by default they're intended to be "volatile" and short lived. The benefit is a reduced, runtime view of the global index (based on a user defined query) that can be further inspected or joined with any additional query or aggregation. Users can also choose to promote the volatile compute segments to a new index, creating an SQL-like ability to persist views in new indexes.

to be continued...

The text was updated successfully, but these errors were encountered:

penghuo · 2022-01-19T18:24:29Z

Add one use case from PPL parse command. for example, parse command extract field:value pair from raw_log field in raw_index. then use extracted field for filter or aggregation.

source = raw_index | parse raw_log "[timestamp] [ip] [status]" | filter status="404" | stats count() by ip

In our current design, we rewrite parse command as customized Script . so the query will be rewrite as following execution plan (it is not actual plan, for explanation only). One major concern is that raw_log is been parsed two times during query time and aggregation time. If I understand correctly, we could define timestamp, ip, status as dynamic fields. then these dynamic fields could be used during query and aggregation time.

{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": "return parse(raw_log, '[timestamp] [ip] [status]', status) = 404;"
        }
      }
    }
  },
  "aggs": {
    "genres": {
      "terms": {
        "script": {
          "source": "parse(raw_log, '[timestamp] [ip] [status]', ip)"
        }
      }
    }
  }
}

Bukhtawar · 2022-01-21T12:13:43Z

Is this similar to #1133 @nknize ?

nknize · 2022-01-21T17:33:32Z

Is this similar to #1133

"schema on read" is a use case for dynamic fields. This is a mechanism for achieving "schema on read", along with query time enrichment, joins, etc.

nknize added discuss Issues intended to help drive brainstorming and decision making feature New feature or request Indexing & Search Search:Aggregations labels Jan 14, 2022

anasalkouz removed the Indexing & Search label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

nknize commented Jan 14, 2022

penghuo commented Jan 19, 2022

Bukhtawar commented Jan 21, 2022

nknize commented Jan 21, 2022

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

Comments

nknize commented Jan 14, 2022

penghuo commented Jan 19, 2022

Bukhtawar commented Jan 21, 2022

nknize commented Jan 21, 2022